Commit graph

347 commits

Author SHA1 Message Date
Chaitany
96eb0d448a
feat(#1357): Lexical chunk retriever (#1392)
<!-- .github/pull_request_template.md -->

## Description
<!-- 
Please provide a clear, human-generated description of the changes in
this PR.
DO NOT use AI-generated descriptions. We want to understand your thought
process and reasoning.
-->
I Implemented Lexical Chunk Retriever In the LexicalRetriever class is
Inherite the BaseRetriever and The DocumentChunk are lazy loaded when
first time query is made because it save time during object
initialization
and the function get_context and the get_completion are Implemented same
as the ChunksRetriever the only diffrence is that the DocumentChunk are
converted to match the output type of the ChunksRetriever using function
get_own_properties in the utils.

## Type of Change
<!-- Please check the relevant option -->
- [-] Bug fix (non-breaking change that fixes an issue)
- [-] New feature (non-breaking change that adds functionality)
- [-] Breaking change (fix or feature that would cause existing
functionality to change)
- [-] Documentation update
- [-] Code refactoring
- [-] Performance improvement
- [-] Other (please specify):

## Changes Made
<!-- List the specific changes made in this PR -->
- Added LexicalRetriever base class with customizable tokenizer & scorer
     - Implemented caching of DocumentChunk tokens and payloads 
- Added robust initialization with error handling and logging -
Implemented get_context with top_k ranking and optional scores
- Implemented get_completion consistent with BaseRetriever interface
- Added JaccardChunksRetriever demo using set/multiset Jaccard
similarity
- Support for stopwords and multiset frequency-aware similarity -
Integrated logging for initialization, scoring, and retrieval

## Testing

- Manual tests: initialized retriever, retrieved chunks with toy corpus
    - Edge cases: empty corpus, empty query, scorer/tokenizer errors 
    - Verified Jaccard similarity results for single/multiset cases 
    - Code formatted and linted


## Screenshots/Videos (if applicable)
<!-- Add screenshots or videos to help explain your changes -->

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [-] **I have tested my changes thoroughly before submitting this PR**
- [-] **This PR contains minimal changes necessary to address the
issue/feature**
- [-] My code follows the project's coding standards and style
guidelines
- [-] I have added tests that prove my fix is effective or that my
feature works
- [-] I have added necessary documentation (if applicable)
- [-] All new and existing tests pass
- [-] I have searched existing PRs to ensure this change hasn't been
submitted already
- [-] I have linked any relevant issues in the description
- [-] My commits have clear and descriptive messages

## Related Issues
<!-- Link any related issues using "Fixes #issue_number" or "Relates to
#issue_number" -->
Relates to  #1392
## Additional Notes
<!-- Add any additional notes, concerns, or context for reviewers -->
Int the cognee/modules/chunking/models/DocumentChunk.py
don't remove the optional  from is_part_of attributes.

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Andrej Milicevic <milicevicandrej@yahoo.com>
Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com>
2025-09-19 18:24:33 +02:00
Igor Ilic
52b25882b3 refactor: Move cli tests to run in parallel with all tests 2025-09-11 22:34:26 +02:00
Igor Ilic
136b5a2f95 test: Add test for memify pipeline 2025-09-11 17:58:42 +02:00
hajdul88
a4e59b7583
Merge branch 'dev' into feature/cog-2746-time-graph-to-cognify 2025-09-01 09:47:37 +02:00
vasilije
a3da74a01d add open router 2025-08-29 21:49:28 +02:00
hajdul88
0fac4da2d0 feat: adds temporal graph integration and structural tests 2025-08-29 18:21:24 +02:00
Igor Ilic
3a3274b5f9
Update .github/workflows/test_different_operating_systems.yml
Co-authored-by: Boris <boris@topoteretes.com>
2025-08-27 07:38:05 -04:00
Igor Ilic
e4e1a5438e refactor: Add read permissions only for gh token 2025-08-27 12:47:23 +02:00
Igor Ilic
23ea1c1659
Potential fix for code scanning alert no. 187: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-08-27 06:42:30 -04:00
Igor Ilic
58655ca41e refactor: Add proper path to test file 2025-08-26 21:51:42 +02:00
Igor Ilic
229a7a1db3 refactor: Speed up CI/CD execution time 2025-08-26 21:28:11 +02:00
Igor Ilic
65542ecec7 refactor: Make CI/CD faster add more OS tests 2025-08-26 21:05:30 +02:00
Igor Ilic
8c69653912 fix: Resolve issue with Windows path 2025-08-26 20:22:20 +02:00
Andrej Milicevic
cf7d41cec2 Fix 2025-08-26 09:35:53 +02:00
Andrej Milicevic
dce525e27b Fix again. 2025-08-25 20:52:12 +02:00
Andrej Milicevic
de7a5a1c5d Fixed yml files 2025-08-25 20:45:11 +02:00
Andrej Milicevic
454358ea28 Solution for the API key error. 2025-08-25 20:26:59 +02:00
Andrej Milicevic
4b593aa523 And another potential fix for the API error. 2025-08-25 20:07:15 +02:00
Andrej Milicevic
31b33a98d3 Yet another potential fix for the invalid API key. 2025-08-25 19:40:39 +02:00
vasilije
fe6c9000fa added fix 2025-08-19 08:57:21 +02:00
vasilije
f5d702f8fb added fixes to integraton tests 2025-08-19 08:48:15 +02:00
vasilije
d084d00a4d added tests 2025-08-18 22:58:14 +02:00
Vasilije
c4ec6799a6
Merge branch 'dev' into move_to_gpt5 2025-08-17 12:20:57 +02:00
vasilije
b0e3f89340 move to gpt5 2025-08-17 12:19:34 +02:00
Daulet Amirkhanov
e4e0512856 feat: add reusable GitHub Action to set up Neo4j with Graph Data Science for testing 2025-08-15 13:29:54 +01:00
Daulet Amirkhanov
5f7598d59d test: use neo4j_metrics_test in descriptive tests instead of networkx 2025-08-15 13:13:15 +01:00
Daulet Amirkhanov
bcdbadc468 fix: unintentionally uninstall required deps when "uv sync" 2025-08-15 09:48:23 +01:00
Daulet Amirkhanov
1ab332828f fix: uv uninstalls rest of packages in some workflows 2025-08-15 09:48:23 +01:00
Daulet Amirkhanov
c60627306f Refactor CI workflows to replace Poetry with uv for dependency management and execution 2025-08-15 09:48:23 +01:00
Igor Ilic
741188f788
Main merge vol5 (#1252)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Pavel Zorin <pazonec@yandex.ru>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-08-14 21:17:17 +02:00
Igor Ilic
4543890a70
Loader separation (#1240)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: vasilije <vas.markovic@gmail.com>
2025-08-14 19:55:39 +02:00
Daulet Amirkhanov
b297289060
Fix/add async lock to all vector databases (#1244)
<!-- .github/pull_request_template.md -->

## Description
1. Cleans up VectorDB adapters that have been migrated to
`cognee-community` repo
2. Adds async lock protection create_collection method in remaining
VectorDB - ChromaDB

See #1222

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-08-14 15:57:34 +02:00
Vasilije
dabd0912f8
feat: Cog 2082 add BAML to cognee (#1054)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Signed-off-by: Raj2604 <rajmandhare26@gmail.com>
Co-authored-by: Daulet Amirkhanov <damirkhanov01@gmail.com>
Co-authored-by: Hande <159312713+hande-k@users.noreply.github.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
Co-authored-by: Matea Pesic <80577904+matea16@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions@users.noreply.github.com>
Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
Co-authored-by: Raj Mandhare <96978537+Raj2604@users.noreply.github.com>
Co-authored-by: Pedro Thompson <thompsonp17@hotmail.com>
Co-authored-by: Pedro Henrique Thompson Furtado <pedrothompson@petrobras.com.br>
2025-08-06 10:41:47 +02:00
Vasilije
daa4e9acc4
fix: Remove weaviate (#1139)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-23 19:34:35 +02:00
hajdul88
dad7da2e7b
fix:Fixes missing entity to entity edges (#1118)
<!-- .github/pull_request_template.md -->

## Description
Fixes missing entity to entity edges

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-22 11:48:56 +02:00
Vasilije
7af7e3834f
feat: Cog 2340 remove graphistry (#1080)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Igor Ilic <igorilic03@gmail.com>
2025-07-21 15:06:23 -04:00
Boris
468186789c
fix: s3 file system env vars (#1112)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-19 15:56:15 +02:00
Boris
c5bd6bed40
fix: s3 file storage (#1095)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-16 20:36:18 +02:00
Vasilije
67c006bd2f
fix: Remove milvus from core (#1096)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-16 15:56:34 +02:00
Boris
46c4463cb2
feat: s3 storage (#988)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: vasilije <vas.markovic@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
2025-07-14 21:47:08 +02:00
Vasilije
4bcb893a54
feat: Weighted edges (#1068)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
2025-07-14 21:26:25 +02:00
vasilije
9fd300112d removed memgraph 2025-07-13 20:43:39 +02:00
Igor Ilic
e51de46163
feat: Add test for permissions, change Cognee search return value (#1058)
<!-- .github/pull_request_template.md -->

## Description
Add tests for permissions for Cognee

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-08 13:33:03 +02:00
Igor Ilic
0d75b6dc76 Merge branch 'main' into main-merge 2025-06-30 12:24:24 +02:00
hajdul88
d1a9cab17d
Feature: Set default database to Kuzu (#1022)
<!-- .github/pull_request_template.md -->

## Description
Set default db to kuzu and remove networkx adapter due to community repo
adapter

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-06-27 08:50:58 +02:00
hajdul88
97d05f105e
feat: Adds core db tests for main search (#1006)
<!-- .github/pull_request_template.md -->

## Description
 Adds core db tests for main search

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-06-24 10:51:34 +02:00
Igor Ilic
31809d98df
feat: Fix python312 issue on main (#1011)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: vasilije <vas.markovic@gmail.com>
2025-06-21 09:49:03 +02:00
hajdul88
acdcb0e8d9
feat: replace Owlready2 with RDFLib (#981)
<!-- .github/pull_request_template.md -->

## Description
Replaces Owlready2 with RDFLib

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Igor Ilic <igorilic03@gmail.com>
2025-06-17 14:49:53 +02:00
Igor Ilic
456f3b58c0
Mcp test (#980)
<!-- .github/pull_request_template.md -->

## Description
Add test of MCP functionality and starting of MCP server, fix some MCP and LanceDB
issues

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-06-13 07:52:48 -04:00
hajdul88
21a4217301
Feature: Makes s3 pathway imports optional so cognee can run without s3fs (#978)
<!-- .github/pull_request_template.md -->

## Description
Makes s3 pathway imports optional so cognee can run without s3fs

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-06-13 08:53:30 +02:00