cognee

Author	SHA1	Message	Date
hajdul88	001fbe699e	feat: Adds edge centered payload and embedding structure during ingestion (#1853 ) <!-- .github/pull_request_template.md --> ## Description This pull request introduces edge‑centered payloads to the ingestion process. Payloads are stored in the Triplet_text collection which is compatible with the triplet_embedding memify pipeline. Changes in This PR: - Refactored custom edge handling, from now on they can be passed to the add_data_points method so the ingestion is centralized and is happening in one place. - Added private methods to handle edge centered payload creation inside the add_data_points.py - Added unit tests to cover the new functionality - Added integration tests - Added e2e tests Acceptance Criteria and Testing Scenario 1: -Set TRIPLET_EMBEDDING env var to True -Run prune, add, cognify -Verify the vector DB contains a non empty Triplet_text collection and the number of triplets are matching with the number of edges in the graph database -Use the new triplet_completion search type and confirm it works correctly. Scenario 2: -Set TRIPLET_EMBEDDING env var to True -Run prune, add, cognify -Verify the vector DB does not have the Triplet_text collection -You should receive an error indicating that the Triplet_text is not available ## Type of Change <!-- Please check the relevant option --> - [ ] Bug fix (non-breaking change that fixes an issue) - [x] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement - [ ] Other (please specify): ## Screenshots/Videos (if applicable) <!-- Add screenshots or videos to help explain your changes --> ## Pre-submission Checklist <!-- Please check all boxes that apply before submitting your PR --> - [x] I have tested my changes thoroughly before submitting this PR - [x] This PR contains minimal changes necessary to address the issue/feature - [x] My code follows the project's coding standards and style guidelines - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added necessary documentation (if applicable) - [x] All new and existing tests pass - [x] I have searched existing PRs to ensure this change hasn't been submitted already - [x] I have linked any relevant issues in the description - [x] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Triplet embeddings supported—embeddings created from graph edges plus connected node text * Ability to supply custom edges when adding data points * New configuration toggle to enable/disable triplet embedding * Tests * Added comprehensive unit and end-to-end tests for edge-centered payloads and triplet embedding * New CI job to run the edge-centered payload e2e test * Bug Fixes * Adjusted server start behavior to surface process output in parent logs <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Pavel Zorin <pazonec@yandex.ru>	2025-12-10 17:10:06 +01:00
lxobr	6223ecf05b	feat: optimize repeated entity extraction (#1682 ) <!-- .github/pull_request_template.md --> ## Description <!-- Please provide a clear, human-generated description of the changes in this PR. DO NOT use AI-generated descriptions. We want to understand your thought process and reasoning. --> - Added an `edge_text` field to edges that auto-fills from `relationship_type` if not provided. - Containts edges now store descriptions for better embedding - Updated and refactored indexing so that edge_text gets embedded and exposed - Updated retrieval to use the new embeddings - Added a test to verify edge_text exists in the graph with the correct format. ## Type of Change <!-- Please check the relevant option --> - [ ] Bug fix (non-breaking change that fixes an issue) - [x] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [x] Code refactoring - [x] Performance improvement - [ ] Other (please specify): ## Screenshots/Videos (if applicable) <!-- Add screenshots or videos to help explain your changes --> ## Pre-submission Checklist <!-- Please check all boxes that apply before submitting your PR --> - [x] I have tested my changes thoroughly before submitting this PR - [x] This PR contains minimal changes necessary to address the issue/feature - [x] My code follows the project's coding standards and style guidelines - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [ ] All new and existing tests pass - [x] I have searched existing PRs to ensure this change hasn't been submitted already - [ ] I have linked any relevant issues in the description - [ ] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-10-30 13:56:06 +01:00
Igor Ilic	13d1133680	chore: Change comments	2025-10-10 17:14:10 +02:00
Igor Ilic	757d745b5d	refactor: Optimize cognification speed	2025-10-10 17:12:09 +02:00
Igor Ilic	abfcbc69d6	refactor: Have embedding calls run in async gather	2025-10-10 15:36:36 +02:00
hajdul88	faeca138d9	fix: fixes distributed pipeline (#1454 ) <!-- .github/pull_request_template.md --> ## Description This PR fixes distributed pipeline + updates core changes in distr logic. ## Type of Change <!-- Please check the relevant option --> - [x] Bug fix (non-breaking change that fixes an issue) - [x] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [x] Code refactoring - [x] Performance improvement - [ ] Other (please specify): ## Changes Made Fixes distributed pipeline: -Changed spawning logic + adds incremental loading to run_tasks_diistributed -Adds batching to consumer nodes -Fixes consumer stopping criteria by adding stop signal + handling -Changed edge embedding solution to avoid huge network load in a case of a multicontainer environment ## Testing Tested it by running 1GB on modal + manually ## Screenshots/Videos (if applicable) None ## Pre-submission Checklist <!-- Please check all boxes that apply before submitting your PR --> - [x] I have tested my changes thoroughly before submitting this PR - [x] This PR contains minimal changes necessary to address the issue/feature - [ ] My code follows the project's coding standards and style guidelines - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [ ] All new and existing tests pass - [ ] I have searched existing PRs to ensure this change hasn't been submitted already - [ ] I have linked any relevant issues in the description - [ ] My commits have clear and descriptive messages ## Related Issues None ## Additional Notes None ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Boris <boris@topoteretes.com> Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>	2025-10-09 14:06:25 +02:00
Igor Ilic	38cdacbcb6	fix: Resolve issue with Gemini adapter (#1494 ) <!-- .github/pull_request_template.md --> ## Description Resolve Gemini Adapter issues: 1. resolve embedding batch issue, 2. Resolve slowness because gemini tokenizer was sending word per word to Googles API to count tokens (using OpenAI's local tokenizer to count tokens for Gemini now) 3. Update deprecated library and move to instructor ## Type of Change <!-- Please check the relevant option --> - [x] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement - [ ] Other (please specify): ## Pre-submission Checklist <!-- Please check all boxes that apply before submitting your PR --> - [ ] I have tested my changes thoroughly before submitting this PR - [ ] This PR contains minimal changes necessary to address the issue/feature - [ ] My code follows the project's coding standards and style guidelines - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [ ] All new and existing tests pass - [ ] I have searched existing PRs to ensure this change hasn't been submitted already - [ ] I have linked any relevant issues in the description - [ ] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-10-07 18:04:18 +02:00
hajdul88	1d63da7923	chore: removes duplicated func def	2025-08-18 13:26:45 +02:00
hajdul88	fbb7d72461	fix: ruff formatting	2025-08-18 13:24:14 +02:00
hajdul88	dc637f70b0	fix: fixes add datapoints params	2025-08-18 13:23:57 +02:00
hajdul88	d53ebb2164	Merge branch 'dev' into feature/cog-2734-cognee-feedbacks-interactions-poc-to-prod	2025-08-18 13:17:13 +02:00
hajdul88	711c805c83	feat: adds cognee-user interactions to search	2025-08-18 13:14:06 +02:00
hajdul88	affbc557d2	chore: ruff formatting	2025-08-14 14:17:35 +02:00
hajdul88	63d071f0d8	feat: adds input checks for add datapoints and summarization tasks	2025-08-14 14:17:13 +02:00
hajdul88	32996aa0d0	feat: adds new error classes to llm and databases + introduces loglevel and logging from child error	2025-08-13 13:40:50 +02:00
Igor Ilic	a75a79f012	Lancedb async lock (#1222 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>	2025-08-12 08:46:15 -04:00
Igor Ilic	8d4ed35cbe	Fix low level pipeline (#1203 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-08-05 17:01:48 +02:00
Igor Ilic	5b6e946c43	fix: Add async lock for dynamic vector table creation (#1175 ) <!-- .github/pull_request_template.md --> ## Description Add async lock for dynamic table creation ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-08-01 15:12:04 +02:00
hajdul88	f78af0cec3	feature: solve edge embedding duplicates in edge collection + retriever optimization (#1151 ) <!-- .github/pull_request_template.md --> ## Description feature: solve edge embedding duplicates in edge collection + retriever optimization ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>	2025-07-29 12:35:38 +02:00
Boris Arzentar	90ffd513f9	Merge remote-tracking branch 'origin/dev' into feat/modal-parallelization	2025-04-22 14:30:08 +02:00
Daniel Molnar	9ba12b25ef	feat: add delete by document (#668 ) <!-- .github/pull_request_template.md --> ## Description Delete by document. ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin	2025-04-17 15:42:10 +02:00
Boris Arzentar	99ff4d73e6	intermidiate commit	2025-04-17 15:14:05 +02:00
hajdul88	f13607cf18	fix: Index graph edges embedding error (#750 ) <!-- .github/pull_request_template.md --> ## Description Fixes the embedding error for index graph edges ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-04-17 12:48:27 +02:00
Igor Ilic	a036787ad1	Embedding string fix [COG-1900] (#742 ) <!-- .github/pull_request_template.md --> ## Description Allow embedding of big strings to support full row embedding in SQL databases ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-04-16 22:39:06 +02:00
Boris	9536395468	Revert "feat: pipeline tasks needs mapping" (#717 ) Reverts topoteretes/cognee#690 I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-04-10 12:10:12 +02:00
Boris	0ce6fad24a	feat: pipeline tasks needs mapping (#690 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-04-03 10:52:59 +02:00
Igor Ilic	9f587a01a4	feat: Relational db to graph db [COG-1468] (#644 ) <!-- .github/pull_request_template.md --> ## Description Add ability to migrate relational database to graph database ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin	2025-03-26 11:40:06 +01:00
Daniel Molnar	73db1a5a53	fix: human readable logs (#658 ) <!-- .github/pull_request_template.md --> ## Description Introducing scructlog. ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin	2025-03-25 11:54:40 +01:00
Boris	8f84713b54	fix: support structured data conversion to data points (#512 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced version tracking and enhanced metadata in core data models for improved data consistency. - Bug Fixes - Improved error handling during graph data loading to prevent disruptions from unexpected identifier formats. - Refactor - Centralized identifier parsing and streamlined model definitions, ensuring smoother and more consistent operations across search, retrieval, and indexing workflows. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-02-10 17:16:13 +01:00
Igor Ilic	f3e296b171	chore: Ruff format fix (#516 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Style - Made minor internal formatting adjustments to improve consistency. These changes do not affect any visible functionality for end-users. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-02-10 06:40:16 -05:00
Boris Arzentar	95ee62ad79	fix: extract index and field name from indexable data points	2025-02-08 15:21:59 +01:00
Boris	f75e35c337	fix: custom model pipeline (#508 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features • Graph visualizations now allow exporting to a user-specified file path for more flexible output management. • The text embedding process has been enhanced with an additional tokenizer option for improved performance. • A new `ExtendableDataPoint` class has been introduced for future extensions. • New JSON files for companies and individuals have been added to facilitate testing and data processing. - Improvements • Search functionality now uses updated identifiers for more reliable content retrieval. • Metadata handling has been streamlined across various classes by removing unnecessary type specifications. • Enhanced serialization of properties in the Neo4j adapter for improved handling of complex structures. • The setup process for databases has been improved with a new asynchronous setup function. - Chores • Dependency and configuration updates improve overall stability and performance. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-02-08 02:00:15 +01:00
alekszievr	5119992fd8	feat: Add graph metrics getter in graph db interface and adapters [COG-1082] (#483 ) Dummy implementation of graph metrics to demonstrate how the interface will look like <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced asynchronous functionality for retrieving comprehensive graph metrics, including counts and connectivity details, across different systems. - Refactor - Streamlined metrics processing and storage by shifting to direct retrieval from the graph engine. - Updated naming conventions for the `GraphMetrics` database table and reorganized module imports to enhance internal consistency. - Chores - Removed dataset deletion functionalities while introducing the ability to store descriptive metrics. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>	2025-02-03 15:25:04 +01:00
hajdul88	f843c256e4	feat: Use unwind for batch edge save and add unit tests for get_graph_from_model * feat: adds some unit tests for get_graph_from_model * feat: updates neo4j add_edges cypher and deletes shallow get_graph_from_model * fix: fixing merge conflict false resolve * chore: deletes old only_root unit test	2025-01-31 13:14:04 +01:00
alekszievr	a79f7133fd	Feat: add number of tokens and descriptive graph metrics to metric table [COG-1132] (#481 ) * Count the number of tokens in documents * save token count to relational db * Add metrics to metric table * Store list as json instead of array in relational db table * Sum in sql instead of python * Unify naming * Return data_points in descriptive metric calculation task --------- Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>	2025-01-30 12:39:14 +01:00
vasilije	41b1486cff	Fix visualization	2025-01-08 13:13:52 +01:00
hajdul88	18c8bc3c33	Merge branch 'dev' into COG-adding_html_graph_render	2025-01-08 10:44:11 +01:00
hajdul88	bd644a1434	fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints	2025-01-07 13:33:05 +01:00
vasilije	60c8fd103b	ruff format	2025-01-05 19:09:08 +01:00
lxobr	262deee26e	Cog 813 source code chunks (#383 ) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <44192193+alekszievr@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2024-12-26 13:53:38 +01:00
alekszievr	bfa0f06fb4	Add type to DataPoint metadata (#364 ) * Add type to DataPoint metadata * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere	2024-12-16 16:27:03 +01:00
Boris	348610e73c	fix: refactor get_graph_from_model to return nodes and edges correctly (#257 ) * fix: handle rate limit error coming from llm model * fix: fixes lost edges and nodes in get_graph_from_model * fix: fixes database pruning issue in pgvector (#261) * fix: cognee_demo notebook pipeline is not saving summaries --------- Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>	2024-12-06 12:52:01 +01:00
hajdul88	c20ee11e80	feat: implements graph edge indexing	2024-12-04 15:37:48 +01:00
Boris Arzentar	0b8b270933	fix: make get_embeddable_data static	2024-12-03 21:47:23 +01:00
Boris Arzentar	27416afed0	fix: lancedb batch merge	2024-12-03 21:13:50 +01:00
Boris Arzentar	11acabdb6a	fix: remove duplicate nodes and edges before saving; Fix FalkorDB vector index;	2024-12-02 10:10:18 +01:00
Boris	6403d15a76	fix: enable falkordb and add test for it (#31 )	2024-11-27 22:55:30 +01:00
Boris	64b8aac86f	feat: code graph swe integration Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com> Co-authored-by: hande-k <handekafkas7@gmail.com> Co-authored-by: Igor Ilic <igorilic03@gmail.com> Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com> Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>	2024-11-27 09:32:29 +01:00
Igor Ilic	15b7b8ef2b	fix: Resolve issue with table names in SQL commands Some SQL commands require lowercase characters in table names unless table name is wrapped in quotes. Renamed all new tables to use lowercase Fix COG-677	2024-11-20 14:54:35 +01:00
Boris	52180eb6b5	feat: COG-184 add falkordb (#192 ) * feat: add falkordb adapter --------- Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>	2024-11-11 18:20:52 +01:00

1 2

51 commits