cognee

Author	SHA1	Message	Date
hajdul88	7510c6f572	Fix: fixes None last_cut_type pydantic errors by introducing default cut type	2025-07-02 09:51:07 +02:00
Daniel Molnar	bb68d6a0df	Docstring tasks. (#878 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-05-27 21:33:16 +02:00
alekszievr	c1f7b667d1	feat: Eliminate the use of max_chunk_tokens and use a unified max_chunk_size instead [cog-1381] (#626 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Refactor - Simplified text processing by unifying multiple size-related parameters into a single metric across chunking and extraction functionalities. - Streamlined logic for text segmentation by removing redundant calculations and checks, resulting in a more consistent chunk management process. - Chores - Removed the `modal` package as a dependency. - Documentation - Updated the README.md to include a new demo video link and clarified default environment variable settings. - Enhanced the CONTRIBUTING.md to improve clarity and engagement for potential contributors. - Bug Fixes - Improved handling of sentence-ending punctuation in text processing to include additional characters. - Version Update - Updated project version to 0.1.33 in the pyproject.toml file. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-03-12 14:03:41 +01:00
Daniel Molnar	d27f847753	Transition to new retrievers, update searches (#585 ) <!-- .github/pull_request_template.md --> ## Description Delete legacy search implementations after migrating to new retriever classes ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Enhanced search and retrieval capabilities, providing improved context resolution for code queries, completions, summaries, and graph connections. - Refactor - Shifted to a modular, object-oriented approach that consolidates query logic and streamlines error management for a more robust and scalable experience. - Bug Fixes - Improved error handling for unsupported search types and retrieval operations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-02-27 15:25:24 +01:00
lxobr	9cc357ac1c	Feat/cog 1365 unify retrievers (#572 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> - Created the `BaseRetriever` class to unify all the retrievers and searches. - Implemented seven specialized retrievers (summaries, chunks, completions, graph, graph-summary, insights, code) with consistent get_context/get_completion interfaces. - Added json context dumping feature in the current completion implementations to enable context comparisons. - Built a comparison framework to validate old vs new implementations. ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced multiple retrieval classes for enhanced search capabilities, including `BaseRetriever`, `ChunksRetriever`, `CodeRetriever`, `CompletionRetriever`, `GraphCompletionRetriever`, `GraphSummaryCompletionRetriever`, `InsightsRetriever`, and `SummariesRetriever`. - Enhanced query completions with optional context saving for improved data persistence. - Implemented advanced tools to compare retrieval outcomes across different implementations. - Refactor - Streamlined internal module organization and updated references for increased maintainability and consistency. - Added comments indicating future maintenance tasks related to code merging. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-02-27 12:13:21 +01:00
Boris	f75e35c337	fix: custom model pipeline (#508 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features • Graph visualizations now allow exporting to a user-specified file path for more flexible output management. • The text embedding process has been enhanced with an additional tokenizer option for improved performance. • A new `ExtendableDataPoint` class has been introduced for future extensions. • New JSON files for companies and individuals have been added to facilitate testing and data processing. - Improvements • Search functionality now uses updated identifiers for more reliable content retrieval. • Metadata handling has been streamlined across various classes by removing unnecessary type specifications. • Enhanced serialization of properties in the Neo4j adapter for improved handling of complex structures. • The setup process for databases has been improved with a new asynchronous setup function. - Chores • Dependency and configuration updates improve overall stability and performance. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-02-08 02:00:15 +01:00
Igor Ilic	8879f3fbbe	feat: Add gemini support [COG-1023] (#485 ) <!-- .github/pull_request_template.md --> ## Description PR to test Gemini PR from holchan 1. Add Gemini LLM and Gemini Embedding support 2. Fix CodeGraph issue with chunks being bigger than maximum token value 3. Add Tokenizer adapters to CodeGraph ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Added support for the Gemini LLM provider. - Expanded LLM configuration options. - Introduced a new GitHub Actions workflow for multimetric QA evaluation. - Added new environment variables for LLM and embedding configurations across various workflows. - Bug Fixes - Improved error handling in various components. - Updated tokenization and embedding processes. - Removed warning related to missing `dict` method in data items. - Refactor - Simplified token extraction and decoding methods. - Updated tokenizer interfaces. - Removed deprecated dependencies. - Enhanced retry logic and error handling in embedding processes. - Documentation - Updated configuration comments and settings. - Chores - Updated GitHub Actions workflows to accommodate new secrets and environment variables. - Modified evaluation parameters. - Adjusted dependency management for optional libraries. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: holchan <61059652+holchan@users.noreply.github.com> Co-authored-by: Boris <boris@topoteretes.com>	2025-01-31 18:03:23 +01:00
Igor Ilic	3db7f85c9c	feat: Add max_chunk_tokens value to chunkers Add formula and forwarding of max_chunk_tokens value through Cognee	2025-01-28 14:32:00 +01:00
Igor Ilic	0a9f1349f2	refactor: Change variable and function names based on PR comments Change variable and function names based on PR comments	2025-01-28 10:10:29 +01:00
Igor Ilic	844d99cb72	docs: Remove commented code	2025-01-23 18:24:26 +01:00
Igor Ilic	6d5679f9d2	Merge branch 'dev' into COG-970-refactor-tokenizing	2025-01-23 18:14:49 +01:00
Igor Ilic	93249c72c5	fix: Initial commit to resolve issue with using tokenizer based on LLMs Currently TikToken is used for tokenizing by default which is only supported by OpenAI, this is an initial commit in an attempt to add Cognee tokenizing support for multiple LLMs	2025-01-21 19:53:22 +01:00
hande-k	2c351c499d	add docstrings any typing to cognee tasks	2025-01-17 10:30:34 +01:00
Rita Aleksziev	5635da6e38	Adjust unit tests	2025-01-09 10:53:03 +01:00
Rita Aleksziev	97814e334f	Get embedding engine instead of passing it in code chunking.	2025-01-08 13:45:04 +01:00
Rita Aleksziev	34a9267f41	Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.	2025-01-08 13:23:17 +01:00
Rita Aleksziev	fb13a1b61a	Handle azure models as well	2025-01-07 15:00:58 +01:00
Rita Aleksziev	a774191ed3	Adjust AudioDocument and handle None token limit	2025-01-07 13:38:23 +01:00
alekszievr	4802567871	Overcome ContextWindowExceededError by checking token count while chunking (#413 )	2025-01-07 11:46:46 +01:00
vasilije	60c8fd103b	ruff format	2025-01-05 19:09:08 +01:00
Igor Ilic	15b7b8ef2b	fix: Resolve issue with table names in SQL commands Some SQL commands require lowercase characters in table names unless table name is wrapped in quotes. Renamed all new tables to use lowercase Fix COG-677	2024-11-20 14:54:35 +01:00
Leon Luithlen	e40e7386a0	Refactor word_type yielding in chuck_by_sentence	2024-11-14 17:16:04 +01:00
Leon Luithlen	84c98f16bb	Remove chunk_index attribute from chunk_by_sentence return value	2024-11-14 16:49:13 +01:00
Leon Luithlen	15420dd864	Fix paragraph_ids handling	2024-11-14 16:47:51 +01:00
Leon Luithlen	d6a6a9eaba	Return sentence_cut instead of word in chunk_by_paragraph	2024-11-14 15:03:09 +01:00
0xideas	8b681529b1	Update cognee/tasks/chunks/chunk_by_paragraph.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2024-11-14 14:42:15 +01:00
Leon Luithlen	73f24f9e4d	Fix sentence_cut return value in inappropriate places	2024-11-14 14:40:42 +01:00
Leon Luithlen	eaf9167fa1	Change chunk_by_word to collect newlines in prior words	2024-11-14 14:19:34 +01:00
Leon Luithlen	57d8149732	Save paragraph_ids in chunk_by_paragraph	2024-11-14 13:59:54 +01:00
Leon Luithlen	6721eaee83	Fix chunk_index bug in chunk_by_paragraph	2024-11-14 13:50:40 +01:00
0xideas	f2206a09c0	Update cognee/tasks/chunks/chunk_by_word.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2024-11-14 13:16:17 +01:00
Leon Luithlen	d90698305b	Simplify chunk_by_word	2024-11-14 09:43:10 +01:00
Leon Luithlen	45a60b7f19	Remove assert and move is_real_paragraph_end outside loop	2024-11-13 16:35:47 +01:00
Leon Luithlen	b787407db7	Add more adversarial examples	2024-11-13 16:23:14 +01:00
Leon Luithlen	9ea2634480	Replace word_count with maximum_length in if clause	2024-11-13 15:53:44 +01:00
Leon Luithlen	9b2fb09c59	Fix PdfDocument teset, give chunk_by_sentence a maximum_length arg	2024-11-13 15:39:17 +01:00
Leon Luithlen	f8e5b529c3	Add maximum_length argument to chunk_sentences	2024-11-13 15:35:03 +01:00
Leon Luithlen	ce498d97dd	Refactor chunk_by_paragraph to be isomorphic	2024-11-13 15:35:03 +01:00
Leon Luithlen	ab55a73d18	Adapt chunk_by_sentence to isomorphic chunk_by_word	2024-11-13 15:35:03 +01:00
Leon Luithlen	c054e897a3	Make chunk_by_word isomorphic	2024-11-13 15:35:03 +01:00
Leon Luithlen	6f0637a028	Small cosmetic changes	2024-11-13 15:35:02 +01:00
Leon Luithlen	fbd011560a	Rebase onto main	2024-11-12 16:47:28 +01:00
Leon Luithlen	d7ffef1979	Remove old __tests__ folders	2024-11-12 16:47:28 +01:00
Leon Luithlen	86e726d741	Complete migrating unit tests	2024-11-12 16:47:28 +01:00
Boris	52180eb6b5	feat: COG-184 add falkordb (#192 ) * feat: add falkordb adapter --------- Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>	2024-11-11 18:20:52 +01:00

45 commits