cognee

Author	SHA1	Message	Date
Igor Ilic	59f8d12fa3	Merge branch 'main' into merge-main-vol7	2025-12-11 19:11:24 +01:00
EricXiao	4c609d6074	Merge branch 'dev' into feat/csv-ingestion Signed-off-by: EricXiao <taoiaox@gmail.com>	2025-11-14 14:46:11 +08:00
martin0731	3acb581bd0	Removed check_permissions_on_dataset.py and related references	2025-11-13 08:31:15 -05:00
Daulet Amirkhanov	3e2dbd1846	Update deprecated Exception status codes	2025-10-22 17:38:41 +01:00
EricXiao	742866b4c9	feat: csv ingestion loader & chunk Signed-off-by: EricXiao <taoiaox@gmail.com>	2025-10-22 16:56:46 +08:00
hajdul88	a7d7e12d4c	ruff fix	2025-08-14 14:48:35 +02:00
hajdul88	df3a3df117	feat: adds errors to classify, and chunking top level	2025-08-14 13:12:08 +02:00
hajdul88	c99b453d96	feat: adds WrongDataDocumentError to classify documents	2025-08-14 10:57:16 +02:00
Igor Ilic	14ba3e8829	feat: Enable async execution of data items for incremental loading (#1092 ) <!-- .github/pull_request_template.md --> ## Description Attempt at making incremental loading run async ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-07-29 10:39:31 -04:00
Boris	46c4463cb2	feat: s3 storage (#988 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: vasilije <vas.markovic@gmail.com> Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>	2025-07-14 21:47:08 +02:00
hajdul88	3c3c89a140	fix: Adds graceful handling quick fix for damaged pdf files (#1047 ) <!-- .github/pull_request_template.md --> ## Description fix: Adds graceful handling quick fix for damaged pdf files ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-07-06 13:09:42 +02:00
Boris	e7644f4b3a	feat: migrate new UI to cognee (#966 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Igor Ilic <igorilic03@gmail.com>	2025-06-18 20:56:44 +02:00
hajdul88	21a4217301	Feature: Makes s3 pathway imports optional so cognee can run without s3fs (#978 ) <!-- .github/pull_request_template.md --> ## Description Makes s3 pathway imports optional so cognee can run without s3fs ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-06-13 08:53:30 +02:00
Igor Ilic	1ed6cfd918	feat: new Dataset permissions (#869 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Boris Arzentar <borisarzentar@gmail.com> Co-authored-by: Boris <boris@topoteretes.com>	2025-06-06 14:20:57 +02:00
Daniel Molnar	bb68d6a0df	Docstring tasks. (#878 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.	2025-05-27 21:33:16 +02:00
Vasilije	bb7eaa017b	feat: Group DataPoints into NodeSets (#680 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com> Co-authored-by: Boris <boris@topoteretes.com> Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>	2025-04-19 20:21:04 +02:00
Boris	ebf1f81b35	fix: code cleanup [COG-781] (#667 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin	2025-03-26 18:32:43 +01:00
alekszievr	c1f7b667d1	feat: Eliminate the use of max_chunk_tokens and use a unified max_chunk_size instead [cog-1381] (#626 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Refactor - Simplified text processing by unifying multiple size-related parameters into a single metric across chunking and extraction functionalities. - Streamlined logic for text segmentation by removing redundant calculations and checks, resulting in a more consistent chunk management process. - Chores - Removed the `modal` package as a dependency. - Documentation - Updated the README.md to include a new demo video link and clarified default environment variable settings. - Enhanced the CONTRIBUTING.md to improve clarity and engagement for potential contributors. - Bug Fixes - Improved handling of sentence-ending punctuation in text processing to include additional characters. - Version Update - Updated project version to 0.1.33 in the pyproject.toml file. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-03-12 14:03:41 +01:00
alekszievr	a61df966c6	feat: use external chunker [cog-1354] (#551 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced a modular content chunking interface that offers flexible text segmentation with configurable chunk size and overlap. - Added new chunkers for enhanced text processing, including `LangchainChunker` and improved `TextChunker`. - Refactor - Unified the chunk extraction mechanism across various document types for improved consistency and type safety. - Updated method signatures to enhance clarity and type safety regarding chunker usage. - Enhanced error handling and logging during text segmentation to guide adjustments when content exceeds limits. - Bug Fixes - Adjusted expected output in tests to reflect changes in chunking logic and configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-02-21 14:10:59 +01:00
alekszievr	2a167fa1ab	feat: externalize chunkers [cog-1354] (#547 ) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Enhanced document chunk extraction for improved processing consistency across multiple formats. - Refactor - Streamlined the configuration for text chunking by replacing indirect mappings with a direct instantiation approach across document types. - Updated method signatures across various document classes to accept chunker class references instead of string identifiers. - Chores - Removed legacy configuration utilities related to document chunking to simplify processing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Boris <boris@topoteretes.com>	2025-02-19 13:26:11 +01:00
alekszievr	edae2771a5	Count the number of tokens in documents [COG-1071] (#476 ) * Count the number of tokens in documents * save token count to relational db --------- Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>	2025-01-29 11:29:09 +01:00
Igor Ilic	3db7f85c9c	feat: Add max_chunk_tokens value to chunkers Add formula and forwarding of max_chunk_tokens value through Cognee	2025-01-28 14:32:00 +01:00
Igor Ilic	6d5679f9d2	Merge branch 'dev' into COG-970-refactor-tokenizing	2025-01-23 18:14:49 +01:00
Igor Ilic	40c0279ec5	Merge branch 'COG-793-metadata-rework' of github.com:topoteretes/cognee into COG-793-metadata-rework	2025-01-22 16:13:11 +01:00
Igor Ilic	80e67b0619	refactor: Rename foreign to external metadata Rename foreign metadata to external metadata for metadata coming outside of Cognee	2025-01-22 16:07:35 +01:00
Igor Ilic	93249c72c5	fix: Initial commit to resolve issue with using tokenizer based on LLMs Currently TikToken is used for tokenizing by default which is only supported by OpenAI, this is an initial commit in an attempt to add Cognee tokenizing support for multiple LLMs	2025-01-21 19:53:22 +01:00
Igor Ilic	655ab0b8cc	Merge branch 'dev' into COG-793-metadata-rework	2025-01-21 18:20:49 +01:00
Igor Ilic	ab8d95cc30	refactor: As neo4j can't support dictionaries, add foreign metadata as string	2025-01-20 17:28:14 +01:00
Igor Ilic	49ad292592	refactor: Reduce complexity of metadata handling Have foreign metadata be a table column in data instead of it's own table to reduce complexity Refactor COG-793	2025-01-20 16:39:05 +01:00
hande-k	2c351c499d	add docstrings any typing to cognee tasks	2025-01-17 10:30:34 +01:00
Rita Aleksziev	5635da6e38	Adjust unit tests	2025-01-09 10:53:03 +01:00
Rita Aleksziev	34a9267f41	Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.	2025-01-08 13:23:17 +01:00
alekszievr	4802567871	Overcome ContextWindowExceededError by checking token count while chunking (#413 )	2025-01-07 11:46:46 +01:00
vasilije	60c8fd103b	ruff format	2025-01-05 19:09:08 +01:00
hajdul88	9e7ab6492a	feat: outsources chunking parameters to extract chunk from documents … (#289 ) * feat: outsources chunking parameters to extract chunk from documents task	2024-12-17 11:31:31 +01:00
Igor Ilic	62db3f8598	feat: Remove the need for libmagic for unstructured documents Remove the need for libmagic so for unstructured documents by providing mime_type information Feature COG-685	2024-12-08 14:37:50 +01:00
Igor Ilic	78214456a6	feat: Add unstructured document handler Added unstructured library and handling of certain document types through their library Feature COG-685	2024-12-06 17:50:22 +01:00
Boris	348610e73c	fix: refactor get_graph_from_model to return nodes and edges correctly (#257 ) * fix: handle rate limit error coming from llm model * fix: fixes lost edges and nodes in get_graph_from_model * fix: fixes database pruning issue in pgvector (#261) * fix: cognee_demo notebook pipeline is not saving summaries --------- Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>	2024-12-06 12:52:01 +01:00
Leon Luithlen	15802237e9	Get metadata from metadata table	2024-11-28 09:18:49 +01:00
Leon Luithlen	cd0e505ac0	WIP	2024-11-28 09:18:49 +01:00
Leon Luithlen	7324564655	Add metadata_id attribute to Document and DocumentChunk, make ingest_with_metadata default	2024-11-28 09:18:49 +01:00
Boris	64b8aac86f	feat: code graph swe integration Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com> Co-authored-by: hande-k <handekafkas7@gmail.com> Co-authored-by: Igor Ilic <igorilic03@gmail.com> Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com> Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>	2024-11-27 09:32:29 +01:00
Leon Luithlen	cd80525420	Revert to EXTENSION_TO_DOCUMENT_CLASS implementation of classify_documents	2024-11-13 14:32:10 +01:00
Leon Luithlen	826de0edbf	Remove orphan dictionary	2024-11-12 16:47:28 +01:00
Leon Luithlen	83995fa548	Try old version of classify_documents	2024-11-12 16:47:28 +01:00
Leon Luithlen	8107709e98	Remove duplicate pdf key	2024-11-12 16:47:28 +01:00
Leon Luithlen	66fb2948f8	Small cleanup pull request	2024-11-12 15:37:03 +01:00
Boris	52180eb6b5	feat: COG-184 add falkordb (#192 ) * feat: add falkordb adapter --------- Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>	2024-11-11 18:20:52 +01:00

48 commits