Commit graph

48 commits

Author SHA1 Message Date
Igor Ilic
59f8d12fa3 Merge branch 'main' into merge-main-vol7 2025-12-11 19:11:24 +01:00
EricXiao
4c609d6074 Merge branch 'dev' into feat/csv-ingestion
Signed-off-by: EricXiao <taoiaox@gmail.com>
2025-11-14 14:46:11 +08:00
martin0731
3acb581bd0 Removed check_permissions_on_dataset.py and related references 2025-11-13 08:31:15 -05:00
Daulet Amirkhanov
3e2dbd1846 Update deprecated Exception status codes 2025-10-22 17:38:41 +01:00
EricXiao
742866b4c9 feat: csv ingestion loader & chunk
Signed-off-by: EricXiao <taoiaox@gmail.com>
2025-10-22 16:56:46 +08:00
hajdul88
a7d7e12d4c ruff fix 2025-08-14 14:48:35 +02:00
hajdul88
df3a3df117 feat: adds errors to classify, and chunking top level 2025-08-14 13:12:08 +02:00
hajdul88
c99b453d96 feat: adds WrongDataDocumentError to classify documents 2025-08-14 10:57:16 +02:00
Igor Ilic
14ba3e8829
feat: Enable async execution of data items for incremental loading (#1092)
<!-- .github/pull_request_template.md -->

## Description
Attempt at making incremental loading run async

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-29 10:39:31 -04:00
Boris
46c4463cb2
feat: s3 storage (#988)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: vasilije <vas.markovic@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
2025-07-14 21:47:08 +02:00
hajdul88
3c3c89a140
fix: Adds graceful handling quick fix for damaged pdf files (#1047)
<!-- .github/pull_request_template.md -->

## Description
fix: Adds graceful handling quick fix for damaged pdf files

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-06 13:09:42 +02:00
Boris
e7644f4b3a
feat: migrate new UI to cognee (#966)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Igor Ilic <igorilic03@gmail.com>
2025-06-18 20:56:44 +02:00
hajdul88
21a4217301
Feature: Makes s3 pathway imports optional so cognee can run without s3fs (#978)
<!-- .github/pull_request_template.md -->

## Description
Makes s3 pathway imports optional so cognee can run without s3fs

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-06-13 08:53:30 +02:00
Igor Ilic
1ed6cfd918
feat: new Dataset permissions (#869)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
Co-authored-by: Boris <boris@topoteretes.com>
2025-06-06 14:20:57 +02:00
Daniel Molnar
bb68d6a0df
Docstring tasks. (#878)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-05-27 21:33:16 +02:00
Vasilije
bb7eaa017b
feat: Group DataPoints into NodeSets (#680)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
2025-04-19 20:21:04 +02:00
Boris
ebf1f81b35
fix: code cleanup [COG-781] (#667)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin
2025-03-26 18:32:43 +01:00
alekszievr
c1f7b667d1
feat: Eliminate the use of max_chunk_tokens and use a unified max_chunk_size instead [cog-1381] (#626)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified text processing by unifying multiple size-related
parameters into a single metric across chunking and extraction
functionalities.
- Streamlined logic for text segmentation by removing redundant
calculations and checks, resulting in a more consistent chunk management
process.
- **Chores**
  - Removed the `modal` package as a dependency.
- **Documentation**
- Updated the README.md to include a new demo video link and clarified
default environment variable settings.
- Enhanced the CONTRIBUTING.md to improve clarity and engagement for
potential contributors.
- **Bug Fixes**
- Improved handling of sentence-ending punctuation in text processing to
include additional characters.
- **Version Update**
  - Updated project version to 0.1.33 in the pyproject.toml file.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-03-12 14:03:41 +01:00
alekszievr
a61df966c6
feat: use external chunker [cog-1354] (#551)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a modular content chunking interface that offers flexible
text segmentation with configurable chunk size and overlap.
- Added new chunkers for enhanced text processing, including
`LangchainChunker` and improved `TextChunker`.

- **Refactor**
- Unified the chunk extraction mechanism across various document types
for improved consistency and type safety.
- Updated method signatures to enhance clarity and type safety regarding
chunker usage.
- Enhanced error handling and logging during text segmentation to guide
adjustments when content exceeds limits.

- **Bug Fixes**
- Adjusted expected output in tests to reflect changes in chunking logic
and configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-02-21 14:10:59 +01:00
alekszievr
2a167fa1ab
feat: externalize chunkers [cog-1354] (#547)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced document chunk extraction for improved processing consistency
across multiple formats.

- **Refactor**
- Streamlined the configuration for text chunking by replacing indirect
mappings with a direct instantiation approach across document types.
- Updated method signatures across various document classes to accept
chunker class references instead of string identifiers.

- **Chores**
- Removed legacy configuration utilities related to document chunking to
simplify processing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Boris <boris@topoteretes.com>
2025-02-19 13:26:11 +01:00
alekszievr
edae2771a5
Count the number of tokens in documents [COG-1071] (#476)
* Count the number of tokens in documents

* save token count to relational db

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-01-29 11:29:09 +01:00
Igor Ilic
3db7f85c9c feat: Add max_chunk_tokens value to chunkers
Add formula and forwarding of max_chunk_tokens value through Cognee
2025-01-28 14:32:00 +01:00
Igor Ilic
6d5679f9d2 Merge branch 'dev' into COG-970-refactor-tokenizing 2025-01-23 18:14:49 +01:00
Igor Ilic
40c0279ec5 Merge branch 'COG-793-metadata-rework' of github.com:topoteretes/cognee into COG-793-metadata-rework 2025-01-22 16:13:11 +01:00
Igor Ilic
80e67b0619 refactor: Rename foreign to external metadata
Rename foreign metadata to external metadata for metadata coming outside of Cognee
2025-01-22 16:07:35 +01:00
Igor Ilic
93249c72c5 fix: Initial commit to resolve issue with using tokenizer based on LLMs
Currently TikToken is used for tokenizing by default which is only supported by OpenAI,
this is an initial commit in an attempt to add Cognee tokenizing support for multiple LLMs
2025-01-21 19:53:22 +01:00
Igor Ilic
655ab0b8cc
Merge branch 'dev' into COG-793-metadata-rework 2025-01-21 18:20:49 +01:00
Igor Ilic
ab8d95cc30 refactor: As neo4j can't support dictionaries, add foreign metadata as string 2025-01-20 17:28:14 +01:00
Igor Ilic
49ad292592 refactor: Reduce complexity of metadata handling
Have foreign metadata be a table column in data instead of it's own table to reduce complexity

Refactor COG-793
2025-01-20 16:39:05 +01:00
hande-k
2c351c499d add docstrings any typing to cognee tasks 2025-01-17 10:30:34 +01:00
Rita Aleksziev
5635da6e38 Adjust unit tests 2025-01-09 10:53:03 +01:00
Rita Aleksziev
34a9267f41 Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. 2025-01-08 13:23:17 +01:00
alekszievr
4802567871
Overcome ContextWindowExceededError by checking token count while chunking (#413) 2025-01-07 11:46:46 +01:00
vasilije
60c8fd103b ruff format 2025-01-05 19:09:08 +01:00
hajdul88
9e7ab6492a
feat: outsources chunking parameters to extract chunk from documents … (#289)
* feat: outsources chunking parameters to extract chunk from documents task
2024-12-17 11:31:31 +01:00
Igor Ilic
62db3f8598 feat: Remove the need for libmagic for unstructured documents
Remove the need for libmagic so for unstructured documents by providing mime_type information

Feature COG-685
2024-12-08 14:37:50 +01:00
Igor Ilic
78214456a6 feat: Add unstructured document handler
Added unstructured library and handling of certain document types through their library

Feature COG-685
2024-12-06 17:50:22 +01:00
Boris
348610e73c
fix: refactor get_graph_from_model to return nodes and edges correctly (#257)
* fix: handle rate limit error coming from llm model

* fix: fixes lost edges and nodes in get_graph_from_model

* fix: fixes database pruning issue in pgvector (#261)

* fix: cognee_demo notebook pipeline is not saving summaries

---------

Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
2024-12-06 12:52:01 +01:00
Leon Luithlen
15802237e9 Get metadata from metadata table 2024-11-28 09:18:49 +01:00
Leon Luithlen
cd0e505ac0 WIP 2024-11-28 09:18:49 +01:00
Leon Luithlen
7324564655 Add metadata_id attribute to Document and DocumentChunk, make ingest_with_metadata default 2024-11-28 09:18:49 +01:00
Boris
64b8aac86f
feat: code graph swe integration
Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
Co-authored-by: hande-k <handekafkas7@gmail.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2024-11-27 09:32:29 +01:00
Leon Luithlen
cd80525420 Revert to EXTENSION_TO_DOCUMENT_CLASS implementation of classify_documents 2024-11-13 14:32:10 +01:00
Leon Luithlen
826de0edbf Remove orphan dictionary 2024-11-12 16:47:28 +01:00
Leon Luithlen
83995fa548 Try old version of classify_documents 2024-11-12 16:47:28 +01:00
Leon Luithlen
8107709e98 Remove duplicate pdf key 2024-11-12 16:47:28 +01:00
Leon Luithlen
66fb2948f8 Small cleanup pull request 2024-11-12 15:37:03 +01:00
Boris
52180eb6b5
feat: COG-184 add falkordb (#192)
* feat: add falkordb adapter

---------

Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
2024-11-11 18:20:52 +01:00