Commit graph

45 commits

Author SHA1 Message Date
hajdul88
7510c6f572 Fix: fixes None last_cut_type pydantic errors by introducing default cut type 2025-07-02 09:51:07 +02:00
Daniel Molnar
bb68d6a0df
Docstring tasks. (#878)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-05-27 21:33:16 +02:00
alekszievr
c1f7b667d1
feat: Eliminate the use of max_chunk_tokens and use a unified max_chunk_size instead [cog-1381] (#626)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified text processing by unifying multiple size-related
parameters into a single metric across chunking and extraction
functionalities.
- Streamlined logic for text segmentation by removing redundant
calculations and checks, resulting in a more consistent chunk management
process.
- **Chores**
  - Removed the `modal` package as a dependency.
- **Documentation**
- Updated the README.md to include a new demo video link and clarified
default environment variable settings.
- Enhanced the CONTRIBUTING.md to improve clarity and engagement for
potential contributors.
- **Bug Fixes**
- Improved handling of sentence-ending punctuation in text processing to
include additional characters.
- **Version Update**
  - Updated project version to 0.1.33 in the pyproject.toml file.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-03-12 14:03:41 +01:00
Daniel Molnar
d27f847753
Transition to new retrievers, update searches (#585)
<!-- .github/pull_request_template.md -->

## Description
Delete legacy search implementations after migrating to new retriever
classes

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced search and retrieval capabilities, providing improved context
resolution for code queries, completions, summaries, and graph
connections.
  
- **Refactor**
- Shifted to a modular, object-oriented approach that consolidates query
logic and streamlines error management for a more robust and scalable
experience.
  
- **Bug Fixes**
- Improved error handling for unsupported search types and retrieval
operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-02-27 15:25:24 +01:00
lxobr
9cc357ac1c
Feat/cog 1365 unify retrievers (#572)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->
- Created the `BaseRetriever` class to unify all the retrievers and
searches.
- Implemented seven specialized retrievers (summaries, chunks,
completions, graph, graph-summary, insights, code) with consistent
get_context/get_completion interfaces.
- Added json context dumping feature in the current completion
implementations to enable context comparisons.
- Built a comparison framework to validate old vs new implementations.
## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced multiple retrieval classes for enhanced search
capabilities, including `BaseRetriever`, `ChunksRetriever`,
`CodeRetriever`, `CompletionRetriever`, `GraphCompletionRetriever`,
`GraphSummaryCompletionRetriever`, `InsightsRetriever`, and
`SummariesRetriever`.
- Enhanced query completions with optional context saving for improved
data persistence.
- Implemented advanced tools to compare retrieval outcomes across
different implementations.

- **Refactor**
- Streamlined internal module organization and updated references for
increased maintainability and consistency.
- Added comments indicating future maintenance tasks related to code
merging.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-02-27 12:13:21 +01:00
Boris
f75e35c337
fix: custom model pipeline (#508)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
• Graph visualizations now allow exporting to a user-specified file path
for more flexible output management.
• The text embedding process has been enhanced with an additional
tokenizer option for improved performance.
• A new `ExtendableDataPoint` class has been introduced for future
extensions.
• New JSON files for companies and individuals have been added to
facilitate testing and data processing.

- **Improvements**
• Search functionality now uses updated identifiers for more reliable
content retrieval.
• Metadata handling has been streamlined across various classes by
removing unnecessary type specifications.
• Enhanced serialization of properties in the Neo4j adapter for improved
handling of complex structures.
• The setup process for databases has been improved with a new
asynchronous setup function.

- **Chores**
• Dependency and configuration updates improve overall stability and
performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-02-08 02:00:15 +01:00
Igor Ilic
8879f3fbbe
feat: Add gemini support [COG-1023] (#485)
<!-- .github/pull_request_template.md -->

## Description
PR to test Gemini PR from holchan

1. Add Gemini LLM and Gemini Embedding support 
2. Fix CodeGraph issue with chunks being bigger than maximum token value
3. Add Tokenizer adapters to CodeGraph

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
    - Added support for the Gemini LLM provider.
    - Expanded LLM configuration options.
- Introduced a new GitHub Actions workflow for multimetric QA
evaluation.
- Added new environment variables for LLM and embedding configurations
across various workflows.

- **Bug Fixes**
    - Improved error handling in various components.
    - Updated tokenization and embedding processes.
    - Removed warning related to missing `dict` method in data items.

- **Refactor**
    - Simplified token extraction and decoding methods.
    - Updated tokenizer interfaces.
    - Removed deprecated dependencies.
    - Enhanced retry logic and error handling in embedding processes.

- **Documentation**
    - Updated configuration comments and settings.

- **Chores**
- Updated GitHub Actions workflows to accommodate new secrets and
environment variables.
    - Modified evaluation parameters.
    - Adjusted dependency management for optional libraries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: holchan <61059652+holchan@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
2025-01-31 18:03:23 +01:00
Igor Ilic
3db7f85c9c feat: Add max_chunk_tokens value to chunkers
Add formula and forwarding of max_chunk_tokens value through Cognee
2025-01-28 14:32:00 +01:00
Igor Ilic
0a9f1349f2 refactor: Change variable and function names based on PR comments
Change variable and function names based on PR comments
2025-01-28 10:10:29 +01:00
Igor Ilic
844d99cb72 docs: Remove commented code 2025-01-23 18:24:26 +01:00
Igor Ilic
6d5679f9d2 Merge branch 'dev' into COG-970-refactor-tokenizing 2025-01-23 18:14:49 +01:00
Igor Ilic
93249c72c5 fix: Initial commit to resolve issue with using tokenizer based on LLMs
Currently TikToken is used for tokenizing by default which is only supported by OpenAI,
this is an initial commit in an attempt to add Cognee tokenizing support for multiple LLMs
2025-01-21 19:53:22 +01:00
hande-k
2c351c499d add docstrings any typing to cognee tasks 2025-01-17 10:30:34 +01:00
Rita Aleksziev
5635da6e38 Adjust unit tests 2025-01-09 10:53:03 +01:00
Rita Aleksziev
97814e334f Get embedding engine instead of passing it in code chunking. 2025-01-08 13:45:04 +01:00
Rita Aleksziev
34a9267f41 Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. 2025-01-08 13:23:17 +01:00
Rita Aleksziev
fb13a1b61a Handle azure models as well 2025-01-07 15:00:58 +01:00
Rita Aleksziev
a774191ed3 Adjust AudioDocument and handle None token limit 2025-01-07 13:38:23 +01:00
alekszievr
4802567871
Overcome ContextWindowExceededError by checking token count while chunking (#413) 2025-01-07 11:46:46 +01:00
vasilije
60c8fd103b ruff format 2025-01-05 19:09:08 +01:00
Igor Ilic
15b7b8ef2b fix: Resolve issue with table names in SQL commands
Some SQL commands require lowercase characters in table names unless table name is wrapped in quotes. Renamed all new tables to use lowercase

Fix COG-677
2024-11-20 14:54:35 +01:00
Leon Luithlen
e40e7386a0 Refactor word_type yielding in chuck_by_sentence 2024-11-14 17:16:04 +01:00
Leon Luithlen
84c98f16bb Remove chunk_index attribute from chunk_by_sentence return value 2024-11-14 16:49:13 +01:00
Leon Luithlen
15420dd864 Fix paragraph_ids handling 2024-11-14 16:47:51 +01:00
Leon Luithlen
d6a6a9eaba Return sentence_cut instead of word in chunk_by_paragraph 2024-11-14 15:03:09 +01:00
0xideas
8b681529b1
Update cognee/tasks/chunks/chunk_by_paragraph.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2024-11-14 14:42:15 +01:00
Leon Luithlen
73f24f9e4d Fix sentence_cut return value in inappropriate places 2024-11-14 14:40:42 +01:00
Leon Luithlen
eaf9167fa1 Change chunk_by_word to collect newlines in prior words 2024-11-14 14:19:34 +01:00
Leon Luithlen
57d8149732 Save paragraph_ids in chunk_by_paragraph 2024-11-14 13:59:54 +01:00
Leon Luithlen
6721eaee83 Fix chunk_index bug in chunk_by_paragraph 2024-11-14 13:50:40 +01:00
0xideas
f2206a09c0
Update cognee/tasks/chunks/chunk_by_word.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2024-11-14 13:16:17 +01:00
Leon Luithlen
d90698305b Simplify chunk_by_word 2024-11-14 09:43:10 +01:00
Leon Luithlen
45a60b7f19 Remove assert and move is_real_paragraph_end outside loop 2024-11-13 16:35:47 +01:00
Leon Luithlen
b787407db7 Add more adversarial examples 2024-11-13 16:23:14 +01:00
Leon Luithlen
9ea2634480 Replace word_count with maximum_length in if clause 2024-11-13 15:53:44 +01:00
Leon Luithlen
9b2fb09c59 Fix PdfDocument teset, give chunk_by_sentence a maximum_length arg 2024-11-13 15:39:17 +01:00
Leon Luithlen
f8e5b529c3 Add maximum_length argument to chunk_sentences 2024-11-13 15:35:03 +01:00
Leon Luithlen
ce498d97dd Refactor chunk_by_paragraph to be isomorphic 2024-11-13 15:35:03 +01:00
Leon Luithlen
ab55a73d18 Adapt chunk_by_sentence to isomorphic chunk_by_word 2024-11-13 15:35:03 +01:00
Leon Luithlen
c054e897a3 Make chunk_by_word isomorphic 2024-11-13 15:35:03 +01:00
Leon Luithlen
6f0637a028 Small cosmetic changes 2024-11-13 15:35:02 +01:00
Leon Luithlen
fbd011560a Rebase onto main 2024-11-12 16:47:28 +01:00
Leon Luithlen
d7ffef1979 Remove old __tests__ folders 2024-11-12 16:47:28 +01:00
Leon Luithlen
86e726d741 Complete migrating unit tests 2024-11-12 16:47:28 +01:00
Boris
52180eb6b5
feat: COG-184 add falkordb (#192)
* feat: add falkordb adapter

---------

Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
2024-11-11 18:20:52 +01:00