Commit graph

140 commits

Author SHA1 Message Date
Igor Ilic
dbdf04c089
Data model migration (#1143)
<!-- .github/pull_request_template.md -->

## Description
Data model migration for new release

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-24 15:03:16 +02:00
Boris
d6727a1b4a
fix: UnstructuredDocument read method (#1141)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-24 13:23:27 +02:00
Boris
c5bd6bed40
fix: s3 file storage (#1095)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-16 20:36:18 +02:00
Boris
46c4463cb2
feat: s3 storage (#988)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: vasilije <vas.markovic@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
2025-07-14 21:47:08 +02:00
Igor Ilic
f68fd59b95
feat: Data size info tracking (#1088)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-14 19:03:58 +02:00
Boris Arzentar
fa5ea44345
Merge remote-tracking branch 'origin/dev' into feat/modal-parallelization 2025-07-06 21:03:10 +02:00
Boris Arzentar
685d282f5c
fix: add error handling 2025-07-06 21:03:02 +02:00
hajdul88
3c3c89a140
fix: Adds graceful handling quick fix for damaged pdf files (#1047)
<!-- .github/pull_request_template.md -->

## Description
fix: Adds graceful handling quick fix for damaged pdf files

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-07-06 13:09:42 +02:00
Boris Arzentar
4eba76ca1f
Merge remote-tracking branch 'origin/dev' into feat/modal-parallelization 2025-07-04 15:37:57 +02:00
Vasilije
ada3f7b086
fix: Logger suppresion and database logs (#1041)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
2025-07-03 20:08:27 +02:00
Boris Arzentar
86bd3e4a5a
Merge remote-tracking branch 'origin/dev' into feat/modal-parallelization 2025-07-02 11:28:22 +02:00
Igor Ilic
0d75b6dc76 Merge branch 'main' into main-merge 2025-06-30 12:24:24 +02:00
Hashem Aldhaheri
fd77e92cc4
Fix: Handle file:// URLs in open_data_file function (#1019)
## Summary
This PR fixes an asymmetry issue where files saved with `file://`
prefixes could not be read back, causing "file not found" errors.

## Problem
The Cognee framework has a bug where:
- `save_data_to_file.py` adds `file://` prefix when saving files
- `open_data_file.py` doesn't handle the `file://` prefix when reading
files
- This causes saved files to appear as "lost" with cryptic "file not
found" errors

## Solution
Added proper handling for `file://` URLs in `open_data_file.py` by:
- Checking if the file path starts with `"file://"`
- Stripping the prefix using `replace("file://", "", 1)`
- Following the same pattern as S3 URL handling

## Changes
- Modified
`cognee/modules/data/processing/document_types/open_data_file.py` to
handle `file://` URLs
- Added comprehensive unit tests in
`cognee/tests/unit/modules/data/test_open_data_file.py`

## Testing
Added 6 test cases covering:
- Regular file paths (ensuring backward compatibility)
- file:// URLs in text mode
- file:// URLs in binary mode
- file:// URLs with specific encoding
- Nonexistent files with file:// URLs
- Edge case with multiple file:// prefixes

All tests pass successfully.

## Notes
- This is a minimal fix that maintains backward compatibility
- The fix follows the existing pattern used for S3 URL handling
- No breaking changes to the API

I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

Signed-off-by: Hashem Aldhaheri <aenawi@gmail.com>
2025-06-30 11:55:34 +02:00
Igor Ilic
14be2a5f5d
feat: Add dataset_id to pipeline run info and status (#1009)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-06-30 11:53:17 +02:00
hajdul88
21a4217301
Feature: Makes s3 pathway imports optional so cognee can run without s3fs (#978)
<!-- .github/pull_request_template.md -->

## Description
Makes s3 pathway imports optional so cognee can run without s3fs

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-06-13 08:53:30 +02:00
Igor Ilic
1ed6cfd918
feat: new Dataset permissions (#869)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
Co-authored-by: Boris <boris@topoteretes.com>
2025-06-06 14:20:57 +02:00
Boris
0aac93e9c4
Merge dev to main (#827)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: vasilije <vas.markovic@gmail.com>
Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
Co-authored-by: Igor Ilic <igorilic03@gmail.com>
Co-authored-by: Hande <159312713+hande-k@users.noreply.github.com>
Co-authored-by: Matea Pesic <80577904+matea16@users.noreply.github.com>
Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
Co-authored-by: Daniel Molnar <soobrosa@gmail.com>
Co-authored-by: Diego Baptista Theuerkauf <34717973+diegoabt@users.noreply.github.com>
2025-05-15 13:15:49 +02:00
Igor Ilic
966e337500
feat: add MCP check status tool [COG-1784] (#793)
<!-- .github/pull_request_template.md -->

## Description
Added tools to check current cognify and codify status

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-05-13 12:09:14 -04:00
Boris Arzentar
a1e605ca97 fix: batch datapoints on save to limit bandwidth size 2025-05-12 11:28:13 +02:00
Igor Ilic
60da1c899e
fix: graph prompt path (#769)
<!-- .github/pull_request_template.md -->

## Description
Fix graph prompt path

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Evain Arthur <arthur.evain35@gmail.com>
Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
2025-04-23 12:03:51 +02:00
Vasilije
bb7eaa017b
feat: Group DataPoints into NodeSets (#680)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
2025-04-19 20:21:04 +02:00
Boris
675b66175f
test: make search unit tests deterministic (#726)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Daniel Molnar <soobrosa@gmail.com>
2025-04-18 21:55:24 +02:00
Daniel Molnar
9ba12b25ef
feat: add delete by document (#668)
<!-- .github/pull_request_template.md -->

## Description
Delete by document.

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin
2025-04-17 15:42:10 +02:00
hajdul88
0121a2b5fc
feature: Adds S3 functionality (#731)
<!-- .github/pull_request_template.md -->

## Description
Adds S3 support


## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-04-17 08:56:40 +02:00
lxobr
8207dc8643
feat: make graph creation prompt configurable (#686)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->
- Added new graph creation prompts
- Exposed graph creation prompts in .cognify via get_default tasks
- Exposed graph creation prompts in eval framework
## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
2025-04-03 11:14:33 +02:00
Boris
ebf1f81b35
fix: code cleanup [COG-781] (#667)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin
2025-03-26 18:32:43 +01:00
Daniel Molnar
73db1a5a53
fix: human readable logs (#658)
<!-- .github/pull_request_template.md -->

## Description
Introducing scructlog.

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin
2025-03-25 11:54:40 +01:00
Boris
d192d1fe20
chore: remove unused dependencies and make some optional (#661)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin
2025-03-25 10:19:52 +01:00
Igor Ilic
7bf30f7373
fix: Cognee backend fixes (#659)
<!-- .github/pull_request_template.md -->

## Description
Cognee backend fixes

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved handling of `tenant_id` in JWT payload for enhanced type
safety.
- Unique identifier generation for datasets now considers the owner ID,
allowing for multiple users to share the same dataset name.

- **Bug Fixes**
- Disabled user role permissions in the permission check logic
temporarily during a rework.

- **Refactor**
  - Simplified dependencies by removing unnecessary model imports.
- Updated parameter name from `tenant` to `tenant_id` for clarity in JWT
creation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-03-20 21:51:35 +01:00
Igor Ilic
88ed411f03
feat: user authorization [COG-1189] (#593)
<!-- .github/pull_request_template.md -->

## Description
Added user authorization through JWT header, reworked user and relevant
RBAC models to accompany future User Permission system.

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
  - Introduced an automated workflow to validate server startup.
  - Added secure JWT token generation for improved session handling.
- Enabled a new structure for permission management with role and
tenant-based controls, including endpoints for creating roles, tenants,
and assigning permissions.
- Added methods for assigning default permissions to roles, tenants, and
users.
- Introduced new classes for managing default permissions for roles,
tenants, and users.

- **Refactor**
- Streamlined authentication and user management flows with enhanced
error handling.

- **Tests**
- Upgraded integration tests with improved database initialization and
data pruning for a more stable environment.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com>
2025-03-13 13:33:42 +01:00
alekszievr
c1f7b667d1
feat: Eliminate the use of max_chunk_tokens and use a unified max_chunk_size instead [cog-1381] (#626)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified text processing by unifying multiple size-related
parameters into a single metric across chunking and extraction
functionalities.
- Streamlined logic for text segmentation by removing redundant
calculations and checks, resulting in a more consistent chunk management
process.
- **Chores**
  - Removed the `modal` package as a dependency.
- **Documentation**
- Updated the README.md to include a new demo video link and clarified
default environment variable settings.
- Enhanced the CONTRIBUTING.md to improve clarity and engagement for
potential contributors.
- **Bug Fixes**
- Improved handling of sentence-ending punctuation in text processing to
include additional characters.
- **Version Update**
  - Updated project version to 0.1.33 in the pyproject.toml file.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-03-12 14:03:41 +01:00
lxobr
f033f733b5
feat: entity brute force triplet search [COG-1325] (#589)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->
- Refactored `brute_force_triplet_search`, extracting memory projection.
- Built **TripletSearchContextProvider** (extends
**BaseContextProvider**) to create a single memory projection and
perform a triplet search for each entity.
- Refactored `entity_completion` into **EntityCompletionRetriever**
(extends **BaseRetriever**).
- Added **SummarizedTripletSearchContextProvider** (extends
**TripletSearchContextProvider**) for an alternative summarized output
format.
- Developed and tested an example showcasing both context providers,
comparing raw triplets, summaries, and standard search results.
## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced text summarization now delivers clearer, more concise
overviews of search results.
- Improved search performance with optimized context retrieval and
memory reuse for faster, more reliable results.
- Introduced advanced entity-based completion for generating more
relevant, context-aware responses.

- **Refactor**
- Streamlined internal workflows and error handling to ensure a smoother
overall experience.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Boris <boris@topoteretes.com>
2025-03-05 11:17:58 +01:00
alekszievr
6d7a68dbba
Feat: Store descriptive metrics identified by pipeline run id [cog-1260] (#582)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new analytic capability that calculates descriptive graph
metrics for pipeline runs when enabled.
- Updated the execution flow to include an option for activating the
graph metrics step.

- **Chores**
- Removed the previous mechanism for storing descriptive metrics to
streamline the system.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
2025-03-03 19:09:35 +01:00
alekszievr
a61df966c6
feat: use external chunker [cog-1354] (#551)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a modular content chunking interface that offers flexible
text segmentation with configurable chunk size and overlap.
- Added new chunkers for enhanced text processing, including
`LangchainChunker` and improved `TextChunker`.

- **Refactor**
- Unified the chunk extraction mechanism across various document types
for improved consistency and type safety.
- Updated method signatures to enhance clarity and type safety regarding
chunker usage.
- Enhanced error handling and logging during text segmentation to guide
adjustments when content exceeds limits.

- **Bug Fixes**
- Adjusted expected output in tests to reflect changes in chunking logic
and configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-02-21 14:10:59 +01:00
alekszievr
2a167fa1ab
feat: externalize chunkers [cog-1354] (#547)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced document chunk extraction for improved processing consistency
across multiple formats.

- **Refactor**
- Streamlined the configuration for text chunking by replacing indirect
mappings with a direct instantiation approach across document types.
- Updated method signatures across various document classes to accept
chunker class references instead of string identifiers.

- **Chores**
- Removed legacy configuration utilities related to document chunking to
simplify processing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Boris <boris@topoteretes.com>
2025-02-19 13:26:11 +01:00
hajdul88
6a0c0e3ef8
feat: Cognee evaluation framework development (#498)
<!-- .github/pull_request_template.md -->

This PR contains the evaluation framework development for cognee

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Expanded evaluation framework now integrates asynchronous corpus
building, question answering, and performance evaluation with adaptive
benchmarks for improved metrics (correctness, exact match, and F1
score).

- **Infrastructure**
- Added database integration for persistent storage of questions,
answers, and metrics.
- Launched an interactive metrics dashboard featuring advanced
visualizations.
- Introduced an automated testing workflow for continuous quality
assurance.

- **Documentation**
  - Updated guidelines for generating concise, clear answers.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-02-11 16:31:54 +01:00
Boris
f75e35c337
fix: custom model pipeline (#508)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
• Graph visualizations now allow exporting to a user-specified file path
for more flexible output management.
• The text embedding process has been enhanced with an additional
tokenizer option for improved performance.
• A new `ExtendableDataPoint` class has been introduced for future
extensions.
• New JSON files for companies and individuals have been added to
facilitate testing and data processing.

- **Improvements**
• Search functionality now uses updated identifiers for more reliable
content retrieval.
• Metadata handling has been streamlined across various classes by
removing unnecessary type specifications.
• Enhanced serialization of properties in the Neo4j adapter for improved
handling of complex structures.
• The setup process for databases has been improved with a new
asynchronous setup function.

- **Chores**
• Dependency and configuration updates improve overall stability and
performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-02-08 02:00:15 +01:00
alekszievr
8396fed9a1
feat: metrics in neo4j adapter [COG-1082] (#487)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced graph management capabilities allow users to verify graph
existence, project complete graphs, and remove graphs, delivering more
comprehensive graph insights.
  
- **Refactor**
  - Adjusted default task behavior for streamlined performance.
- Updated timestamp handling to ensure accurate and consistent record
tracking.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-02-07 15:58:43 +01:00
Igor Ilic
1260fc7db0
fix: Add reraising of general exception handling in cognee [COG-1062] (#490)
<!-- .github/pull_request_template.md -->

## Description
Add re-raising of errors in general exception handling 

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes & Stability Improvements**
- Enhanced error handling throughout the system, ensuring issues during
operations like server startup, data processing, and graph management
are properly logged and reported.

- **Refactor**
- Standardized logging practices replace basic output statements,
improving traceability and providing better insights for
troubleshooting.

- **New Features**
- Updated search functionality now returns only unique results,
enhancing data consistency and the overall user experience.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: holchan <61059652+holchan@users.noreply.github.com>
Co-authored-by: Boris <boris@topoteretes.com>
2025-02-04 10:51:05 +01:00
alekszievr
2858a674f5
feat: Calculate graph metrics for networkx graph [COG-1082] (#484)
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enabled an option to retrieve more detailed metrics, providing
comprehensive analytics for graph and descriptive data.

- **Refactor**
- Standardized the way metrics are obtained across components for
consistent behavior and improved data accuracy.
  
- **Chore**
- Made internal enhancements to support optional detailed metric
calculations, streamlining system performance and ensuring future
scalability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-02-03 18:05:53 +01:00
alekszievr
5119992fd8
feat: Add graph metrics getter in graph db interface and adapters [COG-1082] (#483)
Dummy implementation of graph metrics to demonstrate how the interface
will look like

<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced asynchronous functionality for retrieving comprehensive
graph metrics, including counts and connectivity details, across
different systems.
  
- **Refactor**
- Streamlined metrics processing and storage by shifting to direct
retrieval from the graph engine.
- Updated naming conventions for the `GraphMetrics` database table and
reorganized module imports to enhance internal consistency.
  
- **Chores**
- Removed dataset deletion functionalities while introducing the ability
to store descriptive metrics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-02-03 15:25:04 +01:00
alekszievr
a79f7133fd
Feat: add number of tokens and descriptive graph metrics to metric table [COG-1132] (#481)
* Count the number of tokens in documents

* save token count to relational db

* Add metrics to metric table

* Store list as json instead of array in relational db table

* Sum in sql instead of python

* Unify naming

* Return data_points in descriptive metric calculation task

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-01-30 12:39:14 +01:00
alekszievr
edae2771a5
Count the number of tokens in documents [COG-1071] (#476)
* Count the number of tokens in documents

* save token count to relational db

---------

Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com>
2025-01-29 11:29:09 +01:00
Igor Ilic
860218632f refactor: add suggestions from PR
Add suggestsions made by CodeRabbit on pull request
2025-01-28 17:15:25 +01:00
Igor Ilic
710ca78d6e
Merge branch 'dev' into COG-970-refactor-tokenizing 2025-01-28 16:31:11 +01:00
alekszievr
98f0f60980
Feat: [cog-1089] Define pydantic models for descriptive graph metrics and input metrics (#466)
* feat: make tasks a configurable argument in the cognify function

* fix: add data points task

* Define pydantic models for descriptive graph metrics and input metrics

* remove to_json method

* Use just one MetricData class instead of two

---------

Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com>
2025-01-28 16:11:31 +01:00
Igor Ilic
3db7f85c9c feat: Add max_chunk_tokens value to chunkers
Add formula and forwarding of max_chunk_tokens value through Cognee
2025-01-28 14:32:00 +01:00
Igor Ilic
6d5679f9d2 Merge branch 'dev' into COG-970-refactor-tokenizing 2025-01-23 18:14:49 +01:00
Igor Ilic
80e67b0619 refactor: Rename foreign to external metadata
Rename foreign metadata to external metadata for metadata coming outside of Cognee
2025-01-22 16:07:35 +01:00
Igor Ilic
93249c72c5 fix: Initial commit to resolve issue with using tokenizer based on LLMs
Currently TikToken is used for tokenizing by default which is only supported by OpenAI,
this is an initial commit in an attempt to add Cognee tokenizing support for multiple LLMs
2025-01-21 19:53:22 +01:00