Commit graph

75 commits

Author SHA1 Message Date
Vasilije
9d6081c7f7
feat: Add support for multiple audio and image formats (#12)
Added support for multiple audio and image formats with example

The formats added are the possible filetype library return values for
extension for Audio and Images

Feature COG-507
2024-11-23 16:31:55 +01:00
lxobr
a8aefd57ef
COG-546 get_local_script_dependencies (#6)
A utility function, `get_local_script_dependencies`:

- Extracts and resolves local dependencies of a Python script using
`jedi` and `parso`.
- Returns a sorted list of unique module paths
- Optionally dependencies outside a specified repository path are
filtered out
- Includes an example/checker in `cognee/tasks/code`.

Will be used for creating a graph from a repo.
2024-11-20 16:36:03 +01:00
Igor Ilic
15b7b8ef2b fix: Resolve issue with table names in SQL commands
Some SQL commands require lowercase characters in table names unless table name is wrapped in quotes. Renamed all new tables to use lowercase

Fix COG-677
2024-11-20 14:54:35 +01:00
Igor Ilic
57783a979a feat: Add support for multiple audio and image formats
Added support for multiple audio and image formats with example

Feature COG-507
2024-11-20 14:03:14 +01:00
lxobr
f27dc0c91a fix: Rename, extract checker into a separate script 2024-11-20 12:28:10 +01:00
lxobr
263ecb9149 fix: Add input validation and error handling for paths 2024-11-20 12:28:10 +01:00
lxobr
8bc26bba97 fix: Add error handling for path conversion 2024-11-20 12:28:10 +01:00
lxobr
ebb811af87 fix: Filter out None values in module paths 2024-11-20 12:28:10 +01:00
lxobr
2417d18607 fix: Add logging instead of print 2024-11-20 12:28:10 +01:00
lxobr
1a1452e177 fix: Add error handling for Jedi analysis, with debug mode 2024-11-20 12:28:10 +01:00
lxobr
3aadda9a89 feat: Add argparse for testing purposes 2024-11-20 12:28:10 +01:00
lxobr
4bf2281cd5 feat: Enable async processing 2024-11-20 12:28:10 +01:00
lxobr
742792b6c1 refactor: Remove a comment 2024-11-20 12:28:10 +01:00
lxobr
2be2b802c0 feat: Safely handle file read errors 2024-11-20 12:28:10 +01:00
lxobr
e148d32c14 refactor: Modify sys.path in context manager 2024-11-20 12:28:10 +01:00
lxobr
ba83d71269 feat: extract script dependencies 2024-11-20 12:28:10 +01:00
lxobr
26e2dc852d feat: new repo-to-graph task 2024-11-20 12:28:10 +01:00
Boris
22a0e43d4a
Merge branch 'main' into COG-417-chunking-unit-tests 2024-11-17 13:40:32 +01:00
Igor Ilic
d30adb53f3
Cog 337 llama index support (#186)
* feat: Add support for LlamaIndex Document type

Added support for LlamaIndex Document type

Feature #COG-337

* docs: Add Jupyer Notebook for cognee with llama index document type

Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage

Docs #COG-337

* feat: Add metadata migration from LlamaIndex document type

Allow usage of metadata from LlamaIndex documents

Feature #COG-337

* refactor: Change llama index migration function name

Change name of llama index function

Refactor #COG-337

* chore: Add llama index core dependency

Downgrade needed on tenacity and instructor modules to support llama index

Chore #COG-337

* Feature: Add ingest_data_with_metadata task

Added task that will have access to metadata if data is provided from different data ingestion tools

Feature #COG-337

* docs: Add description on why specific type checking is done

Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well

Docs #COG-337

* fix: Add missing parameter to function call

Added missing parameter to function call

Fix #COG-337

* refactor: Move storing of data from async to sync function

Moved data storing from async to sync

Refactor #COG-337

* refactor: Pretend ingest_data was changes instead of having two tasks

Refactor so ingest_data file was modified instead of having two ingest tasks

Refactor #COG-337

* refactor: Use old name for data ingestion with metadata

Merged new and old data ingestion tasks into one

Refactor #COG-337

* refactor: Return ingest_data and save_data_to_storage Tasks

Returned ingest_data and save_data_to_storage tasks

Refactor #COG-337

* refactor: Return previous ingestion Tasks to add function

Returned previous ignestion tasks to add function

Refactor #COG-337

* fix: Remove dict and use string for search query

Remove dictionary and use string for query in notebook and simple example

Fix COG-337

* refactor: Add changes request in pull request

Added the following changes that were requested in pull request:

Added synchronize label,
Made uniform syntax in if statement in workflow,
fixed instructor dependency,
added llama-index to be optional

Refactor COG-337

* fix: Resolve issue with llama-index being mandatory

Resolve issue with llama-index being mandatory to run cognee

Fix COG-337

* fix: Add install of llama-index to notebook

Removed additional references to llama-index from core cognee lib.
Added llama-index-core install from notebook

Fix COG-337

---------
2024-11-17 11:47:08 +01:00
Leon Luithlen
e40e7386a0 Refactor word_type yielding in chuck_by_sentence 2024-11-14 17:16:04 +01:00
Leon Luithlen
84c98f16bb Remove chunk_index attribute from chunk_by_sentence return value 2024-11-14 16:49:13 +01:00
Leon Luithlen
15420dd864 Fix paragraph_ids handling 2024-11-14 16:47:51 +01:00
Leon Luithlen
d6a6a9eaba Return sentence_cut instead of word in chunk_by_paragraph 2024-11-14 15:03:09 +01:00
0xideas
8b681529b1
Update cognee/tasks/chunks/chunk_by_paragraph.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2024-11-14 14:42:15 +01:00
Leon Luithlen
73f24f9e4d Fix sentence_cut return value in inappropriate places 2024-11-14 14:40:42 +01:00
Leon Luithlen
eaf9167fa1 Change chunk_by_word to collect newlines in prior words 2024-11-14 14:19:34 +01:00
Leon Luithlen
57d8149732 Save paragraph_ids in chunk_by_paragraph 2024-11-14 13:59:54 +01:00
Leon Luithlen
6721eaee83 Fix chunk_index bug in chunk_by_paragraph 2024-11-14 13:50:40 +01:00
0xideas
f2206a09c0
Update cognee/tasks/chunks/chunk_by_word.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2024-11-14 13:16:17 +01:00
Leon Luithlen
d90698305b Simplify chunk_by_word 2024-11-14 09:43:10 +01:00
Leon Luithlen
45a60b7f19 Remove assert and move is_real_paragraph_end outside loop 2024-11-13 16:35:47 +01:00
Leon Luithlen
b787407db7 Add more adversarial examples 2024-11-13 16:23:14 +01:00
Leon Luithlen
9ea2634480 Replace word_count with maximum_length in if clause 2024-11-13 15:53:44 +01:00
Leon Luithlen
9b2fb09c59 Fix PdfDocument teset, give chunk_by_sentence a maximum_length arg 2024-11-13 15:39:17 +01:00
Leon Luithlen
f8e5b529c3 Add maximum_length argument to chunk_sentences 2024-11-13 15:35:03 +01:00
Leon Luithlen
ce498d97dd Refactor chunk_by_paragraph to be isomorphic 2024-11-13 15:35:03 +01:00
Leon Luithlen
ab55a73d18 Adapt chunk_by_sentence to isomorphic chunk_by_word 2024-11-13 15:35:03 +01:00
Leon Luithlen
c054e897a3 Make chunk_by_word isomorphic 2024-11-13 15:35:03 +01:00
Leon Luithlen
6f0637a028 Small cosmetic changes 2024-11-13 15:35:02 +01:00
Leon Luithlen
cd80525420 Revert to EXTENSION_TO_DOCUMENT_CLASS implementation of classify_documents 2024-11-13 14:32:10 +01:00
Leon Luithlen
826de0edbf Remove orphan dictionary 2024-11-12 16:47:28 +01:00
Leon Luithlen
83995fa548 Try old version of classify_documents 2024-11-12 16:47:28 +01:00
Leon Luithlen
8107709e98 Remove duplicate pdf key 2024-11-12 16:47:28 +01:00
Leon Luithlen
fbd011560a Rebase onto main 2024-11-12 16:47:28 +01:00
Leon Luithlen
d7ffef1979 Remove old __tests__ folders 2024-11-12 16:47:28 +01:00
Leon Luithlen
86e726d741 Complete migrating unit tests 2024-11-12 16:47:28 +01:00
Leon Luithlen
66fb2948f8 Small cleanup pull request 2024-11-12 15:37:03 +01:00
Leon Luithlen
adaf69c127 Readd infer_data_ontology models 2024-11-12 09:05:51 +01:00
Boris Arzentar
b1b6b79ca4 fix: convert qdrant search results to ScoredPoint 2024-11-12 09:01:03 +01:00
Boris Arzentar
68700f32c7 fix: add code graph generation pipeline 2024-11-12 09:01:03 +01:00