From 0f97f8f71be8d44088a630e4bbe90b25dac700c6 Mon Sep 17 00:00:00 2001 From: Boris Date: Mon, 13 Jan 2025 22:36:42 +0100 Subject: [PATCH] Version 0.1.22 (#438) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2dfb9645e58ed31835e8eaec7c90970358. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com> Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com> Co-authored-by: Boris * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com> Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com> Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com> Co-authored-by: Boris * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <44192193+alekszievr@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix diagram * Fix instructions * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Implement PR review * Comment out profiling * Comment out profiling * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Fix visualization * Fix visualization * Fix visualization * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Fix visualization * Fix visualization * Fix visualization * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Fix visualization * Fix poetry issues * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * test: Test ubuntu 24.04 * test: change all actions to ubuntu-latest * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * docs: Add LlamaIndex Cognee integration notebook Added LlamaIndex Cognee integration notebook * test: Add github action for testing llama index cognee integration notebook * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 * fix: update dependencies of the mcp server * Update README.md * Fix: Fixes logging setup * feat: deletes on the fly embeddings as uses edge collections * fix: Change nbformat on llama index integration notebook * fix: Resolve api key issue with llama index integration notebook * fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault * version: Increase version to 0.1.22 --------- Co-authored-by: vasilije Co-authored-by: Igor Ilic Co-authored-by: Igor Ilic <30923996+dexters1@users.noreply.github.com> Co-authored-by: lxobr <122801072+lxobr@users.noreply.github.com> Co-authored-by: alekszievr <44192193+alekszievr@users.noreply.github.com> Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com> Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Rita Aleksziev Co-authored-by: Henry Mao <1828968+calclavia@users.noreply.github.com> --- .github/workflows/dockerhub.yml | 2 +- .github/workflows/py_lint.yml | 2 +- .github/workflows/reusable_notebook.yml | 1 + .github/workflows/ruff_format.yaml | 2 +- .github/workflows/ruff_lint.yaml | 2 +- .github/workflows/test_deduplication.yml | 2 +- ...lama_index_cognee_integration_notebook.yml | 20 ++ .github/workflows/test_qdrant.yml | 2 +- .github/workflows/test_weaviate.yml | 2 +- README.md | 12 +- .../modules/graph/cognee_graph/CogneeGraph.py | 38 +-- .../retrieval/brute_force_triplet_search.py | 25 +- .../modules/users/methods/get_default_user.py | 4 +- cognee/shared/utils.py | 14 +- examples/python/dynamic_steps_example.py | 2 +- .../llama_index_cognee_integration.ipynb | 285 ++++++++++++++++++ pyproject.toml | 2 +- 17 files changed, 336 insertions(+), 81 deletions(-) create mode 100644 .github/workflows/test_llama_index_cognee_integration_notebook.yml create mode 100644 notebooks/llama_index_cognee_integration.ipynb diff --git a/.github/workflows/dockerhub.yml b/.github/workflows/dockerhub.yml index 20f0bde96..b48dde2cc 100644 --- a/.github/workflows/dockerhub.yml +++ b/.github/workflows/dockerhub.yml @@ -7,7 +7,7 @@ on: jobs: docker-build-and-push: - runs-on: ubuntu-22.04 + runs-on: ubuntu-latest steps: - name: Checkout repository diff --git a/.github/workflows/py_lint.yml b/.github/workflows/py_lint.yml index 543a0d221..11d0a8b7d 100644 --- a/.github/workflows/py_lint.yml +++ b/.github/workflows/py_lint.yml @@ -16,7 +16,7 @@ jobs: fail-fast: true matrix: os: - - ubuntu-22.04 + - ubuntu-latest python-version: ["3.10.x", "3.11.x"] defaults: diff --git a/.github/workflows/reusable_notebook.yml b/.github/workflows/reusable_notebook.yml index 8034aca97..9bc09c3a6 100644 --- a/.github/workflows/reusable_notebook.yml +++ b/.github/workflows/reusable_notebook.yml @@ -51,6 +51,7 @@ jobs: env: ENV: 'dev' LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }} + OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }} GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }} run: | diff --git a/.github/workflows/ruff_format.yaml b/.github/workflows/ruff_format.yaml index a75a795e7..959b7fc4b 100644 --- a/.github/workflows/ruff_format.yaml +++ b/.github/workflows/ruff_format.yaml @@ -3,7 +3,7 @@ on: [ pull_request ] jobs: ruff: - runs-on: ubuntu-22.04 + runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: astral-sh/ruff-action@v2 diff --git a/.github/workflows/ruff_lint.yaml b/.github/workflows/ruff_lint.yaml index 4c4fb81e3..214e8ec6d 100644 --- a/.github/workflows/ruff_lint.yaml +++ b/.github/workflows/ruff_lint.yaml @@ -3,7 +3,7 @@ on: [ pull_request ] jobs: ruff: - runs-on: ubuntu-22.04 + runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: astral-sh/ruff-action@v2 diff --git a/.github/workflows/test_deduplication.yml b/.github/workflows/test_deduplication.yml index 923bbb68c..2f97e4ea6 100644 --- a/.github/workflows/test_deduplication.yml +++ b/.github/workflows/test_deduplication.yml @@ -16,7 +16,7 @@ env: jobs: run_deduplication_test: name: test - runs-on: ubuntu-22.04 + runs-on: ubuntu-latest defaults: run: shell: bash diff --git a/.github/workflows/test_llama_index_cognee_integration_notebook.yml b/.github/workflows/test_llama_index_cognee_integration_notebook.yml new file mode 100644 index 000000000..aacc31eb5 --- /dev/null +++ b/.github/workflows/test_llama_index_cognee_integration_notebook.yml @@ -0,0 +1,20 @@ +name: test | llama index cognee integration notebook + +on: + workflow_dispatch: + pull_request: + types: [labeled, synchronize] + +concurrency: + group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} + cancel-in-progress: true + +jobs: + run_notebook_test: + uses: ./.github/workflows/reusable_notebook.yml + with: + notebook-location: notebooks/llama_index_cognee_integration.ipynb + secrets: + OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} + GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }} + GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }} diff --git a/.github/workflows/test_qdrant.yml b/.github/workflows/test_qdrant.yml index d1447c65c..e2cf9abe8 100644 --- a/.github/workflows/test_qdrant.yml +++ b/.github/workflows/test_qdrant.yml @@ -17,7 +17,7 @@ jobs: run_qdrant_integration_test: name: test - runs-on: ubuntu-22.04 + runs-on: ubuntu-latest defaults: run: shell: bash diff --git a/.github/workflows/test_weaviate.yml b/.github/workflows/test_weaviate.yml index 159fce194..81cc2603f 100644 --- a/.github/workflows/test_weaviate.yml +++ b/.github/workflows/test_weaviate.yml @@ -17,7 +17,7 @@ jobs: run_weaviate_integration_test: name: test - runs-on: ubuntu-22.04 + runs-on: ubuntu-latest defaults: run: shell: bash diff --git a/README.md b/README.md index f35829783..8ff2d71d5 100644 --- a/README.md +++ b/README.md @@ -101,15 +101,9 @@ cognee.config.set_graphistry_config({ }) ``` -(Optional) To run the UI, go to cognee-frontend directory and run: -``` -npm run dev -``` -or run everything in a docker container: -``` -docker-compose up -``` -Then navigate to localhost:3000 +(Optional) To run the with an UI, go to cognee-mcp directory and follow the instructions. +You will be able to use cognee as mcp tool and create graphs and query them. + If you want to use Cognee with PostgreSQL, make sure to set the following values in the .env file: ``` diff --git a/cognee/modules/graph/cognee_graph/CogneeGraph.py b/cognee/modules/graph/cognee_graph/CogneeGraph.py index 279a73b19..491f83b5a 100644 --- a/cognee/modules/graph/cognee_graph/CogneeGraph.py +++ b/cognee/modules/graph/cognee_graph/CogneeGraph.py @@ -8,7 +8,7 @@ from cognee.infrastructure.databases.graph.graph_db_interface import GraphDBInte from cognee.modules.graph.cognee_graph.CogneeGraphElements import Node, Edge from cognee.modules.graph.cognee_graph.CogneeAbstractGraph import CogneeAbstractGraph import heapq -from graphistry import edges +import asyncio class CogneeGraph(CogneeAbstractGraph): @@ -127,51 +127,25 @@ class CogneeGraph(CogneeAbstractGraph): else: print(f"Node with id {node_id} not found in the graph.") - async def map_vector_distances_to_graph_edges( - self, vector_engine, query - ) -> None: # :TODO: When we calculate edge embeddings in vector db change this similarly to node mapping + async def map_vector_distances_to_graph_edges(self, vector_engine, query) -> None: try: - # Step 1: Generate the query embedding query_vector = await vector_engine.embed_data([query]) query_vector = query_vector[0] if query_vector is None or len(query_vector) == 0: raise ValueError("Failed to generate query embedding.") - # Step 2: Collect all unique relationship types - unique_relationship_types = set() - for edge in self.edges: - relationship_type = edge.attributes.get("relationship_type") - if relationship_type: - unique_relationship_types.add(relationship_type) + edge_distances = await vector_engine.get_distance_from_collection_elements( + "edge_type_relationship_name", query_text=query + ) - # Step 3: Embed all unique relationship types - unique_relationship_types = list(unique_relationship_types) - relationship_type_embeddings = await vector_engine.embed_data(unique_relationship_types) + embedding_map = {result.payload["text"]: result.score for result in edge_distances} - # Step 4: Map relationship types to their embeddings and calculate distances - embedding_map = {} - for relationship_type, embedding in zip( - unique_relationship_types, relationship_type_embeddings - ): - edge_vector = np.array(embedding) - - # Calculate cosine similarity - similarity = np.dot(query_vector, edge_vector) / ( - np.linalg.norm(query_vector) * np.linalg.norm(edge_vector) - ) - distance = 1 - similarity - - # Round the distance to 4 decimal places and store it - embedding_map[relationship_type] = round(distance, 4) - - # Step 4: Assign precomputed distances to edges for edge in self.edges: relationship_type = edge.attributes.get("relationship_type") if not relationship_type or relationship_type not in embedding_map: print(f"Edge {edge} has an unknown or missing relationship type.") continue - # Assign the precomputed distance edge.attributes["vector_distance"] = embedding_map[relationship_type] except Exception as ex: diff --git a/cognee/modules/retrieval/brute_force_triplet_search.py b/cognee/modules/retrieval/brute_force_triplet_search.py index 9c778505d..c27e90766 100644 --- a/cognee/modules/retrieval/brute_force_triplet_search.py +++ b/cognee/modules/retrieval/brute_force_triplet_search.py @@ -62,24 +62,6 @@ async def brute_force_triplet_search( return retrieved_results -def delete_duplicated_vector_db_elements( - collections, results -): #:TODO: This is just for now to fix vector db duplicates - results_dict = {} - for collection, results in zip(collections, results): - seen_ids = set() - unique_results = [] - for result in results: - if result.id not in seen_ids: - unique_results.append(result) - seen_ids.add(result.id) - else: - print(f"Duplicate found in collection '{collection}': {result.id}") - results_dict[collection] = unique_results - - return results_dict - - async def brute_force_search( query: str, user: User, top_k: int, collections: List[str] = None ) -> list: @@ -125,10 +107,7 @@ async def brute_force_search( ] ) - ############################################# :TODO: Change when vector db does not contain duplicates - node_distances = delete_duplicated_vector_db_elements(collections, results) - # node_distances = {collection: result for collection, result in zip(collections, results)} - ############################################## + node_distances = {collection: result for collection, result in zip(collections, results)} memory_fragment = CogneeGraph() @@ -140,14 +119,12 @@ async def brute_force_search( await memory_fragment.map_vector_distances_to_graph_nodes(node_distances=node_distances) - #:TODO: Change when vectordb contains edge embeddings await memory_fragment.map_vector_distances_to_graph_edges(vector_engine, query) results = await memory_fragment.calculate_top_triplet_importances(k=top_k) send_telemetry("cognee.brute_force_triplet_search EXECUTION STARTED", user.id) - #:TODO: Once we have Edge pydantic models we should retrieve the exact edge and node objects from graph db return results except Exception as e: diff --git a/cognee/modules/users/methods/get_default_user.py b/cognee/modules/users/methods/get_default_user.py index c67d9d71f..2bb15ea95 100644 --- a/cognee/modules/users/methods/get_default_user.py +++ b/cognee/modules/users/methods/get_default_user.py @@ -1,4 +1,4 @@ -from sqlalchemy.orm import joinedload +from sqlalchemy.orm import selectinload from sqlalchemy.future import select from cognee.modules.users.models import User from cognee.infrastructure.databases.relational import get_relational_engine @@ -11,7 +11,7 @@ async def get_default_user(): async with db_engine.get_async_session() as session: query = ( select(User) - .options(joinedload(User.groups)) + .options(selectinload(User.groups)) .where(User.email == "default_user@example.com") ) diff --git a/cognee/shared/utils.py b/cognee/shared/utils.py index e57decde1..affd92c87 100644 --- a/cognee/shared/utils.py +++ b/cognee/shared/utils.py @@ -468,16 +468,20 @@ def graph_to_tuple(graph): def setup_logging(log_level=logging.INFO): - """This method sets up the logging configuration.""" + """Sets up the logging configuration.""" formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s\n") + stream_handler = logging.StreamHandler(sys.stdout) stream_handler.setFormatter(formatter) stream_handler.setLevel(log_level) - logging.basicConfig( - level=log_level, - handlers=[stream_handler], - ) + root_logger = logging.getLogger() + + if root_logger.hasHandlers(): + root_logger.handlers.clear() + + root_logger.addHandler(stream_handler) + root_logger.setLevel(log_level) # ---------------- Example Usage ---------------- diff --git a/examples/python/dynamic_steps_example.py b/examples/python/dynamic_steps_example.py index 11596a5e2..4422dd39d 100644 --- a/examples/python/dynamic_steps_example.py +++ b/examples/python/dynamic_steps_example.py @@ -192,7 +192,7 @@ async def main(enable_steps): if __name__ == "__main__": - setup_logging(logging.INFO) + setup_logging(logging.ERROR) rebuild_kg = True retrieve = True diff --git a/notebooks/llama_index_cognee_integration.ipynb b/notebooks/llama_index_cognee_integration.ipynb new file mode 100644 index 000000000..772c0a8c7 --- /dev/null +++ b/notebooks/llama_index_cognee_integration.ipynb @@ -0,0 +1,285 @@ +{ + "cells": [ + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## LlamaIndex Cognee GraphRAG Integration\n", + "\n", + "Connecting external knowledge to the LLM efficiently and retrieving it is a key challenge faced by developers. For developers and data scientists, integrating structured and unstructured data into AI workflows often involves multiple tools, complex pipelines, and time-consuming processes.\n", + "\n", + "Enter **cognee,** a powerful framework for knowledge and memory management, and LlamaIndex, a versatile data integration library. Together, they enable us to transform retrieval-augmented generation (RAG) pipelines, into GraphRAG pipelines, streamlining the path from raw data to actionable insights.\n", + "\n", + "In this post, we’ll explore a demo that leverages cognee and LlamaIndex to create a knowledge graph from a LlamaIndex document, process it into a meaningful structure, and extract useful insights. By the end, you’ll see how these tools can give you new insights into your data by connecting various data sources in one big semantic layer you can analyze.\n", + "\n", + "## RAG - Recap\n", + "\n", + "RAG enhances LLMs by integrating external knowledge sources during inference. It does so by turning the data into a vector representation and storing it in a vector store.\n", + "\n", + "### Key Benefits of RAG:\n", + "\n", + "1. Connecting domain specific data to LLMs\n", + "2. Cost savings\n", + "3. Higher accuracy than base LLM\n", + "\n", + "However, building a RAG system presents challenges: handling diverse data formats, data updates, creating a robust metadata layer, and mediocre accuracy\n", + "\n", + "## Introducing cognee and LlamaIndex more\n", + "\n", + "cognee simplifies knowledge and memory management for LLMs, while LlamaIndex facilitates connecting LLMs to structured data sources and enabling agentic use-cases\n", + "\n", + "cognee is inspired by human mind and higer cognitive functions. It mimics ways we construct our mental map of the world and build a semantic understanding of various objects, terms and issues in our everyday lives.\n", + "\n", + "cognee brings this approach to code by allowing developers to create semantic layers that would allow users to store their ontologies which are **a formalised depiction of knowledge** in graphs.\n", + "\n", + "This lets you use the knowledge you have about a system connect it to LLMs in a modular way, with best data engineering practices, wide choice of vector and graph stores and various LLMs you can use.\n", + "\n", + "Together, they:\n", + "\n", + "- Turn unstructured and semi-structured data into a graph/vector representation.\n", + "- Enable ontology generation for particular domains, making unique graphs for every vertical\n", + "- Provide a deterministic layer for LLM outputs, ensuring consistency and reliability.\n", + "\n", + "## Step-by-Step Demo: Building a RAG System with Cognee and LlamaIndex\n", + "\n", + "### 1. Setting Up the Environment\n", + "\n", + "Start by importing the required libraries and defining the environment:" + ], + "id": "d0d7a82d729bbef6" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "!pip install llama-index-graph-rag-cognee==0.1.1", + "id": "598b52e384086512" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "import os\n", + "import asyncio\n", + "from llama_index.core import Document\n", + "from llama_index.graph_rag.cognee import CogneeGraphRAG\n", + "\n", + "if \"OPENAI_API_KEY\" not in os.environ:\n", + " os.environ[\"OPENAI_API_KEY\"] = \"\"" + ], + "id": "892a1b1198ec662f" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "Ensure you’ve set up your API keys and installed necessary dependencies.\n", + "\n", + "### 2. Preparing the Dataset\n", + "\n", + "We’ll use a brief profile of an individual as our sample dataset:" + ], + "id": "a1f16f5ca5249ebb" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "documents = [\n", + " Document(\n", + " text=\"Jessica Miller, Experienced Sales Manager with a strong track record in driving sales growth and building high-performing teams.\"\n", + " ),\n", + " Document(\n", + " text=\"David Thompson, Creative Graphic Designer with over 8 years of experience in visual design and branding.\"\n", + " ),\n", + " ]" + ], + "id": "198022c34636a3a0" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 3. Initializing CogneeGraphRAG\n", + "\n", + "Instantiate the Cognee framework with configurations for LLM, graph, and database providers:" + ], + "id": "781ae78e52ff49a" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "cogneeRAG = CogneeGraphRAG(\n", + " llm_api_key=os.environ[\"OPENAI_API_KEY\"],\n", + " llm_provider=\"openai\",\n", + " llm_model=\"gpt-4o-mini\",\n", + " graph_db_provider=\"networkx\",\n", + " vector_db_provider=\"lancedb\",\n", + " relational_db_provider=\"sqlite\",\n", + " relational_db_name=\"cognee_db\",\n", + ")" + ], + "id": "17e466821ab88d50" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 4. Adding Data to Cognee\n", + "\n", + "Load the dataset into the cognee framework:" + ], + "id": "2a55d5be9de0ce81" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "await cogneeRAG.add(documents, \"test\")", + "id": "238b716429aba541" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "This step prepares the data for graph-based processing.\n", + "\n", + "### 5. Processing Data into a Knowledge Graph\n", + "\n", + "Transform the data into a structured knowledge graph:" + ], + "id": "23e5316aa7e5dbc7" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "await cogneeRAG.process_data(\"test\")", + "id": "c3b3063d428b07a2" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "The graph now contains nodes and relationships derived from the dataset, creating a powerful structure for exploration.\n", + "\n", + "### 6. Performing Searches\n", + "\n", + "### Answer prompt based on knowledge graph approach:" + ], + "id": "e32327de54e98dc8" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "search_results = await cogneeRAG.search(\"Tell me who are the people mentioned?\")\n", + "\n", + "print(\"\\n\\nAnswer based on knowledge graph:\\n\")\n", + "for result in search_results:\n", + " print(f\"{result}\\n\")" + ], + "id": "fddbf5916d1e50e5" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### Answer prompt based on RAG approach:", + "id": "9246aed7f69ceb7e" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "search_results = await cogneeRAG.rag_search(\"Tell me who are the people mentioned?\")\n", + "\n", + "print(\"\\n\\nAnswer based on RAG:\\n\")\n", + "for result in search_results:\n", + " print(f\"{result}\\n\")" + ], + "id": "fe77c7a7c57fe4e4" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "In conclusion, the results demonstrate a significant advantage of the knowledge graph-based approach (Graphrag) over the RAG approach. Graphrag successfully identified all the mentioned individuals across multiple documents, showcasing its ability to aggregate and infer information from a global context. In contrast, the RAG approach was limited to identifying individuals within a single document due to its chunking-based processing constraints. This highlights Graphrag's superior capability in comprehensively resolving queries that span across a broader corpus of interconnected data.", + "id": "89cc99628392eb99" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 7. Finding Related Nodes\n", + "\n", + "Explore relationships in the knowledge graph:" + ], + "id": "44c9b67c09763610" + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "related_nodes = await cogneeRAG.get_related_nodes(\"person\")\n", + "\n", + "print(\"\\n\\nRelated nodes are:\\n\")\n", + "for node in related_nodes:\n", + " print(f\"{node}\\n\")" + ], + "id": "efbc1511586f46fe" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## Why Choose Cognee and LlamaIndex?\n", + "\n", + "### 1. Agentic Framework and Memory tied together\n", + "\n", + "Your agents can now get long-term, short-term memory and memory specific to their domains\n", + "\n", + "### 2. Enhanced Querying and Insights\n", + "\n", + "Your memory can now automatically optimize itself and allow to respond to questions better\n", + "\n", + "### 3. Simplified Deployment\n", + "\n", + "You can use the standard tools out of the box and get things done without much effort\n", + "\n", + "## Visualizing the Knowledge Graph\n", + "\n", + "Imagine a graph structure where each node represents a document or entity, and edges indicate relationships.\n", + "\n", + "Here’s the visualized knowledge graph from the simple example above:\n", + "\n", + "![example.png]()\n", + "\n", + "\n", + "## Conclusion\n", + "\n", + "Try running it yourself\n", + "\n", + "Join cognee community" + ], + "id": "d0f82c2c6eb7793" + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/pyproject.toml b/pyproject.toml index 5a0e83057..446e807de 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "cognee" -version = "0.1.21" +version = "0.1.22" description = "Cognee - is a library for enriching LLM context with a semantic layer for better understanding and reasoning." authors = ["Vasilije Markovic", "Boris Arzentar"] readme = "README.md"