No description
Find a file
Igor Ilic d30adb53f3
Cog 337 llama index support (#186)
* feat: Add support for LlamaIndex Document type

Added support for LlamaIndex Document type

Feature #COG-337

* docs: Add Jupyer Notebook for cognee with llama index document type

Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage

Docs #COG-337

* feat: Add metadata migration from LlamaIndex document type

Allow usage of metadata from LlamaIndex documents

Feature #COG-337

* refactor: Change llama index migration function name

Change name of llama index function

Refactor #COG-337

* chore: Add llama index core dependency

Downgrade needed on tenacity and instructor modules to support llama index

Chore #COG-337

* Feature: Add ingest_data_with_metadata task

Added task that will have access to metadata if data is provided from different data ingestion tools

Feature #COG-337

* docs: Add description on why specific type checking is done

Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well

Docs #COG-337

* fix: Add missing parameter to function call

Added missing parameter to function call

Fix #COG-337

* refactor: Move storing of data from async to sync function

Moved data storing from async to sync

Refactor #COG-337

* refactor: Pretend ingest_data was changes instead of having two tasks

Refactor so ingest_data file was modified instead of having two ingest tasks

Refactor #COG-337

* refactor: Use old name for data ingestion with metadata

Merged new and old data ingestion tasks into one

Refactor #COG-337

* refactor: Return ingest_data and save_data_to_storage Tasks

Returned ingest_data and save_data_to_storage tasks

Refactor #COG-337

* refactor: Return previous ingestion Tasks to add function

Returned previous ignestion tasks to add function

Refactor #COG-337

* fix: Remove dict and use string for search query

Remove dictionary and use string for query in notebook and simple example

Fix COG-337

* refactor: Add changes request in pull request

Added the following changes that were requested in pull request:

Added synchronize label,
Made uniform syntax in if statement in workflow,
fixed instructor dependency,
added llama-index to be optional

Refactor COG-337

* fix: Resolve issue with llama-index being mandatory

Resolve issue with llama-index being mandatory to run cognee

Fix COG-337

* fix: Add install of llama-index to notebook

Removed additional references to llama-index from core cognee lib.
Added llama-index-core install from notebook

Fix COG-337

---------
2024-11-17 11:47:08 +01:00
.data Intermidiate commit 2024-04-24 19:35:36 +02:00
.dlt fix: remove obsolete code 2024-03-13 10:19:03 +01:00
.github Cog 337 llama index support (#186) 2024-11-17 11:47:08 +01:00
alembic fix: various fixes for the deployment 2024-10-22 11:26:48 +02:00
assets Added star history 2024-05-25 08:27:49 +02:00
bin chore: enable all origins in cors settings 2024-09-25 14:34:14 +02:00
cognee Cog 337 llama index support (#186) 2024-11-17 11:47:08 +01:00
cognee-frontend feat: improve API request and response models and docs (#154) 2024-10-14 13:38:36 +02:00
docs Updating cognify pipeline documentation (#181) 2024-11-08 14:04:58 +01:00
evals feat: migrate search to tasks (#144) 2024-10-07 14:41:35 +02:00
examples Cog 337 llama index support (#186) 2024-11-17 11:47:08 +01:00
notebooks Cog 337 llama index support (#186) 2024-11-17 11:47:08 +01:00
tests test: add github action running weaviate integration test 2024-06-12 22:36:57 +02:00
tools Clean up notebook merge request 2024-11-12 09:04:43 +01:00
.dockerignore chore: add vanilla docker config 2024-06-23 00:36:34 +02:00
.env.template Merge branch 'main' of github.com:topoteretes/cognee into COG-170-PGvector-adapter 2024-10-22 12:41:18 +02:00
.gitignore Copy over gitignore 2024-11-12 16:47:28 +01:00
.pylintrc fix: enable sqlalchemy adapter 2024-08-04 22:23:28 +02:00
.python-version chore: update python version to 3.11 2024-03-29 14:10:20 +01:00
alembic.ini feat: migrate search to tasks (#144) 2024-10-07 14:41:35 +02:00
CONTRIBUTING.md Fic: typo error 2024-10-19 17:00:11 +00:00
docker-compose.yml feat: Add config support for pgvector 2024-10-11 13:23:11 +02:00
Dockerfile fix: Add installing of all extras to cognee Dockerfile 2024-10-29 19:08:09 +01:00
entrypoint-old.sh fix: run frontend in a container 2024-06-23 13:24:58 +02:00
entrypoint.sh fix: various fixes for the deployment 2024-10-22 11:26:48 +02:00
LICENSE Update LICENSE 2024-03-30 11:57:07 +01:00
mkdocs.yml feature: add tracking to docs website (#165) 2024-10-25 14:09:27 +02:00
mypy.ini Improve processing, update networkx client, and Neo4j, and dspy (#69) 2024-04-20 19:05:40 +02:00
poetry.lock Cog 337 llama index support (#186) 2024-11-17 11:47:08 +01:00
pyproject.toml Cog 337 llama index support (#186) 2024-11-17 11:47:08 +01:00
README.md Rerun checks 2024-11-12 12:07:20 +01:00

cognee

GitHub forks GitHub stars GitHub commits Github tag Downloads GitHub license

We build for developers who need a reliable, production-ready data layer for AI applications

What is cognee?

Cognee implements scalable, modular ECL (Extract, Cognify, Load) pipelines that allow you to interconnect and retrieve past conversations, documents, and audio transcriptions while reducing hallucinations, developer effort, and cost. Try it in a Google Colab notebook or have a look at our documentation

If you have questions, join our Discord community

📦 Installation

With pip

pip install cognee

With pip with PostgreSQL support

pip install 'cognee[postgres]'

With poetry

poetry add cognee

With poetry with PostgreSQL support

poetry add cognee -E postgres

💻 Basic Usage

Setup

import os

os.environ["LLM_API_KEY"] = "YOUR OPENAI_API_KEY"

or

import cognee
cognee.config.set_llm_api_key("YOUR_OPENAI_API_KEY")

You can also set the variables by creating .env file, here is our template. To use different LLM providers, for more info check out our documentation

If you are using Network, create an account on Graphistry to visualize results:

cognee.config.set_graphistry_config({
    "username": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD"
})

(Optional) To run the UI, go to cognee-frontend directory and run:

npm run dev

or run everything in a docker container:

docker-compose up

Then navigate to localhost:3000

If you want to use the UI with PostgreSQL through docker-compose make sure to set the following values in the .env file:

DB_PROVIDER=postgres

DB_HOST=postgres
DB_PORT=5432

DB_NAME=cognee_db
DB_USERNAME=cognee
DB_PASSWORD=cognee

Simple example

First, copy .env.template to .env and add your OpenAI API key to the LLM_API_KEY field.

Optionally, set VECTOR_DB_PROVIDER="lancedb" in .env to simplify setup.

This script will run the default pipeline:

import cognee
import asyncio
from cognee.api.v1.search import SearchType

async def main():
    # Reset cognee data
    await cognee.prune.prune_data()
    # Reset cognee system state
    await cognee.prune.prune_system(metadata=True)

    text = """
    Natural language processing (NLP) is an interdisciplinary
    subfield of computer science and information retrieval.
    """

    # Add text to cognee
    await cognee.add(text)

    # Use LLMs and cognee to create knowledge graph
    await cognee.cognify()

    # Search cognee for insights
    search_results = await cognee.search(
        SearchType.INSIGHTS,
        "Tell me about NLP",
    )

    # Display results
    for result_text in search_results:
        print(result_text)
        # natural_language_processing is_a field
        # natural_language_processing is_subfield_of computer_science
        # natural_language_processing is_subfield_of information_retrieval

asyncio.run(main())

A version of this example is here: examples/pyton/simple_example.py

Create your own memory store

cognee framework consists of tasks that can be grouped into pipelines. Each task can be an independent part of business logic, that can be tied to other tasks to form a pipeline. These tasks persist data into your memory store enabling you to search for relevant context of past conversations, documents, or any other data you have stored.

Example: Classify your documents

Here is an example of how it looks for a default cognify pipeline:

  1. To prepare the data for the pipeline run, first we need to add it to our metastore and normalize it:

Start with:

text = """Natural language processing (NLP) is an interdisciplinary
       subfield of computer science and information retrieval"""

await cognee.add(text) # Add a new piece of information
  1. In the next step we make a task. The task can be any business logic we need, but the important part is that it should be encapsulated in one function.

Here we show an example of creating a naive LLM classifier that takes a Pydantic model and then stores the data in both the graph and vector stores after analyzing each chunk. We provided just a snippet for reference, but feel free to check out the implementation in our repo.

async def chunk_naive_llm_classifier(
    data_chunks: list[DocumentChunk],
    classification_model: Type[BaseModel]
):
    # Extract classifications asynchronously
    chunk_classifications = await asyncio.gather(
        *(extract_categories(chunk.text, classification_model) for chunk in data_chunks)
    )

    # Collect classification data points using a set to avoid duplicates
    classification_data_points = {
        uuid5(NAMESPACE_OID, cls.label.type)
        for cls in chunk_classifications
    } | {
        uuid5(NAMESPACE_OID, subclass.value)
        for cls in chunk_classifications
        for subclass in cls.label.subclass
    }

    vector_engine = get_vector_engine()
    collection_name = "classification"

    # Define the payload schema
    class Keyword(BaseModel):
        uuid: str
        text: str
        chunk_id: str
        document_id: str

    # Ensure the collection exists and retrieve existing data points
    if not await vector_engine.has_collection(collection_name):
        await vector_engine.create_collection(collection_name, payload_schema=Keyword)
        existing_points_map = {}
    else:
        existing_points_map = {}
    return data_chunks

...

We have many tasks that can be used in your pipelines, and you can also create your tasks to fit your business logic.

  1. Once we have our tasks, it is time to group them into a pipeline. This simplified snippet demonstrates how tasks can be added to a pipeline, and how they can pass the information forward from one to another.
            

Task(
    chunk_naive_llm_classifier,
    classification_model = cognee_config.classification_model,
)

pipeline = run_tasks(tasks, documents)

To see the working code, check cognee.api.v1.cognify default pipeline in our repo.

Vector retrieval, Graphs and LLMs

Cognee supports a variety of tools and services for different operations:

  • Modular: Cognee is modular by nature, using tasks grouped into pipelines

  • Local Setup: By default, LanceDB runs locally with NetworkX and OpenAI.

  • Vector Stores: Cognee supports LanceDB, Qdrant, PGVector and Weaviate for vector storage.

  • Language Models (LLMs): You can use either Anyscale or Ollama as your LLM provider.

  • Graph Stores: In addition to NetworkX, Neo4j is also supported for graph storage.

  • User management: Create individual user graphs and manage permissions

Demo

Check out our demo notebook here

Get Started

Install Server

Please see the cognee Quick Start Guide for important configuration information.

docker compose up

Install SDK

Please see the cognee Development Guide for important beta information and usage instructions.

pip install cognee

Star History

Star History Chart

💫 Contributors

contributors