No description

Find a file

Boris Arzentar a2fa25fb60 Merge remote-tracking branch 'origin/main'		2024-11-23 14:07:44 +01:00
.data	Intermidiate commit	2024-04-24 19:35:36 +02:00
.dlt	fix: remove obsolete code	2024-03-13 10:19:03 +01:00
.github	Switch to gpt-4o-mini by default (#233 )	2024-11-18 17:38:54 +01:00
alembic	fix: various fixes for the deployment	2024-10-22 11:26:48 +02:00
assets	Added star history	2024-05-25 08:27:49 +02:00
bin	chore: enable all origins in cors settings	2024-09-25 14:34:14 +02:00
cognee	Merge remote-tracking branch 'origin/main'	2024-11-23 14:07:44 +01:00
cognee-frontend	Switch to gpt-4o-mini by default (#233 )	2024-11-18 17:38:54 +01:00
docs	Update docs/quickstart.md	2024-11-18 17:50:15 +01:00
evals	use acreate_structured_output instead of create_structured_output in eval script	2024-11-20 16:02:15 +01:00
examples	Merge remote-tracking branch 'origin/main'	2024-11-23 14:07:44 +01:00
notebooks	fix: Update table name in notebook	2024-11-20 15:14:38 +01:00
profiling/graph_pydantic_conversion	Autoformat graph pydantic conversion code	2024-11-15 16:44:30 +01:00
tests	test: add github action running weaviate integration test	2024-06-12 22:36:57 +02:00
tools	fixes	2024-11-08 12:01:47 +01:00
.dockerignore	chore: add vanilla docker config	2024-06-23 00:36:34 +02:00
.env.template	feat: add vector and graph dbs state to README file (#235 )	2024-11-18 17:51:41 +01:00
.gitignore	Clean up notebook merge request	2024-11-12 09:04:43 +01:00
.pylintrc	fix: enable sqlalchemy adapter	2024-08-04 22:23:28 +02:00
.python-version	chore: update python version to 3.11	2024-03-29 14:10:20 +01:00
alembic.ini	feat: migrate search to tasks (#144 )	2024-10-07 14:41:35 +02:00
CONTRIBUTING.md	Fic: typo error	2024-10-19 17:00:11 +00:00
docker-compose.yml	fix: make all checks green (#1 )	2024-11-19 15:30:09 +01:00
Dockerfile	fix: Add installing of all extras to cognee Dockerfile	2024-10-29 19:08:09 +01:00
entrypoint-old.sh	fix: run frontend in a container	2024-06-23 13:24:58 +02:00
entrypoint.sh	fix: various fixes for the deployment	2024-10-22 11:26:48 +02:00
LICENSE	Update LICENSE	2024-03-30 11:57:07 +01:00
mkdocs.yml	feature: add tracking to docs website (#165 )	2024-10-25 14:09:27 +02:00
mypy.ini	Improve processing, update networkx client, and Neo4j, and dspy (#69 )	2024-04-20 19:05:40 +02:00
poetry.lock	fix: Resolve issue with pgvector timeout (#3 )	2024-11-19 15:31:26 +01:00
pyproject.toml	fix: Resolve issue with pgvector timeout (#3 )	2024-11-19 15:31:26 +01:00
README.md	Merge remote-tracking branch 'origin/main'	2024-11-23 14:07:44 +01:00

README.md

cognee

We build for developers who need a reliable, production-ready data layer for AI applications

What is cognee?

Cognee implements scalable, modular ECL (Extract, Cognify, Load) pipelines that allow you to interconnect and retrieve past conversations, documents, and audio transcriptions while reducing hallucinations, developer effort, and cost. Try it in a Google Colab notebook or have a look at our documentation

If you have questions, join our Discord community

📦 Installation

With pip

pip install cognee

With pip with PostgreSQL support

pip install 'cognee[postgres]'

With poetry

poetry add cognee

With poetry with PostgreSQL support

poetry add cognee -E postgres

💻 Basic Usage

Setup

import os

os.environ["LLM_API_KEY"] = "YOUR OPENAI_API_KEY"

import cognee
cognee.config.set_llm_api_key("YOUR_OPENAI_API_KEY")

You can also set the variables by creating .env file, here is our template. To use different LLM providers, for more info check out our documentation

If you are using Network, create an account on Graphistry to visualize results:

cognee.config.set_graphistry_config({
    "username": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD"
})

(Optional) To run the UI, go to cognee-frontend directory and run:

npm run dev

or run everything in a docker container:

docker-compose up

Then navigate to localhost:3000

If you want to use Cognee with PostgreSQL, make sure to set the following values in the .env file:

DB_PROVIDER=postgres

DB_HOST=postgres
DB_PORT=5432

DB_NAME=cognee_db
DB_USERNAME=cognee
DB_PASSWORD=cognee

Simple example

First, copy .env.template to .env and add your OpenAI API key to the LLM_API_KEY field.

This script will run the default pipeline:

import cognee
import asyncio
from cognee.api.v1.search import SearchType

async def main():
    # Create a clean slate for cognee -- reset data and system state
    print("Resetting cognee data...")
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    print("Data reset complete.\n")

    # cognee knowledge graph will be created based on this text
    text = """
    Natural language processing (NLP) is an interdisciplinary
    subfield of computer science and information retrieval.
    """
    
    print("Adding text to cognee:")
    print(text.strip())  
    # Add the text, and make it available for cognify
    await cognee.add(text)
    print("Text added successfully.\n")

    
    print("Running cognify to create knowledge graph...\n")
    print("Cognify process steps:")
    print("1. Classifying the document: Determining the type and category of the input text.")
    print("2. Checking permissions: Ensuring the user has the necessary rights to process the text.")
    print("3. Extracting text chunks: Breaking down the text into sentences or phrases for analysis.")
    print("4. Adding data points: Storing the extracted chunks for processing.")
    print("5. Generating knowledge graph: Extracting entities and relationships to form a knowledge graph.")
    print("6. Summarizing text: Creating concise summaries of the content for quick insights.\n")
    
    # Use LLMs and cognee to create knowledge graph
    await cognee.cognify()
    print("Cognify process complete.\n")

    
    query_text = 'Tell me about NLP'
    print(f"Searching cognee for insights with query: '{query_text}'")
    # Query cognee for insights on the added text
    search_results = await cognee.search(
        SearchType.INSIGHTS, query_text=query_text
    )
    
    print("Search results:")
    # Display results
    for result_text in search_results:
        print(result_text)

    # Example output:
       # ({'id': UUID('bc338a39-64d6-549a-acec-da60846dd90d'), 'updated_at': datetime.datetime(2024, 11, 21, 12, 23, 1, 211808, tzinfo=datetime.timezone.utc), 'name': 'natural language processing', 'description': 'An interdisciplinary subfield of computer science and information retrieval.'}, {'relationship_name': 'is_a_subfield_of', 'source_node_id': UUID('bc338a39-64d6-549a-acec-da60846dd90d'), 'target_node_id': UUID('6218dbab-eb6a-5759-a864-b3419755ffe0'), 'updated_at': datetime.datetime(2024, 11, 21, 12, 23, 15, 473137, tzinfo=datetime.timezone.utc)}, {'id': UUID('6218dbab-eb6a-5759-a864-b3419755ffe0'), 'updated_at': datetime.datetime(2024, 11, 21, 12, 23, 1, 211808, tzinfo=datetime.timezone.utc), 'name': 'computer science', 'description': 'The study of computation and information processing.'})
       # (...)
        #
        # It represents nodes and relationships in the knowledge graph:
        # - The first element is the source node (e.g., 'natural language processing').
        # - The second element is the relationship between nodes (e.g., 'is_a_subfield_of').
        # - The third element is the target node (e.g., 'computer science').

if __name__ == '__main__':
    asyncio.run(main())

When you run this script, you will see step-by-step messages in the console that help you trace the execution flow and understand what the script is doing at each stage. A version of this example is here: examples/python/simple_example.py

Create your own memory store

cognee framework consists of tasks that can be grouped into pipelines. Each task can be an independent part of business logic, that can be tied to other tasks to form a pipeline. These tasks persist data into your memory store enabling you to search for relevant context of past conversations, documents, or any other data you have stored.

Example: Classify your documents

Here is an example of how it looks for a default cognify pipeline:

To prepare the data for the pipeline run, first we need to add it to our metastore and normalize it:

Start with:

text = """Natural language processing (NLP) is an interdisciplinary
       subfield of computer science and information retrieval"""

await cognee.add(text) # Add a new piece of information

In the next step we make a task. The task can be any business logic we need, but the important part is that it should be encapsulated in one function.

Here we show an example of creating a naive LLM classifier that takes a Pydantic model and then stores the data in both the graph and vector stores after analyzing each chunk. We provided just a snippet for reference, but feel free to check out the implementation in our repo.

async def chunk_naive_llm_classifier(
    data_chunks: list[DocumentChunk],
    classification_model: Type[BaseModel]
):
    # Extract classifications asynchronously
    chunk_classifications = await asyncio.gather(
        *(extract_categories(chunk.text, classification_model) for chunk in data_chunks)
    )

    # Collect classification data points using a set to avoid duplicates
    classification_data_points = {
        uuid5(NAMESPACE_OID, cls.label.type)
        for cls in chunk_classifications
    } | {
        uuid5(NAMESPACE_OID, subclass.value)
        for cls in chunk_classifications
        for subclass in cls.label.subclass
    }

    vector_engine = get_vector_engine()
    collection_name = "classification"

    # Define the payload schema
    class Keyword(BaseModel):
        uuid: str
        text: str
        chunk_id: str
        document_id: str

    # Ensure the collection exists and retrieve existing data points
    if not await vector_engine.has_collection(collection_name):
        await vector_engine.create_collection(collection_name, payload_schema=Keyword)
        existing_points_map = {}
    else:
        existing_points_map = {}
    return data_chunks

...

We have many tasks that can be used in your pipelines, and you can also create your tasks to fit your business logic.

Once we have our tasks, it is time to group them into a pipeline. This simplified snippet demonstrates how tasks can be added to a pipeline, and how they can pass the information forward from one to another.

            

Task(
    chunk_naive_llm_classifier,
    classification_model = cognee_config.classification_model,
)

pipeline = run_tasks(tasks, documents)

To see the working code, check cognee.api.v1.cognify default pipeline in our repo.

Vector retrieval, Graphs and LLMs

Cognee supports a variety of tools and services for different operations:

Modular: Cognee is modular by nature, using tasks grouped into pipelines
Local Setup: By default, LanceDB runs locally with NetworkX and OpenAI.
Vector Stores: Cognee supports LanceDB, Qdrant, PGVector and Weaviate for vector storage.
Language Models (LLMs): You can use either Anyscale or Ollama as your LLM provider.
Graph Stores: In addition to NetworkX, Neo4j is also supported for graph storage.
User management: Create individual user graphs and manage permissions

Demo

Check out our demo notebook here

Get Started

Install Server

Please see the cognee Quick Start Guide for important configuration information.

docker compose up

Install SDK

Please see the cognee Development Guide for important beta information and usage instructions.

pip install cognee

💫 Contributors

Star History

Vector & Graph Databases Implementation State

Name	Type	Current state	Known Issues
Qdrant	Vector	Stable ✅
Weaviate	Vector	Stable ✅
LanceDB	Vector	Stable ✅
Neo4j	Graph	Stable ✅
NetworkX	Graph	Stable ✅
FalkorDB	Vector/Graph	Unstable ❌
PGVector	Vector	Unstable ❌	Postgres DB returns the Timeout error