Rewrite cognee documentation and apply theme (#130)

* Update docs

* fix: add cognee colors and logo

* fix: add link to community discord

---------

Co-authored-by: Boris Arzentar <borisarzentar@gmail.com>
This commit is contained in:
Vasilije 2024-08-22 13:38:16 +02:00 committed by GitHub
parent 7c8efb0a57
commit 22c0dd5b2d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 792 additions and 509 deletions

View file

@ -232,13 +232,15 @@ async def add(
datasetId,
)
return JSONResponse(
status_code=200,
content="OK"
status_code = 200,
content = {
"message": "OK"
}
)
except Exception as error:
return JSONResponse(
status_code=409,
content={"error": str(error)}
status_code = 409,
content = {"error": str(error)}
)
class CognifyPayload(BaseModel):
@ -252,7 +254,9 @@ async def cognify(payload: CognifyPayload):
await cognee_cognify(payload.datasets)
return JSONResponse(
status_code = 200,
content = "OK"
content = {
"message": "OK"
}
)
except Exception as error:
return JSONResponse(

View file

@ -1,122 +1,262 @@
# cognee API Reference
# Cognee API Reference
## Overview
The Cognee API provides a set of endpoints for managing datasets, performing cognitive tasks, and configuring various settings in the system. The API is built on FastAPI and includes multiple routes to handle different functionalities. This reference outlines the available endpoints and their usage.
The Cognee API has:
## Base URL
1. Python library configuration entry points
2. FastAPI server
The base URL for all API requests is determined by the server's deployment environment. Typically, this will be:
- **Development**: `http://localhost:8000`
- **Production**: Depending on your server setup.
## Python Library
## Endpoints
# Module: cognee.config
### 1. Root
This module provides functionalities to configure various aspects of the system's operation in the cognee library.
It interfaces with a set of Pydantic settings singleton classes to manage the system's configuration.
- **URL**: `/`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Root endpoint that returns a welcome message.
**Response**:
```json
{
"message": "Hello, World, I am alive!"
}
```
## Overview
The config class in this module offers a series of static methods to configure the system's directories, various machine learning models, and other parameters.
### 2. Health Check
- **URL**: `/health`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Health check endpoint that returns the server status.
**Response**:
```json
{
"status": "OK"
}
```
## Methods
### 3. Get Datasets
- **URL**: `/datasets`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Retrieve a list of available datasets.
**Response**:
```json
[
{
"id": "dataset_id_1",
"name": "Dataset Name 1",
"description": "Description of Dataset 1",
...
},
...
]
```
### system_root_directory(system_root_directory: str)
Sets the root directory of the system where essential system files and operations are managed. Parameters:
system_root_directory (str): The path to set as the system's root directory.
Example:
```python
cognee.config.system_root_directory('/path/to/system/root')
```
### 4. Delete Dataset
### data_root_directory(data_root_directory: str)
Sets the directory for storing data used and generated by the system.
Parameters:
data_root_directory (str): The path to set as the data root directory.
- **URL**: `/datasets/{dataset_id}`
- **Method**: `DELETE`
- **Auth Required**: No
- **Description**: Delete a specific dataset by its ID.
**Path Parameters**:
- `dataset_id`: The ID of the dataset to delete.
**Response**:
```json
{
"status": "OK"
}
```
Example:
```python
import cognee
cognee.config.data_root_directory('/path/to/data/root')
```
### 5. Get Dataset Graph
### set_classification_model(classification_model: object)
Assigns a machine learning model for classification tasks within the system.
Parameters:
classification_model (object): The Pydantic model to use for classification.
Check cognee.shared.data_models for existing models.
Example:
```python
import cognee
cognee.config.set_classification_model(model)
```
- **URL**: `/datasets/{dataset_id}/graph`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Retrieve the graph visualization URL for a specific dataset.
**Path Parameters**:
- `dataset_id`: The ID of the dataset.
**Response**:
```json
"http://example.com/path/to/graph"
```
set_summarization_model(summarization_model: object)
Sets the Pydantic model to be used for summarization tasks.
Parameters:
summarization_model (object): The model to use for summarization.
Check cognee.shared.data_models for existing models.
Example:
```python
import cognee
cognee.config.set_summarization_model(my_summarization_model)
```
### 6. Get Dataset Data
### set_llm_model(llm_model: object)
Determines the model to handle LLMs. Parameters:
llm_model (object): The model to use for LLMs.
Example:
```python
import cognee
cognee.config.set_llm_model("openai")
```
- **URL**: `/datasets/{dataset_id}/data`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Retrieve data associated with a specific dataset.
**Path Parameters**:
- `dataset_id`: The ID of the dataset.
**Response**:
```json
[
{
"data_id": "data_id_1",
"content": "Data content here",
...
},
...
]
```
### graph_database_provider(graph_engine: string)
Sets the engine to manage graph processing tasks.
Parameters:
graph_database_provider (object): The engine for graph tasks.
Example:
```python
from cognee.shared.data_models import GraphDBType
### 7. Get Dataset Status
cognee.config.set_graph_engine(GraphDBType.NEO4J)
```
- **URL**: `/datasets/status`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Retrieve the status of one or more datasets.
**Query Parameters**:
- `dataset`: A list of dataset IDs to check status for.
**Response**:
```json
{
"dataset_id_1": "Status 1",
"dataset_id_2": "Status 2",
...
}
```
### 8. Get Raw Data
- **URL**: `/datasets/{dataset_id}/data/{data_id}/raw`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Retrieve the raw data file for a specific data entry in a dataset.
**Path Parameters**:
- `dataset_id`: The ID of the dataset.
- `data_id`: The ID of the data entry.
**Response**: Raw file download.
## API
### 9. Add Data
- **URL**: `/add`
- **Method**: `POST`
- **Auth Required**: No
- **Description**: Add new data to a dataset. The data can be uploaded from a file or a URL.
**Form Parameters**:
- `datasetId`: The ID of the dataset to add data to.
- `data`: A list of files to upload.
**Request**
```json
{
"dataset_id": "ID_OF_THE_DATASET_TO_PUT_DATA_IN", // Optional, we use "main" as default.
"files": File[]
}
```
**Response**:
```json
{
"message": "OK"
}
```
For each API endpoint, provide the following details:
### 10. Cognify
- **URL**: `/cognify`
- **Method**: `POST`
- **Auth Required**: No
- **Description**: Perform cognitive processing on the specified datasets.
**Request Body**:
```json
{
"datasets": ["ID_OF_THE_DATASET_1", "ID_OF_THE_DATASET_2", ...]
}
```
**Response**:
```json
{
"message": "OK"
}
```
### 11. Search
### Endpoint 1: Root
- URL: /add
- Method: POST
- Auth Required: No
- Description: Root endpoint that returns a welcome message.
- **URL**: `/search`
- **Method**: `POST`
- **Auth Required**: No
- **Description**: Search for nodes in the graph based on the provided query parameters.
<!-- **Request Body**:
- `query_params`: A dictionary of query parameters. -->
**Request Body**:
```json
{
"query_params": [{
"query": "QUERY_TO_MATCH_DATA",
"searchType": "SIMILARITY", // or TRAVERSE, ADJACENT, SUMMARY
}]
}
```
**Response**
```json
{
"results": [
{
"node_id": "node_id_1",
"attributes": {...},
...
},
...
]
}
```
#### Response
```json
{
"message": "Hello, World, I am alive!"
}
```
### 12. Get Settings
### Endpoint 1: Health Check
- URL: /health
- Method: GET
- Auth Required: No
- Description: Health check endpoint that returns the server status.
#### Response
```json
{
"status": "OK"
}
```
- **URL**: `/settings`
- **Method**: `GET`
- **Auth Required**: No
- **Description**: Retrieve the current system settings.
**Response**:
```json
{
"llm": {...},
"vectorDB": {...},
...
}
```
More endpoints are available in the FastAPI server. Documentation is in progress
### 13. Save Settings
- **URL**: `/settings`
- **Method**: `POST`
- **Auth Required**: No
- **Description**: Save new settings for the system, including LLM and vector DB configurations.
**Request Body**:
- `llm`: Optional. The configuration for the LLM provider.
- `vectorDB`: Optional. The configuration for the vector database provider.
**Response**:
```json
{
"status": "OK"
}
```

BIN
docs/assets/favicon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2 KiB

BIN
docs/assets/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.1 KiB

View file

@ -32,21 +32,14 @@ Data Types and Their Handling
### Concept 2: Data Enrichment with LLMs
LLMs are adept at processing unstructured data. They can easily extract summaries, keywords, and other useful information from documents. We use function calling with Pydantic models to extract the data and dspy to train our functions.
LLMs are adept at processing unstructured data. They can easily extract summaries, keywords, and other useful information from documents. We use function calling with Pydantic models to extract information from the unstructured data.
<figure markdown>
![Data Enrichment](img/enrichment.png)
<figcaption>Data Enrichment Example</figcaption>
</figure>
We decompose the loaded content into graphs, allowing us to more precisely map out the relationships between entities and concepts.
### Concept 3: Linguistic Analysis
LLMs are probabilistic models, meaning they can make mistakes.
To mitigate this, we can use a combination of NLP and LLMs to determine how to analyze the data and score each part of the text.
<figure markdown>
![Linguistic analysis](img/linguistic_analysis.png)
<figcaption>Linguistic analysis</figcaption>
</figure>
### Concept 4: Graphs
### Concept 3: Graphs
Knowledge graphs simply map out knowledge, linking specific facts and their connections.
When Large Language Models (LLMs) process text, they infer these links, leading to occasional inaccuracies due to their probabilistic nature.
@ -57,11 +50,12 @@ This structured approach can extend beyond concepts to document layouts, pages,
![Graph structure](img/graph_structure.png)
<figcaption>Graph Structure</figcaption>
</figure>
### Concept 5: Vector and Graph Retrieval
### Concept 4: Vector and Graph Retrieval
Cognee lets you use multiple vector and graph retrieval methods to find the most relevant information.
!!! info "Learn more?"
Check out learning materials to see how you can use these methods in your projects.
### Concept 6: Auto-Optimizing Pipelines
### Concept 5: Auto-Optimizing Pipelines
Integrating knowledge graphs into Retrieval-Augmented Generation (RAG) pipelines leads to an intriguing outcome: the system's adeptness at contextual understanding allows it to be evaluated in a way Machine Learning (ML) engineers are accustomed to.
This involves bombarding the RAG system with hundreds of synthetic questions, enabling the knowledge graph to evolve and refine its context autonomously over time.
@ -80,10 +74,9 @@ Main components:
- **Data Pipelines**: Responsible for ingesting, processing, and transforming data from various sources.
- **LLMs**: Large Language Models that process unstructured data and generate text.
- **Graphs**: Knowledge graphs that represent relationships between entities and concepts.
- **Vector Stores**: Databases that store vector representations of data for efficient retrieval.
- **dspy module**: Pipelines that automatically adjust based on feedback and data changes.
- **Search wrapper**: Retrieves relevant information from the knowledge graph and vector stores.
- **Graph Store**: Knowledge graphs that represent relationships between entities and concepts.
- **Vector Store**: Database that stores vector representations of data for efficient retrieval.
- **Search**: Retrieves relevant information from the knowledge graph and vector stores.
## How It Fits Into Your Projects

93
docs/configuration.md Normal file
View file

@ -0,0 +1,93 @@
# Configuration
## 🚀 Configure Vector and Graph Stores
You can configure the vector and graph stores using the environment variables in your .env file or programatically.
We use [Pydantic Settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/#dotenv-env-support)
We have a global configuration object (cognee.config) and individual configurations on pipeline and data store levels
Check available configuration options:
``` python
from cognee.infrastructure.databases.vector import get_vectordb_config
from cognee.infrastructure.databases.graph.config import get_graph_config
from cognee.infrastructure.databases.relational import get_relational_config
print(get_vectordb_config().to_dict())
print(get_graph_config().to_dict())
print(get_relational_config().to_dict())
```
Setting the environment variables in your .env file, and Pydantic will pick them up:
```bash
GRAPH_DATABASE_PROVIDER = 'lancedb'
```
Otherwise, you can set the configuration yourself:
```python
cognee.config.llm_provider = 'ollama'
```
## 🚀 Getting Started with Local Models
You'll need to run the local model on your machine or use one of the providers hosting the model.
!!! note "We had some success with mixtral, but 7b models did not work well. We recommend using mixtral for now."
### Ollama
Set up Ollama by following instructions on [Ollama website](https://ollama.com/)
Set the environment variable in your .env to use the model
```bash
LLM_PROVIDER = 'ollama'
```
Otherwise, you can set the configuration for the model:
```bash
cognee.config.llm_provider = 'ollama'
```
You can also set the HOST and model name:
```bash
cognee.config.llm_endpoint = "http://localhost:11434/v1"
cognee.config.llm_model = "mistral:instruct"
```
### Anyscale
```bash
LLM_PROVIDER = 'custom'
```
Otherwise, you can set the configuration for the model:
```bash
cognee.config.llm_provider = 'custom'
```
You can also set the HOST and model name:
```bash
LLM_MODEL = "mistralai/Mixtral-8x7B-Instruct-v0.1"
LLM_ENDPOINT = "https://api.endpoints.anyscale.com/v1"
LLM_API_KEY = "your_api_key"
```
You can set the same way HOST and model name for any other provider that has an API endpoint.

46
docs/data_ingestion.md Normal file
View file

@ -0,0 +1,46 @@
# How data ingestion with cognee works
# Why bother with data ingestion?
In order to use cognee, you need to ingest data into the cognee data store.
This data can be events, customer data, or third-party data.
In order to build reliable models and pipelines, we need to structure and process various types of datasets and data sources in the same way.
Some of the operations like normalization, deduplication, and data cleaning are common across all data sources.
This is where cognee comes in. It provides a unified interface to ingest data from various sources and process it in a consistent way.
For this we use dlt (Data Loading Tool) which is a part of cognee infrastructure.
# Example
Let's say you have a dataset of customer reviews in a PDF file. You want to ingest this data into cognee and use it to train a model.
You can use the following code to ingest the data:
```python
dataset_name = "artificial_intelligence"
ai_text_file_path = os.path.join(pathlib.Path(__file__).parent, "test_data/artificial-intelligence.pdf")
await cognee.add([ai_text_file_path], dataset_name)
```
cognee uses dlt to ingest the data and allows you to use:
1. SQL databases. Supports PostgreSQL, MySQL, MS SQL Server, BigQuery, Redshift, and more.
2. REST API generic source. Loads data from REST APIs using declarative configuration.
3. OpenAPI source generator. Generates a source from an OpenAPI 3.x spec using the REST API source.
4. Cloud and local storage. Retrieves data from AWS S3, Google Cloud Storage, Azure Blob Storage, local files, and more.
# What happens under the hood?
We use dlt as a loader to ingest data into the cognee metadata store. We can ingest data from various sources like SQL databases, REST APIs, OpenAPI specs, and cloud storage.
This enables us to have a common data model we can then use to build models and pipelines.
The models and pipelines we build in this way end up in the cognee data store, which is a unified interface to access the data.

View file

@ -1,381 +1,31 @@
# cognee
# New to cognee?
The getting started guide covers adding a cognee data store to your AI app, sending data, identifying users, extracting actions and insights, and interconnecting separate datasets.
[Get started](quickstart.md)
#### Deterministic LLMs Outputs for AI Engineers
## Ingest Data
Learn how to manage the ingestion of events, customer data, or third-party data for use with cognee.
[Explore](data_ingestion.md)
_Open-source framework for loading and structuring LLM context to create accurate and explainable AI solutions using knowledge graphs and vector stores_
## Tasks and Pipelines
Analyze and enrich your data and improve LLM answers with a series of tasks and pipelines.
[Learn about tasks](templates.md)
---
## API
Push or pull data to build custom functionality or create bespoke views for your business needs.
[Explore](api_reference.md)
[![Twitter Follow](https://img.shields.io/twitter/follow/tricalt?style=social)](https://twitter.com/tricalt)
## Resources
### Resources
[![Downloads](https://img.shields.io/pypi/dm/cognee.svg)](https://pypi.python.org/pypi/cognee)
- [Research](research.md)
- [Community](https://discord.gg/52QTb5JK){:target="_blank"}
[![Star on GitHub](https://img.shields.io/github/stars/topoteretes/cognee.svg?style=social)](https://github.com/topoteretes/cognee)
### Let's learn about cogneeHub!
cogneeHub is a free, open-source learning platform for those interested in creating deterministic LLM outputs. We help developers by using graphs, LLMs, and adding vector retrieval to their Machine Learning stack.
- **Get started** — [Get started with cognee quickly and try it out for yourself.](quickstart.md)
- **Conceptual Overview** — Learn about the [core concepts](conceptual_overview.md) of cognee and how it fits into your projects.
- **Data Engineering and LLMOps** — Learn about some [data engineering and llmops](data_engineering_llm_ops.md) core concepts that will help you build better AI apps.
- **RAGs** — We provide easy-to-follow [learning materials](rags.md) to help you learn about RAGs.
- **Research** — A list of resources to help you learn more about [cognee and LLM memory research](research.md)
- **Blog** — A blog where you can read about the [latest news and updates](blog/index.md) about cognee.
- **Support** — [Book time](https://www.cognee.ai/#bookTime) with our team.
[//]: # (- **Case Studies** — Read about [case studies](case_studies.md) that show how cognee can be used in real-world applications.)
### Vision
![Vision](img/roadmap.png)
### Architecture
![Architecture](img/architecture.png)
### Why use cognee?
The question of using cognee is fundamentally a question of why to have deterministic outputs for your llm workflows.
1. **Cost-effective** — cognee extends the capabilities of your LLMs without the need for expensive data processing tools.
2. **Self-contained** — cognee runs as a library and is simple to use
3. **Interpretable** — Navigate graphs instead of embeddings to understand your data.
4. **User Guided** — cognee lets you control your input and provide your own Pydantic data models
## License
This project is licensed under the terms of the Apache License 2.0.
[//]: # (<style>)
[//]: # ()
[//]: # (.container {)
[//]: # ()
[//]: # ( display: flex;)
[//]: # ()
[//]: # ( justify-content: space-around;)
[//]: # ()
[//]: # ( margin-top: 20px;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.container div {)
[//]: # ()
[//]: # ( width: 28%;)
[//]: # ()
[//]: # ( padding: 20px;)
[//]: # ()
[//]: # ( box-sizing: border-box;)
[//]: # ()
[//]: # ( border: 1px solid #e0e0e0;)
[//]: # ()
[//]: # ( border-radius: 8px;)
[//]: # ()
[//]: # ( background-color: #f9f9f9;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.container h2 {)
[//]: # ()
[//]: # ( font-size: 1.25em;)
[//]: # ()
[//]: # ( margin-bottom: 10px;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.container p {)
[//]: # ()
[//]: # ( margin-bottom: 20px;)
[//]: # ()
[//]: # ( line-height: 1.6;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.button-container {)
[//]: # ()
[//]: # ( text-align: center;)
[//]: # ()
[//]: # ( margin: 30px 0;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.button-container a {)
[//]: # ()
[//]: # ( display: inline-block;)
[//]: # ()
[//]: # ( padding: 15px 25px;)
[//]: # ()
[//]: # ( background-color: #007bff;)
[//]: # ()
[//]: # ( color: white;)
[//]: # ()
[//]: # ( text-decoration: none;)
[//]: # ()
[//]: # ( border-radius: 5px;)
[//]: # ()
[//]: # ( font-size: 1em;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.button-container a:hover {)
[//]: # ()
[//]: # ( background-color: #0056b3;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.resources {)
[//]: # ()
[//]: # ( margin-top: 40px;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.resources h2 {)
[//]: # ()
[//]: # ( font-size: 1.5em;)
[//]: # ()
[//]: # ( margin-bottom: 20px;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.resources ul {)
[//]: # ()
[//]: # ( list-style: none;)
[//]: # ()
[//]: # ( padding: 0;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.resources li {)
[//]: # ()
[//]: # ( margin-bottom: 10px;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.resources a {)
[//]: # ()
[//]: # ( color: #007bff;)
[//]: # ()
[//]: # ( text-decoration: none;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # ()
[//]: # (.resources a:hover {)
[//]: # ()
[//]: # ( text-decoration: underline;)
[//]: # ()
[//]: # (})
[//]: # ()
[//]: # (</style>)
[//]: # ()
[//]: # ()
[//]: # (# New to cognee?)
[//]: # ()
[//]: # ()
[//]: # (The getting started guide covers adding a GraphRAG data store to your AI app, sending events, identifying users, extracting actions and insights, and interconnecting separate datasets.)
[//]: # ()
[//]: # ()
[//]: # (<div class="button-container">)
[//]: # ()
[//]: # ( <a href="./quickstart.md">Get started</a>)
[//]: # ()
[//]: # (</div>)
[//]: # ()
[//]: # ()
[//]: # (<div class="container">)
[//]: # ()
[//]: # ( <div>)
[//]: # ()
[//]: # ( <h2>Ingest Data</h2>)
[//]: # ()
[//]: # ( <p>Learn how to manage ingestion of events, customer data or third party data for use with cognee.</p>)
[//]: # ()
[//]: # ( <a href="#">Explore</a>)
[//]: # ()
[//]: # ( </div>)
[//]: # ()
[//]: # ( <div>)
[//]: # ()
[//]: # ( <h2>Templates</h2>)
[//]: # ()
[//]: # ( <p>Analyze and enrich your data and improve LLM answers with a series of templates using cognee tasks and pipelines.</p>)
[//]: # ()
[//]: # ( <a href="#">Browse templates</a>)
[//]: # ()
[//]: # ( </div>)
[//]: # ()
[//]: # ( <div>)
[//]: # ()
[//]: # ( <h2>API</h2>)
[//]: # ()
[//]: # ( <p>Push or pull data to build custom functionality or create bespoke views for your business needs.</p>)
[//]: # ()
[//]: # ( <a href="#">Explore</a>)
[//]: # ()
[//]: # ( </div>)
[//]: # ()
[//]: # (</div>)
[//]: # ()
[//]: # ()
[//]: # (<div class="resources">)
[//]: # ()
[//]: # ( <h2>Resources</h2>)
[//]: # ()
[//]: # ( <ul>)
[//]: # ()
[//]: # ( <li><a href="#">What is GraphRAG</a></li>)
[//]: # ()
[//]: # ( <li><a href="#">Research</a></li>)
[//]: # ()
[//]: # ( <li><a href="#">Community</a></li>)
[//]: # ()
[//]: # ( <li><a href="#">Community</a></li>)
[//]: # ()
[//]: # ( <li><a href="#">API Reference</a></li>)
[//]: # ()
[//]: # ( <li><a href="#">Support</a></li>)
[//]: # ()
[//]: # ( </ul>)
[//]: # ()
[//]: # (</div>)

View file

@ -4,11 +4,17 @@
## Setup
You will need a Weaviate instance and an OpenAI API key to use cognee.
Weaviate let's you run an instance for 14 days for free. You can sign up at their website: [Weaviate](https://www.semi.technology/products/weaviate.html)
To run cognee, you will need the following:
1. A running postgres instance
2. OpenAI API key (Ollama or Anyscale could work as [well](local_models.md))
You can also use Ollama or Anyscale as your LLM provider. For more info on local models check [here](local_models.md)
Navigate to cognee folder and run
```
docker compose up postgres
```
Add your LLM API key to the enviroment variables
```
import os
@ -28,6 +34,9 @@ If you are using Networkx, create an account on Graphistry to vizualize results:
```
## Run
cognee is asynchronous by design, meaning that operations like adding information, processing it, and querying it can run concurrently without blocking the execution of other tasks.
Make sure to await the results of the functions that you call.
```
import cognee
@ -42,4 +51,7 @@ search_results = cognee.search("SIMILARITY", {'query': 'Tell me about NLP'}) # Q
for result_text in search_results[0]:
print(result_text)
```
```
In the example above, we add a piece of information to cognee, use LLMs to create a GraphRAG, and then query cognee for the knowledge.
cognee is composable and you can build your own cognee pipelines using our [templates.](templates.md)

59
docs/search.md Normal file
View file

@ -0,0 +1,59 @@
## Cognee Search Module
This module contains the search function that is used to search for nodes in the graph. It supports various search types and integrates with user permissions to filter results accordingly.
### Search Types
The `SearchType` enum defines the different types of searches that can be performed:
- `ADJACENT`: Search for nodes adjacent to a given node.
- `TRAVERSE`: Traverse the graph to find related nodes.
- `SIMILARITY`: Find nodes similar to a given node.
- `SUMMARY`: Retrieve a summary of the node.
- `SUMMARY_CLASSIFICATION`: Classify the summary of the node.
- `NODE_CLASSIFICATION`: Classify the node.
- `DOCUMENT_CLASSIFICATION`: Classify the document.
- `CYPHER`: Perform a Cypher query on the graph.
### Search Parameters
The `SearchParameters` class is a Pydantic model that validates and holds the search parameters:
```python
class SearchParameters(BaseModel):
search_type: SearchType
params: Dict[str, Any]
@field_validator("search_type", mode="before")
def convert_string_to_enum(cls, value):
if isinstance(value, str):
return SearchType.from_str(value)
return value
```
### Search Function
The `search` function is the main entry point for performing a search. It handles user authentication, retrieves document IDs for the user, and filters the search results based on user permissions.
```python
async def search(search_type: str, params: Dict[str, Any], user: User = None) -> List:
if user is None:
user = await get_default_user()
own_document_ids = await get_document_ids_for_user(user.id)
search_params = SearchParameters(search_type=search_type, params=params)
search_results = await specific_search([search_params])
from uuid import UUID
filtered_search_results = []
for search_result in search_results:
document_id = search_result["document_id"] if "document_id" in search_result else None
document_id = UUID(document_id) if type(document_id) == str else document_id
if document_id is None or document_id in own_document_ids:
filtered_search_results.append(search_result)
return filtered_search_results
```

View file

@ -0,0 +1,51 @@
[data-md-color-scheme = "cognee"] {
color-scheme: dark;
--md-default-bg-color: #0C0121;
--md-default-bg-color--light: #240067;
--md-default-fg-color: #57DFD7;
--md-default-fg-color--light: #85ded8;
--md-default-fg-color--dark: #4dc6be;
/* --md-primary-fg-color: #0C0121; */
--md-primary-fg-color: #7233BA;
--md-primary-fg-color--light: #8a49d4;
--md-primary-fg-color--dark: #522488;
/* --md-primary-bg-color: hsla(0, 0%, 100%, 1);
--md-primary-bg-color--light: */
--md-accent-fg-color: #41a29b;
--md-typeset-color: white;
--md-typeset-a-color: #57DFD7;
--md-footer-bg-color: #0C0121;
--md-footer-bg-color--dark: #0C0121;
}
.md-header {
background-color: var(--md-default-bg-color);
}
/* Remove unnecessary title from the header */
.md-header__title {
display: none;
}
/* Spread header elements evenly when there is no title */
.md-header__inner {
justify-content: space-between;
}
.md-tabs {
background-color: var(--md-default-bg-color);
}
.md-button--primary:hover {
background-color: #8a49d4 !important;
}
.md-typeset .md-button {
border-radius: 32px;
}

242
docs/templates.md Normal file
View file

@ -0,0 +1,242 @@
# TASKS
!!! tip "cognee uses tasks grouped into pipelines to populate graph and vector stores"
Cognee uses tasks grouped into pipelines to populate graph and vector stores. These tasks are designed to analyze and enrich your data, improving the answers generated by Large Language Models (LLMs).
In this section, you'll find a template that you can use to structure your data and build pipelines.
These tasks are designed to help you get started with cognee and build reliable LLM pipelines
## Task 1: Category Extraction
Data enrichment is the process of enhancing raw data with additional information to make it more valuable. This template is a sample task that extract categories from a document and populates a graph with the extracted categories.
Let's go over the steps to use this template [full code provided here](https://github.com/topoteretes/cognee/blob/main/cognee/tasks/chunk_naive_llm_classifier/chunk_naive_llm_classifier.py):
This function is designed to classify chunks of text using a specified language model. The goal is to categorize the text, map relationships, and store the results in a vector engine and a graph engine. The function is asynchronous, allowing for concurrent execution of tasks like classification and data point creation.
### Parameters
- `data_chunks: list[DocumentChunk]`: A list of text chunks to be classified. Each chunk represents a piece of text and includes metadata like `chunk_id` and `document_id`.
- `classification_model: Type[BaseModel]`: The model used to classify each chunk of text. This model is expected to output labels that categorize the text.
### Steps in the Function
#### Check for Empty Input
```python
if len(data_chunks) == 0:
return data_chunks
```
If there are no data chunks provided, the function returns immediately with the input list (which is empty).
#### Classify Each Chunk
```python
chunk_classifications = await asyncio.gather(
*[extract_categories(chunk.text, classification_model) for chunk in data_chunks],
)
```
The function uses `asyncio.gather` to concurrently classify each chunk of text. `extract_categories` is called for each chunk, and the results are collected in `chunk_classifications`.
#### Initialize Data Structures
```python
classification_data_points = []
```
A list is initialized to store the classification data points that will be used later for mapping relationships and storing in the vector engine.
#### Generate UUIDs for Classifications
The function loops through each chunk and generates unique identifiers (UUIDs) for both the main classification type and its subclasses:
```python
classification_data_points.append(uuid5(NAMESPACE_OID, chunk_classification.label.type))
classification_data_points.append(uuid5(NAMESPACE_OID, classification_subclass.value))
```
These UUIDs are used to uniquely identify classifications and ensure consistency.
#### Retrieve or Create Vector Collection
```python
vector_engine = get_vector_engine()
collection_name = "classification"
```
The function interacts with a vector engine. It checks if the collection named "classification" exists. If it does, it retrieves existing data points to avoid duplicates. Otherwise, it creates the collection.
#### Prepare Data Points, Nodes, and Edges
The function then builds a list of `data_points` (representing the classification results) and constructs nodes and edges to represent relationships between chunks and their classifications:
```python
data_points.append(DataPoint[Keyword](...))
nodes.append((...))
edges.append((...))
```
- **Nodes**: Represent classifications (e.g., media type, subtype).
- **Edges**: Represent relationships between chunks and classifications (e.g., "is_media_type", "is_subtype_of").
#### Create Data Points and Relationships
If there are new nodes or edges to add, the function stores the data points in the vector engine and updates the graph engine with the new nodes and edges:
```python
await vector_engine.create_data_points(collection_name, data_points)
await graph_engine.add_nodes(nodes)
await graph_engine.add_edges(edges)
```
#### Return the Processed Chunks
Finally, the function returns the processed `data_chunks`, which can now be used further as needed:
```python
return data_chunks
```
## Pipeline 1: cognee pipeline
This is the main pipeline currently implemented in cognee. It is designed to process data in a structured way and populate the graph and vector stores with the results
This function is the entry point for processing datasets. It handles dataset retrieval, user authorization, and manages the execution of a pipeline of tasks that process documents.
### Parameters
- `datasets: Union[str, list[str]] = None`: A string or list of dataset names to be processed.
- `user: User = None`: The user requesting the processing. If not provided, the default user is retrieved.
### Steps in the Function
#### Database Engine Initialization
```python
db_engine = get_relational_engine()
```
The function starts by getting an instance of the relational database engine, which is used to retrieve datasets and other necessary data.
#### Handle Empty or String Dataset Input
```python
if datasets is None or len(datasets) == 0:
return await cognify(await db_engine.get_datasets())
if type(datasets[0]) == str:
datasets = await retrieve_datasets(datasets)
```
If no datasets are provided, the function retrieves all available datasets from the database. If a list of dataset names (strings) is provided, they are converted into dataset objects.
#### User Authentication
```python
if user is None:
user = await get_default_user()
```
If no user is provided, the function retrieves the default user.
#### Run Cognify Pipeline for Each Dataset
```python
async def run_cognify_pipeline(dataset: Dataset):
# Pipeline logic goes here...
```
The `run_cognify_pipeline` function is defined within `cognify` and is responsible for processing a single dataset. This is where most of the heavy lifting occurs.
#### Retrieve Dataset Data
The function fetches all the data associated with the dataset.
```python
data: list[Data] = await get_dataset_data(dataset_id=dataset.id)
```
#### Create Document Objects
Based on the file type (e.g., PDF, Audio, Image, Text), corresponding document objects are created.
```python
documents = [...]
```
#### Check Permissions
The user's permissions are checked to ensure they can access the documents.
```python
await check_permissions_on_documents(user, "read", document_ids)
```
#### Pipeline Status Logging
The function logs the start and end of the pipeline processing.
```python
async with update_status_lock:
task_status = await get_pipeline_status([dataset_id])
if dataset_id in task_status and task_status[dataset_id] == "DATASET_PROCESSING_STARTED":
logger.info("Dataset %s is already being processed.", dataset_name)
return
await log_pipeline_status(dataset_id, "DATASET_PROCESSING_STARTED", {...})
```
#### Pipeline Tasks
The pipeline consists of several tasks, each responsible for different parts of the processing:
- `document_to_ontology`: Maps documents to an ontology structure.
- `source_documents_to_chunks`: Splits documents into chunks.
- `chunk_to_graph_decomposition`: Defines the graph structure for chunks.
- `chunks_into_graph`: Integrates chunks into the knowledge graph.
- `chunk_update_check`: Checks for updated or new chunks.
- `save_chunks_to_store`: Saves chunks to a vector store and graph database.
Parallel Tasks: `chunk_extract_summary` and `chunk_naive_llm_classifier` run in parallel to summarize and classify chunks.
- `chunk_remove_disconnected`: Cleans up obsolete chunks.
The tasks are managed and executed asynchronously using the `run_tasks` and `run_tasks_parallel` functions.
```python
pipeline = run_tasks(tasks, documents)
async for result in pipeline:
print(result)
```
#### Handle Errors
If any errors occur during processing, they are logged, and the exception is raised.
```python
except Exception as error:
await log_pipeline_status(dataset_id, "DATASET_PROCESSING_ERROR", {...})
raise error
```
#### Processing Multiple Datasets
The function prepares to process multiple datasets concurrently using `asyncio.gather`.
```python
awaitables = []
for dataset in datasets:
dataset_name = generate_dataset_name(dataset.name)
if dataset_name in existing_datasets:
awaitables.append(run_cognify_pipeline(dataset))
return await asyncio.gather(*awaitables)
```

View file

@ -8,6 +8,8 @@ edit_uri: edit/main/docs/
copyright: Copyright &copy; 2024 cognee
theme:
name: material
logo: assets/logo.png
favicon: assets/favicon.png
icon:
repo: fontawesome/brands/github
edit: material/pencil
@ -45,31 +47,25 @@ theme:
- navigation.prune
- navigation.sections
- navigation.tabs
# - navigation.tabs.sticky
- navigation.top
- navigation.tracking
- navigation.path
- search.highlight
- search.share
- search.suggest
- toc.follow
# - toc.integrate
palette:
- scheme: default
primary: black
accent: indigo
toggle:
icon: material/brightness-7
name: Switch to dark mode
- scheme: slate
primary: black
accent: indigo
toggle:
icon: material/brightness-4
name: Switch to light mode
- scheme: cognee
primary: custom
font:
text: Roboto
code: Roboto Mono
custom_dir: docs/overrides
extra_css:
- stylesheets/extra.css
# Extensions
markdown_extensions:
- abbr
@ -117,25 +113,22 @@ markdown_extensions:
- pymdownx.tasklist:
custom_checkbox: true
nav:
- Introduction:
- Welcome to cognee: 'index.md'
- Quickstart: 'quickstart.md'
- Conceptual overview: 'conceptual_overview.md'
- Learning materials: 'rags.md'
- Blog: 'blog/index.md'
- Research: 'research.md'
- Local models: 'local_models.md'
- Api reference: 'api_reference.md'
- Overview:
- Overview: 'index.md'
- Start here:
- Installation: 'quickstart.md'
- Add data: 'data_ingestion.md'
- Create LLM enriched data store: 'templates.md'
- Explore data: 'search.md'
# - SDK:
# - Overview: 'sdk_overview.md'
- Configuration: 'configuration.md'
- What is cognee:
- Introduction: 'conceptual_overview.md'
- API reference: 'api_reference.md'
- Blog:
- "blog/index.md"
- Why cognee:
- "why.md"
- Research:
- "research.md"
- Api reference:
- 'api_reference.md'
- Team:
- "team.md"
plugins:
- mkdocs-jupyter: