cognee/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Cognee is an open-source AI memory platform that transforms raw data into persistent knowledge graphs for AI agents. It replaces traditional RAG (Retrieval-Augmented Generation) with an ECL (Extract, Cognify, Load) pipeline combining vector search, graph databases, and LLM-powered entity extraction.

**Requirements**: Python 3.9 - 3.12

## Development Commands

### Setup
```bash
# Create virtual environment (recommended: uv)
uv venv && source .venv/bin/activate

# Install with pip, poetry, or uv
uv pip install -e .

# Install with dev dependencies
uv pip install -e ".[dev]"

# Install with specific extras
uv pip install -e ".[postgres,neo4j,docs,chromadb]"

# Set up pre-commit hooks
pre-commit install
```

### Available Installation Extras
- **postgres** / **postgres-binary** - PostgreSQL + PGVector support
- **neo4j** - Neo4j graph database support
- **neptune** - AWS Neptune support
- **chromadb** - ChromaDB vector database
- **docs** - Document processing (unstructured library)
- **scraping** - Web scraping (Tavily, BeautifulSoup, Playwright)
- **langchain** - LangChain integration
- **llama-index** - LlamaIndex integration
- **anthropic** - Anthropic Claude models
- **gemini** - Google Gemini models
- **ollama** - Ollama local models
- **mistral** - Mistral AI models
- **groq** - Groq API support
- **llama-cpp** - Llama.cpp local inference
- **huggingface** - HuggingFace transformers
- **aws** - S3 storage backend
- **redis** - Redis caching
- **graphiti** - Graphiti-core integration
- **baml** - BAML structured output
- **dlt** - Data load tool (dlt) integration
- **docling** - Docling document processing
- **codegraph** - Code graph extraction
- **evals** - Evaluation tools
- **deepeval** - DeepEval testing framework
- **posthog** - PostHog analytics
- **monitoring** - Sentry + Langfuse observability
- **distributed** - Modal distributed execution
- **dev** - All development tools (pytest, mypy, ruff, etc.)
- **debug** - Debugpy for debugging

### Testing
```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=cognee --cov-report=html

# Run specific test file
pytest cognee/tests/test_custom_model.py

# Run specific test function
pytest cognee/tests/test_custom_model.py::test_function_name

# Run async tests
pytest -v cognee/tests/integration/

# Run unit tests only
pytest cognee/tests/unit/

# Run integration tests only
pytest cognee/tests/integration/
```

### Code Quality
```bash
# Run ruff linter
ruff check .

# Run ruff formatter
ruff format .

# Run both linting and formatting (pre-commit)
pre-commit run --all-files

# Type checking with mypy
mypy cognee/

# Run pylint
pylint cognee/
```

### Running Cognee
```bash
# Using Python SDK
python examples/python/simple_example.py

# Using CLI
cognee-cli add "Your text here"
cognee-cli cognify
cognee-cli search "Your query"
cognee-cli delete --all

# Launch full stack with UI
cognee-cli -ui
```

## Architecture Overview

### Core Workflow: add → cognify → search/memify

1. **add()** - Ingest data (files, URLs, text) into datasets
2. **cognify()** - Extract entities/relationships and build knowledge graph
3. **search()** - Query knowledge using various retrieval strategies
4. **memify()** - Enrich graph with additional context and rules

### Key Architectural Patterns

#### 1. Pipeline-Based Processing
All data flows through task-based pipelines (`cognee/modules/pipelines/`). Tasks are composable units that can run sequentially or in parallel. Example pipeline tasks: `classify_documents`, `extract_graph_from_data`, `add_data_points`.

#### 2. Interface-Based Database Adapters
Multiple backends are supported through adapter interfaces:
- **Graph**: Kuzu (default), Neo4j, Neptune via `GraphDBInterface`
- **Vector**: LanceDB (default), ChromaDB, PGVector via `VectorDBInterface`
- **Relational**: SQLite (default), PostgreSQL

Key files:
- `cognee/infrastructure/databases/graph/graph_db_interface.py`
- `cognee/infrastructure/databases/vector/vector_db_interface.py`

#### 3. Multi-Tenant Access Control
User → Dataset → Data hierarchy with permission-based filtering. Enable with `ENABLE_BACKEND_ACCESS_CONTROL=True`. Each user+dataset combination can have isolated graph/vector databases (when using supported backends: Kuzu, LanceDB, SQLite, Postgres).

### Layer Structure

```
API Layer (cognee/api/v1/)
    ↓
Main Functions (add, cognify, search, memify)
    ↓
Pipeline Orchestrator (cognee/modules/pipelines/)
    ↓
Task Execution Layer (cognee/tasks/)
    ↓
Domain Modules (graph, retrieval, ingestion, etc.)
    ↓
Infrastructure Adapters (LLM, databases)
    ↓
External Services (OpenAI, Kuzu, LanceDB, etc.)
```

### Critical Data Flow Paths

#### ADD: Data Ingestion
`add()` → `resolve_data_directories` → `ingest_data` → `save_data_item_to_storage` → Create Dataset + Data records in relational DB

Key files: `cognee/api/v1/add/add.py`, `cognee/tasks/ingestion/ingest_data.py`

#### COGNIFY: Knowledge Graph Construction
`cognify()` → `classify_documents` → `extract_chunks_from_documents` → `extract_graph_from_data` (LLM extracts entities/relationships using Instructor) → `summarize_text` → `add_data_points` (store in graph + vector DBs)

Key files:
- `cognee/api/v1/cognify/cognify.py`
- `cognee/tasks/graph/extract_graph_from_data.py`
- `cognee/tasks/storage/add_data_points.py`

#### SEARCH: Retrieval
`search(query_text, query_type)` → route to retriever type → filter by permissions → return results

Available search types (from `cognee/modules/search/types/SearchType.py`):
- **GRAPH_COMPLETION** (default) - Graph traversal + LLM completion
- **GRAPH_SUMMARY_COMPLETION** - Uses pre-computed summaries with graph context
- **GRAPH_COMPLETION_COT** - Chain-of-thought reasoning over graph
- **GRAPH_COMPLETION_CONTEXT_EXTENSION** - Extended context graph retrieval
- **TRIPLET_COMPLETION** - Triplet-based (subject-predicate-object) search
- **RAG_COMPLETION** - Traditional RAG with chunks
- **CHUNKS** - Vector similarity search over chunks
- **CHUNKS_LEXICAL** - Lexical (keyword) search over chunks
- **SUMMARIES** - Search pre-computed document summaries
- **CYPHER** - Direct Cypher query execution (requires `ALLOW_CYPHER_QUERY=True`)
- **NATURAL_LANGUAGE** - Natural language to structured query
- **TEMPORAL** - Time-aware graph search
- **FEELING_LUCKY** - Automatic search type selection
- **FEEDBACK** - User feedback-based refinement
- **CODING_RULES** - Code-specific search rules

Key files:
- `cognee/api/v1/search/search.py`
- `cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py`
- `cognee/modules/search/types/SearchType.py`

### Core Data Models

#### Engine Models (`cognee/infrastructure/engine/models/`)
- **DataPoint** - Base class for all graph nodes (versioned, with metadata)
- **Edge** - Graph relationships (source, target, relationship type)
- **Triplet** - (Subject, Predicate, Object) representation

#### Graph Models (`cognee/shared/data_models.py`)
- **KnowledgeGraph** - Container for nodes and edges
- **Node** - Entity (id, name, type, description)
- **Edge** - Relationship (source_node_id, target_node_id, relationship_name)

### Key Infrastructure Components

#### LLM Gateway (`cognee/infrastructure/llm/LLMGateway.py`)
Unified interface for multiple LLM providers: OpenAI, Anthropic, Gemini, Ollama, Mistral, Bedrock. Uses Instructor for structured output extraction.

#### Embedding Engines
Factory pattern for embeddings: `cognee/infrastructure/databases/vector/embeddings/get_embedding_engine.py`

#### Document Loaders
Support for PDF, DOCX, CSV, images, audio, code files in `cognee/infrastructure/files/`

## Important Configuration

### Environment Setup
Copy `.env.template` to `.env` and configure:

```bash
# Minimal setup (defaults to OpenAI + local file-based databases)
LLM_API_KEY="your_openai_api_key"
LLM_MODEL="openai/gpt-4o-mini"  # Default model
```

**Important**: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both to avoid unexpected defaults.

Default databases (no extra setup needed):
- **Relational**: SQLite (metadata and state storage)
- **Vector**: LanceDB (embeddings for semantic search)
- **Graph**: Kuzu (knowledge graph and relationships)

All stored in `.venv` by default. Override with `DATA_ROOT_DIRECTORY` and `SYSTEM_ROOT_DIRECTORY`.

### Switching Databases

#### Relational Databases
```bash
# PostgreSQL (requires postgres extra: pip install cognee[postgres])
DB_PROVIDER=postgres
DB_HOST=localhost
DB_PORT=5432
DB_USERNAME=cognee
DB_PASSWORD=cognee
DB_NAME=cognee_db
```

#### Vector Databases
Supported: lancedb (default), pgvector, chromadb, qdrant, weaviate, milvus
```bash
# ChromaDB (requires chromadb extra)
VECTOR_DB_PROVIDER=chromadb

# PGVector (requires postgres extra)
VECTOR_DB_PROVIDER=pgvector
VECTOR_DB_URL=postgresql://cognee:cognee@localhost:5432/cognee_db
```

#### Graph Databases
Supported: kuzu (default), neo4j, neptune, kuzu-remote
```bash
# Neo4j (requires neo4j extra: pip install cognee[neo4j])
GRAPH_DATABASE_PROVIDER=neo4j
GRAPH_DATABASE_URL=bolt://localhost:7687
GRAPH_DATABASE_NAME=neo4j
GRAPH_DATABASE_USERNAME=neo4j
GRAPH_DATABASE_PASSWORD=yourpassword

# Remote Kuzu
GRAPH_DATABASE_PROVIDER=kuzu-remote
GRAPH_DATABASE_URL=http://localhost:8000
GRAPH_DATABASE_USERNAME=your_username
GRAPH_DATABASE_PASSWORD=your_password
```

### LLM Provider Configuration

Supported providers: OpenAI (default), Azure OpenAI, Google Gemini, Anthropic, AWS Bedrock, Ollama, LM Studio, Custom (OpenAI-compatible APIs)

#### OpenAI (Recommended - Minimal Setup)
```bash
LLM_API_KEY="your_openai_api_key"
LLM_MODEL="openai/gpt-4o-mini"  # or gpt-4o, gpt-4-turbo, etc.
LLM_PROVIDER="openai"
```

#### Azure OpenAI
```bash
LLM_PROVIDER="azure"
LLM_MODEL="azure/gpt-4o-mini"
LLM_ENDPOINT="https://YOUR-RESOURCE.openai.azure.com/openai/deployments/gpt-4o-mini"
LLM_API_KEY="your_azure_api_key"
LLM_API_VERSION="2024-12-01-preview"
```

#### Google Gemini (requires gemini extra)
```bash
LLM_PROVIDER="gemini"
LLM_MODEL="gemini/gemini-2.0-flash-exp"
LLM_API_KEY="your_gemini_api_key"
```

#### Anthropic Claude (requires anthropic extra)
```bash
LLM_PROVIDER="anthropic"
LLM_MODEL="claude-3-5-sonnet-20241022"
LLM_API_KEY="your_anthropic_api_key"
```

#### Ollama (Local - requires ollama extra)
```bash
LLM_PROVIDER="ollama"
LLM_MODEL="llama3.1:8b"
LLM_ENDPOINT="http://localhost:11434/v1"
LLM_API_KEY="ollama"
EMBEDDING_PROVIDER="ollama"
EMBEDDING_MODEL="nomic-embed-text:latest"
EMBEDDING_ENDPOINT="http://localhost:11434/api/embed"
HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5"
```

#### Custom / OpenRouter / vLLM
```bash
LLM_PROVIDER="custom"
LLM_MODEL="openrouter/google/gemini-2.0-flash-lite-preview-02-05:free"
LLM_ENDPOINT="https://openrouter.ai/api/v1"
LLM_API_KEY="your_api_key"
```

#### AWS Bedrock (requires aws extra)
```bash
LLM_PROVIDER="bedrock"
LLM_MODEL="anthropic.claude-3-sonnet-20240229-v1:0"
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID="your_access_key"
AWS_SECRET_ACCESS_KEY="your_secret_key"
# Optional for temporary credentials:
# AWS_SESSION_TOKEN="your_session_token"
```

#### LLM Rate Limiting
```bash
LLM_RATE_LIMIT_ENABLED=true
LLM_RATE_LIMIT_REQUESTS=60  # Requests per interval
LLM_RATE_LIMIT_INTERVAL=60  # Interval in seconds
```

#### Instructor Mode (Structured Output)
```bash
# LLM_INSTRUCTOR_MODE controls how structured data is extracted
# Each LLM has its own default (e.g., gpt-4o models use "json_schema_mode")
# Override if needed:
LLM_INSTRUCTOR_MODE="json_schema_mode"  # or "tool_call", "md_json", etc.
```

### Structured Output Framework
```bash
# Use Instructor (default, via litellm)
STRUCTURED_OUTPUT_FRAMEWORK="instructor"

# Or use BAML (requires baml extra: pip install cognee[baml])
STRUCTURED_OUTPUT_FRAMEWORK="baml"
BAML_LLM_PROVIDER=openai
BAML_LLM_MODEL="gpt-4o-mini"
BAML_LLM_API_KEY="your_api_key"
```

### Storage Backend
```bash
# Local filesystem (default)
STORAGE_BACKEND="local"

# S3 (requires aws extra: pip install cognee[aws])
STORAGE_BACKEND="s3"
STORAGE_BUCKET_NAME="your-bucket-name"
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID="your_access_key"
AWS_SECRET_ACCESS_KEY="your_secret_key"
DATA_ROOT_DIRECTORY="s3://your-bucket/cognee/data"
SYSTEM_ROOT_DIRECTORY="s3://your-bucket/cognee/system"
```

## Extension Points

### Adding New Functionality

1. **New Task Type**: Create task function in `cognee/tasks/`, return Task object, register in pipeline
2. **New Database Backend**: Implement `GraphDBInterface` or `VectorDBInterface` in `cognee/infrastructure/databases/`
3. **New LLM Provider**: Add configuration in LLM config (uses litellm)
4. **New Document Processor**: Extend loaders in `cognee/modules/data/processing/`
5. **New Search Type**: Add to `SearchType` enum and implement retriever in `cognee/modules/retrieval/`
6. **Custom Graph Models**: Define Pydantic models extending `DataPoint` in your code

### Working with Ontologies
Cognee supports ontology-based entity extraction to ground knowledge graphs in standardized semantic frameworks (e.g., OWL ontologies).

Configuration:
```bash
ONTOLOGY_RESOLVER=rdflib  # Default: uses rdflib and OWL files
MATCHING_STRATEGY=fuzzy   # Default: fuzzy matching with 80% similarity
ONTOLOGY_FILE_PATH=/path/to/your/ontology.owl  # Full path to ontology file
```

Implementation: `cognee/modules/ontology/`

## Branching Strategy

**IMPORTANT**: Always branch from `dev`, not `main`. The `dev` branch is the active development branch.

```bash
git checkout dev
git pull origin dev
git checkout -b feature/your-feature-name
```

## Code Style

- **Formatter**: Ruff (configured in `pyproject.toml`)
- **Line length**: 100 characters
- **String quotes**: Use double quotes `"` not single quotes `'` (enforced by ruff-format)
- **Pre-commit hooks**: Run ruff linting and formatting automatically
- **Type hints**: Encouraged (mypy checks enabled)
- **Important**: Always run `pre-commit run --all-files` before committing to catch formatting issues

## Testing Strategy

Tests are organized in `cognee/tests/`:
- `unit/` - Unit tests for individual modules
- `integration/` - Full pipeline integration tests
- `cli_tests/` - CLI command tests
- `tasks/` - Task-specific tests

When adding features, add corresponding tests. Integration tests should cover the full add → cognify → search flow.

## API Structure

FastAPI application with versioned routes under `cognee/api/v1/`:
- `/add` - Data ingestion
- `/cognify` - Knowledge graph processing
- `/search` - Query interface
- `/memify` - Graph enrichment
- `/datasets` - Dataset management
- `/users` - Authentication (if `REQUIRE_AUTHENTICATION=True`)
- `/visualize` - Graph visualization server

## Python SDK Entry Points

Main functions exported from `cognee/__init__.py`:
- `add(data, dataset_name)` - Ingest data
- `cognify(datasets)` - Build knowledge graph
- `search(query_text, query_type)` - Query knowledge
- `memify(extraction_tasks, enrichment_tasks)` - Enrich graph
- `delete(data_id)` - Remove data
- `config()` - Configuration management
- `datasets()` - Dataset operations

All functions are async - use `await` or `asyncio.run()`.

## Security Considerations

Several security environment variables in `.env`:
- `ACCEPT_LOCAL_FILE_PATH` - Allow local file paths (default: True)
- `ALLOW_HTTP_REQUESTS` - Allow HTTP requests from Cognee (default: True)
- `ALLOW_CYPHER_QUERY` - Allow raw Cypher queries (default: True)
- `REQUIRE_AUTHENTICATION` - Enable API authentication (default: False)
- `ENABLE_BACKEND_ACCESS_CONTROL` - Multi-tenant isolation (default: True)

For production deployments, review and tighten these settings.

## Common Patterns

### Creating a Custom Pipeline Task
```python
from cognee.modules.pipelines.tasks.Task import Task

async def my_custom_task(data):
    # Your logic here
    processed_data = process(data)
    return processed_data

# Use in pipeline
task = Task(my_custom_task)
```

### Accessing Databases Directly
```python
from cognee.infrastructure.databases.graph import get_graph_engine
from cognee.infrastructure.databases.vector import get_vector_engine

graph_engine = await get_graph_engine()
vector_engine = await get_vector_engine()
```

### Using LLM Gateway
```python
from cognee.infrastructure.llm.get_llm_client import get_llm_client

llm_client = get_llm_client()
response = await llm_client.acreate_structured_output(
    text_input="Your prompt",
    system_prompt="System instructions",
    response_model=YourPydanticModel
)
```

## Key Concepts

### Datasets
Datasets are project-level containers that support organization, permissions, and isolated processing workflows. Each user can have multiple datasets with different access permissions.

```python
# Create/use a dataset
await cognee.add(data, dataset_name="my_project")
await cognee.cognify(datasets=["my_project"])
```

### DataPoints
Atomic knowledge units that form the foundation of graph structures. All graph nodes extend the `DataPoint` base class with versioning and metadata support.

### Permissions System
Multi-tenant architecture with users, roles, and Access Control Lists (ACLs):
- Read, write, delete, and share permissions per dataset
- Enable with `ENABLE_BACKEND_ACCESS_CONTROL=True`
- Supports isolated databases per user+dataset (Kuzu, LanceDB, SQLite, Postgres)

### Graph Visualization
Launch visualization server:
```bash
# Via CLI
cognee-cli -ui  # Launches full stack with UI at http://localhost:3000

# Via Python
from cognee.api.v1.visualize import start_visualization_server
await start_visualization_server(port=8080)
```

## Debugging & Troubleshooting

### Debug Configuration
- Set `LITELLM_LOG="DEBUG"` for verbose LLM logs (default: "ERROR")
- Enable debug mode: `ENV="development"` or `ENV="debug"`
- Disable telemetry: `TELEMETRY_DISABLED=1`
- Check logs in structured format (uses structlog)
- Use `debugpy` optional dependency for debugging: `pip install cognee[debug]`

### Common Issues

**Ollama + OpenAI Embeddings NoDataError**
- Issue: Mixing Ollama with OpenAI embeddings can cause errors
- Solution: Configure both LLM and embeddings to use the same provider, or ensure `HUGGINGFACE_TOKENIZER` is set when using Ollama

**LM Studio Structured Output**
- Issue: LM Studio requires explicit instructor mode
- Solution: Set `LLM_INSTRUCTOR_MODE="json_schema_mode"` (or appropriate mode)

**Default Provider Fallback**
- Issue: Configuring only LLM or only embeddings defaults the other to OpenAI
- Solution: Always configure both LLM and embedding providers, or ensure valid OpenAI API key

**Permission Denied on Search**
- Behavior: Returns empty list rather than error (prevents information leakage)
- Solution: Check dataset permissions and user access rights

**Database Connection Issues**
- Check: Verify database URLs, credentials, and that services are running
- Docker users: Use `DB_HOST=host.docker.internal` for local databases

**Rate Limiting Errors**
- Enable client-side rate limiting: `LLM_RATE_LIMIT_ENABLED=true`
- Adjust limits: `LLM_RATE_LIMIT_REQUESTS` and `LLM_RATE_LIMIT_INTERVAL`

## Resources

- [Documentation](https://docs.cognee.ai/)
- [Discord Community](https://discord.gg/NQPKmU5CCg)
- [GitHub Issues](https://github.com/topoteretes/cognee/issues)
- [Example Notebooks](examples/python/)
- [Research Paper](https://arxiv.org/abs/2505.24478) - Optimizing knowledge graphs for LLM reasoning