Merge branch 'dev' into neo4j-multiuser-delete
This commit is contained in:
commit
47054200c9
16 changed files with 686 additions and 58 deletions
37
.github/workflows/basic_tests.yml
vendored
37
.github/workflows/basic_tests.yml
vendored
|
|
@ -34,43 +34,6 @@ env:
|
|||
ENV: 'dev'
|
||||
|
||||
jobs:
|
||||
|
||||
lint:
|
||||
name: Run Linting
|
||||
runs-on: ubuntu-22.04
|
||||
steps:
|
||||
- name: Check out repository
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Cognee Setup
|
||||
uses: ./.github/actions/cognee_setup
|
||||
with:
|
||||
python-version: ${{ inputs.python-version }}
|
||||
|
||||
- name: Run Linting
|
||||
uses: astral-sh/ruff-action@v2
|
||||
|
||||
format-check:
|
||||
name: Run Formatting Check
|
||||
runs-on: ubuntu-22.04
|
||||
steps:
|
||||
- name: Check out repository
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Cognee Setup
|
||||
uses: ./.github/actions/cognee_setup
|
||||
with:
|
||||
python-version: ${{ inputs.python-version }}
|
||||
|
||||
- name: Run Formatting Check
|
||||
uses: astral-sh/ruff-action@v2
|
||||
with:
|
||||
args: "format --check"
|
||||
|
||||
unit-tests:
|
||||
name: Run Unit Tests
|
||||
runs-on: ubuntu-22.04
|
||||
|
|
|
|||
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -148,6 +148,7 @@ ENV/
|
|||
env.bak/
|
||||
venv.bak/
|
||||
mise.toml
|
||||
deployment/helm/values-local.yml
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
|
|
|
|||
588
CLAUDE.md
Normal file
588
CLAUDE.md
Normal file
|
|
@ -0,0 +1,588 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
Cognee is an open-source AI memory platform that transforms raw data into persistent knowledge graphs for AI agents. It replaces traditional RAG (Retrieval-Augmented Generation) with an ECL (Extract, Cognify, Load) pipeline combining vector search, graph databases, and LLM-powered entity extraction.
|
||||
|
||||
**Requirements**: Python 3.9 - 3.12
|
||||
|
||||
## Development Commands
|
||||
|
||||
### Setup
|
||||
```bash
|
||||
# Create virtual environment (recommended: uv)
|
||||
uv venv && source .venv/bin/activate
|
||||
|
||||
# Install with pip, poetry, or uv
|
||||
uv pip install -e .
|
||||
|
||||
# Install with dev dependencies
|
||||
uv pip install -e ".[dev]"
|
||||
|
||||
# Install with specific extras
|
||||
uv pip install -e ".[postgres,neo4j,docs,chromadb]"
|
||||
|
||||
# Set up pre-commit hooks
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
### Available Installation Extras
|
||||
- **postgres** / **postgres-binary** - PostgreSQL + PGVector support
|
||||
- **neo4j** - Neo4j graph database support
|
||||
- **neptune** - AWS Neptune support
|
||||
- **chromadb** - ChromaDB vector database
|
||||
- **docs** - Document processing (unstructured library)
|
||||
- **scraping** - Web scraping (Tavily, BeautifulSoup, Playwright)
|
||||
- **langchain** - LangChain integration
|
||||
- **llama-index** - LlamaIndex integration
|
||||
- **anthropic** - Anthropic Claude models
|
||||
- **gemini** - Google Gemini models
|
||||
- **ollama** - Ollama local models
|
||||
- **mistral** - Mistral AI models
|
||||
- **groq** - Groq API support
|
||||
- **llama-cpp** - Llama.cpp local inference
|
||||
- **huggingface** - HuggingFace transformers
|
||||
- **aws** - S3 storage backend
|
||||
- **redis** - Redis caching
|
||||
- **graphiti** - Graphiti-core integration
|
||||
- **baml** - BAML structured output
|
||||
- **dlt** - Data load tool (dlt) integration
|
||||
- **docling** - Docling document processing
|
||||
- **codegraph** - Code graph extraction
|
||||
- **evals** - Evaluation tools
|
||||
- **deepeval** - DeepEval testing framework
|
||||
- **posthog** - PostHog analytics
|
||||
- **monitoring** - Sentry + Langfuse observability
|
||||
- **distributed** - Modal distributed execution
|
||||
- **dev** - All development tools (pytest, mypy, ruff, etc.)
|
||||
- **debug** - Debugpy for debugging
|
||||
|
||||
### Testing
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=cognee --cov-report=html
|
||||
|
||||
# Run specific test file
|
||||
pytest cognee/tests/test_custom_model.py
|
||||
|
||||
# Run specific test function
|
||||
pytest cognee/tests/test_custom_model.py::test_function_name
|
||||
|
||||
# Run async tests
|
||||
pytest -v cognee/tests/integration/
|
||||
|
||||
# Run unit tests only
|
||||
pytest cognee/tests/unit/
|
||||
|
||||
# Run integration tests only
|
||||
pytest cognee/tests/integration/
|
||||
```
|
||||
|
||||
### Code Quality
|
||||
```bash
|
||||
# Run ruff linter
|
||||
ruff check .
|
||||
|
||||
# Run ruff formatter
|
||||
ruff format .
|
||||
|
||||
# Run both linting and formatting (pre-commit)
|
||||
pre-commit run --all-files
|
||||
|
||||
# Type checking with mypy
|
||||
mypy cognee/
|
||||
|
||||
# Run pylint
|
||||
pylint cognee/
|
||||
```
|
||||
|
||||
### Running Cognee
|
||||
```bash
|
||||
# Using Python SDK
|
||||
python examples/python/simple_example.py
|
||||
|
||||
# Using CLI
|
||||
cognee-cli add "Your text here"
|
||||
cognee-cli cognify
|
||||
cognee-cli search "Your query"
|
||||
cognee-cli delete --all
|
||||
|
||||
# Launch full stack with UI
|
||||
cognee-cli -ui
|
||||
```
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Core Workflow: add → cognify → search/memify
|
||||
|
||||
1. **add()** - Ingest data (files, URLs, text) into datasets
|
||||
2. **cognify()** - Extract entities/relationships and build knowledge graph
|
||||
3. **search()** - Query knowledge using various retrieval strategies
|
||||
4. **memify()** - Enrich graph with additional context and rules
|
||||
|
||||
### Key Architectural Patterns
|
||||
|
||||
#### 1. Pipeline-Based Processing
|
||||
All data flows through task-based pipelines (`cognee/modules/pipelines/`). Tasks are composable units that can run sequentially or in parallel. Example pipeline tasks: `classify_documents`, `extract_graph_from_data`, `add_data_points`.
|
||||
|
||||
#### 2. Interface-Based Database Adapters
|
||||
Multiple backends are supported through adapter interfaces:
|
||||
- **Graph**: Kuzu (default), Neo4j, Neptune via `GraphDBInterface`
|
||||
- **Vector**: LanceDB (default), ChromaDB, PGVector via `VectorDBInterface`
|
||||
- **Relational**: SQLite (default), PostgreSQL
|
||||
|
||||
Key files:
|
||||
- `cognee/infrastructure/databases/graph/graph_db_interface.py`
|
||||
- `cognee/infrastructure/databases/vector/vector_db_interface.py`
|
||||
|
||||
#### 3. Multi-Tenant Access Control
|
||||
User → Dataset → Data hierarchy with permission-based filtering. Enable with `ENABLE_BACKEND_ACCESS_CONTROL=True`. Each user+dataset combination can have isolated graph/vector databases (when using supported backends: Kuzu, LanceDB, SQLite, Postgres).
|
||||
|
||||
### Layer Structure
|
||||
|
||||
```
|
||||
API Layer (cognee/api/v1/)
|
||||
↓
|
||||
Main Functions (add, cognify, search, memify)
|
||||
↓
|
||||
Pipeline Orchestrator (cognee/modules/pipelines/)
|
||||
↓
|
||||
Task Execution Layer (cognee/tasks/)
|
||||
↓
|
||||
Domain Modules (graph, retrieval, ingestion, etc.)
|
||||
↓
|
||||
Infrastructure Adapters (LLM, databases)
|
||||
↓
|
||||
External Services (OpenAI, Kuzu, LanceDB, etc.)
|
||||
```
|
||||
|
||||
### Critical Data Flow Paths
|
||||
|
||||
#### ADD: Data Ingestion
|
||||
`add()` → `resolve_data_directories` → `ingest_data` → `save_data_item_to_storage` → Create Dataset + Data records in relational DB
|
||||
|
||||
Key files: `cognee/api/v1/add/add.py`, `cognee/tasks/ingestion/ingest_data.py`
|
||||
|
||||
#### COGNIFY: Knowledge Graph Construction
|
||||
`cognify()` → `classify_documents` → `extract_chunks_from_documents` → `extract_graph_from_data` (LLM extracts entities/relationships using Instructor) → `summarize_text` → `add_data_points` (store in graph + vector DBs)
|
||||
|
||||
Key files:
|
||||
- `cognee/api/v1/cognify/cognify.py`
|
||||
- `cognee/tasks/graph/extract_graph_from_data.py`
|
||||
- `cognee/tasks/storage/add_data_points.py`
|
||||
|
||||
#### SEARCH: Retrieval
|
||||
`search(query_text, query_type)` → route to retriever type → filter by permissions → return results
|
||||
|
||||
Available search types (from `cognee/modules/search/types/SearchType.py`):
|
||||
- **GRAPH_COMPLETION** (default) - Graph traversal + LLM completion
|
||||
- **GRAPH_SUMMARY_COMPLETION** - Uses pre-computed summaries with graph context
|
||||
- **GRAPH_COMPLETION_COT** - Chain-of-thought reasoning over graph
|
||||
- **GRAPH_COMPLETION_CONTEXT_EXTENSION** - Extended context graph retrieval
|
||||
- **TRIPLET_COMPLETION** - Triplet-based (subject-predicate-object) search
|
||||
- **RAG_COMPLETION** - Traditional RAG with chunks
|
||||
- **CHUNKS** - Vector similarity search over chunks
|
||||
- **CHUNKS_LEXICAL** - Lexical (keyword) search over chunks
|
||||
- **SUMMARIES** - Search pre-computed document summaries
|
||||
- **CYPHER** - Direct Cypher query execution (requires `ALLOW_CYPHER_QUERY=True`)
|
||||
- **NATURAL_LANGUAGE** - Natural language to structured query
|
||||
- **TEMPORAL** - Time-aware graph search
|
||||
- **FEELING_LUCKY** - Automatic search type selection
|
||||
- **FEEDBACK** - User feedback-based refinement
|
||||
- **CODING_RULES** - Code-specific search rules
|
||||
|
||||
Key files:
|
||||
- `cognee/api/v1/search/search.py`
|
||||
- `cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py`
|
||||
- `cognee/modules/search/types/SearchType.py`
|
||||
|
||||
### Core Data Models
|
||||
|
||||
#### Engine Models (`cognee/infrastructure/engine/models/`)
|
||||
- **DataPoint** - Base class for all graph nodes (versioned, with metadata)
|
||||
- **Edge** - Graph relationships (source, target, relationship type)
|
||||
- **Triplet** - (Subject, Predicate, Object) representation
|
||||
|
||||
#### Graph Models (`cognee/shared/data_models.py`)
|
||||
- **KnowledgeGraph** - Container for nodes and edges
|
||||
- **Node** - Entity (id, name, type, description)
|
||||
- **Edge** - Relationship (source_node_id, target_node_id, relationship_name)
|
||||
|
||||
### Key Infrastructure Components
|
||||
|
||||
#### LLM Gateway (`cognee/infrastructure/llm/LLMGateway.py`)
|
||||
Unified interface for multiple LLM providers: OpenAI, Anthropic, Gemini, Ollama, Mistral, Bedrock. Uses Instructor for structured output extraction.
|
||||
|
||||
#### Embedding Engines
|
||||
Factory pattern for embeddings: `cognee/infrastructure/databases/vector/embeddings/get_embedding_engine.py`
|
||||
|
||||
#### Document Loaders
|
||||
Support for PDF, DOCX, CSV, images, audio, code files in `cognee/infrastructure/files/`
|
||||
|
||||
## Important Configuration
|
||||
|
||||
### Environment Setup
|
||||
Copy `.env.template` to `.env` and configure:
|
||||
|
||||
```bash
|
||||
# Minimal setup (defaults to OpenAI + local file-based databases)
|
||||
LLM_API_KEY="your_openai_api_key"
|
||||
LLM_MODEL="openai/gpt-4o-mini" # Default model
|
||||
```
|
||||
|
||||
**Important**: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both to avoid unexpected defaults.
|
||||
|
||||
Default databases (no extra setup needed):
|
||||
- **Relational**: SQLite (metadata and state storage)
|
||||
- **Vector**: LanceDB (embeddings for semantic search)
|
||||
- **Graph**: Kuzu (knowledge graph and relationships)
|
||||
|
||||
All stored in `.venv` by default. Override with `DATA_ROOT_DIRECTORY` and `SYSTEM_ROOT_DIRECTORY`.
|
||||
|
||||
### Switching Databases
|
||||
|
||||
#### Relational Databases
|
||||
```bash
|
||||
# PostgreSQL (requires postgres extra: pip install cognee[postgres])
|
||||
DB_PROVIDER=postgres
|
||||
DB_HOST=localhost
|
||||
DB_PORT=5432
|
||||
DB_USERNAME=cognee
|
||||
DB_PASSWORD=cognee
|
||||
DB_NAME=cognee_db
|
||||
```
|
||||
|
||||
#### Vector Databases
|
||||
Supported: lancedb (default), pgvector, chromadb, qdrant, weaviate, milvus
|
||||
```bash
|
||||
# ChromaDB (requires chromadb extra)
|
||||
VECTOR_DB_PROVIDER=chromadb
|
||||
|
||||
# PGVector (requires postgres extra)
|
||||
VECTOR_DB_PROVIDER=pgvector
|
||||
VECTOR_DB_URL=postgresql://cognee:cognee@localhost:5432/cognee_db
|
||||
```
|
||||
|
||||
#### Graph Databases
|
||||
Supported: kuzu (default), neo4j, neptune, kuzu-remote
|
||||
```bash
|
||||
# Neo4j (requires neo4j extra: pip install cognee[neo4j])
|
||||
GRAPH_DATABASE_PROVIDER=neo4j
|
||||
GRAPH_DATABASE_URL=bolt://localhost:7687
|
||||
GRAPH_DATABASE_NAME=neo4j
|
||||
GRAPH_DATABASE_USERNAME=neo4j
|
||||
GRAPH_DATABASE_PASSWORD=yourpassword
|
||||
|
||||
# Remote Kuzu
|
||||
GRAPH_DATABASE_PROVIDER=kuzu-remote
|
||||
GRAPH_DATABASE_URL=http://localhost:8000
|
||||
GRAPH_DATABASE_USERNAME=your_username
|
||||
GRAPH_DATABASE_PASSWORD=your_password
|
||||
```
|
||||
|
||||
### LLM Provider Configuration
|
||||
|
||||
Supported providers: OpenAI (default), Azure OpenAI, Google Gemini, Anthropic, AWS Bedrock, Ollama, LM Studio, Custom (OpenAI-compatible APIs)
|
||||
|
||||
#### OpenAI (Recommended - Minimal Setup)
|
||||
```bash
|
||||
LLM_API_KEY="your_openai_api_key"
|
||||
LLM_MODEL="openai/gpt-4o-mini" # or gpt-4o, gpt-4-turbo, etc.
|
||||
LLM_PROVIDER="openai"
|
||||
```
|
||||
|
||||
#### Azure OpenAI
|
||||
```bash
|
||||
LLM_PROVIDER="azure"
|
||||
LLM_MODEL="azure/gpt-4o-mini"
|
||||
LLM_ENDPOINT="https://YOUR-RESOURCE.openai.azure.com/openai/deployments/gpt-4o-mini"
|
||||
LLM_API_KEY="your_azure_api_key"
|
||||
LLM_API_VERSION="2024-12-01-preview"
|
||||
```
|
||||
|
||||
#### Google Gemini (requires gemini extra)
|
||||
```bash
|
||||
LLM_PROVIDER="gemini"
|
||||
LLM_MODEL="gemini/gemini-2.0-flash-exp"
|
||||
LLM_API_KEY="your_gemini_api_key"
|
||||
```
|
||||
|
||||
#### Anthropic Claude (requires anthropic extra)
|
||||
```bash
|
||||
LLM_PROVIDER="anthropic"
|
||||
LLM_MODEL="claude-3-5-sonnet-20241022"
|
||||
LLM_API_KEY="your_anthropic_api_key"
|
||||
```
|
||||
|
||||
#### Ollama (Local - requires ollama extra)
|
||||
```bash
|
||||
LLM_PROVIDER="ollama"
|
||||
LLM_MODEL="llama3.1:8b"
|
||||
LLM_ENDPOINT="http://localhost:11434/v1"
|
||||
LLM_API_KEY="ollama"
|
||||
EMBEDDING_PROVIDER="ollama"
|
||||
EMBEDDING_MODEL="nomic-embed-text:latest"
|
||||
EMBEDDING_ENDPOINT="http://localhost:11434/api/embed"
|
||||
HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5"
|
||||
```
|
||||
|
||||
#### Custom / OpenRouter / vLLM
|
||||
```bash
|
||||
LLM_PROVIDER="custom"
|
||||
LLM_MODEL="openrouter/google/gemini-2.0-flash-lite-preview-02-05:free"
|
||||
LLM_ENDPOINT="https://openrouter.ai/api/v1"
|
||||
LLM_API_KEY="your_api_key"
|
||||
```
|
||||
|
||||
#### AWS Bedrock (requires aws extra)
|
||||
```bash
|
||||
LLM_PROVIDER="bedrock"
|
||||
LLM_MODEL="anthropic.claude-3-sonnet-20240229-v1:0"
|
||||
AWS_REGION="us-east-1"
|
||||
AWS_ACCESS_KEY_ID="your_access_key"
|
||||
AWS_SECRET_ACCESS_KEY="your_secret_key"
|
||||
# Optional for temporary credentials:
|
||||
# AWS_SESSION_TOKEN="your_session_token"
|
||||
```
|
||||
|
||||
#### LLM Rate Limiting
|
||||
```bash
|
||||
LLM_RATE_LIMIT_ENABLED=true
|
||||
LLM_RATE_LIMIT_REQUESTS=60 # Requests per interval
|
||||
LLM_RATE_LIMIT_INTERVAL=60 # Interval in seconds
|
||||
```
|
||||
|
||||
#### Instructor Mode (Structured Output)
|
||||
```bash
|
||||
# LLM_INSTRUCTOR_MODE controls how structured data is extracted
|
||||
# Each LLM has its own default (e.g., gpt-4o models use "json_schema_mode")
|
||||
# Override if needed:
|
||||
LLM_INSTRUCTOR_MODE="json_schema_mode" # or "tool_call", "md_json", etc.
|
||||
```
|
||||
|
||||
### Structured Output Framework
|
||||
```bash
|
||||
# Use Instructor (default, via litellm)
|
||||
STRUCTURED_OUTPUT_FRAMEWORK="instructor"
|
||||
|
||||
# Or use BAML (requires baml extra: pip install cognee[baml])
|
||||
STRUCTURED_OUTPUT_FRAMEWORK="baml"
|
||||
BAML_LLM_PROVIDER=openai
|
||||
BAML_LLM_MODEL="gpt-4o-mini"
|
||||
BAML_LLM_API_KEY="your_api_key"
|
||||
```
|
||||
|
||||
### Storage Backend
|
||||
```bash
|
||||
# Local filesystem (default)
|
||||
STORAGE_BACKEND="local"
|
||||
|
||||
# S3 (requires aws extra: pip install cognee[aws])
|
||||
STORAGE_BACKEND="s3"
|
||||
STORAGE_BUCKET_NAME="your-bucket-name"
|
||||
AWS_REGION="us-east-1"
|
||||
AWS_ACCESS_KEY_ID="your_access_key"
|
||||
AWS_SECRET_ACCESS_KEY="your_secret_key"
|
||||
DATA_ROOT_DIRECTORY="s3://your-bucket/cognee/data"
|
||||
SYSTEM_ROOT_DIRECTORY="s3://your-bucket/cognee/system"
|
||||
```
|
||||
|
||||
## Extension Points
|
||||
|
||||
### Adding New Functionality
|
||||
|
||||
1. **New Task Type**: Create task function in `cognee/tasks/`, return Task object, register in pipeline
|
||||
2. **New Database Backend**: Implement `GraphDBInterface` or `VectorDBInterface` in `cognee/infrastructure/databases/`
|
||||
3. **New LLM Provider**: Add configuration in LLM config (uses litellm)
|
||||
4. **New Document Processor**: Extend loaders in `cognee/modules/data/processing/`
|
||||
5. **New Search Type**: Add to `SearchType` enum and implement retriever in `cognee/modules/retrieval/`
|
||||
6. **Custom Graph Models**: Define Pydantic models extending `DataPoint` in your code
|
||||
|
||||
### Working with Ontologies
|
||||
Cognee supports ontology-based entity extraction to ground knowledge graphs in standardized semantic frameworks (e.g., OWL ontologies).
|
||||
|
||||
Configuration:
|
||||
```bash
|
||||
ONTOLOGY_RESOLVER=rdflib # Default: uses rdflib and OWL files
|
||||
MATCHING_STRATEGY=fuzzy # Default: fuzzy matching with 80% similarity
|
||||
ONTOLOGY_FILE_PATH=/path/to/your/ontology.owl # Full path to ontology file
|
||||
```
|
||||
|
||||
Implementation: `cognee/modules/ontology/`
|
||||
|
||||
## Branching Strategy
|
||||
|
||||
**IMPORTANT**: Always branch from `dev`, not `main`. The `dev` branch is the active development branch.
|
||||
|
||||
```bash
|
||||
git checkout dev
|
||||
git pull origin dev
|
||||
git checkout -b feature/your-feature-name
|
||||
```
|
||||
|
||||
## Code Style
|
||||
|
||||
- Ruff for linting and formatting (configured in `pyproject.toml`)
|
||||
- Line length: 100 characters
|
||||
- Pre-commit hooks run ruff automatically
|
||||
- Type hints encouraged (mypy checks enabled)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Tests are organized in `cognee/tests/`:
|
||||
- `unit/` - Unit tests for individual modules
|
||||
- `integration/` - Full pipeline integration tests
|
||||
- `cli_tests/` - CLI command tests
|
||||
- `tasks/` - Task-specific tests
|
||||
|
||||
When adding features, add corresponding tests. Integration tests should cover the full add → cognify → search flow.
|
||||
|
||||
## API Structure
|
||||
|
||||
FastAPI application with versioned routes under `cognee/api/v1/`:
|
||||
- `/add` - Data ingestion
|
||||
- `/cognify` - Knowledge graph processing
|
||||
- `/search` - Query interface
|
||||
- `/memify` - Graph enrichment
|
||||
- `/datasets` - Dataset management
|
||||
- `/users` - Authentication (if `REQUIRE_AUTHENTICATION=True`)
|
||||
- `/visualize` - Graph visualization server
|
||||
|
||||
## Python SDK Entry Points
|
||||
|
||||
Main functions exported from `cognee/__init__.py`:
|
||||
- `add(data, dataset_name)` - Ingest data
|
||||
- `cognify(datasets)` - Build knowledge graph
|
||||
- `search(query_text, query_type)` - Query knowledge
|
||||
- `memify(extraction_tasks, enrichment_tasks)` - Enrich graph
|
||||
- `delete(data_id)` - Remove data
|
||||
- `config()` - Configuration management
|
||||
- `datasets()` - Dataset operations
|
||||
|
||||
All functions are async - use `await` or `asyncio.run()`.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
Several security environment variables in `.env`:
|
||||
- `ACCEPT_LOCAL_FILE_PATH` - Allow local file paths (default: True)
|
||||
- `ALLOW_HTTP_REQUESTS` - Allow HTTP requests from Cognee (default: True)
|
||||
- `ALLOW_CYPHER_QUERY` - Allow raw Cypher queries (default: True)
|
||||
- `REQUIRE_AUTHENTICATION` - Enable API authentication (default: False)
|
||||
- `ENABLE_BACKEND_ACCESS_CONTROL` - Multi-tenant isolation (default: True)
|
||||
|
||||
For production deployments, review and tighten these settings.
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Creating a Custom Pipeline Task
|
||||
```python
|
||||
from cognee.modules.pipelines.tasks.Task import Task
|
||||
|
||||
async def my_custom_task(data):
|
||||
# Your logic here
|
||||
processed_data = process(data)
|
||||
return processed_data
|
||||
|
||||
# Use in pipeline
|
||||
task = Task(my_custom_task)
|
||||
```
|
||||
|
||||
### Accessing Databases Directly
|
||||
```python
|
||||
from cognee.infrastructure.databases.graph import get_graph_engine
|
||||
from cognee.infrastructure.databases.vector import get_vector_engine
|
||||
|
||||
graph_engine = await get_graph_engine()
|
||||
vector_engine = await get_vector_engine()
|
||||
```
|
||||
|
||||
### Using LLM Gateway
|
||||
```python
|
||||
from cognee.infrastructure.llm.get_llm_client import get_llm_client
|
||||
|
||||
llm_client = get_llm_client()
|
||||
response = await llm_client.acreate_structured_output(
|
||||
text_input="Your prompt",
|
||||
system_prompt="System instructions",
|
||||
response_model=YourPydanticModel
|
||||
)
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Datasets
|
||||
Datasets are project-level containers that support organization, permissions, and isolated processing workflows. Each user can have multiple datasets with different access permissions.
|
||||
|
||||
```python
|
||||
# Create/use a dataset
|
||||
await cognee.add(data, dataset_name="my_project")
|
||||
await cognee.cognify(datasets=["my_project"])
|
||||
```
|
||||
|
||||
### DataPoints
|
||||
Atomic knowledge units that form the foundation of graph structures. All graph nodes extend the `DataPoint` base class with versioning and metadata support.
|
||||
|
||||
### Permissions System
|
||||
Multi-tenant architecture with users, roles, and Access Control Lists (ACLs):
|
||||
- Read, write, delete, and share permissions per dataset
|
||||
- Enable with `ENABLE_BACKEND_ACCESS_CONTROL=True`
|
||||
- Supports isolated databases per user+dataset (Kuzu, LanceDB, SQLite, Postgres)
|
||||
|
||||
### Graph Visualization
|
||||
Launch visualization server:
|
||||
```bash
|
||||
# Via CLI
|
||||
cognee-cli -ui # Launches full stack with UI at http://localhost:3000
|
||||
|
||||
# Via Python
|
||||
from cognee.api.v1.visualize import start_visualization_server
|
||||
await start_visualization_server(port=8080)
|
||||
```
|
||||
|
||||
## Debugging & Troubleshooting
|
||||
|
||||
### Debug Configuration
|
||||
- Set `LITELLM_LOG="DEBUG"` for verbose LLM logs (default: "ERROR")
|
||||
- Enable debug mode: `ENV="development"` or `ENV="debug"`
|
||||
- Disable telemetry: `TELEMETRY_DISABLED=1`
|
||||
- Check logs in structured format (uses structlog)
|
||||
- Use `debugpy` optional dependency for debugging: `pip install cognee[debug]`
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Ollama + OpenAI Embeddings NoDataError**
|
||||
- Issue: Mixing Ollama with OpenAI embeddings can cause errors
|
||||
- Solution: Configure both LLM and embeddings to use the same provider, or ensure `HUGGINGFACE_TOKENIZER` is set when using Ollama
|
||||
|
||||
**LM Studio Structured Output**
|
||||
- Issue: LM Studio requires explicit instructor mode
|
||||
- Solution: Set `LLM_INSTRUCTOR_MODE="json_schema_mode"` (or appropriate mode)
|
||||
|
||||
**Default Provider Fallback**
|
||||
- Issue: Configuring only LLM or only embeddings defaults the other to OpenAI
|
||||
- Solution: Always configure both LLM and embedding providers, or ensure valid OpenAI API key
|
||||
|
||||
**Permission Denied on Search**
|
||||
- Behavior: Returns empty list rather than error (prevents information leakage)
|
||||
- Solution: Check dataset permissions and user access rights
|
||||
|
||||
**Database Connection Issues**
|
||||
- Check: Verify database URLs, credentials, and that services are running
|
||||
- Docker users: Use `DB_HOST=host.docker.internal` for local databases
|
||||
|
||||
**Rate Limiting Errors**
|
||||
- Enable client-side rate limiting: `LLM_RATE_LIMIT_ENABLED=true`
|
||||
- Adjust limits: `LLM_RATE_LIMIT_REQUESTS` and `LLM_RATE_LIMIT_INTERVAL`
|
||||
|
||||
## Resources
|
||||
|
||||
- [Documentation](https://docs.cognee.ai/)
|
||||
- [Discord Community](https://discord.gg/NQPKmU5CCg)
|
||||
- [GitHub Issues](https://github.com/topoteretes/cognee/issues)
|
||||
- [Example Notebooks](examples/python/)
|
||||
- [Research Paper](https://arxiv.org/abs/2505.24478) - Optimizing knowledge graphs for LLM reasoning
|
||||
|
|
@ -8,12 +8,14 @@ from fastapi.encoders import jsonable_encoder
|
|||
|
||||
from cognee.modules.search.types import SearchType, SearchResult, CombinedSearchResult
|
||||
from cognee.api.DTO import InDTO, OutDTO
|
||||
from cognee.modules.users.exceptions.exceptions import PermissionDeniedError
|
||||
from cognee.modules.users.exceptions.exceptions import PermissionDeniedError, UserNotFoundError
|
||||
from cognee.modules.users.models import User
|
||||
from cognee.modules.search.operations import get_history
|
||||
from cognee.modules.users.methods import get_authenticated_user
|
||||
from cognee.shared.utils import send_telemetry
|
||||
from cognee import __version__ as cognee_version
|
||||
from cognee.infrastructure.databases.exceptions import DatabaseNotCreatedError
|
||||
from cognee.exceptions import CogneeValidationError
|
||||
|
||||
|
||||
# Note: Datasets sent by name will only map to datasets owned by the request sender
|
||||
|
|
@ -138,6 +140,17 @@ def get_search_router() -> APIRouter:
|
|||
)
|
||||
|
||||
return jsonable_encoder(results)
|
||||
except (DatabaseNotCreatedError, UserNotFoundError, CogneeValidationError) as e:
|
||||
# Return a clear 422 with actionable guidance instead of leaking a stacktrace
|
||||
status_code = getattr(e, "status_code", 422)
|
||||
return JSONResponse(
|
||||
status_code=status_code,
|
||||
content={
|
||||
"error": "Search prerequisites not met",
|
||||
"detail": str(e),
|
||||
"hint": "Run `await cognee.add(...)` then `await cognee.cognify()` before searching.",
|
||||
},
|
||||
)
|
||||
except PermissionDeniedError:
|
||||
return []
|
||||
except Exception as error:
|
||||
|
|
|
|||
|
|
@ -11,6 +11,9 @@ from cognee.modules.data.methods import get_authorized_existing_datasets
|
|||
from cognee.modules.data.exceptions import DatasetNotFoundError
|
||||
from cognee.context_global_variables import set_session_user_context_variable
|
||||
from cognee.shared.logging_utils import get_logger
|
||||
from cognee.infrastructure.databases.exceptions import DatabaseNotCreatedError
|
||||
from cognee.exceptions import CogneeValidationError
|
||||
from cognee.modules.users.exceptions.exceptions import UserNotFoundError
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
|
|
@ -176,7 +179,18 @@ async def search(
|
|||
datasets = [datasets]
|
||||
|
||||
if user is None:
|
||||
user = await get_default_user()
|
||||
try:
|
||||
user = await get_default_user()
|
||||
except (DatabaseNotCreatedError, UserNotFoundError) as error:
|
||||
# Provide a clear, actionable message instead of surfacing low-level stacktraces
|
||||
raise CogneeValidationError(
|
||||
message=(
|
||||
"Search prerequisites not met: no database/default user found. "
|
||||
"Initialize Cognee before searching by:\n"
|
||||
"• running `await cognee.add(...)` followed by `await cognee.cognify()`."
|
||||
),
|
||||
name="SearchPreconditionError",
|
||||
) from error
|
||||
|
||||
await set_session_user_context_variable(user)
|
||||
|
||||
|
|
|
|||
|
|
@ -19,8 +19,8 @@ The aim is to achieve simplicity and clarity in the knowledge graph.
|
|||
- **Naming Convention**: Use snake_case for relationship names, e.g., `acted_in`.
|
||||
# 3. Coreference Resolution
|
||||
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
|
||||
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
|
||||
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the Persons ID.
|
||||
If an entity, is mentioned multiple times in the text but is referred to by different names or pronouns,
|
||||
always use the most complete identifier for that entity throughout the knowledge graph.
|
||||
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
|
||||
# 4. Strict Compliance
|
||||
Adhere to the rules strictly. Non-compliance will result in termination
|
||||
|
|
|
|||
|
|
@ -22,7 +22,7 @@ You are an advanced algorithm designed to extract structured information to buil
|
|||
3. **Coreference Resolution**:
|
||||
- Maintain one consistent node ID for each real-world entity.
|
||||
- Resolve aliases, acronyms, and pronouns to the most complete form.
|
||||
- *Example*: Always use "John Doe" even if later referred to as "Doe" or "he".
|
||||
- *Example*: Always use full identifier even if later referred to as in a similar but slightly different way
|
||||
|
||||
**Property & Data Guidelines**:
|
||||
|
||||
|
|
|
|||
|
|
@ -42,10 +42,10 @@ You are an advanced algorithm designed to extract structured information from un
|
|||
- **Rule**: Resolve all aliases, acronyms, and pronouns to one canonical identifier.
|
||||
|
||||
> **One-Shot Example**:
|
||||
> **Input**: "John Doe is an author. Later, Doe published a book. He is well-known."
|
||||
> **Input**: "X is an author. Later, Doe published a book. He is well-known."
|
||||
> **Output Node**:
|
||||
> ```
|
||||
> John Doe (Person)
|
||||
> X (Person)
|
||||
> ```
|
||||
|
||||
---
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ You are an advanced algorithm that extracts structured data into a knowledge gra
|
|||
- Properties are key-value pairs; do not use escaped quotes.
|
||||
|
||||
3. **Coreference Resolution**
|
||||
- Use a single, complete identifier for each entity (e.g., always "John Doe" not "Joe" or "he").
|
||||
- Use a single, complete identifier for each entity
|
||||
|
||||
4. **Relationship Labels**:
|
||||
- Use descriptive, lowercase, snake_case names for edges.
|
||||
|
|
|
|||
|
|
@ -26,7 +26,7 @@ Use **basic atomic types** for node labels. Always prefer general types over spe
|
|||
- Good: "Alan Turing", "Google Inc.", "World War II"
|
||||
- Bad: "Entity_001", "1234", "he", "they"
|
||||
- Never use numeric or autogenerated IDs.
|
||||
- Prioritize **most complete form** of entity names for consistency (e.g., always use "John Doe" instead of "John" or "he").
|
||||
- Prioritize **most complete form** of entity names for consistency
|
||||
|
||||
2. Dates, Numbers, and Properties
|
||||
---------------------------------
|
||||
|
|
|
|||
|
|
@ -8,7 +8,8 @@ import http.server
|
|||
import socketserver
|
||||
from threading import Thread
|
||||
import pathlib
|
||||
from uuid import uuid4, uuid5, NAMESPACE_OID
|
||||
from typing import Union, Any, Dict, List
|
||||
from uuid import uuid4, uuid5, NAMESPACE_OID, UUID
|
||||
|
||||
from cognee.base_config import get_base_config
|
||||
from cognee.shared.logging_utils import get_logger
|
||||
|
|
@ -58,7 +59,7 @@ def get_anonymous_id():
|
|||
return anonymous_id
|
||||
|
||||
|
||||
def _sanitize_nested_properties(obj, property_names: list[str]):
|
||||
def _sanitize_nested_properties(obj: Any, property_names: list[str]) -> Any:
|
||||
"""
|
||||
Recursively replaces any property whose key matches one of `property_names`
|
||||
(e.g., ['url', 'path']) in a nested dict or list with a uuid5 hash
|
||||
|
|
@ -78,7 +79,9 @@ def _sanitize_nested_properties(obj, property_names: list[str]):
|
|||
return obj
|
||||
|
||||
|
||||
def send_telemetry(event_name: str, user_id, additional_properties: dict = {}):
|
||||
def send_telemetry(event_name: str, user_id: Union[str, UUID], additional_properties: dict = {}):
|
||||
if additional_properties is None:
|
||||
additional_properties = {}
|
||||
if os.getenv("TELEMETRY_DISABLED"):
|
||||
return
|
||||
|
||||
|
|
@ -108,7 +111,7 @@ def send_telemetry(event_name: str, user_id, additional_properties: dict = {}):
|
|||
print(f"Error sending telemetry through proxy: {response.status_code}")
|
||||
|
||||
|
||||
def embed_logo(p, layout_scale, logo_alpha, position):
|
||||
def embed_logo(p: Any, layout_scale: float, logo_alpha: float, position: str):
|
||||
"""
|
||||
Embed a logo into the graph visualization as a watermark.
|
||||
"""
|
||||
|
|
@ -138,7 +141,11 @@ def embed_logo(p, layout_scale, logo_alpha, position):
|
|||
|
||||
|
||||
def start_visualization_server(
|
||||
host="0.0.0.0", port=8001, handler_class=http.server.SimpleHTTPRequestHandler
|
||||
host: str = "0.0.0.0",
|
||||
port: int = 8001,
|
||||
handler_class: type[
|
||||
http.server.SimpleHTTPRequestHandler
|
||||
] = http.server.SimpleHTTPRequestHandler,
|
||||
):
|
||||
"""
|
||||
Spin up a simple HTTP server in a background thread to serve files.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
|
||||
# cognee-infra-helm
|
||||
General infrastructure setup for Cognee on Kubernetes using a Helm chart.
|
||||
# Example helm chart
|
||||
Example Helm chart fro Cognee with PostgreSQL and pgvector extension
|
||||
It is not ready for production usage
|
||||
|
||||
## Prerequisites
|
||||
Before deploying the Helm chart, ensure the following prerequisites are met:
|
||||
|
|
@ -13,13 +14,22 @@ Before deploying the Helm chart, ensure the following prerequisites are met:
|
|||
|
||||
Clone the Repository Clone this repository to your local machine and navigate to the directory.
|
||||
|
||||
## Deploy Helm Chart:
|
||||
## Example deploy Helm Chart:
|
||||
|
||||
```bash
|
||||
helm install cognee ./cognee-chart
|
||||
helm upgrade --install cognee deployment/helm \
|
||||
--namespace cognee --create-namespace \
|
||||
--set cognee.env.LLM_API_KEY="$YOUR_KEY"
|
||||
```
|
||||
|
||||
**Uninstall Helm Release**:
|
||||
```bash
|
||||
helm uninstall cognee
|
||||
```
|
||||
|
||||
## Port forwarding
|
||||
To access cognee, run
|
||||
```
|
||||
kubectl port-forward svc/cognee-service -n cognee 8000
|
||||
```
|
||||
it will be available at localhost:8000
|
||||
|
|
|
|||
|
|
@ -20,12 +20,35 @@ spec:
|
|||
ports:
|
||||
- containerPort: {{ .Values.cognee.port }}
|
||||
env:
|
||||
- name: ENABLE_BACKEND_ACCESS_CONTROL
|
||||
value: "false"
|
||||
- name: HOST
|
||||
value: {{ .Values.cognee.env.HOST }}
|
||||
- name: ENVIRONMENT
|
||||
value: {{ .Values.cognee.env.ENVIRONMENT }}
|
||||
- name: PYTHONPATH
|
||||
value: {{ .Values.cognee.env.PYTHONPATH }}
|
||||
- name: VECTOR_DB_PROVIDER
|
||||
value: pgvector
|
||||
- name: DB_HOST
|
||||
value: {{ .Release.Name }}-postgres
|
||||
- name: DB_PORT
|
||||
value: "{{ .Values.postgres.port }}"
|
||||
- name: DB_NAME
|
||||
value: {{ .Values.postgres.env.POSTGRES_DB }}
|
||||
- name: DB_USERNAME
|
||||
value: {{ .Values.postgres.env.POSTGRES_USER }}
|
||||
- name: DB_PASSWORD
|
||||
value: {{ .Values.postgres.env.POSTGRES_PASSWORD }}
|
||||
- name: LLM_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: {{ .Release.Name }}-llm-api-key
|
||||
key: LLM_API_KEY
|
||||
- name: LLM_MODEL
|
||||
value: {{ .Values.cognee.env.LLM_MODEL }}
|
||||
- name: LLM_PROVIDER
|
||||
value: {{ .Values.cognee.env.LLM_PROVIDER }}
|
||||
resources:
|
||||
limits:
|
||||
cpu: {{ .Values.cognee.resources.cpu }}
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ metadata:
|
|||
labels:
|
||||
app: {{ .Release.Name }}-cognee
|
||||
spec:
|
||||
type: NodePort
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: {{ .Values.cognee.port }}
|
||||
targetPort: {{ .Values.cognee.port }}
|
||||
|
|
|
|||
7
deployment/helm/templates/secrets.yml
Normal file
7
deployment/helm/templates/secrets.yml
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: {{ .Release.Name }}-llm-api-key
|
||||
type: Opaque
|
||||
data:
|
||||
LLM_API_KEY: {{ .Values.cognee.env.LLM_API_KEY | b64enc | quote }}
|
||||
|
|
@ -7,9 +7,11 @@ cognee:
|
|||
HOST: "0.0.0.0"
|
||||
ENVIRONMENT: "local"
|
||||
PYTHONPATH: "."
|
||||
LLM_MODEL: "openai/gpt-4o-mini"
|
||||
LLM_PROVIDER: "openai"
|
||||
resources:
|
||||
cpu: "4.0"
|
||||
memory: "8Gi"
|
||||
memory: "2Gi"
|
||||
|
||||
# Configuration for the 'postgres' database service
|
||||
postgres:
|
||||
|
|
@ -19,4 +21,4 @@ postgres:
|
|||
POSTGRES_USER: "cognee"
|
||||
POSTGRES_PASSWORD: "cognee"
|
||||
POSTGRES_DB: "cognee_db"
|
||||
storage: "8Gi"
|
||||
storage: "2Gi"
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue