From 8499258272ad52a1b4a6f5cdce35f735a7d5f8dc Mon Sep 17 00:00:00 2001 From: vasilije Date: Sun, 28 Dec 2025 20:00:29 +0100 Subject: [PATCH 01/10] resolve jon doe issue --- cognee/infrastructure/llm/prompts/generate_graph_prompt.txt | 4 ++-- .../llm/prompts/generate_graph_prompt_guided.txt | 2 +- .../llm/prompts/generate_graph_prompt_oneshot.txt | 4 ++-- .../llm/prompts/generate_graph_prompt_simple.txt | 2 +- .../llm/prompts/generate_graph_prompt_strict.txt | 2 +- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/cognee/infrastructure/llm/prompts/generate_graph_prompt.txt b/cognee/infrastructure/llm/prompts/generate_graph_prompt.txt index 6392cdc33..ce3317381 100644 --- a/cognee/infrastructure/llm/prompts/generate_graph_prompt.txt +++ b/cognee/infrastructure/llm/prompts/generate_graph_prompt.txt @@ -19,8 +19,8 @@ The aim is to achieve simplicity and clarity in the knowledge graph. - **Naming Convention**: Use snake_case for relationship names, e.g., `acted_in`. # 3. Coreference Resolution - **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency. - If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"), - always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the Persons ID. + If an entity, is mentioned multiple times in the text but is referred to by different names or pronouns, + always use the most complete identifier for that entity throughout the knowledge graph. Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. # 4. Strict Compliance Adhere to the rules strictly. Non-compliance will result in termination diff --git a/cognee/infrastructure/llm/prompts/generate_graph_prompt_guided.txt b/cognee/infrastructure/llm/prompts/generate_graph_prompt_guided.txt index a216b835f..b087755d3 100644 --- a/cognee/infrastructure/llm/prompts/generate_graph_prompt_guided.txt +++ b/cognee/infrastructure/llm/prompts/generate_graph_prompt_guided.txt @@ -22,7 +22,7 @@ You are an advanced algorithm designed to extract structured information to buil 3. **Coreference Resolution**: - Maintain one consistent node ID for each real-world entity. - Resolve aliases, acronyms, and pronouns to the most complete form. - - *Example*: Always use "John Doe" even if later referred to as "Doe" or "he". + - *Example*: Always use full identifier even if later referred to as in a similar but slightly different way **Property & Data Guidelines**: diff --git a/cognee/infrastructure/llm/prompts/generate_graph_prompt_oneshot.txt b/cognee/infrastructure/llm/prompts/generate_graph_prompt_oneshot.txt index adc31f469..6375e6eb1 100644 --- a/cognee/infrastructure/llm/prompts/generate_graph_prompt_oneshot.txt +++ b/cognee/infrastructure/llm/prompts/generate_graph_prompt_oneshot.txt @@ -42,10 +42,10 @@ You are an advanced algorithm designed to extract structured information from un - **Rule**: Resolve all aliases, acronyms, and pronouns to one canonical identifier. > **One-Shot Example**: -> **Input**: "John Doe is an author. Later, Doe published a book. He is well-known." +> **Input**: "X is an author. Later, Doe published a book. He is well-known." > **Output Node**: > ``` -> John Doe (Person) +> X (Person) > ``` --- diff --git a/cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt b/cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt index 4a166c027..177c9f34a 100644 --- a/cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt +++ b/cognee/infrastructure/llm/prompts/generate_graph_prompt_simple.txt @@ -15,7 +15,7 @@ You are an advanced algorithm that extracts structured data into a knowledge gra - Properties are key-value pairs; do not use escaped quotes. 3. **Coreference Resolution** - - Use a single, complete identifier for each entity (e.g., always "John Doe" not "Joe" or "he"). + - Use a single, complete identifier for each entity 4. **Relationship Labels**: - Use descriptive, lowercase, snake_case names for edges. diff --git a/cognee/infrastructure/llm/prompts/generate_graph_prompt_strict.txt b/cognee/infrastructure/llm/prompts/generate_graph_prompt_strict.txt index a8191033f..08c117ee4 100644 --- a/cognee/infrastructure/llm/prompts/generate_graph_prompt_strict.txt +++ b/cognee/infrastructure/llm/prompts/generate_graph_prompt_strict.txt @@ -26,7 +26,7 @@ Use **basic atomic types** for node labels. Always prefer general types over spe - Good: "Alan Turing", "Google Inc.", "World War II" - Bad: "Entity_001", "1234", "he", "they" - Never use numeric or autogenerated IDs. -- Prioritize **most complete form** of entity names for consistency (e.g., always use "John Doe" instead of "John" or "he"). +- Prioritize **most complete form** of entity names for consistency 2. Dates, Numbers, and Properties --------------------------------- From ce685557bbe992e3474f536eb8afb09fb7211143 Mon Sep 17 00:00:00 2001 From: vasilije Date: Sun, 28 Dec 2025 20:26:47 +0100 Subject: [PATCH 02/10] added fix that raises error if database doesnt exist --- .../api/v1/search/routers/get_search_router.py | 15 ++++++++++++++- cognee/api/v1/search/search.py | 16 +++++++++++++++- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/cognee/api/v1/search/routers/get_search_router.py b/cognee/api/v1/search/routers/get_search_router.py index 171c03e49..1aaed7f39 100644 --- a/cognee/api/v1/search/routers/get_search_router.py +++ b/cognee/api/v1/search/routers/get_search_router.py @@ -8,12 +8,14 @@ from fastapi.encoders import jsonable_encoder from cognee.modules.search.types import SearchType, SearchResult, CombinedSearchResult from cognee.api.DTO import InDTO, OutDTO -from cognee.modules.users.exceptions.exceptions import PermissionDeniedError +from cognee.modules.users.exceptions.exceptions import PermissionDeniedError, UserNotFoundError from cognee.modules.users.models import User from cognee.modules.search.operations import get_history from cognee.modules.users.methods import get_authenticated_user from cognee.shared.utils import send_telemetry from cognee import __version__ as cognee_version +from cognee.infrastructure.databases.exceptions import DatabaseNotCreatedError +from cognee.exceptions import CogneeValidationError # Note: Datasets sent by name will only map to datasets owned by the request sender @@ -138,6 +140,17 @@ def get_search_router() -> APIRouter: ) return jsonable_encoder(results) + except (DatabaseNotCreatedError, UserNotFoundError, CogneeValidationError) as e: + # Return a clear 422 with actionable guidance instead of leaking a stacktrace + status_code = getattr(e, "status_code", 422) + return JSONResponse( + status_code=status_code, + content={ + "error": "Search prerequisites not met", + "detail": str(e), + "hint": "Run `await cognee.add(...)` then `await cognee.cognify()` before searching.", + }, + ) except PermissionDeniedError: return [] except Exception as error: diff --git a/cognee/api/v1/search/search.py b/cognee/api/v1/search/search.py index 354331c57..ee7408758 100644 --- a/cognee/api/v1/search/search.py +++ b/cognee/api/v1/search/search.py @@ -11,6 +11,9 @@ from cognee.modules.data.methods import get_authorized_existing_datasets from cognee.modules.data.exceptions import DatasetNotFoundError from cognee.context_global_variables import set_session_user_context_variable from cognee.shared.logging_utils import get_logger +from cognee.infrastructure.databases.exceptions import DatabaseNotCreatedError +from cognee.exceptions import CogneeValidationError +from cognee.modules.users.exceptions.exceptions import UserNotFoundError logger = get_logger() @@ -176,7 +179,18 @@ async def search( datasets = [datasets] if user is None: - user = await get_default_user() + try: + user = await get_default_user() + except (DatabaseNotCreatedError, UserNotFoundError) as error: + # Provide a clear, actionable message instead of surfacing low-level stacktraces + raise CogneeValidationError( + message=( + "Search prerequisites not met: no database/default user found. " + "Initialize Cognee before searching by:\n" + "β€’ running `await cognee.add(...)` followed by `await cognee.cognify()`." + ), + name="SearchPreconditionError", + ) from error await set_session_user_context_variable(user) From aeb2f39fd847efc6ff6a1ad3e400ea7583ee7d4f Mon Sep 17 00:00:00 2001 From: Pavel Zorin Date: Fri, 9 Jan 2026 14:15:36 +0100 Subject: [PATCH 03/10] Chore: Remove Lint and Format check in favor to pre-commit --- .github/workflows/basic_tests.yml | 37 ------------------------------- 1 file changed, 37 deletions(-) diff --git a/.github/workflows/basic_tests.yml b/.github/workflows/basic_tests.yml index 1fc20a148..6d9ddce3a 100644 --- a/.github/workflows/basic_tests.yml +++ b/.github/workflows/basic_tests.yml @@ -34,43 +34,6 @@ env: ENV: 'dev' jobs: - - lint: - name: Run Linting - runs-on: ubuntu-22.04 - steps: - - name: Check out repository - uses: actions/checkout@v4 - with: - fetch-depth: 0 - - - name: Cognee Setup - uses: ./.github/actions/cognee_setup - with: - python-version: ${{ inputs.python-version }} - - - name: Run Linting - uses: astral-sh/ruff-action@v2 - - format-check: - name: Run Formatting Check - runs-on: ubuntu-22.04 - steps: - - name: Check out repository - uses: actions/checkout@v4 - with: - fetch-depth: 0 - - - name: Cognee Setup - uses: ./.github/actions/cognee_setup - with: - python-version: ${{ inputs.python-version }} - - - name: Run Formatting Check - uses: astral-sh/ruff-action@v2 - with: - args: "format --check" - unit-tests: name: Run Unit Tests runs-on: ubuntu-22.04 From fb4796204a92a84436f083b6c93b3ed41a719839 Mon Sep 17 00:00:00 2001 From: Pavel Zorin Date: Fri, 9 Jan 2026 18:06:08 +0100 Subject: [PATCH 04/10] Chore: Fix helm chart --- .gitignore | 1 + deployment/helm/README.md | 18 +++++++++++---- .../helm/templates/cognee_deployment.yaml | 23 +++++++++++++++++++ deployment/helm/templates/cognee_service.yaml | 2 +- deployment/helm/templates/secrets.yml | 7 ++++++ deployment/helm/values.yaml | 6 +++-- 6 files changed, 50 insertions(+), 7 deletions(-) create mode 100644 deployment/helm/templates/secrets.yml diff --git a/.gitignore b/.gitignore index 3d7d33d3c..7c3095d08 100644 --- a/.gitignore +++ b/.gitignore @@ -148,6 +148,7 @@ ENV/ env.bak/ venv.bak/ mise.toml +deployment/helm/values-local.yml # Spyder project settings .spyderproject diff --git a/deployment/helm/README.md b/deployment/helm/README.md index 6ee79cc80..3b496c54b 100644 --- a/deployment/helm/README.md +++ b/deployment/helm/README.md @@ -1,6 +1,7 @@ -# cognee-infra-helm -General infrastructure setup for Cognee on Kubernetes using a Helm chart. +# Example helm chart +Example Helm chart fro Cognee with PostgreSQL and pgvector extension +It is not ready for production usage ## Prerequisites Before deploying the Helm chart, ensure the following prerequisites are met:Β  @@ -13,13 +14,22 @@ Before deploying the Helm chart, ensure the following prerequisites are met:Β  Clone the RepositoryΒ Clone this repository to your local machine and navigate to the directory. -## Deploy Helm Chart: +## Example deploy Helm Chart: ```bash - helm install cognee ./cognee-chart + helm upgrade --install cognee deployment/helm \ + --namespace cognee --create-namespace \ + --set cognee.env.LLM_API_KEY="$YOUR_KEY" ``` **Uninstall Helm Release**: ```bash helm uninstall cognee ``` + +## Port forwarding +To access cognee, run +``` +kubectl port-forward svc/cognee-service -n cognee 8000 +``` +it will be available at localhost:8000 diff --git a/deployment/helm/templates/cognee_deployment.yaml b/deployment/helm/templates/cognee_deployment.yaml index f16a475ec..cf44d7301 100644 --- a/deployment/helm/templates/cognee_deployment.yaml +++ b/deployment/helm/templates/cognee_deployment.yaml @@ -20,12 +20,35 @@ spec: ports: - containerPort: {{ .Values.cognee.port }} env: + - name: ENABLE_BACKEND_ACCESS_CONTROL + value: "false" - name: HOST value: {{ .Values.cognee.env.HOST }} - name: ENVIRONMENT value: {{ .Values.cognee.env.ENVIRONMENT }} - name: PYTHONPATH value: {{ .Values.cognee.env.PYTHONPATH }} + - name: VECTOR_DB_PROVIDER + value: pgvector + - name: DB_HOST + value: {{ .Release.Name }}-postgres + - name: DB_PORT + value: "{{ .Values.postgres.port }}" + - name: DB_NAME + value: {{ .Values.postgres.env.POSTGRES_DB }} + - name: DB_USERNAME + value: {{ .Values.postgres.env.POSTGRES_USER }} + - name: DB_PASSWORD + value: {{ .Values.postgres.env.POSTGRES_PASSWORD }} + - name: LLM_API_KEY + valueFrom: + secretKeyRef: + name: {{ .Release.Name }}-llm-api-key + key: LLM_API_KEY + - name: LLM_MODEL + value: {{ .Values.cognee.env.LLM_MODEL }} + - name: LLM_PROVIDER + value: {{ .Values.cognee.env.LLM_PROVIDER }} resources: limits: cpu: {{ .Values.cognee.resources.cpu }} diff --git a/deployment/helm/templates/cognee_service.yaml b/deployment/helm/templates/cognee_service.yaml index 21e9e470e..b3ecbd5e3 100644 --- a/deployment/helm/templates/cognee_service.yaml +++ b/deployment/helm/templates/cognee_service.yaml @@ -5,7 +5,7 @@ metadata: labels: app: {{ .Release.Name }}-cognee spec: - type: NodePort + type: ClusterIP ports: - port: {{ .Values.cognee.port }} targetPort: {{ .Values.cognee.port }} diff --git a/deployment/helm/templates/secrets.yml b/deployment/helm/templates/secrets.yml new file mode 100644 index 000000000..1088865d2 --- /dev/null +++ b/deployment/helm/templates/secrets.yml @@ -0,0 +1,7 @@ +apiVersion: v1 +kind: Secret +metadata: + name: {{ .Release.Name }}-llm-api-key +type: Opaque +data: + LLM_API_KEY: {{ .Values.cognee.env.LLM_API_KEY | b64enc | quote }} diff --git a/deployment/helm/values.yaml b/deployment/helm/values.yaml index 278312373..4a8fd4622 100644 --- a/deployment/helm/values.yaml +++ b/deployment/helm/values.yaml @@ -7,9 +7,11 @@ cognee: HOST: "0.0.0.0" ENVIRONMENT: "local" PYTHONPATH: "." + LLM_MODEL: "openai/gpt-4o-mini" + LLM_PROVIDER: "openai" resources: cpu: "4.0" - memory: "8Gi" + memory: "2Gi" # Configuration for the 'postgres' database service postgres: @@ -19,4 +21,4 @@ postgres: POSTGRES_USER: "cognee" POSTGRES_PASSWORD: "cognee" POSTGRES_DB: "cognee_db" - storage: "8Gi" + storage: "2Gi" From f73457ef725772670969cdd25ce8878b2af77266 Mon Sep 17 00:00:00 2001 From: HectorSin Date: Sat, 10 Jan 2026 20:24:00 +0900 Subject: [PATCH 05/10] refactor: add type hints for user_id and visualization server args Signed-off-by: HectorSin --- cognee/shared/utils.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/cognee/shared/utils.py b/cognee/shared/utils.py index 7c76cfa59..17501a512 100644 --- a/cognee/shared/utils.py +++ b/cognee/shared/utils.py @@ -8,7 +8,8 @@ import http.server import socketserver from threading import Thread import pathlib -from uuid import uuid4, uuid5, NAMESPACE_OID +from typing import Union +from uuid import uuid4, uuid5, NAMESPACE_OID, UUID from cognee.base_config import get_base_config from cognee.shared.logging_utils import get_logger @@ -78,7 +79,7 @@ def _sanitize_nested_properties(obj, property_names: list[str]): return obj -def send_telemetry(event_name: str, user_id, additional_properties: dict = {}): +def send_telemetry(event_name: str, user_id: Union[str, UUID], additional_properties: dict = {}): if os.getenv("TELEMETRY_DISABLED"): return @@ -138,7 +139,9 @@ def embed_logo(p, layout_scale, logo_alpha, position): def start_visualization_server( - host="0.0.0.0", port=8001, handler_class=http.server.SimpleHTTPRequestHandler + host: str = "0.0.0.0", + port: int = 8001, + handler_class=http.server.SimpleHTTPRequestHandler, ): """ Spin up a simple HTTP server in a background thread to serve files. From ebf2aaaa5c4f049bf5e9af893912d7053f175a9c Mon Sep 17 00:00:00 2001 From: HectorSin Date: Sat, 10 Jan 2026 20:58:08 +0900 Subject: [PATCH 06/10] refactor: add type hint for handler_class Signed-off-by: HectorSin --- cognee/shared/utils.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/cognee/shared/utils.py b/cognee/shared/utils.py index 17501a512..460a10bfb 100644 --- a/cognee/shared/utils.py +++ b/cognee/shared/utils.py @@ -141,7 +141,9 @@ def embed_logo(p, layout_scale, logo_alpha, position): def start_visualization_server( host: str = "0.0.0.0", port: int = 8001, - handler_class=http.server.SimpleHTTPRequestHandler, + handler_class: type[ + http.server.SimpleHTTPRequestHandler + ] = http.server.SimpleHTTPRequestHandler, ): """ Spin up a simple HTTP server in a background thread to serve files. From 46c12cc0ee7dda7b379b52f34a3f84f6607b3c6f Mon Sep 17 00:00:00 2001 From: HectorSin Date: Sat, 10 Jan 2026 21:13:10 +0900 Subject: [PATCH 07/10] refactor: resolve remaining ANN001 errors in utils.py Signed-off-by: HectorSin --- cognee/shared/utils.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/cognee/shared/utils.py b/cognee/shared/utils.py index 460a10bfb..7eb036ef1 100644 --- a/cognee/shared/utils.py +++ b/cognee/shared/utils.py @@ -8,7 +8,7 @@ import http.server import socketserver from threading import Thread import pathlib -from typing import Union +from typing import Union, Any, Dict, List from uuid import uuid4, uuid5, NAMESPACE_OID, UUID from cognee.base_config import get_base_config @@ -59,7 +59,7 @@ def get_anonymous_id(): return anonymous_id -def _sanitize_nested_properties(obj, property_names: list[str]): +def _sanitize_nested_properties(obj: Union[Dict, List, Any], property_names: list[str]): """ Recursively replaces any property whose key matches one of `property_names` (e.g., ['url', 'path']) in a nested dict or list with a uuid5 hash @@ -109,7 +109,7 @@ def send_telemetry(event_name: str, user_id: Union[str, UUID], additional_proper print(f"Error sending telemetry through proxy: {response.status_code}") -def embed_logo(p, layout_scale, logo_alpha, position): +def embed_logo(p: Any, layout_scale: float, logo_alpha: float, position: str): """ Embed a logo into the graph visualization as a watermark. """ From da5660b7169c32a744430b06201c282a338dcf76 Mon Sep 17 00:00:00 2001 From: HectorSin Date: Sat, 10 Jan 2026 21:21:04 +0900 Subject: [PATCH 08/10] refactor: fix mutable default argument in send_telemetry Signed-off-by: HectorSin --- cognee/shared/utils.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/cognee/shared/utils.py b/cognee/shared/utils.py index 7eb036ef1..ce9fb2169 100644 --- a/cognee/shared/utils.py +++ b/cognee/shared/utils.py @@ -80,6 +80,8 @@ def _sanitize_nested_properties(obj: Union[Dict, List, Any], property_names: lis def send_telemetry(event_name: str, user_id: Union[str, UUID], additional_properties: dict = {}): + if additional_properties is None: + additional_properties = {} if os.getenv("TELEMETRY_DISABLED"): return From 4189cda89518c779d329b7cd05c9050533b6a40e Mon Sep 17 00:00:00 2001 From: HectorSin Date: Sat, 10 Jan 2026 21:25:25 +0900 Subject: [PATCH 09/10] refactor: simplify type hint and add return type for sanitize function Signed-off-by: HectorSin --- cognee/shared/utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cognee/shared/utils.py b/cognee/shared/utils.py index ce9fb2169..3b6e7ba14 100644 --- a/cognee/shared/utils.py +++ b/cognee/shared/utils.py @@ -59,7 +59,7 @@ def get_anonymous_id(): return anonymous_id -def _sanitize_nested_properties(obj: Union[Dict, List, Any], property_names: list[str]): +def _sanitize_nested_properties(obj: Any, property_names: list[str]) -> Any: """ Recursively replaces any property whose key matches one of `property_names` (e.g., ['url', 'path']) in a nested dict or list with a uuid5 hash From ab990f7c5c2c21a02f85f0c383ae6f157b72bfe9 Mon Sep 17 00:00:00 2001 From: vasilije Date: Sun, 11 Jan 2026 16:04:11 +0100 Subject: [PATCH 10/10] docs: add CLAUDE.md for Claude Code guidance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive CLAUDE.md file to guide future Claude Code instances working in this repository. Includes: - Development commands (setup, testing, code quality) - Architecture overview (ECL pipeline, data flows, key patterns) - Complete configuration guide (LLM providers, databases, storage) - All 15 search types with descriptions - Extension points for custom functionality - Troubleshooting common issues πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 --- CLAUDE.md | 588 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 588 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..7ac4f01d0 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,588 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +Cognee is an open-source AI memory platform that transforms raw data into persistent knowledge graphs for AI agents. It replaces traditional RAG (Retrieval-Augmented Generation) with an ECL (Extract, Cognify, Load) pipeline combining vector search, graph databases, and LLM-powered entity extraction. + +**Requirements**: Python 3.9 - 3.12 + +## Development Commands + +### Setup +```bash +# Create virtual environment (recommended: uv) +uv venv && source .venv/bin/activate + +# Install with pip, poetry, or uv +uv pip install -e . + +# Install with dev dependencies +uv pip install -e ".[dev]" + +# Install with specific extras +uv pip install -e ".[postgres,neo4j,docs,chromadb]" + +# Set up pre-commit hooks +pre-commit install +``` + +### Available Installation Extras +- **postgres** / **postgres-binary** - PostgreSQL + PGVector support +- **neo4j** - Neo4j graph database support +- **neptune** - AWS Neptune support +- **chromadb** - ChromaDB vector database +- **docs** - Document processing (unstructured library) +- **scraping** - Web scraping (Tavily, BeautifulSoup, Playwright) +- **langchain** - LangChain integration +- **llama-index** - LlamaIndex integration +- **anthropic** - Anthropic Claude models +- **gemini** - Google Gemini models +- **ollama** - Ollama local models +- **mistral** - Mistral AI models +- **groq** - Groq API support +- **llama-cpp** - Llama.cpp local inference +- **huggingface** - HuggingFace transformers +- **aws** - S3 storage backend +- **redis** - Redis caching +- **graphiti** - Graphiti-core integration +- **baml** - BAML structured output +- **dlt** - Data load tool (dlt) integration +- **docling** - Docling document processing +- **codegraph** - Code graph extraction +- **evals** - Evaluation tools +- **deepeval** - DeepEval testing framework +- **posthog** - PostHog analytics +- **monitoring** - Sentry + Langfuse observability +- **distributed** - Modal distributed execution +- **dev** - All development tools (pytest, mypy, ruff, etc.) +- **debug** - Debugpy for debugging + +### Testing +```bash +# Run all tests +pytest + +# Run with coverage +pytest --cov=cognee --cov-report=html + +# Run specific test file +pytest cognee/tests/test_custom_model.py + +# Run specific test function +pytest cognee/tests/test_custom_model.py::test_function_name + +# Run async tests +pytest -v cognee/tests/integration/ + +# Run unit tests only +pytest cognee/tests/unit/ + +# Run integration tests only +pytest cognee/tests/integration/ +``` + +### Code Quality +```bash +# Run ruff linter +ruff check . + +# Run ruff formatter +ruff format . + +# Run both linting and formatting (pre-commit) +pre-commit run --all-files + +# Type checking with mypy +mypy cognee/ + +# Run pylint +pylint cognee/ +``` + +### Running Cognee +```bash +# Using Python SDK +python examples/python/simple_example.py + +# Using CLI +cognee-cli add "Your text here" +cognee-cli cognify +cognee-cli search "Your query" +cognee-cli delete --all + +# Launch full stack with UI +cognee-cli -ui +``` + +## Architecture Overview + +### Core Workflow: add β†’ cognify β†’ search/memify + +1. **add()** - Ingest data (files, URLs, text) into datasets +2. **cognify()** - Extract entities/relationships and build knowledge graph +3. **search()** - Query knowledge using various retrieval strategies +4. **memify()** - Enrich graph with additional context and rules + +### Key Architectural Patterns + +#### 1. Pipeline-Based Processing +All data flows through task-based pipelines (`cognee/modules/pipelines/`). Tasks are composable units that can run sequentially or in parallel. Example pipeline tasks: `classify_documents`, `extract_graph_from_data`, `add_data_points`. + +#### 2. Interface-Based Database Adapters +Multiple backends are supported through adapter interfaces: +- **Graph**: Kuzu (default), Neo4j, Neptune via `GraphDBInterface` +- **Vector**: LanceDB (default), ChromaDB, PGVector via `VectorDBInterface` +- **Relational**: SQLite (default), PostgreSQL + +Key files: +- `cognee/infrastructure/databases/graph/graph_db_interface.py` +- `cognee/infrastructure/databases/vector/vector_db_interface.py` + +#### 3. Multi-Tenant Access Control +User β†’ Dataset β†’ Data hierarchy with permission-based filtering. Enable with `ENABLE_BACKEND_ACCESS_CONTROL=True`. Each user+dataset combination can have isolated graph/vector databases (when using supported backends: Kuzu, LanceDB, SQLite, Postgres). + +### Layer Structure + +``` +API Layer (cognee/api/v1/) + ↓ +Main Functions (add, cognify, search, memify) + ↓ +Pipeline Orchestrator (cognee/modules/pipelines/) + ↓ +Task Execution Layer (cognee/tasks/) + ↓ +Domain Modules (graph, retrieval, ingestion, etc.) + ↓ +Infrastructure Adapters (LLM, databases) + ↓ +External Services (OpenAI, Kuzu, LanceDB, etc.) +``` + +### Critical Data Flow Paths + +#### ADD: Data Ingestion +`add()` β†’ `resolve_data_directories` β†’ `ingest_data` β†’ `save_data_item_to_storage` β†’ Create Dataset + Data records in relational DB + +Key files: `cognee/api/v1/add/add.py`, `cognee/tasks/ingestion/ingest_data.py` + +#### COGNIFY: Knowledge Graph Construction +`cognify()` β†’ `classify_documents` β†’ `extract_chunks_from_documents` β†’ `extract_graph_from_data` (LLM extracts entities/relationships using Instructor) β†’ `summarize_text` β†’ `add_data_points` (store in graph + vector DBs) + +Key files: +- `cognee/api/v1/cognify/cognify.py` +- `cognee/tasks/graph/extract_graph_from_data.py` +- `cognee/tasks/storage/add_data_points.py` + +#### SEARCH: Retrieval +`search(query_text, query_type)` β†’ route to retriever type β†’ filter by permissions β†’ return results + +Available search types (from `cognee/modules/search/types/SearchType.py`): +- **GRAPH_COMPLETION** (default) - Graph traversal + LLM completion +- **GRAPH_SUMMARY_COMPLETION** - Uses pre-computed summaries with graph context +- **GRAPH_COMPLETION_COT** - Chain-of-thought reasoning over graph +- **GRAPH_COMPLETION_CONTEXT_EXTENSION** - Extended context graph retrieval +- **TRIPLET_COMPLETION** - Triplet-based (subject-predicate-object) search +- **RAG_COMPLETION** - Traditional RAG with chunks +- **CHUNKS** - Vector similarity search over chunks +- **CHUNKS_LEXICAL** - Lexical (keyword) search over chunks +- **SUMMARIES** - Search pre-computed document summaries +- **CYPHER** - Direct Cypher query execution (requires `ALLOW_CYPHER_QUERY=True`) +- **NATURAL_LANGUAGE** - Natural language to structured query +- **TEMPORAL** - Time-aware graph search +- **FEELING_LUCKY** - Automatic search type selection +- **FEEDBACK** - User feedback-based refinement +- **CODING_RULES** - Code-specific search rules + +Key files: +- `cognee/api/v1/search/search.py` +- `cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py` +- `cognee/modules/search/types/SearchType.py` + +### Core Data Models + +#### Engine Models (`cognee/infrastructure/engine/models/`) +- **DataPoint** - Base class for all graph nodes (versioned, with metadata) +- **Edge** - Graph relationships (source, target, relationship type) +- **Triplet** - (Subject, Predicate, Object) representation + +#### Graph Models (`cognee/shared/data_models.py`) +- **KnowledgeGraph** - Container for nodes and edges +- **Node** - Entity (id, name, type, description) +- **Edge** - Relationship (source_node_id, target_node_id, relationship_name) + +### Key Infrastructure Components + +#### LLM Gateway (`cognee/infrastructure/llm/LLMGateway.py`) +Unified interface for multiple LLM providers: OpenAI, Anthropic, Gemini, Ollama, Mistral, Bedrock. Uses Instructor for structured output extraction. + +#### Embedding Engines +Factory pattern for embeddings: `cognee/infrastructure/databases/vector/embeddings/get_embedding_engine.py` + +#### Document Loaders +Support for PDF, DOCX, CSV, images, audio, code files in `cognee/infrastructure/files/` + +## Important Configuration + +### Environment Setup +Copy `.env.template` to `.env` and configure: + +```bash +# Minimal setup (defaults to OpenAI + local file-based databases) +LLM_API_KEY="your_openai_api_key" +LLM_MODEL="openai/gpt-4o-mini" # Default model +``` + +**Important**: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both to avoid unexpected defaults. + +Default databases (no extra setup needed): +- **Relational**: SQLite (metadata and state storage) +- **Vector**: LanceDB (embeddings for semantic search) +- **Graph**: Kuzu (knowledge graph and relationships) + +All stored in `.venv` by default. Override with `DATA_ROOT_DIRECTORY` and `SYSTEM_ROOT_DIRECTORY`. + +### Switching Databases + +#### Relational Databases +```bash +# PostgreSQL (requires postgres extra: pip install cognee[postgres]) +DB_PROVIDER=postgres +DB_HOST=localhost +DB_PORT=5432 +DB_USERNAME=cognee +DB_PASSWORD=cognee +DB_NAME=cognee_db +``` + +#### Vector Databases +Supported: lancedb (default), pgvector, chromadb, qdrant, weaviate, milvus +```bash +# ChromaDB (requires chromadb extra) +VECTOR_DB_PROVIDER=chromadb + +# PGVector (requires postgres extra) +VECTOR_DB_PROVIDER=pgvector +VECTOR_DB_URL=postgresql://cognee:cognee@localhost:5432/cognee_db +``` + +#### Graph Databases +Supported: kuzu (default), neo4j, neptune, kuzu-remote +```bash +# Neo4j (requires neo4j extra: pip install cognee[neo4j]) +GRAPH_DATABASE_PROVIDER=neo4j +GRAPH_DATABASE_URL=bolt://localhost:7687 +GRAPH_DATABASE_NAME=neo4j +GRAPH_DATABASE_USERNAME=neo4j +GRAPH_DATABASE_PASSWORD=yourpassword + +# Remote Kuzu +GRAPH_DATABASE_PROVIDER=kuzu-remote +GRAPH_DATABASE_URL=http://localhost:8000 +GRAPH_DATABASE_USERNAME=your_username +GRAPH_DATABASE_PASSWORD=your_password +``` + +### LLM Provider Configuration + +Supported providers: OpenAI (default), Azure OpenAI, Google Gemini, Anthropic, AWS Bedrock, Ollama, LM Studio, Custom (OpenAI-compatible APIs) + +#### OpenAI (Recommended - Minimal Setup) +```bash +LLM_API_KEY="your_openai_api_key" +LLM_MODEL="openai/gpt-4o-mini" # or gpt-4o, gpt-4-turbo, etc. +LLM_PROVIDER="openai" +``` + +#### Azure OpenAI +```bash +LLM_PROVIDER="azure" +LLM_MODEL="azure/gpt-4o-mini" +LLM_ENDPOINT="https://YOUR-RESOURCE.openai.azure.com/openai/deployments/gpt-4o-mini" +LLM_API_KEY="your_azure_api_key" +LLM_API_VERSION="2024-12-01-preview" +``` + +#### Google Gemini (requires gemini extra) +```bash +LLM_PROVIDER="gemini" +LLM_MODEL="gemini/gemini-2.0-flash-exp" +LLM_API_KEY="your_gemini_api_key" +``` + +#### Anthropic Claude (requires anthropic extra) +```bash +LLM_PROVIDER="anthropic" +LLM_MODEL="claude-3-5-sonnet-20241022" +LLM_API_KEY="your_anthropic_api_key" +``` + +#### Ollama (Local - requires ollama extra) +```bash +LLM_PROVIDER="ollama" +LLM_MODEL="llama3.1:8b" +LLM_ENDPOINT="http://localhost:11434/v1" +LLM_API_KEY="ollama" +EMBEDDING_PROVIDER="ollama" +EMBEDDING_MODEL="nomic-embed-text:latest" +EMBEDDING_ENDPOINT="http://localhost:11434/api/embed" +HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5" +``` + +#### Custom / OpenRouter / vLLM +```bash +LLM_PROVIDER="custom" +LLM_MODEL="openrouter/google/gemini-2.0-flash-lite-preview-02-05:free" +LLM_ENDPOINT="https://openrouter.ai/api/v1" +LLM_API_KEY="your_api_key" +``` + +#### AWS Bedrock (requires aws extra) +```bash +LLM_PROVIDER="bedrock" +LLM_MODEL="anthropic.claude-3-sonnet-20240229-v1:0" +AWS_REGION="us-east-1" +AWS_ACCESS_KEY_ID="your_access_key" +AWS_SECRET_ACCESS_KEY="your_secret_key" +# Optional for temporary credentials: +# AWS_SESSION_TOKEN="your_session_token" +``` + +#### LLM Rate Limiting +```bash +LLM_RATE_LIMIT_ENABLED=true +LLM_RATE_LIMIT_REQUESTS=60 # Requests per interval +LLM_RATE_LIMIT_INTERVAL=60 # Interval in seconds +``` + +#### Instructor Mode (Structured Output) +```bash +# LLM_INSTRUCTOR_MODE controls how structured data is extracted +# Each LLM has its own default (e.g., gpt-4o models use "json_schema_mode") +# Override if needed: +LLM_INSTRUCTOR_MODE="json_schema_mode" # or "tool_call", "md_json", etc. +``` + +### Structured Output Framework +```bash +# Use Instructor (default, via litellm) +STRUCTURED_OUTPUT_FRAMEWORK="instructor" + +# Or use BAML (requires baml extra: pip install cognee[baml]) +STRUCTURED_OUTPUT_FRAMEWORK="baml" +BAML_LLM_PROVIDER=openai +BAML_LLM_MODEL="gpt-4o-mini" +BAML_LLM_API_KEY="your_api_key" +``` + +### Storage Backend +```bash +# Local filesystem (default) +STORAGE_BACKEND="local" + +# S3 (requires aws extra: pip install cognee[aws]) +STORAGE_BACKEND="s3" +STORAGE_BUCKET_NAME="your-bucket-name" +AWS_REGION="us-east-1" +AWS_ACCESS_KEY_ID="your_access_key" +AWS_SECRET_ACCESS_KEY="your_secret_key" +DATA_ROOT_DIRECTORY="s3://your-bucket/cognee/data" +SYSTEM_ROOT_DIRECTORY="s3://your-bucket/cognee/system" +``` + +## Extension Points + +### Adding New Functionality + +1. **New Task Type**: Create task function in `cognee/tasks/`, return Task object, register in pipeline +2. **New Database Backend**: Implement `GraphDBInterface` or `VectorDBInterface` in `cognee/infrastructure/databases/` +3. **New LLM Provider**: Add configuration in LLM config (uses litellm) +4. **New Document Processor**: Extend loaders in `cognee/modules/data/processing/` +5. **New Search Type**: Add to `SearchType` enum and implement retriever in `cognee/modules/retrieval/` +6. **Custom Graph Models**: Define Pydantic models extending `DataPoint` in your code + +### Working with Ontologies +Cognee supports ontology-based entity extraction to ground knowledge graphs in standardized semantic frameworks (e.g., OWL ontologies). + +Configuration: +```bash +ONTOLOGY_RESOLVER=rdflib # Default: uses rdflib and OWL files +MATCHING_STRATEGY=fuzzy # Default: fuzzy matching with 80% similarity +ONTOLOGY_FILE_PATH=/path/to/your/ontology.owl # Full path to ontology file +``` + +Implementation: `cognee/modules/ontology/` + +## Branching Strategy + +**IMPORTANT**: Always branch from `dev`, not `main`. The `dev` branch is the active development branch. + +```bash +git checkout dev +git pull origin dev +git checkout -b feature/your-feature-name +``` + +## Code Style + +- Ruff for linting and formatting (configured in `pyproject.toml`) +- Line length: 100 characters +- Pre-commit hooks run ruff automatically +- Type hints encouraged (mypy checks enabled) + +## Testing Strategy + +Tests are organized in `cognee/tests/`: +- `unit/` - Unit tests for individual modules +- `integration/` - Full pipeline integration tests +- `cli_tests/` - CLI command tests +- `tasks/` - Task-specific tests + +When adding features, add corresponding tests. Integration tests should cover the full add β†’ cognify β†’ search flow. + +## API Structure + +FastAPI application with versioned routes under `cognee/api/v1/`: +- `/add` - Data ingestion +- `/cognify` - Knowledge graph processing +- `/search` - Query interface +- `/memify` - Graph enrichment +- `/datasets` - Dataset management +- `/users` - Authentication (if `REQUIRE_AUTHENTICATION=True`) +- `/visualize` - Graph visualization server + +## Python SDK Entry Points + +Main functions exported from `cognee/__init__.py`: +- `add(data, dataset_name)` - Ingest data +- `cognify(datasets)` - Build knowledge graph +- `search(query_text, query_type)` - Query knowledge +- `memify(extraction_tasks, enrichment_tasks)` - Enrich graph +- `delete(data_id)` - Remove data +- `config()` - Configuration management +- `datasets()` - Dataset operations + +All functions are async - use `await` or `asyncio.run()`. + +## Security Considerations + +Several security environment variables in `.env`: +- `ACCEPT_LOCAL_FILE_PATH` - Allow local file paths (default: True) +- `ALLOW_HTTP_REQUESTS` - Allow HTTP requests from Cognee (default: True) +- `ALLOW_CYPHER_QUERY` - Allow raw Cypher queries (default: True) +- `REQUIRE_AUTHENTICATION` - Enable API authentication (default: False) +- `ENABLE_BACKEND_ACCESS_CONTROL` - Multi-tenant isolation (default: True) + +For production deployments, review and tighten these settings. + +## Common Patterns + +### Creating a Custom Pipeline Task +```python +from cognee.modules.pipelines.tasks.Task import Task + +async def my_custom_task(data): + # Your logic here + processed_data = process(data) + return processed_data + +# Use in pipeline +task = Task(my_custom_task) +``` + +### Accessing Databases Directly +```python +from cognee.infrastructure.databases.graph import get_graph_engine +from cognee.infrastructure.databases.vector import get_vector_engine + +graph_engine = await get_graph_engine() +vector_engine = await get_vector_engine() +``` + +### Using LLM Gateway +```python +from cognee.infrastructure.llm.get_llm_client import get_llm_client + +llm_client = get_llm_client() +response = await llm_client.acreate_structured_output( + text_input="Your prompt", + system_prompt="System instructions", + response_model=YourPydanticModel +) +``` + +## Key Concepts + +### Datasets +Datasets are project-level containers that support organization, permissions, and isolated processing workflows. Each user can have multiple datasets with different access permissions. + +```python +# Create/use a dataset +await cognee.add(data, dataset_name="my_project") +await cognee.cognify(datasets=["my_project"]) +``` + +### DataPoints +Atomic knowledge units that form the foundation of graph structures. All graph nodes extend the `DataPoint` base class with versioning and metadata support. + +### Permissions System +Multi-tenant architecture with users, roles, and Access Control Lists (ACLs): +- Read, write, delete, and share permissions per dataset +- Enable with `ENABLE_BACKEND_ACCESS_CONTROL=True` +- Supports isolated databases per user+dataset (Kuzu, LanceDB, SQLite, Postgres) + +### Graph Visualization +Launch visualization server: +```bash +# Via CLI +cognee-cli -ui # Launches full stack with UI at http://localhost:3000 + +# Via Python +from cognee.api.v1.visualize import start_visualization_server +await start_visualization_server(port=8080) +``` + +## Debugging & Troubleshooting + +### Debug Configuration +- Set `LITELLM_LOG="DEBUG"` for verbose LLM logs (default: "ERROR") +- Enable debug mode: `ENV="development"` or `ENV="debug"` +- Disable telemetry: `TELEMETRY_DISABLED=1` +- Check logs in structured format (uses structlog) +- Use `debugpy` optional dependency for debugging: `pip install cognee[debug]` + +### Common Issues + +**Ollama + OpenAI Embeddings NoDataError** +- Issue: Mixing Ollama with OpenAI embeddings can cause errors +- Solution: Configure both LLM and embeddings to use the same provider, or ensure `HUGGINGFACE_TOKENIZER` is set when using Ollama + +**LM Studio Structured Output** +- Issue: LM Studio requires explicit instructor mode +- Solution: Set `LLM_INSTRUCTOR_MODE="json_schema_mode"` (or appropriate mode) + +**Default Provider Fallback** +- Issue: Configuring only LLM or only embeddings defaults the other to OpenAI +- Solution: Always configure both LLM and embedding providers, or ensure valid OpenAI API key + +**Permission Denied on Search** +- Behavior: Returns empty list rather than error (prevents information leakage) +- Solution: Check dataset permissions and user access rights + +**Database Connection Issues** +- Check: Verify database URLs, credentials, and that services are running +- Docker users: Use `DB_HOST=host.docker.internal` for local databases + +**Rate Limiting Errors** +- Enable client-side rate limiting: `LLM_RATE_LIMIT_ENABLED=true` +- Adjust limits: `LLM_RATE_LIMIT_REQUESTS` and `LLM_RATE_LIMIT_INTERVAL` + +## Resources + +- [Documentation](https://docs.cognee.ai/) +- [Discord Community](https://discord.gg/NQPKmU5CCg) +- [GitHub Issues](https://github.com/topoteretes/cognee/issues) +- [Example Notebooks](examples/python/) +- [Research Paper](https://arxiv.org/abs/2505.24478) - Optimizing knowledge graphs for LLM reasoning