cognee/HEALTH_CHECK_IMPLEMENTATION.md
2025-08-02 09:42:25 -05:00

6 KiB

Cognee Health Check System Implementation

Overview

This implementation provides a comprehensive health check system for the Cognee API that monitors all critical backend components and provides detailed health status information for production deployments, container orchestration, and monitoring systems.

Implementation Files

1. /cognee/api/health.py

  • HealthChecker class: Main health checking logic
  • Health models: Pydantic models for structured responses
  • Component checkers: Individual health check methods for each service

2. /cognee/api/client.py (Updated)

  • Enhanced health endpoints: Three new endpoints replacing the basic health check
  • Proper HTTP status codes: Returns appropriate status codes based on health status

Health Check Endpoints

1. GET /health - Basic Liveness Probe

  • Purpose: Basic liveness check for container orchestration
  • Response: HTTP 200 (healthy/degraded) or 503 (unhealthy)
  • Use case: Kubernetes liveness probe, load balancer health checks

2. GET /health/ready - Readiness Probe

  • Purpose: Kubernetes readiness probe
  • Response: JSON with ready/not ready status
  • Use case: Kubernetes readiness probe, deployment verification

3. GET /health/detailed - Comprehensive Health Status

  • Purpose: Detailed health information for monitoring and debugging
  • Response: Complete health status with component details
  • Use case: Monitoring dashboards, troubleshooting, operational visibility

Health Check Components

Critical Services (Failure = HTTP 503)

  1. Relational Database (SQLite/PostgreSQL)

    • Tests database connectivity and session creation
    • Validates schema accessibility
  2. Vector Database (LanceDB/Qdrant/PGVector/ChromaDB)

    • Tests vector database connectivity
    • Validates index accessibility
  3. Graph Database (Kuzu/Neo4j/FalkorDB/Memgraph)

    • Tests graph database connectivity
    • Validates schema and basic operations
  4. File Storage (Local/S3)

    • Tests file system or S3 accessibility
    • Validates read/write permissions

Non-Critical Services (Failure = Degraded Status)

  1. LLM Provider (OpenAI/Ollama/Anthropic/Gemini)

    • Validates configuration and API key presence
    • Non-blocking for core functionality
  2. Embedding Service

    • Tests embedding engine accessibility
    • Non-blocking for core functionality

Response Format

{
  "status": "healthy|degraded|unhealthy",
  "timestamp": "2024-01-15T10:30:45Z",
  "version": "1.0.0",
  "uptime": 3600,
  "components": {
    "relational_db": {
      "status": "healthy",
      "provider": "sqlite",
      "response_time_ms": 45,
      "details": "Connection successful"
    },
    "vector_db": {
      "status": "healthy",
      "provider": "lancedb",
      "response_time_ms": 120,
      "details": "Index accessible"
    },
    "graph_db": {
      "status": "healthy",
      "provider": "kuzu",
      "response_time_ms": 89,
      "details": "Schema validated"
    },
    "file_storage": {
      "status": "healthy",
      "provider": "local",
      "response_time_ms": 156,
      "details": "Storage accessible"
    },
    "llm_provider": {
      "status": "healthy",
      "provider": "openai",
      "response_time_ms": 1250,
      "details": "Configuration valid"
    },
    "embedding_service": {
      "status": "healthy",
      "provider": "configured",
      "response_time_ms": 890,
      "details": "Embedding engine accessible"
    }
  }
}

Health Status Logic

Overall Status Determination

  • UNHEALTHY: Any critical service is unhealthy
  • DEGRADED: All critical services healthy, but non-critical services have issues
  • HEALTHY: All services are functioning properly

HTTP Status Codes

  • 200: Healthy or degraded (service operational)
  • 503: Unhealthy (service not ready/available)

Usage Examples

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cognee-api
spec:
  template:
    spec:
      containers:
      - name: cognee
        image: cognee:latest
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

Docker Compose Health Check

version: '3.8'
services:
  cognee-api:
    image: cognee:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Monitoring Integration

# Basic health check
curl http://localhost:8000/health

# Detailed health status for monitoring
curl http://localhost:8000/health/detailed | jq '.components'

# Readiness check
curl http://localhost:8000/health/ready

Implementation Benefits

  1. Production Ready: Proper HTTP status codes and structured responses
  2. Container Orchestration: Kubernetes-compatible liveness and readiness probes
  3. Monitoring Integration: Detailed component status for observability
  4. Graceful Degradation: Distinguishes between critical and non-critical failures
  5. Performance Tracking: Response time metrics for each component
  6. Troubleshooting: Detailed error messages and component status

Error Handling

  • All health checks are wrapped in try-catch blocks
  • Individual component failures don't crash the health check system
  • Detailed error messages are provided for troubleshooting
  • Timeouts and response times are tracked for performance monitoring

Security Considerations

  • Health endpoints don't expose sensitive configuration details
  • Error messages are sanitized to prevent information leakage
  • No authentication required for basic health checks (standard practice)
  • Detailed endpoint can be restricted if needed via reverse proxy rules

This implementation provides a robust, production-ready health check system that meets enterprise requirements for monitoring, observability, and container orchestration.