Pavan Chilukuri 54e5be39e1 Add health checks

2025-08-02 09:42:25 -05:00

6 KiB

Raw Blame History

Cognee Health Check System Implementation

Overview

This implementation provides a comprehensive health check system for the Cognee API that monitors all critical backend components and provides detailed health status information for production deployments, container orchestration, and monitoring systems.

Implementation Files

1. `/cognee/api/health.py`

HealthChecker class: Main health checking logic
Health models: Pydantic models for structured responses
Component checkers: Individual health check methods for each service

2. `/cognee/api/client.py` (Updated)

Enhanced health endpoints: Three new endpoints replacing the basic health check
Proper HTTP status codes: Returns appropriate status codes based on health status

Health Check Endpoints

1. `GET /health` - Basic Liveness Probe

Purpose: Basic liveness check for container orchestration
Response: HTTP 200 (healthy/degraded) or 503 (unhealthy)
Use case: Kubernetes liveness probe, load balancer health checks

2. `GET /health/ready` - Readiness Probe

Purpose: Kubernetes readiness probe
Response: JSON with ready/not ready status
Use case: Kubernetes readiness probe, deployment verification

3. `GET /health/detailed` - Comprehensive Health Status

Purpose: Detailed health information for monitoring and debugging
Response: Complete health status with component details
Use case: Monitoring dashboards, troubleshooting, operational visibility

Health Check Components

Critical Services (Failure = HTTP 503)

Relational Database (SQLite/PostgreSQL)
- Tests database connectivity and session creation
- Validates schema accessibility
Vector Database (LanceDB/Qdrant/PGVector/ChromaDB)
- Tests vector database connectivity
- Validates index accessibility
Graph Database (Kuzu/Neo4j/FalkorDB/Memgraph)
- Tests graph database connectivity
- Validates schema and basic operations
File Storage (Local/S3)
- Tests file system or S3 accessibility
- Validates read/write permissions

Non-Critical Services (Failure = Degraded Status)

LLM Provider (OpenAI/Ollama/Anthropic/Gemini)
- Validates configuration and API key presence
- Non-blocking for core functionality
Embedding Service
- Tests embedding engine accessibility
- Non-blocking for core functionality

Response Format

{
  "status": "healthy|degraded|unhealthy",
  "timestamp": "2024-01-15T10:30:45Z",
  "version": "1.0.0",
  "uptime": 3600,
  "components": {
    "relational_db": {
      "status": "healthy",
      "provider": "sqlite",
      "response_time_ms": 45,
      "details": "Connection successful"
    },
    "vector_db": {
      "status": "healthy",
      "provider": "lancedb",
      "response_time_ms": 120,
      "details": "Index accessible"
    },
    "graph_db": {
      "status": "healthy",
      "provider": "kuzu",
      "response_time_ms": 89,
      "details": "Schema validated"
    },
    "file_storage": {
      "status": "healthy",
      "provider": "local",
      "response_time_ms": 156,
      "details": "Storage accessible"
    },
    "llm_provider": {
      "status": "healthy",
      "provider": "openai",
      "response_time_ms": 1250,
      "details": "Configuration valid"
    },
    "embedding_service": {
      "status": "healthy",
      "provider": "configured",
      "response_time_ms": 890,
      "details": "Embedding engine accessible"
    }
  }
}

Health Status Logic

Overall Status Determination

UNHEALTHY: Any critical service is unhealthy
DEGRADED: All critical services healthy, but non-critical services have issues
HEALTHY: All services are functioning properly

HTTP Status Codes

200: Healthy or degraded (service operational)
503: Unhealthy (service not ready/available)

Usage Examples

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cognee-api
spec:
  template:
    spec:
      containers:
      - name: cognee
        image: cognee:latest
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

Docker Compose Health Check

version: '3.8'
services:
  cognee-api:
    image: cognee:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Monitoring Integration

# Basic health check
curl http://localhost:8000/health

# Detailed health status for monitoring
curl http://localhost:8000/health/detailed | jq '.components'

# Readiness check
curl http://localhost:8000/health/ready

Implementation Benefits

Production Ready: Proper HTTP status codes and structured responses
Container Orchestration: Kubernetes-compatible liveness and readiness probes
Monitoring Integration: Detailed component status for observability
Graceful Degradation: Distinguishes between critical and non-critical failures
Performance Tracking: Response time metrics for each component
Troubleshooting: Detailed error messages and component status

Error Handling

All health checks are wrapped in try-catch blocks
Individual component failures don't crash the health check system
Detailed error messages are provided for troubleshooting
Timeouts and response times are tracked for performance monitoring

Security Considerations

Health endpoints don't expose sensitive configuration details
Error messages are sanitized to prevent information leakage
No authentication required for basic health checks (standard practice)
Detailed endpoint can be restricted if needed via reverse proxy rules

This implementation provides a robust, production-ready health check system that meets enterprise requirements for monitoring, observability, and container orchestration.

6 KiB Raw Blame History