gmakstutis/cognee

Fork 0

Pavan Chilukuri 54e5be39e1 Add health checks

2025-08-02 09:42:25 -05:00

5.2 KiB

Raw Blame History

Health Check System Implementation Summary

What Was Implemented

1. Core Health Check Module (`cognee/api/health.py`)

HealthChecker class: Comprehensive health checking system
Pydantic models: Structured response models for health data
Component checkers: Individual health check methods for each backend service
Status determination logic: Proper classification of healthy/degraded/unhealthy states

2. Enhanced API Endpoints (`cognee/api/client.py`)

GET /health: Basic liveness probe (replaces existing basic endpoint)
GET /health/ready: Kubernetes readiness probe
GET /health/detailed: Comprehensive health status with component details

3. Backend Component Health Checks

Critical Services (Failure = HTTP 503)

Relational Database: SQLite/PostgreSQL connectivity and session validation
Vector Database: LanceDB/Qdrant/PGVector/ChromaDB connectivity and index access
Graph Database: Kuzu/Neo4j/FalkorDB/Memgraph connectivity and schema validation
File Storage: Local filesystem/S3 accessibility and permissions

Non-Critical Services (Failure = Degraded Status)

LLM Provider: OpenAI/Ollama/Anthropic/Gemini configuration validation
Embedding Service: Embedding engine accessibility check

Key Features

1. Production-Ready Design

Proper HTTP status codes (200 for healthy/degraded, 503 for unhealthy)
Structured JSON responses with detailed component information
Response time tracking for performance monitoring
Graceful error handling and detailed error messages

2. Container Orchestration Support

Kubernetes-compatible liveness and readiness probes
Docker health check support
Proper startup and runtime health validation

3. Monitoring Integration

Detailed component status for observability platforms
Performance metrics (response times)
Version and uptime information
Structured logging for troubleshooting

4. Robust Error Handling

Individual component failures don't crash the health system
Detailed error messages for troubleshooting
Timeout handling and performance tracking
Graceful degradation for non-critical services

Response Format Example

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:45Z",
  "version": "1.0.0-local",
  "uptime": 3600,
  "components": {
    "relational_db": {
      "status": "healthy",
      "provider": "sqlite",
      "response_time_ms": 45,
      "details": "Connection successful"
    },
    "vector_db": {
      "status": "healthy",
      "provider": "lancedb",
      "response_time_ms": 120,
      "details": "Index accessible"
    },
    "graph_db": {
      "status": "healthy",
      "provider": "kuzu",
      "response_time_ms": 89,
      "details": "Schema validated"
    },
    "file_storage": {
      "status": "healthy",
      "provider": "local",
      "response_time_ms": 156,
      "details": "Storage accessible"
    },
    "llm_provider": {
      "status": "healthy",
      "provider": "openai",
      "response_time_ms": 25,
      "details": "Configuration valid"
    },
    "embedding_service": {
      "status": "healthy",
      "provider": "configured",
      "response_time_ms": 30,
      "details": "Embedding engine accessible"
    }
  }
}

Files Created/Modified

New Files

cognee/api/health.py - Core health check system
examples/health_check_example.py - Usage examples and monitoring script
HEALTH_CHECK_IMPLEMENTATION.md - Detailed documentation
HEALTH_CHECK_SUMMARY.md - This summary file

Modified Files

cognee/api/client.py - Enhanced with new health endpoints

Usage Examples

Basic Health Check

curl http://localhost:8000/health
# Returns: HTTP 200 (healthy/degraded) or 503 (unhealthy)

Readiness Check

curl http://localhost:8000/health/ready
# Returns: {"status": "ready"} or {"status": "not ready", "reason": "..."}

Detailed Health Status

curl http://localhost:8000/health/detailed
# Returns: Complete health status with component details

Kubernetes Integration

livenessProbe:
  httpGet:
    path: /health
    port: 8000
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000

Benefits Achieved

Comprehensive Monitoring: All critical backend services are monitored
Production Ready: Proper HTTP status codes and error handling
Container Orchestration: Kubernetes and Docker compatibility
Observability: Detailed metrics and status information
Troubleshooting: Clear error messages and component status
Performance Tracking: Response time metrics for each component
Graceful Degradation: Distinguishes critical vs non-critical failures

Implementation Notes

Health checks are designed to be lightweight and fast
Critical service failures result in HTTP 503 (service unavailable)
Non-critical service failures result in degraded status but HTTP 200
All health checks include proper error handling and timeout management
The system is extensible for adding new backend components

This implementation provides a robust, enterprise-grade health check system that meets the requirements for production deployments, container orchestration, and comprehensive monitoring.

5.2 KiB Raw Blame History