6 KiB
6 KiB
Cognee Health Check System Implementation
Overview
This implementation provides a comprehensive health check system for the Cognee API that monitors all critical backend components and provides detailed health status information for production deployments, container orchestration, and monitoring systems.
Implementation Files
1. /cognee/api/health.py
- HealthChecker class: Main health checking logic
- Health models: Pydantic models for structured responses
- Component checkers: Individual health check methods for each service
2. /cognee/api/client.py (Updated)
- Enhanced health endpoints: Three new endpoints replacing the basic health check
- Proper HTTP status codes: Returns appropriate status codes based on health status
Health Check Endpoints
1. GET /health - Basic Liveness Probe
- Purpose: Basic liveness check for container orchestration
- Response: HTTP 200 (healthy/degraded) or 503 (unhealthy)
- Use case: Kubernetes liveness probe, load balancer health checks
2. GET /health/ready - Readiness Probe
- Purpose: Kubernetes readiness probe
- Response: JSON with ready/not ready status
- Use case: Kubernetes readiness probe, deployment verification
3. GET /health/detailed - Comprehensive Health Status
- Purpose: Detailed health information for monitoring and debugging
- Response: Complete health status with component details
- Use case: Monitoring dashboards, troubleshooting, operational visibility
Health Check Components
Critical Services (Failure = HTTP 503)
-
Relational Database (SQLite/PostgreSQL)
- Tests database connectivity and session creation
- Validates schema accessibility
-
Vector Database (LanceDB/Qdrant/PGVector/ChromaDB)
- Tests vector database connectivity
- Validates index accessibility
-
Graph Database (Kuzu/Neo4j/FalkorDB/Memgraph)
- Tests graph database connectivity
- Validates schema and basic operations
-
File Storage (Local/S3)
- Tests file system or S3 accessibility
- Validates read/write permissions
Non-Critical Services (Failure = Degraded Status)
-
LLM Provider (OpenAI/Ollama/Anthropic/Gemini)
- Validates configuration and API key presence
- Non-blocking for core functionality
-
Embedding Service
- Tests embedding engine accessibility
- Non-blocking for core functionality
Response Format
{
"status": "healthy|degraded|unhealthy",
"timestamp": "2024-01-15T10:30:45Z",
"version": "1.0.0",
"uptime": 3600,
"components": {
"relational_db": {
"status": "healthy",
"provider": "sqlite",
"response_time_ms": 45,
"details": "Connection successful"
},
"vector_db": {
"status": "healthy",
"provider": "lancedb",
"response_time_ms": 120,
"details": "Index accessible"
},
"graph_db": {
"status": "healthy",
"provider": "kuzu",
"response_time_ms": 89,
"details": "Schema validated"
},
"file_storage": {
"status": "healthy",
"provider": "local",
"response_time_ms": 156,
"details": "Storage accessible"
},
"llm_provider": {
"status": "healthy",
"provider": "openai",
"response_time_ms": 1250,
"details": "Configuration valid"
},
"embedding_service": {
"status": "healthy",
"provider": "configured",
"response_time_ms": 890,
"details": "Embedding engine accessible"
}
}
}
Health Status Logic
Overall Status Determination
- UNHEALTHY: Any critical service is unhealthy
- DEGRADED: All critical services healthy, but non-critical services have issues
- HEALTHY: All services are functioning properly
HTTP Status Codes
- 200: Healthy or degraded (service operational)
- 503: Unhealthy (service not ready/available)
Usage Examples
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: cognee-api
spec:
template:
spec:
containers:
- name: cognee
image: cognee:latest
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
Docker Compose Health Check
version: '3.8'
services:
cognee-api:
image: cognee:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Monitoring Integration
# Basic health check
curl http://localhost:8000/health
# Detailed health status for monitoring
curl http://localhost:8000/health/detailed | jq '.components'
# Readiness check
curl http://localhost:8000/health/ready
Implementation Benefits
- Production Ready: Proper HTTP status codes and structured responses
- Container Orchestration: Kubernetes-compatible liveness and readiness probes
- Monitoring Integration: Detailed component status for observability
- Graceful Degradation: Distinguishes between critical and non-critical failures
- Performance Tracking: Response time metrics for each component
- Troubleshooting: Detailed error messages and component status
Error Handling
- All health checks are wrapped in try-catch blocks
- Individual component failures don't crash the health check system
- Detailed error messages are provided for troubleshooting
- Timeouts and response times are tracked for performance monitoring
Security Considerations
- Health endpoints don't expose sensitive configuration details
- Error messages are sanitized to prevent information leakage
- No authentication required for basic health checks (standard practice)
- Detailed endpoint can be restricted if needed via reverse proxy rules
This implementation provides a robust, production-ready health check system that meets enterprise requirements for monitoring, observability, and container orchestration.