cognee/HEALTH_CHECK_IMPLEMENTATION.md

# Cognee Health Check System Implementation

## Overview

This implementation provides a comprehensive health check system for the Cognee API that monitors all critical backend components and provides detailed health status information for production deployments, container orchestration, and monitoring systems.

## Implementation Files

### 1. `/cognee/api/health.py`
- **HealthChecker class**: Main health checking logic
- **Health models**: Pydantic models for structured responses
- **Component checkers**: Individual health check methods for each service

### 2. `/cognee/api/client.py` (Updated)
- **Enhanced health endpoints**: Three new endpoints replacing the basic health check
- **Proper HTTP status codes**: Returns appropriate status codes based on health status

## Health Check Endpoints

### 1. `GET /health` - Basic Liveness Probe
- **Purpose**: Basic liveness check for container orchestration
- **Response**: HTTP 200 (healthy/degraded) or 503 (unhealthy)
- **Use case**: Kubernetes liveness probe, load balancer health checks

### 2. `GET /health/ready` - Readiness Probe
- **Purpose**: Kubernetes readiness probe
- **Response**: JSON with ready/not ready status
- **Use case**: Kubernetes readiness probe, deployment verification

### 3. `GET /health/detailed` - Comprehensive Health Status
- **Purpose**: Detailed health information for monitoring and debugging
- **Response**: Complete health status with component details
- **Use case**: Monitoring dashboards, troubleshooting, operational visibility

## Health Check Components

### Critical Services (Failure = HTTP 503)
1. **Relational Database** (SQLite/PostgreSQL)
   - Tests database connectivity and session creation
   - Validates schema accessibility

2. **Vector Database** (LanceDB/Qdrant/PGVector/ChromaDB)
   - Tests vector database connectivity
   - Validates index accessibility

3. **Graph Database** (Kuzu/Neo4j/FalkorDB/Memgraph)
   - Tests graph database connectivity
   - Validates schema and basic operations

4. **File Storage** (Local/S3)
   - Tests file system or S3 accessibility
   - Validates read/write permissions

### Non-Critical Services (Failure = Degraded Status)
1. **LLM Provider** (OpenAI/Ollama/Anthropic/Gemini)
   - Validates configuration and API key presence
   - Non-blocking for core functionality

2. **Embedding Service**
   - Tests embedding engine accessibility
   - Non-blocking for core functionality

## Response Format

```json
{
  "status": "healthy|degraded|unhealthy",
  "timestamp": "2024-01-15T10:30:45Z",
  "version": "1.0.0",
  "uptime": 3600,
  "components": {
    "relational_db": {
      "status": "healthy",
      "provider": "sqlite",
      "response_time_ms": 45,
      "details": "Connection successful"
    },
    "vector_db": {
      "status": "healthy",
      "provider": "lancedb",
      "response_time_ms": 120,
      "details": "Index accessible"
    },
    "graph_db": {
      "status": "healthy",
      "provider": "kuzu",
      "response_time_ms": 89,
      "details": "Schema validated"
    },
    "file_storage": {
      "status": "healthy",
      "provider": "local",
      "response_time_ms": 156,
      "details": "Storage accessible"
    },
    "llm_provider": {
      "status": "healthy",
      "provider": "openai",
      "response_time_ms": 1250,
      "details": "Configuration valid"
    },
    "embedding_service": {
      "status": "healthy",
      "provider": "configured",
      "response_time_ms": 890,
      "details": "Embedding engine accessible"
    }
  }
}
```

## Health Status Logic

### Overall Status Determination
- **UNHEALTHY**: Any critical service is unhealthy
- **DEGRADED**: All critical services healthy, but non-critical services have issues
- **HEALTHY**: All services are functioning properly

### HTTP Status Codes
- **200**: Healthy or degraded (service operational)
- **503**: Unhealthy (service not ready/available)

## Usage Examples

### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cognee-api
spec:
  template:
    spec:
      containers:
      - name: cognee
        image: cognee:latest
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
```

### Docker Compose Health Check
```yaml
version: '3.8'
services:
  cognee-api:
    image: cognee:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
```

### Monitoring Integration
```bash
# Basic health check
curl http://localhost:8000/health

# Detailed health status for monitoring
curl http://localhost:8000/health/detailed | jq '.components'

# Readiness check
curl http://localhost:8000/health/ready
```

## Implementation Benefits

1. **Production Ready**: Proper HTTP status codes and structured responses
2. **Container Orchestration**: Kubernetes-compatible liveness and readiness probes
3. **Monitoring Integration**: Detailed component status for observability
4. **Graceful Degradation**: Distinguishes between critical and non-critical failures
5. **Performance Tracking**: Response time metrics for each component
6. **Troubleshooting**: Detailed error messages and component status

## Error Handling

- All health checks are wrapped in try-catch blocks
- Individual component failures don't crash the health check system
- Detailed error messages are provided for troubleshooting
- Timeouts and response times are tracked for performance monitoring

## Security Considerations

- Health endpoints don't expose sensitive configuration details
- Error messages are sanitized to prevent information leakage
- No authentication required for basic health checks (standard practice)
- Detailed endpoint can be restricted if needed via reverse proxy rules

This implementation provides a robust, production-ready health check system that meets enterprise requirements for monitoring, observability, and container orchestration.