200 lines
No EOL
6 KiB
Markdown
200 lines
No EOL
6 KiB
Markdown
# Cognee Health Check System Implementation
|
|
|
|
## Overview
|
|
|
|
This implementation provides a comprehensive health check system for the Cognee API that monitors all critical backend components and provides detailed health status information for production deployments, container orchestration, and monitoring systems.
|
|
|
|
## Implementation Files
|
|
|
|
### 1. `/cognee/api/health.py`
|
|
- **HealthChecker class**: Main health checking logic
|
|
- **Health models**: Pydantic models for structured responses
|
|
- **Component checkers**: Individual health check methods for each service
|
|
|
|
### 2. `/cognee/api/client.py` (Updated)
|
|
- **Enhanced health endpoints**: Three new endpoints replacing the basic health check
|
|
- **Proper HTTP status codes**: Returns appropriate status codes based on health status
|
|
|
|
## Health Check Endpoints
|
|
|
|
### 1. `GET /health` - Basic Liveness Probe
|
|
- **Purpose**: Basic liveness check for container orchestration
|
|
- **Response**: HTTP 200 (healthy/degraded) or 503 (unhealthy)
|
|
- **Use case**: Kubernetes liveness probe, load balancer health checks
|
|
|
|
### 2. `GET /health/ready` - Readiness Probe
|
|
- **Purpose**: Kubernetes readiness probe
|
|
- **Response**: JSON with ready/not ready status
|
|
- **Use case**: Kubernetes readiness probe, deployment verification
|
|
|
|
### 3. `GET /health/detailed` - Comprehensive Health Status
|
|
- **Purpose**: Detailed health information for monitoring and debugging
|
|
- **Response**: Complete health status with component details
|
|
- **Use case**: Monitoring dashboards, troubleshooting, operational visibility
|
|
|
|
## Health Check Components
|
|
|
|
### Critical Services (Failure = HTTP 503)
|
|
1. **Relational Database** (SQLite/PostgreSQL)
|
|
- Tests database connectivity and session creation
|
|
- Validates schema accessibility
|
|
|
|
2. **Vector Database** (LanceDB/Qdrant/PGVector/ChromaDB)
|
|
- Tests vector database connectivity
|
|
- Validates index accessibility
|
|
|
|
3. **Graph Database** (Kuzu/Neo4j/FalkorDB/Memgraph)
|
|
- Tests graph database connectivity
|
|
- Validates schema and basic operations
|
|
|
|
4. **File Storage** (Local/S3)
|
|
- Tests file system or S3 accessibility
|
|
- Validates read/write permissions
|
|
|
|
### Non-Critical Services (Failure = Degraded Status)
|
|
1. **LLM Provider** (OpenAI/Ollama/Anthropic/Gemini)
|
|
- Validates configuration and API key presence
|
|
- Non-blocking for core functionality
|
|
|
|
2. **Embedding Service**
|
|
- Tests embedding engine accessibility
|
|
- Non-blocking for core functionality
|
|
|
|
## Response Format
|
|
|
|
```json
|
|
{
|
|
"status": "healthy|degraded|unhealthy",
|
|
"timestamp": "2024-01-15T10:30:45Z",
|
|
"version": "1.0.0",
|
|
"uptime": 3600,
|
|
"components": {
|
|
"relational_db": {
|
|
"status": "healthy",
|
|
"provider": "sqlite",
|
|
"response_time_ms": 45,
|
|
"details": "Connection successful"
|
|
},
|
|
"vector_db": {
|
|
"status": "healthy",
|
|
"provider": "lancedb",
|
|
"response_time_ms": 120,
|
|
"details": "Index accessible"
|
|
},
|
|
"graph_db": {
|
|
"status": "healthy",
|
|
"provider": "kuzu",
|
|
"response_time_ms": 89,
|
|
"details": "Schema validated"
|
|
},
|
|
"file_storage": {
|
|
"status": "healthy",
|
|
"provider": "local",
|
|
"response_time_ms": 156,
|
|
"details": "Storage accessible"
|
|
},
|
|
"llm_provider": {
|
|
"status": "healthy",
|
|
"provider": "openai",
|
|
"response_time_ms": 1250,
|
|
"details": "Configuration valid"
|
|
},
|
|
"embedding_service": {
|
|
"status": "healthy",
|
|
"provider": "configured",
|
|
"response_time_ms": 890,
|
|
"details": "Embedding engine accessible"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Health Status Logic
|
|
|
|
### Overall Status Determination
|
|
- **UNHEALTHY**: Any critical service is unhealthy
|
|
- **DEGRADED**: All critical services healthy, but non-critical services have issues
|
|
- **HEALTHY**: All services are functioning properly
|
|
|
|
### HTTP Status Codes
|
|
- **200**: Healthy or degraded (service operational)
|
|
- **503**: Unhealthy (service not ready/available)
|
|
|
|
## Usage Examples
|
|
|
|
### Kubernetes Deployment
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: cognee-api
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: cognee
|
|
image: cognee:latest
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8000
|
|
initialDelaySeconds: 30
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health/ready
|
|
port: 8000
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 5
|
|
```
|
|
|
|
### Docker Compose Health Check
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
cognee-api:
|
|
image: cognee:latest
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 40s
|
|
```
|
|
|
|
### Monitoring Integration
|
|
```bash
|
|
# Basic health check
|
|
curl http://localhost:8000/health
|
|
|
|
# Detailed health status for monitoring
|
|
curl http://localhost:8000/health/detailed | jq '.components'
|
|
|
|
# Readiness check
|
|
curl http://localhost:8000/health/ready
|
|
```
|
|
|
|
## Implementation Benefits
|
|
|
|
1. **Production Ready**: Proper HTTP status codes and structured responses
|
|
2. **Container Orchestration**: Kubernetes-compatible liveness and readiness probes
|
|
3. **Monitoring Integration**: Detailed component status for observability
|
|
4. **Graceful Degradation**: Distinguishes between critical and non-critical failures
|
|
5. **Performance Tracking**: Response time metrics for each component
|
|
6. **Troubleshooting**: Detailed error messages and component status
|
|
|
|
## Error Handling
|
|
|
|
- All health checks are wrapped in try-catch blocks
|
|
- Individual component failures don't crash the health check system
|
|
- Detailed error messages are provided for troubleshooting
|
|
- Timeouts and response times are tracked for performance monitoring
|
|
|
|
## Security Considerations
|
|
|
|
- Health endpoints don't expose sensitive configuration details
|
|
- Error messages are sanitized to prevent information leakage
|
|
- No authentication required for basic health checks (standard practice)
|
|
- Detailed endpoint can be restricted if needed via reverse proxy rules
|
|
|
|
This implementation provides a robust, production-ready health check system that meets enterprise requirements for monitoring, observability, and container orchestration. |