# ADR 005: Security Analysis and Mitigation Strategies ## Status: Proposed ## Overview This document identifies security considerations, potential vulnerabilities, and mitigation strategies for the multi-tenant architecture. ## Security Principles ### Zero Trust Model Every request is treated as potentially untrusted: - All tenant/KB context must be explicitly verified - No implicit assumptions about user access - Cross-tenant data access denied by default ### Defense in Depth Multiple layers of security: 1. Authentication (identity verification) 2. Authorization (permission checking) 3. Data isolation (storage layer filtering) 4. Audit logging (forensic capability) 5. Rate limiting (abuse prevention) ### Complete Mediation All data access controlled through API layer, never direct storage access. ## Threat Model ### Attack Vectors & Mitigations #### 1. Unauthorized Cross-Tenant Access **Threat**: Attacker gains access to another tenant's data ``` Attacker (Tenant A) → Exploit → Access Tenant B data ``` **Likelihood**: HIGH (if not mitigated) **Impact**: CRITICAL (data breach) **Mitigation Strategies**: ```python # 1. Strict tenant validation in dependency injection async def get_tenant_context( tenant_id: str = Path(...), kb_id: str = Path(...), authorization: str = Header(...), token_service = Depends(get_token_service) ) -> TenantContext: # Decode and validate token token_data = token_service.validate_token(authorization) # CRITICAL: Verify tenant in token matches path parameter if token_data["tenant_id"] != tenant_id: logger.warning( f"Tenant mismatch: token claims {token_data['tenant_id']}, " f"but path requests {tenant_id}", extra={"user_id": token_data["sub"], "request_id": request_id} ) raise HTTPException(status_code=403, detail="Tenant mismatch") # Verify KB accessibility if kb_id not in token_data["knowledge_base_ids"] and "*" not in token_data["knowledge_base_ids"]: raise HTTPException(status_code=403, detail="KB not accessible") return TenantContext(tenant_id=tenant_id, kb_id=kb_id, ...) # 2. Storage layer filtering (defense in depth) async def query_with_tenant_filter( sql: str, tenant_id: str, kb_id: str, params: List[Any] ): # Always add tenant/kb filter to WHERE clause if "WHERE" in sql: sql += " AND tenant_id = ? AND kb_id = ?" else: sql += " WHERE tenant_id = ? AND kb_id = ?" params.extend([tenant_id, kb_id]) return await execute(sql, params) # 3. Composite key validation def validate_composite_key(entity_id: str, expected_tenant: str, expected_kb: str): parts = entity_id.split(":") if len(parts) != 3 or parts[0] != expected_tenant or parts[1] != expected_kb: raise ValueError(f"Invalid entity_id: {entity_id}") ``` #### 2. Authentication Bypass via Token Manipulation **Threat**: Attacker forges or modifies JWT token to gain unauthorized access ``` Valid Token → Modify claims → Invalid signature but accepted ``` **Likelihood**: MEDIUM (if not mitigated) **Impact**: CRITICAL **Mitigation Strategies**: ```python # 1. Strong signature verification def validate_token(token: str) -> TokenPayload: try: # Use strong algorithm (HS256 minimum, RS256 preferred) payload = jwt.decode( token, settings.jwt_secret_key, # Keep secret secure algorithms=["HS256"], # Only allow expected algorithms options={"verify_signature": True} ) # Verify required claims required_claims = ["sub", "tenant_id", "exp", "iat"] for claim in required_claims: if claim not in payload: raise jwt.InvalidTokenError(f"Missing claim: {claim}") # Check expiration if payload["exp"] < time.time(): raise jwt.ExpiredSignatureError("Token expired") # Check issued-at time (prevent tokens from future) if payload["iat"] > time.time() + 60: # 60 second clock skew tolerance raise jwt.InvalidTokenError("Token issued in future") return TokenPayload(**payload) except jwt.DecodeError as e: logger.warning(f"Invalid token signature: {e}") raise HTTPException(status_code=401, detail="Invalid token") ``` #### 3. Parameter Injection / Path Traversal **Threat**: Attacker passes malicious tenant_id to access unintended data ``` GET /api/v1/tenants/../../admin/data POST /api/v1/tenants/"; DROP TABLE tenants; -- ``` **Likelihood**: MEDIUM **Impact**: HIGH **Mitigation Strategies**: ```python # 1. Strict input validation from pydantic import constr, validator class TenantPathParams(BaseModel): tenant_id: constr(regex="^[a-f0-9-]{36}$") # UUID format only kb_id: constr(regex="^[a-f0-9-]{36}$") # UUID format only @router.get("/api/v1/tenants/{tenant_id}") async def get_tenant(params: TenantPathParams = Depends()): # tenant_id is guaranteed to be valid UUID format pass # 2. Parameterized queries (prevent SQL injection) # VULNERABLE: query = f"SELECT * FROM tenants WHERE tenant_id = '{tenant_id}'" # SAFE: query = "SELECT * FROM tenants WHERE tenant_id = ?" result = await db.execute(query, [tenant_id]) # 3. API rate limiting per tenant class RateLimitMiddleware: async def __call__(self, request: Request, call_next): tenant_id = request.path_params.get("tenant_id") rate_limit_key = f"tenant:{tenant_id}:rateimit" if await redis.incr(rate_limit_key) > RATE_LIMIT: raise HTTPException(status_code=429, detail="Rate limit exceeded") redis.expire(rate_limit_key, 60) return await call_next(request) ``` #### 4. Information Disclosure via Error Messages **Threat**: Detailed error messages leak information about system structure ``` Error: "User john@acme.com does not have access to tenant-id-xyz" ``` **Likelihood**: HIGH **Impact**: MEDIUM (reconnaissance for further attacks) **Mitigation Strategies**: ```python # 1. Generic error messages # VULNERABLE: if tenant not found: return {"error": f"Tenant '{tenant_id}' not found in system"} # SAFE: if tenant not found or user cannot access tenant: return { "status": "error", "code": "ACCESS_DENIED", "message": "Access denied" } # 2. Detailed logging (not exposed to client) logger.warning( f"Unauthorized access attempt", extra={ "user_id": user_id, "requested_tenant": tenant_id, "user_tenants": user_tenants, "ip_address": client_ip, "request_id": request_id } ) # 3. Generic HTTP status codes # 401: Authentication failed (invalid token) # 403: Authorization failed (valid token, but no access) # 404: Not found (could mean doesn't exist OR no access) ``` #### 5. Denial of Service (DoS) via Resource Exhaustion **Threat**: Attacker uses API to exhaust resources ``` Attacker sends 100k queries/sec → Exhausts database connections → System unavailable ``` **Likelihood**: MEDIUM **Impact**: HIGH **Mitigation Strategies**: ```python # 1. Per-tenant rate limiting class TenantRateLimiter: async def check_limit(self, tenant_id: str, operation: str): key = f"limit:{tenant_id}:{operation}" current = await redis.get(key) limits = { "query": 100, # 100 queries per minute "document_add": 10, # 10 documents per hour "api_call": 1000, # 1000 API calls per hour } if int(current or 0) >= limits[operation]: raise HTTPException( status_code=429, detail="Rate limit exceeded", headers={"Retry-After": "60"} ) pipe = redis.pipeline() pipe.incr(key) pipe.expire(key, 60) await pipe.execute() # 2. Query complexity limits async def validate_query_complexity(query_param: QueryParam): complexity_score = 0 # Penalize expensive operations if query_param.mode == "global": complexity_score += 10 if query_param.top_k > 50: complexity_score += query_param.top_k - 50 # Check against quota tenant = await get_current_tenant() max_complexity = tenant.quota.max_monthly_api_calls if complexity_score > max_complexity: raise HTTPException(status_code=429, detail="Quota exceeded") # 3. Connection pooling limits # In storage implementation: class DatabasePool: def __init__(self, max_connections: int = 50): self.pool = create_pool(max_size=max_connections) async def execute(self, query: str, params: List): async with self.pool.acquire() as conn: return await conn.execute(query, params) ``` #### 6. Data Leakage via Logs **Threat**: Sensitive data logged and exposed via log access ``` Log: "Processing document for tenant-acme with content: [secret API key]" ``` **Likelihood**: MEDIUM **Impact**: HIGH **Mitigation Strategies**: ```python # 1. Data sanitization in logs def sanitize_for_logging(data: Any) -> Any: """Remove sensitive fields before logging""" sensitive_fields = { "password", "api_key", "secret", "token", "auth_header", "llm_binding_api_key", "embedding_binding_api_key" } if isinstance(data, dict): return { k: "***REDACTED***" if k in sensitive_fields else v for k, v in data.items() } return data # 2. Structured logging with field control logger.warning( "Authentication failed", extra={ "user_id": user_id, "tenant_id": tenant_id, "reason": "Invalid token", # Sensitive fields not included } ) # 3. Log retention and access control # - Keep logs only as long as needed (e.g., 90 days) # - Encrypt logs at rest # - Restrict access to logs (RBAC) # - Audit log access # 4. PII handling # Strip/hash PII in logs def hash_email(email: str) -> str: import hashlib return hashlib.sha256(email.encode()).hexdigest()[:8] logger.info( "Document added", extra={"created_by": hash_email(user_email)} ) ``` #### 7. Replay Attacks **Threat**: Attacker replays captured API requests ``` Attacker captures: POST /query with response Attacker replays: Same request multiple times ``` **Likelihood**: LOW-MEDIUM **Impact**: MEDIUM **Mitigation Strategies**: ```python # 1. Nonce/JTI (JWT ID) tracking class TokenBlacklist: def __init__(self): self.blacklist = set() async def revoke_token(self, jti: str): self.blacklist.add(jti) # Expire after token expiration time scheduler.schedule_removal(jti, expiration_time) async def is_revoked(self, jti: str) -> bool: return jti in self.blacklist # 2. Request idempotency for mutation operations class IdempotencyMiddleware: async def __call__(self, request: Request, call_next): if request.method in ["POST", "PUT", "DELETE"]: idempotency_key = request.headers.get("Idempotency-Key") if idempotency_key: # Check if already processed cached_response = await redis.get(f"idempotency:{idempotency_key}") if cached_response: return JSONResponse(cached_response) # Process request response = await call_next(request) # Cache response await redis.setex( f"idempotency:{idempotency_key}", 3600, # 1 hour response.body ) return response return await call_next(request) # 3. Timestamp validation async def validate_request_timestamp(request: Request): timestamp = request.headers.get("X-Timestamp") if not timestamp: raise HTTPException(status_code=400, detail="Missing timestamp") request_time = datetime.fromisoformat(timestamp) current_time = datetime.utcnow() # Reject requests older than 5 minutes if abs((current_time - request_time).total_seconds()) > 300: raise HTTPException(status_code=400, detail="Request expired") ``` ## Security Configuration ### 1. JWT Configuration ```python # settings.py class JWTSettings: # Use RS256 (asymmetric) in production instead of HS256 ALGORITHM = "RS256" # Production: asymmetric # Generate key pair: # openssl genrsa -out private_key.pem 2048 # openssl rsa -in private_key.pem -pubout -out public_key.pem PRIVATE_KEY = load_private_key() PUBLIC_KEY = load_public_key() # Token expiration times (keep short) ACCESS_TOKEN_EXPIRE_MINUTES = 15 REFRESH_TOKEN_EXPIRE_DAYS = 7 # Token claims validation REQUIRED_CLAIMS = ["sub", "tenant_id", "exp", "iat", "jti"] ``` ### 2. API Key Security ```python class APIKeySettings: # Use bcrypt for hashing API keys HASH_ALGORITHM = "bcrypt" # Require minimum key length MIN_KEY_LENGTH = 32 # Key rotation policy KEY_ROTATION_DAYS = 90 # Revocation tracking TRACK_REVOKED_KEYS = True REVOKED_KEY_RETENTION_DAYS = 30 ``` ### 3. TLS/HTTPS Configuration ```python # Enforce HTTPS in production if settings.environment == "production": # Force HTTPS redirect app.add_middleware(HTTPSRedirectMiddleware) # HSTS header (1 year) app.add_middleware( BaseHTTPMiddleware, dispatch=lambda request, call_next: add_hsts_header(call_next(request)) ) ``` ### 4. CORS Configuration ```python # Restrict CORS origins app.add_middleware( CORSMiddleware, allow_origins=[ "https://lightrag.example.com", "https://app.example.com" ], allow_methods=["GET", "POST", "PUT", "DELETE"], allow_headers=["Content-Type", "Authorization"], allow_credentials=True, max_age=3600 ) ``` ## Audit Logging ### Audit Trail ```python class AuditLog(BaseModel): audit_id: str = Field(default_factory=uuid4) timestamp: datetime = Field(default_factory=datetime.utcnow) user_id: str tenant_id: str kb_id: Optional[str] action: str # create_document, query, delete_entity, etc. resource_type: str # document, entity, relationship, etc. resource_id: str changes: Optional[Dict[str, Any]] # What changed status: str # success | failure status_code: int # HTTP status ip_address: str user_agent: str error_message: Optional[str] # Store audit logs (cannot be modified after creation) async def log_audit_event(event: AuditLog): # Store in append-only log storage await audit_storage.insert(event.dict()) # Also emit to audit stream for real-time monitoring await audit_event_stream.publish(event) # Example events to audit AUDIT_EVENTS = [ "tenant_created", "tenant_modified", "kb_created", "kb_deleted", "document_added", "document_deleted", "entity_modified", "query_executed", "api_key_created", "api_key_revoked", "user_access_denied", "quota_exceeded", ] ``` ## Vulnerability Scanning ### Regular Security Activities 1. **Dependencies Audit** ```bash # Monthly pip-audit safety check bandit -r lightrag/ ``` 2. **SAST (Static Application Security Testing)** ```bash # On every commit bandit -r lightrag/ # Scan for hardcoded secrets git-secrets scan detect-secrets scan ``` 3. **DAST (Dynamic Application Security Testing)** - Run against staging before deployment - Test common OWASP Top 10 vulnerabilities 4. **Penetration Testing** - Quarterly by external security firm - Focus on multi-tenant isolation ## Security Checklist - [ ] All API endpoints require authentication - [ ] All endpoints verify tenant context matches user token - [ ] All queries include tenant/kb filters at storage layer - [ ] Error messages don't leak system information - [ ] Rate limiting enabled per tenant - [ ] JWT tokens have short expiration (< 1 hour) - [ ] API keys hashed with bcrypt, not plain text - [ ] All sensitive data sanitized from logs - [ ] HTTPS enforced in production - [ ] CORS properly configured - [ ] Audit logging for all sensitive operations - [ ] Secret keys rotated regularly - [ ] Dependencies audited for vulnerabilities - [ ] SAST tools run on every commit - [ ] Regular penetration testing scheduled ## Compliance Considerations - **GDPR**: Data deletion, right to be forgotten - **SOC 2 Type II**: Audit trails, access controls - **ISO 27001**: Information security management - **HIPAA** (if healthcare): Data encryption, audit trails --- **Document Version**: 1.0 **Last Updated**: 2025-11-20 **Related Files**: 004-api-design.md, 002-implementation-strategy.md