LightRAG/docs/archives/adr/007-deployment-guide-quick-reference.md
Raphael MANSUY 2b292d4924
docs: Enterprise Edition & Multi-tenancy attribution (#5)
* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
2025-12-04 18:09:15 +08:00

14 KiB

ADR 007: Deployment Guide and Quick Reference

Status: Proposed

Summary of Multi-Tenant Architecture

Core Components

Component Purpose Responsibility
Tenant Top-level isolation boundary Grouping of knowledge bases
Knowledge Base Domain-specific RAG system Contains documents, entities, relationships
TenantContext Request-scoped isolation Passed through entire call stack
RAGManager Instance caching Creates/caches LightRAG per tenant/KB
Storage Layer Filters Defense in depth All queries scoped to tenant/KB

Key Design Decisions

┌──────────────────────────────────────┐
│   Composite Isolation Strategy       │
├──────────────────────────────────────┤
│ Tenant ID (UUID)                     │
│ └─ Knowledge Base ID (UUID)          │
│    └─ Composite Key: t:k:entity_id   │
│       └─ Storage filters all queries  │
└──────────────────────────────────────┘

Files Modified/Created

New Files (11 total):

  1. lightrag/models/tenant.py - Tenant/KB models
  2. lightrag/services/tenant_service.py - Tenant management
  3. lightrag/tenant_rag_manager.py - Instance caching
  4. lightrag/api/dependencies.py - DI for tenant context
  5. lightrag/api/models/requests.py - API request models
  6. lightrag/api/routers/tenant_routes.py - Tenant endpoints
  7. tests/test_tenant_isolation.py - Unit tests
  8. tests/test_api_tenant_routes.py - Integration tests
  9. scripts/migrate_workspace_to_tenant.py - Migration script
  10. lightrag/kg/migrations/001_add_tenant_schema.sql - DB schema
  11. lightrag/kg/migrations/mongo_001_add_tenant_collections.py - MongoDB schema

Modified Files (7 total):

  1. lightrag/base.py - Add tenant/kb to StorageNameSpace
  2. lightrag/lightrag.py - Add tenant context to query/insert
  3. lightrag/kg/postgres_impl.py - Add tenant filtering to all queries
  4. lightrag/kg/json_kv_impl.py - Add tenant/kb directories
  5. lightrag/api/lightrag_server.py - Register new routes
  6. lightrag/api/auth.py - Tenant-aware JWT validation
  7. lightrag/api/config.py - Add tenant configuration

Quick Start for Developers

1. Setting Up Development Environment

# Install dependencies
pip install -r requirements.txt

# Set up PostgreSQL for tenant metadata
docker run -d --name lightrag-postgres \
  -e POSTGRES_PASSWORD=password \
  -p 5432:5432 \
  postgres:15

# Run migrations
psql postgresql://postgres:password@localhost:5432/postgres < \
  lightrag/kg/migrations/001_add_tenant_schema.sql

# Set environment variables
export LIGHTRAG_KV_STORAGE=PGKVStorage
export TENANT_DB_HOST=localhost
export TENANT_DB_USER=postgres
export TENANT_DB_PASSWORD=password

2. Testing Locally

# Run unit tests
pytest tests/test_tenant_isolation.py -v

# Run integration tests
pytest tests/test_api_tenant_routes.py -v

# Run with coverage
pytest --cov=lightrag tests/ --cov-report=html

# Test tenant isolation (should fail if not working)
pytest tests/test_tenant_isolation.py::TestTenantIsolation::test_cross_tenant_data_isolation -v

3. Manual Testing via cURL

# 1. Create tenant (admin)
ADMIN_TOKEN="eyJhbGc..."  # From auth system
curl -X POST http://localhost:9621/api/v1/tenants \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tenant_name": "Test Tenant"}'

# Response:
# {
#   "status": "success",
#   "data": {
#     "tenant_id": "550e8400-e29b-41d4-a716-446655440000",
#     "tenant_name": "Test Tenant",
#     "is_active": true,
#     "created_at": "2025-11-20T10:00:00Z"
#   }
# }

TENANT_ID="550e8400-e29b-41d4-a716-446655440000"

# 2. Create knowledge base
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/knowledge-bases \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"kb_name": "Test KB"}'

KB_ID="660e8400-e29b-41d4-a716-446655440000"

# 3. Create API key for tenant
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/api-keys \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "key_name": "test-key",
    "knowledge_base_ids": ["'$KB_ID'"],
    "permissions": ["query:run", "document:read"]
  }'

# Response includes: {"key": "sk-..."}
API_KEY="sk-..."

# 4. Add document with API key
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/knowledge-bases/$KB_ID/documents/add \
  -H "X-API-Key: $API_KEY" \
  -F "file=@test_document.pdf"

# 5. Query knowledge base
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/knowledge-bases/$KB_ID/query \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is this document about?",
    "mode": "mix",
    "top_k": 10
  }'

# 6. Verify cross-tenant isolation (should fail)
TENANT_B_ID="770e8400-e29b-41d4-a716-446655440001"
curl -X GET http://localhost:9621/api/v1/tenants/$TENANT_B_ID \
  -H "X-API-Key: $API_KEY"

# Response: 403 Forbidden (API key only for Tenant A)

Backward Compatibility

Migrating from Workspace to Tenant

# 1. Backup existing data
cp -r ./rag_storage ./rag_storage.backup

# 2. Run migration script
python scripts/migrate_workspace_to_tenant.py \
  --working-dir ./rag_storage

# 3. Verify migration
python -c "
from lightrag.services.tenant_service import TenantService
import asyncio

async def verify():
    service = TenantService(...)
    tenants = await service.list_all_tenants()
    for t in tenants:
        print(f'Tenant: {t.tenant_id} ({t.tenant_name})')
        kbs = await service.list_knowledge_bases(t.tenant_id)
        for kb in kbs:
            print(f'  KB: {kb.kb_id} ({kb.kb_name})')

asyncio.run(verify())
"

# 4. Test that old workspace still accessible via tenant
# Legacy workspace 'myworkspace' becomes tenant 'myworkspace'

Configuration Examples

Docker Compose

version: '3.8'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: lightrag
      POSTGRES_PASSWORD: secret
    ports:
      - "5432:5432"
    volumes:
      - ./lightrag/kg/migrations/001_add_tenant_schema.sql:/docker-entrypoint-initdb.d/01_schema.sql

  redis:
    image: redis:7
    ports:
      - "6379:6379"

  lightrag:
    build: .
    environment:
      # Tenant Configuration
      TENANT_ENABLED: "true"
      MAX_CACHED_INSTANCES: "100"
      
      # Storage Configuration
      LIGHTRAG_KV_STORAGE: "PGKVStorage"
      LIGHTRAG_VECTOR_STORAGE: "PGVectorStorage"
      LIGHTRAG_GRAPH_STORAGE: "PGGraphStorage"
      
      # Database
      PG_HOST: "postgres"
      PG_DATABASE: "lightrag"
      PG_USER: "postgres"
      PG_PASSWORD: "secret"
      
      # LLM Configuration
      LLM_BINDING: "openai"
      LLM_MODEL: "gpt-4o-mini"
      LLM_BINDING_API_KEY: "${OPENAI_API_KEY}"
      
      # Embedding Configuration
      EMBEDDING_BINDING: "openai"
      EMBEDDING_MODEL: "text-embedding-3-small"
      EMBEDDING_DIM: "1536"
      
      # Authentication
      JWT_ALGORITHM: "HS256"
      TOKEN_SECRET: "your-secret-key-change-in-production"
      TOKEN_EXPIRE_HOURS: "24"
      
      # API
      CORS_ORIGINS: "*"
      LOG_LEVEL: "INFO"
    
    ports:
      - "9621:9621"
    
    depends_on:
      - postgres
      - redis
    
    volumes:
      - ./rag_storage:/app/rag_storage

Environment Variables

# Tenant Manager
TENANT_ENABLED=true
MAX_CACHED_INSTANCES=100
TENANT_CONFIG_SYNC_INTERVAL=300

# Database
LIGHTRAG_KV_STORAGE=PGKVStorage
LIGHTRAG_VECTOR_STORAGE=PGVectorStorage
LIGHTRAG_GRAPH_STORAGE=PGGraphStorage

# PostgreSQL Connection
PG_HOST=localhost
PG_PORT=5432
PG_DATABASE=lightrag
PG_USER=postgres
PG_PASSWORD=secret

# Authentication
JWT_ALGORITHM=HS256
TOKEN_SECRET=your-secret-key
TOKEN_EXPIRE_HOURS=24
GUEST_TOKEN_EXPIRE_HOURS=1

# LLM Configuration
LLM_BINDING=openai
LLM_MODEL=gpt-4o-mini
LLM_BINDING_API_KEY=${OPENAI_API_KEY}
EMBEDDING_BINDING=openai
EMBEDDING_MODEL=text-embedding-3-small

# Quotas
MAX_DOCUMENTS=10000
MAX_STORAGE_GB=100
MAX_KB_PER_TENANT=50

# Rate Limiting
RATE_LIMIT_QUERIES_PER_MINUTE=100
RATE_LIMIT_DOCUMENTS_PER_HOUR=50
RATE_LIMIT_API_CALLS_PER_MONTH=100000

# Monitoring
LOG_LEVEL=INFO
ENABLE_AUDIT_LOGGING=true
AUDIT_LOG_RETENTION_DAYS=90

Monitoring and Observability

Metrics to Track

# Key metrics for multi-tenant system

METRICS = {
    "tenant_management": {
        "active_tenants": "Gauge",
        "total_kbs": "Gauge",
        "tenant_creation_time": "Histogram",
    },
    "isolation": {
        "cross_tenant_access_attempts": "Counter",  # Should be 0
        "cross_kb_access_attempts": "Counter",      # Should be 0
        "isolation_violations": "Counter",           # Should be 0
    },
    "performance": {
        "query_latency_per_tenant": "Histogram",
        "document_processing_time": "Histogram",
        "rag_instance_cache_hits": "Counter",
        "rag_instance_cache_misses": "Counter",
    },
    "security": {
        "failed_auth_attempts": "Counter",
        "permission_denials": "Counter",
        "api_key_usage": "Counter (per key)",
    },
    "quotas": {
        "storage_used_per_tenant": "Gauge",
        "documents_per_tenant": "Gauge",
        "api_calls_per_tenant": "Counter",
    }
}

Example Prometheus Queries

# Average query latency per tenant
histogram_quantile(0.95, query_latency_per_tenant) by (tenant_id)

# Cache hit rate
rag_instance_cache_hits / (rag_instance_cache_hits + rag_instance_cache_misses)

# Failed auth attempts
rate(failed_auth_attempts[5m])

# Cross-tenant access attempts (should be 0)
cross_tenant_access_attempts

Logging

# Structured logging for debugging

import structlog

logger = structlog.get_logger()

# Example log entry
logger.info(
    "query_executed",
    user_id="user-123",
    tenant_id="acme",
    kb_id="docs",
    query="What is...",
    mode="mix",
    latency_ms=145,
    result_count=5,
    request_id="req-abc-123"
)

Rollout Strategy

Phase 1: Soft Launch (Week 1)

- Deploy with TENANT_ENABLED=false (features off)
- Run in parallel with existing system
- Test against staging data
- Monitor for issues: 0 expected

Phase 2: Closed Beta (Week 2)

- TENANT_ENABLED=true for 10% of traffic
- Small set of trusted customers
- Monitor metrics closely
- Rollback plan ready

Phase 3: Gradual Rollout (Week 3)

- 25% → 50% → 100%
- Staggered by time of day
- Monitor isolation violations (should be 0)
- Customer education happening

Phase 4: Full Production (Week 4)

- 100% of traffic on multi-tenant system
- Legacy workspace mode deprecated (6-month timeline)
- Full monitoring and alerting active
- Support team trained

Troubleshooting Guide

Issue: Cross-Tenant Data Visible

Symptom: User can see Tenant B data while using Tenant A credentials
Solution:
1. Check TokenPayload.tenant_id == request.path.tenant_id
2. Check storage filters include WHERE tenant_id = ? AND kb_id = ?
3. Review TenantContext creation in get_tenant_context()
4. Check RAGManager.get_rag_instance() is called with correct IDs

Issue: Slow Queries

Symptom: Queries taking >1 second
Solution:
1. Check indexes on (tenant_id, kb_id) columns
2. Verify RAG instance cache is working (check metrics)
3. Check if instance is being recompiled every request
4. Profile with: SELECT * FROM documents WHERE tenant_id=? AND kb_id=?

Issue: High Memory Usage

Symptom: Memory growing over time
Solution:
1. Check MAX_CACHED_INSTANCES setting (default 100)
2. Monitor rag_instance_cache_size metric
3. Verify finalize_storages() called on eviction
4. Check for memory leaks in embedding cache

Support and Resources

Documentation

  • Architecture Overview: adr/001-multi-tenant-architecture-overview.md
  • Implementation Guide: adr/002-implementation-strategy.md
  • Data Models: adr/003-data-models-and-storage.md
  • API Design: adr/004-api-design.md
  • Security: adr/005-security-analysis.md
  • Diagrams & Alternatives: adr/006-architecture-diagrams-alternatives.md

Code Examples

  • See examples/multi_tenant_demo.py for complete usage example
  • See tests/test_api_tenant_routes.py for API testing examples
  • See scripts/migrate_workspace_to_tenant.py for migration examples

Getting Help

Success Criteria

Multi-tenant implementation is successful when:

Functional Requirements Met

  • All API endpoints working with tenant/KB routing
  • Data isolation verified (cross-tenant access prevents)
  • RBAC enforcement working correctly
  • Audit logging capturing all operations
  • Migration from workspace to tenant successful

Performance Targets Met

  • Query latency < 200ms p99 (including tenant filtering)
  • Storage overhead < 3%
  • Instance cache hit rate > 90%
  • API response time < 150ms average

Security Requirements Met

  • Zero cross-tenant data access
  • JWT token validation in all requests
  • Permission checking on every operation
  • Rate limiting preventing abuse
  • Audit logs tamper-proof and retained

Operational Readiness

  • Monitoring/alerting configured
  • Runbooks for common issues
  • Disaster recovery plan tested
  • Support team trained
  • Documentation complete

Document Version: 1.0
Last Updated: 2025-11-20
Deployment Timeline: 4 weeks
Success Criteria: All items checked off
Status: Ready for Implementation