LightRAG/docs/archives/adr/007-deployment-guide-quick-reference.md
Raphael MANSUY 2b292d4924
docs: Enterprise Edition & Multi-tenancy attribution (#5)
* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
2025-12-04 18:09:15 +08:00

517 lines
14 KiB
Markdown

# ADR 007: Deployment Guide and Quick Reference
## Status: Proposed
## Summary of Multi-Tenant Architecture
### Core Components
| Component | Purpose | Responsibility |
|-----------|---------|-----------------|
| **Tenant** | Top-level isolation boundary | Grouping of knowledge bases |
| **Knowledge Base** | Domain-specific RAG system | Contains documents, entities, relationships |
| **TenantContext** | Request-scoped isolation | Passed through entire call stack |
| **RAGManager** | Instance caching | Creates/caches LightRAG per tenant/KB |
| **Storage Layer Filters** | Defense in depth | All queries scoped to tenant/KB |
### Key Design Decisions
```
┌──────────────────────────────────────┐
│ Composite Isolation Strategy │
├──────────────────────────────────────┤
│ Tenant ID (UUID) │
│ └─ Knowledge Base ID (UUID) │
│ └─ Composite Key: t:k:entity_id │
│ └─ Storage filters all queries │
└──────────────────────────────────────┘
```
### Files Modified/Created
**New Files (11 total)**:
1. `lightrag/models/tenant.py` - Tenant/KB models
2. `lightrag/services/tenant_service.py` - Tenant management
3. `lightrag/tenant_rag_manager.py` - Instance caching
4. `lightrag/api/dependencies.py` - DI for tenant context
5. `lightrag/api/models/requests.py` - API request models
6. `lightrag/api/routers/tenant_routes.py` - Tenant endpoints
7. `tests/test_tenant_isolation.py` - Unit tests
8. `tests/test_api_tenant_routes.py` - Integration tests
9. `scripts/migrate_workspace_to_tenant.py` - Migration script
10. `lightrag/kg/migrations/001_add_tenant_schema.sql` - DB schema
11. `lightrag/kg/migrations/mongo_001_add_tenant_collections.py` - MongoDB schema
**Modified Files (7 total)**:
1. `lightrag/base.py` - Add tenant/kb to StorageNameSpace
2. `lightrag/lightrag.py` - Add tenant context to query/insert
3. `lightrag/kg/postgres_impl.py` - Add tenant filtering to all queries
4. `lightrag/kg/json_kv_impl.py` - Add tenant/kb directories
5. `lightrag/api/lightrag_server.py` - Register new routes
6. `lightrag/api/auth.py` - Tenant-aware JWT validation
7. `lightrag/api/config.py` - Add tenant configuration
## Quick Start for Developers
### 1. Setting Up Development Environment
```bash
# Install dependencies
pip install -r requirements.txt
# Set up PostgreSQL for tenant metadata
docker run -d --name lightrag-postgres \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 \
postgres:15
# Run migrations
psql postgresql://postgres:password@localhost:5432/postgres < \
lightrag/kg/migrations/001_add_tenant_schema.sql
# Set environment variables
export LIGHTRAG_KV_STORAGE=PGKVStorage
export TENANT_DB_HOST=localhost
export TENANT_DB_USER=postgres
export TENANT_DB_PASSWORD=password
```
### 2. Testing Locally
```bash
# Run unit tests
pytest tests/test_tenant_isolation.py -v
# Run integration tests
pytest tests/test_api_tenant_routes.py -v
# Run with coverage
pytest --cov=lightrag tests/ --cov-report=html
# Test tenant isolation (should fail if not working)
pytest tests/test_tenant_isolation.py::TestTenantIsolation::test_cross_tenant_data_isolation -v
```
### 3. Manual Testing via cURL
```bash
# 1. Create tenant (admin)
ADMIN_TOKEN="eyJhbGc..." # From auth system
curl -X POST http://localhost:9621/api/v1/tenants \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"tenant_name": "Test Tenant"}'
# Response:
# {
# "status": "success",
# "data": {
# "tenant_id": "550e8400-e29b-41d4-a716-446655440000",
# "tenant_name": "Test Tenant",
# "is_active": true,
# "created_at": "2025-11-20T10:00:00Z"
# }
# }
TENANT_ID="550e8400-e29b-41d4-a716-446655440000"
# 2. Create knowledge base
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/knowledge-bases \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"kb_name": "Test KB"}'
KB_ID="660e8400-e29b-41d4-a716-446655440000"
# 3. Create API key for tenant
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/api-keys \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"key_name": "test-key",
"knowledge_base_ids": ["'$KB_ID'"],
"permissions": ["query:run", "document:read"]
}'
# Response includes: {"key": "sk-..."}
API_KEY="sk-..."
# 4. Add document with API key
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/knowledge-bases/$KB_ID/documents/add \
-H "X-API-Key: $API_KEY" \
-F "file=@test_document.pdf"
# 5. Query knowledge base
curl -X POST http://localhost:9621/api/v1/tenants/$TENANT_ID/knowledge-bases/$KB_ID/query \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What is this document about?",
"mode": "mix",
"top_k": 10
}'
# 6. Verify cross-tenant isolation (should fail)
TENANT_B_ID="770e8400-e29b-41d4-a716-446655440001"
curl -X GET http://localhost:9621/api/v1/tenants/$TENANT_B_ID \
-H "X-API-Key: $API_KEY"
# Response: 403 Forbidden (API key only for Tenant A)
```
## Backward Compatibility
### Migrating from Workspace to Tenant
```bash
# 1. Backup existing data
cp -r ./rag_storage ./rag_storage.backup
# 2. Run migration script
python scripts/migrate_workspace_to_tenant.py \
--working-dir ./rag_storage
# 3. Verify migration
python -c "
from lightrag.services.tenant_service import TenantService
import asyncio
async def verify():
service = TenantService(...)
tenants = await service.list_all_tenants()
for t in tenants:
print(f'Tenant: {t.tenant_id} ({t.tenant_name})')
kbs = await service.list_knowledge_bases(t.tenant_id)
for kb in kbs:
print(f' KB: {kb.kb_id} ({kb.kb_name})')
asyncio.run(verify())
"
# 4. Test that old workspace still accessible via tenant
# Legacy workspace 'myworkspace' becomes tenant 'myworkspace'
```
## Configuration Examples
### Docker Compose
```yaml
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_DB: lightrag
POSTGRES_PASSWORD: secret
ports:
- "5432:5432"
volumes:
- ./lightrag/kg/migrations/001_add_tenant_schema.sql:/docker-entrypoint-initdb.d/01_schema.sql
redis:
image: redis:7
ports:
- "6379:6379"
lightrag:
build: .
environment:
# Tenant Configuration
TENANT_ENABLED: "true"
MAX_CACHED_INSTANCES: "100"
# Storage Configuration
LIGHTRAG_KV_STORAGE: "PGKVStorage"
LIGHTRAG_VECTOR_STORAGE: "PGVectorStorage"
LIGHTRAG_GRAPH_STORAGE: "PGGraphStorage"
# Database
PG_HOST: "postgres"
PG_DATABASE: "lightrag"
PG_USER: "postgres"
PG_PASSWORD: "secret"
# LLM Configuration
LLM_BINDING: "openai"
LLM_MODEL: "gpt-4o-mini"
LLM_BINDING_API_KEY: "${OPENAI_API_KEY}"
# Embedding Configuration
EMBEDDING_BINDING: "openai"
EMBEDDING_MODEL: "text-embedding-3-small"
EMBEDDING_DIM: "1536"
# Authentication
JWT_ALGORITHM: "HS256"
TOKEN_SECRET: "your-secret-key-change-in-production"
TOKEN_EXPIRE_HOURS: "24"
# API
CORS_ORIGINS: "*"
LOG_LEVEL: "INFO"
ports:
- "9621:9621"
depends_on:
- postgres
- redis
volumes:
- ./rag_storage:/app/rag_storage
```
### Environment Variables
```bash
# Tenant Manager
TENANT_ENABLED=true
MAX_CACHED_INSTANCES=100
TENANT_CONFIG_SYNC_INTERVAL=300
# Database
LIGHTRAG_KV_STORAGE=PGKVStorage
LIGHTRAG_VECTOR_STORAGE=PGVectorStorage
LIGHTRAG_GRAPH_STORAGE=PGGraphStorage
# PostgreSQL Connection
PG_HOST=localhost
PG_PORT=5432
PG_DATABASE=lightrag
PG_USER=postgres
PG_PASSWORD=secret
# Authentication
JWT_ALGORITHM=HS256
TOKEN_SECRET=your-secret-key
TOKEN_EXPIRE_HOURS=24
GUEST_TOKEN_EXPIRE_HOURS=1
# LLM Configuration
LLM_BINDING=openai
LLM_MODEL=gpt-4o-mini
LLM_BINDING_API_KEY=${OPENAI_API_KEY}
EMBEDDING_BINDING=openai
EMBEDDING_MODEL=text-embedding-3-small
# Quotas
MAX_DOCUMENTS=10000
MAX_STORAGE_GB=100
MAX_KB_PER_TENANT=50
# Rate Limiting
RATE_LIMIT_QUERIES_PER_MINUTE=100
RATE_LIMIT_DOCUMENTS_PER_HOUR=50
RATE_LIMIT_API_CALLS_PER_MONTH=100000
# Monitoring
LOG_LEVEL=INFO
ENABLE_AUDIT_LOGGING=true
AUDIT_LOG_RETENTION_DAYS=90
```
## Monitoring and Observability
### Metrics to Track
```python
# Key metrics for multi-tenant system
METRICS = {
"tenant_management": {
"active_tenants": "Gauge",
"total_kbs": "Gauge",
"tenant_creation_time": "Histogram",
},
"isolation": {
"cross_tenant_access_attempts": "Counter", # Should be 0
"cross_kb_access_attempts": "Counter", # Should be 0
"isolation_violations": "Counter", # Should be 0
},
"performance": {
"query_latency_per_tenant": "Histogram",
"document_processing_time": "Histogram",
"rag_instance_cache_hits": "Counter",
"rag_instance_cache_misses": "Counter",
},
"security": {
"failed_auth_attempts": "Counter",
"permission_denials": "Counter",
"api_key_usage": "Counter (per key)",
},
"quotas": {
"storage_used_per_tenant": "Gauge",
"documents_per_tenant": "Gauge",
"api_calls_per_tenant": "Counter",
}
}
```
### Example Prometheus Queries
```promql
# Average query latency per tenant
histogram_quantile(0.95, query_latency_per_tenant) by (tenant_id)
# Cache hit rate
rag_instance_cache_hits / (rag_instance_cache_hits + rag_instance_cache_misses)
# Failed auth attempts
rate(failed_auth_attempts[5m])
# Cross-tenant access attempts (should be 0)
cross_tenant_access_attempts
```
### Logging
```python
# Structured logging for debugging
import structlog
logger = structlog.get_logger()
# Example log entry
logger.info(
"query_executed",
user_id="user-123",
tenant_id="acme",
kb_id="docs",
query="What is...",
mode="mix",
latency_ms=145,
result_count=5,
request_id="req-abc-123"
)
```
## Rollout Strategy
### Phase 1: Soft Launch (Week 1)
```
- Deploy with TENANT_ENABLED=false (features off)
- Run in parallel with existing system
- Test against staging data
- Monitor for issues: 0 expected
```
### Phase 2: Closed Beta (Week 2)
```
- TENANT_ENABLED=true for 10% of traffic
- Small set of trusted customers
- Monitor metrics closely
- Rollback plan ready
```
### Phase 3: Gradual Rollout (Week 3)
```
- 25% → 50% → 100%
- Staggered by time of day
- Monitor isolation violations (should be 0)
- Customer education happening
```
### Phase 4: Full Production (Week 4)
```
- 100% of traffic on multi-tenant system
- Legacy workspace mode deprecated (6-month timeline)
- Full monitoring and alerting active
- Support team trained
```
## Troubleshooting Guide
### Issue: Cross-Tenant Data Visible
```
Symptom: User can see Tenant B data while using Tenant A credentials
Solution:
1. Check TokenPayload.tenant_id == request.path.tenant_id
2. Check storage filters include WHERE tenant_id = ? AND kb_id = ?
3. Review TenantContext creation in get_tenant_context()
4. Check RAGManager.get_rag_instance() is called with correct IDs
```
### Issue: Slow Queries
```
Symptom: Queries taking >1 second
Solution:
1. Check indexes on (tenant_id, kb_id) columns
2. Verify RAG instance cache is working (check metrics)
3. Check if instance is being recompiled every request
4. Profile with: SELECT * FROM documents WHERE tenant_id=? AND kb_id=?
```
### Issue: High Memory Usage
```
Symptom: Memory growing over time
Solution:
1. Check MAX_CACHED_INSTANCES setting (default 100)
2. Monitor rag_instance_cache_size metric
3. Verify finalize_storages() called on eviction
4. Check for memory leaks in embedding cache
```
## Support and Resources
### Documentation
- Architecture Overview: `adr/001-multi-tenant-architecture-overview.md`
- Implementation Guide: `adr/002-implementation-strategy.md`
- Data Models: `adr/003-data-models-and-storage.md`
- API Design: `adr/004-api-design.md`
- Security: `adr/005-security-analysis.md`
- Diagrams & Alternatives: `adr/006-architecture-diagrams-alternatives.md`
### Code Examples
- See `examples/multi_tenant_demo.py` for complete usage example
- See `tests/test_api_tenant_routes.py` for API testing examples
- See `scripts/migrate_workspace_to_tenant.py` for migration examples
### Getting Help
- GitHub Issues: [LightRAG/issues](https://github.com/HKUDS/LightRAG/issues)
- Discussions: [LightRAG/discussions](https://github.com/HKUDS/LightRAG/discussions)
- Discord: [LightRAG Community](https://discord.gg/yF2MmDJyGJ)
## Success Criteria
Multi-tenant implementation is successful when:
**Functional Requirements Met**
- [ ] All API endpoints working with tenant/KB routing
- [ ] Data isolation verified (cross-tenant access prevents)
- [ ] RBAC enforcement working correctly
- [ ] Audit logging capturing all operations
- [ ] Migration from workspace to tenant successful
**Performance Targets Met**
- [ ] Query latency < 200ms p99 (including tenant filtering)
- [ ] Storage overhead < 3%
- [ ] Instance cache hit rate > 90%
- [ ] API response time < 150ms average
**Security Requirements Met**
- [ ] Zero cross-tenant data access
- [ ] JWT token validation in all requests
- [ ] Permission checking on every operation
- [ ] Rate limiting preventing abuse
- [ ] Audit logs tamper-proof and retained
**Operational Readiness**
- [ ] Monitoring/alerting configured
- [ ] Runbooks for common issues
- [ ] Disaster recovery plan tested
- [ ] Support team trained
- [ ] Documentation complete
---
**Document Version**: 1.0
**Last Updated**: 2025-11-20
**Deployment Timeline**: 4 weeks
**Success Criteria**: All items checked off
**Status**: Ready for Implementation