LightRAG/docs/archives/adr/001-multi-tenant-architecture-overview.md
Raphael MANSUY 2b292d4924
docs: Enterprise Edition & Multi-tenancy attribution (#5)
* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
2025-12-04 18:09:15 +08:00

302 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR 001: Multi-Tenant, Multi-Knowledge-Base Architecture for LightRAG
## Status: Proposed
## Context
### Current State
LightRAG is a retrieval-augmented generation system that currently operates as a single-instance system with basic workspace-level data isolation. The existing architecture uses:
- **Workspace concept**: Directory-based or database-field-based isolation for file/database storage
- **Single LightRAG instance**: One RAG system per server process, configured at startup
- **Basic authentication**: JWT tokens and API key support without tenant/knowledge-base awareness
- **Shared configuration**: All data uses the same LLM, embedding, and storage configurations
### Limitations of Current Architecture
1. **No true multi-tenancy**: Cannot serve multiple independent tenants securely
2. **No knowledge base isolation**: All data belongs to a single knowledge base
3. **Shared compute resources**: LLM and embedding calls are shared across all workspaces
4. **Static configuration**: All tenants must use the same models and settings
5. **Cross-tenant data leak risk**: Workspace isolation is not cryptographically enforced
6. **No resource quotas**: No limits on storage, compute, or API usage per tenant
7. **Authentication limitations**: JWT tokens don't support fine-grained access control
### Existing Code Evidence
- **Workspace in base.py**: `StorageNameSpace` class (line 176) includes `workspace` field for basic isolation
- **Namespace concept**: `NameSpace` class in `namespace.py` defines storage categories but no tenant/KB concept
- **Storage implementations**: Each storage type (PostgreSQL, JSON, Neo4j) implements workspace filtering:
- `PostgreSQLDB` constructor accepts workspace parameter (line 56 in postgres_impl.py)
- `JsonKVStorage` creates workspace directories (line 30-39 in json_kv_impl.py)
- **API configuration**: `lightrag_server.py` accepts `--workspace` flag but no tenant/KB parameters
- **Authentication**: `auth.py` provides JWT tokens with roles but no tenant/KB scoping
### Business Requirements
Organizations deploying LightRAG need to:
1. Serve multiple independent customers (tenants) from a single instance
2. Support multiple knowledge bases per tenant for different use cases
3. Enforce complete data isolation between tenants
4. Manage per-tenant resource quotas and billing
5. Support per-tenant configuration (models, parameters, API keys)
6. Provide audit trails and access logs per tenant
## Decision
### High-Level Architecture
Implement a **multi-tenant, multi-knowledge-base (MT-MKB)** architecture that:
1. **Adds tenant abstraction layer** above the current workspace concept
2. **Introduces knowledge base concept** as a first-class entity
3. **Implements tenant-aware routing** at the API level
4. **Enforces data isolation** through composite keys and access control
5. **Supports per-tenant/KB configuration** for models and parameters
6. **Adds role-based access control (RBAC)** for fine-grained permissions
### Core Design Principles
1. **Backward Compatibility**: Existing single-workspace setups continue to work
2. **Layered Isolation**: Tenant > Knowledge Base > Document > Chunk/Entity
3. **Zero Trust**: All data access requires explicit tenant/KB context
4. **Default Deny**: Cross-tenant access is explicitly blocked unless authorized
5. **Audit Trail**: All operations logged with tenant/KB context
6. **Resource Aware**: Quotas and limits per tenant/KB
### Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Server (Single Instance) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ │ API Router │ │ Auth/Middleware │ │ Request Handler │
│ │ Layer │ │ (Tenant Extract) │ │ Layer │
│ └──────┬───────────┘ └──────┬───────────┘ └──────┬───────────┘
│ │ │ │
│ ┌──────▼──────────────────────▼──────────────────────▼──────┐
│ │ Tenant Context (TenantID + KnowledgeBaseID) │
│ │ Injected via Dependency Injection / Middleware │
│ └──────┬─────────────────────────────────────────────────────┘
│ │
│ ┌──────▼──────────────────────────────────────────────────────┐
│ │ Tenant-Aware LightRAG Instance Manager │
│ │ (Caches instances per tenant) │
│ └──────┬─────────────────────────────────────────────────────┘
│ │
│ ┌──────▼──────────────────────────────────────────────────────┐
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ │ Tenant 1 │ │ Tenant 2 │ │ Tenant N │ │
│ │ │ KB1, KB2 │ │ KB1, KB3 │ │ KB1, ... │ │
│ │ └─────────────┘ └─────────────┘ └──────────────┘ │
│ │ │
│ │ Multiple LightRAG Instances (per tenant or cached) │
│ └──────┬──────────────────────────────────────────────────────┘
│ │
│ ┌──────▼──────────────────────────────────────────────────────┐
│ │ Storage Access Layer with Tenant Filtering │
│ │ (Adds tenant/KB filters to all queries) │
│ └──────┬─────────────────────────────────────────────────────┘
│ │
│ ┌──────▼──────────────────────────────────────────────────────┐
│ │ │
│ │ ┌────────────────┐ ┌────────────┐ ┌────────────────┐ │
│ │ │ PostgreSQL │ │ Neo4j │ │ Redis/Milvus │ │
│ │ │ (Shared DB) │ │ (Shared) │ │ (Shared) │ │
│ │ └────────────────┘ └────────────┘ └────────────────┘ │
│ │ │
│ │ All queries filtered by tenant/KB at storage layer │
│ └────────────────────────────────────────────────────────────┘
│ │
└─────────────────────────────────────────────────────────────────┘
```
### Key Components
#### 1. Tenant Model
- **TenantID**: Unique identifier (UUID or slug)
- **TenantName**: Human-readable name
- **Configuration**: Per-tenant LLM, embedding, and rerank model configs
- **ResourceQuotas**: Storage, API calls, concurrent requests limits
- **CreatedAt/UpdatedAt**: Audit timestamps
#### 2. Knowledge Base Model
- **KnowledgeBaseID**: Unique within tenant
- **TenantID**: Parent tenant reference
- **KBName**: Display name
- **Description**: Purpose and content overview
- **Configuration**: Per-KB indexing and query parameters
- **Status**: Active/Archived
- **Metadata**: Custom fields for tenant-specific data
#### 3. Storage Isolation Strategy
All storage operations will include tenant/KB filters:
- **Document storage**: `workspace = f"{tenant_id}_{kb_id}"`
- **Vector storage**: Add `tenant_id` and `kb_id` metadata fields
- **Graph storage**: Store tenant/KB info as node/edge attributes
- **KV storage**: Prefix keys with `tenant_id:kb_id:entity_id`
#### 4. API Routing
```
POST /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/add
GET /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/{doc_id}
POST /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/query
GET /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/graph
```
#### 5. Authentication & Authorization
```python
# JWT Token Payload
{
"sub": "user_id", # User identifier
"tenant_id": "tenant_uuid", # Assigned tenant
"knowledge_base_ids": ["kb1", "kb2"], # Accessible KBs
"role": "admin|editor|viewer", # Role within tenant
"exp": 1234567890, # Expiration
"permissions": {
"create_kb": true,
"delete_documents": true,
"run_queries": true
}
}
```
#### 6. Dependency Injection for Tenant Context
```python
# FastAPI dependency to extract and validate tenant context
async def get_tenant_context(
tenant_id: str,
kb_id: str,
token: str = Depends(get_auth_token)
) -> TenantContext:
# Verify user can access this tenant/KB
# Return validated context object
pass
```
## Consequences
### Positive
1. **True Multi-Tenancy**: Complete data isolation between tenants
2. **Scalability**: Support hundreds of tenants in single instance
3. **Cost Efficiency**: Shared infrastructure reduces per-tenant costs
4. **Flexibility**: Per-tenant model and parameter configuration
5. **Security**: Fine-grained access control and audit trails
6. **Resource Management**: Per-tenant quotas prevent resource abuse
7. **Operational Simplicity**: Single instance to manage
### Negative/Tradeoffs
1. **Increased Complexity**: More code, more testing required (~2-3x development effort)
2. **Performance Overhead**: Tenant/KB filtering on every query (~5-10% latency impact)
3. **Storage Overhead**: Tenant/KB metadata increases storage footprint (~2-3%)
4. **Operational Complexity**: More configuration options, training needed
5. **Breaking Changes**: API endpoints change, requires migration scripts
6. **Backward Compatibility**: Existing workspaces need migration strategy
### Security Considerations
1. **Data Isolation**: Tenant-aware queries prevent cross-tenant leaks
2. **Authentication**: JWT tokens must include tenant scope
3. **Authorization**: RBAC prevents unauthorized access to KBs
4. **Audit Trail**: All operations logged for compliance
5. **Key Management**: Per-tenant API keys need separate management
6. **Potential Vulnerabilities**:
- Parameter injection in tenant/KB IDs (mitigate: strict validation)
- JWT token hijacking (mitigate: short expiry, rate limiting)
- Side-channel attacks via timing (mitigate: constant-time comparisons)
- Resource exhaustion (mitigate: quotas and rate limiting)
### Performance Impact
- **Query Latency**: +5-10% from additional filtering
- **Storage Size**: +2-3% for tenant/KB metadata
- **Memory Usage**: +20-30% from maintaining multiple LightRAG instances
- **CPU Usage**: +10-15% from authentication/authorization checks
### Migration Path for Existing Deployments
1. **Phase 1**: Deploy with backward compatibility (single tenant = existing workspace)
2. **Phase 2**: Provide migration script to convert workspaces to tenants
3. **Phase 3**: Support hybrid mode (legacy workspaces + new tenants)
4. **Phase 4**: Deprecate workspace mode in favor of tenant mode
## Implementation Plan (Summary)
See `002-implementation-strategy.md` for detailed step-by-step implementation guide.
### High-Level Phases
1. **Phase 1 (2-3 weeks)**: Core infrastructure
- Database schema changes
- Tenant/KB models
- Storage access layer updates
2. **Phase 2 (2-3 weeks)**: API layer
- Tenant-aware routing
- Request/response models
- Authentication/authorization
3. **Phase 3 (1-2 weeks)**: LightRAG integration
- Instance manager
- Per-tenant configurations
- Query execution
4. **Phase 4 (1 week)**: Testing & deployment
- Unit/integration tests
- Migration scripts
- Documentation
## Alternatives Considered
### 1. Separate Database Per Tenant
- **Approach**: Each tenant gets its own database/storage instance
- **Rejected because**:
- Massive operational overhead (n×database connections, backups, upgrades)
- Expensive (n×database licensing)
- Complex to manage tenants across instances
- Makes sharing resources impossible
### 2. Dedicated Server Instance Per Tenant
- **Approach**: Each tenant runs their own LightRAG instance
- **Rejected because**:
- Massive resource waste (minimum resources per instance)
- Very expensive at scale (n×server costs)
- Difficult to manage and monitor
- Cannot share LLM/embedding infrastructure
### 3. Simple Workspace Extension
- **Approach**: Just rename "workspace" to "tenant"
- **Rejected because**:
- No knowledge base concept (multiple KB per tenant fails)
- Cannot enforce cross-tenant access prevention
- No RBAC or fine-grained permissions
- Cannot manage per-tenant configuration
- No resource quotas
### 4. Sharding by Tenant Hash
- **Approach**: Hash tenant ID to determine shard, send queries to correct shard
- **Rejected because**:
- Breaks operational simplicity (multiple instances to manage)
- Rebalancing is complex when adding/removing tenants
- Doesn't reduce resource overhead
## Evidence/References
### Code References
- **Storage base class**: `lightrag/base.py:176-185` (StorageNameSpace)
- **Namespace constants**: `lightrag/namespace.py` (NameSpace class)
- **Workspace implementation**: `lightrag/kg/json_kv_impl.py:28-39` (JsonKVStorage)
- **PostgreSQL workspace support**: `lightrag/kg/postgres_impl.py:44-59`
- **API server architecture**: `lightrag/api/lightrag_server.py:1-300`
- **Authentication**: `lightrag/api/auth.py` (JWT token management)
- **Config**: `lightrag/api/config.py:200-220` (workspace argument)
### Related Documentation
- Current workspace isolation documented in `lightrag/api/README-zh.md:165-173`
- Storage implementations in `lightrag/kg/` directory
## Next Steps
1. Review and approve this ADR
2. Create detailed design documents for each component (see ADR 002-007)
3. Conduct security review of proposed architecture
4. Estimate development effort and allocate resources
5. Create implementation tickets and sprint planning
---
**Document Version**: 1.0
**Last Updated**: 2025-11-20
**Author**: Architecture Design Process
**Status**: Proposed - Awaiting Review and Approval