* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad. * Add multi-tenant testing strategy and ADR index documentation - Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details. - Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles. * feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise - Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints. - Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options. - Documented the LightRAG architecture, storage backends, LLM integrations, and query modes. - Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
302 lines
16 KiB
Markdown
302 lines
16 KiB
Markdown
# ADR 001: Multi-Tenant, Multi-Knowledge-Base Architecture for LightRAG
|
||
|
||
## Status: Proposed
|
||
|
||
## Context
|
||
|
||
### Current State
|
||
LightRAG is a retrieval-augmented generation system that currently operates as a single-instance system with basic workspace-level data isolation. The existing architecture uses:
|
||
|
||
- **Workspace concept**: Directory-based or database-field-based isolation for file/database storage
|
||
- **Single LightRAG instance**: One RAG system per server process, configured at startup
|
||
- **Basic authentication**: JWT tokens and API key support without tenant/knowledge-base awareness
|
||
- **Shared configuration**: All data uses the same LLM, embedding, and storage configurations
|
||
|
||
### Limitations of Current Architecture
|
||
1. **No true multi-tenancy**: Cannot serve multiple independent tenants securely
|
||
2. **No knowledge base isolation**: All data belongs to a single knowledge base
|
||
3. **Shared compute resources**: LLM and embedding calls are shared across all workspaces
|
||
4. **Static configuration**: All tenants must use the same models and settings
|
||
5. **Cross-tenant data leak risk**: Workspace isolation is not cryptographically enforced
|
||
6. **No resource quotas**: No limits on storage, compute, or API usage per tenant
|
||
7. **Authentication limitations**: JWT tokens don't support fine-grained access control
|
||
|
||
### Existing Code Evidence
|
||
- **Workspace in base.py**: `StorageNameSpace` class (line 176) includes `workspace` field for basic isolation
|
||
- **Namespace concept**: `NameSpace` class in `namespace.py` defines storage categories but no tenant/KB concept
|
||
- **Storage implementations**: Each storage type (PostgreSQL, JSON, Neo4j) implements workspace filtering:
|
||
- `PostgreSQLDB` constructor accepts workspace parameter (line 56 in postgres_impl.py)
|
||
- `JsonKVStorage` creates workspace directories (line 30-39 in json_kv_impl.py)
|
||
- **API configuration**: `lightrag_server.py` accepts `--workspace` flag but no tenant/KB parameters
|
||
- **Authentication**: `auth.py` provides JWT tokens with roles but no tenant/KB scoping
|
||
|
||
### Business Requirements
|
||
Organizations deploying LightRAG need to:
|
||
1. Serve multiple independent customers (tenants) from a single instance
|
||
2. Support multiple knowledge bases per tenant for different use cases
|
||
3. Enforce complete data isolation between tenants
|
||
4. Manage per-tenant resource quotas and billing
|
||
5. Support per-tenant configuration (models, parameters, API keys)
|
||
6. Provide audit trails and access logs per tenant
|
||
|
||
## Decision
|
||
|
||
### High-Level Architecture
|
||
Implement a **multi-tenant, multi-knowledge-base (MT-MKB)** architecture that:
|
||
|
||
1. **Adds tenant abstraction layer** above the current workspace concept
|
||
2. **Introduces knowledge base concept** as a first-class entity
|
||
3. **Implements tenant-aware routing** at the API level
|
||
4. **Enforces data isolation** through composite keys and access control
|
||
5. **Supports per-tenant/KB configuration** for models and parameters
|
||
6. **Adds role-based access control (RBAC)** for fine-grained permissions
|
||
|
||
### Core Design Principles
|
||
1. **Backward Compatibility**: Existing single-workspace setups continue to work
|
||
2. **Layered Isolation**: Tenant > Knowledge Base > Document > Chunk/Entity
|
||
3. **Zero Trust**: All data access requires explicit tenant/KB context
|
||
4. **Default Deny**: Cross-tenant access is explicitly blocked unless authorized
|
||
5. **Audit Trail**: All operations logged with tenant/KB context
|
||
6. **Resource Aware**: Quotas and limits per tenant/KB
|
||
|
||
### Architecture Overview
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ FastAPI Server (Single Instance) │
|
||
├─────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
|
||
│ │ API Router │ │ Auth/Middleware │ │ Request Handler │
|
||
│ │ Layer │ │ (Tenant Extract) │ │ Layer │
|
||
│ └──────┬───────────┘ └──────┬───────────┘ └──────┬───────────┘
|
||
│ │ │ │
|
||
│ ┌──────▼──────────────────────▼──────────────────────▼──────┐
|
||
│ │ Tenant Context (TenantID + KnowledgeBaseID) │
|
||
│ │ Injected via Dependency Injection / Middleware │
|
||
│ └──────┬─────────────────────────────────────────────────────┘
|
||
│ │
|
||
│ ┌──────▼──────────────────────────────────────────────────────┐
|
||
│ │ Tenant-Aware LightRAG Instance Manager │
|
||
│ │ (Caches instances per tenant) │
|
||
│ └──────┬─────────────────────────────────────────────────────┘
|
||
│ │
|
||
│ ┌──────▼──────────────────────────────────────────────────────┐
|
||
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
|
||
│ │ │ Tenant 1 │ │ Tenant 2 │ │ Tenant N │ │
|
||
│ │ │ KB1, KB2 │ │ KB1, KB3 │ │ KB1, ... │ │
|
||
│ │ └─────────────┘ └─────────────┘ └──────────────┘ │
|
||
│ │ │
|
||
│ │ Multiple LightRAG Instances (per tenant or cached) │
|
||
│ └──────┬──────────────────────────────────────────────────────┘
|
||
│ │
|
||
│ ┌──────▼──────────────────────────────────────────────────────┐
|
||
│ │ Storage Access Layer with Tenant Filtering │
|
||
│ │ (Adds tenant/KB filters to all queries) │
|
||
│ └──────┬─────────────────────────────────────────────────────┘
|
||
│ │
|
||
│ ┌──────▼──────────────────────────────────────────────────────┐
|
||
│ │ │
|
||
│ │ ┌────────────────┐ ┌────────────┐ ┌────────────────┐ │
|
||
│ │ │ PostgreSQL │ │ Neo4j │ │ Redis/Milvus │ │
|
||
│ │ │ (Shared DB) │ │ (Shared) │ │ (Shared) │ │
|
||
│ │ └────────────────┘ └────────────┘ └────────────────┘ │
|
||
│ │ │
|
||
│ │ All queries filtered by tenant/KB at storage layer │
|
||
│ └────────────────────────────────────────────────────────────┘
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Key Components
|
||
|
||
#### 1. Tenant Model
|
||
- **TenantID**: Unique identifier (UUID or slug)
|
||
- **TenantName**: Human-readable name
|
||
- **Configuration**: Per-tenant LLM, embedding, and rerank model configs
|
||
- **ResourceQuotas**: Storage, API calls, concurrent requests limits
|
||
- **CreatedAt/UpdatedAt**: Audit timestamps
|
||
|
||
#### 2. Knowledge Base Model
|
||
- **KnowledgeBaseID**: Unique within tenant
|
||
- **TenantID**: Parent tenant reference
|
||
- **KBName**: Display name
|
||
- **Description**: Purpose and content overview
|
||
- **Configuration**: Per-KB indexing and query parameters
|
||
- **Status**: Active/Archived
|
||
- **Metadata**: Custom fields for tenant-specific data
|
||
|
||
#### 3. Storage Isolation Strategy
|
||
All storage operations will include tenant/KB filters:
|
||
- **Document storage**: `workspace = f"{tenant_id}_{kb_id}"`
|
||
- **Vector storage**: Add `tenant_id` and `kb_id` metadata fields
|
||
- **Graph storage**: Store tenant/KB info as node/edge attributes
|
||
- **KV storage**: Prefix keys with `tenant_id:kb_id:entity_id`
|
||
|
||
#### 4. API Routing
|
||
```
|
||
POST /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/add
|
||
GET /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/{doc_id}
|
||
POST /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/query
|
||
GET /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/graph
|
||
```
|
||
|
||
#### 5. Authentication & Authorization
|
||
```python
|
||
# JWT Token Payload
|
||
{
|
||
"sub": "user_id", # User identifier
|
||
"tenant_id": "tenant_uuid", # Assigned tenant
|
||
"knowledge_base_ids": ["kb1", "kb2"], # Accessible KBs
|
||
"role": "admin|editor|viewer", # Role within tenant
|
||
"exp": 1234567890, # Expiration
|
||
"permissions": {
|
||
"create_kb": true,
|
||
"delete_documents": true,
|
||
"run_queries": true
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 6. Dependency Injection for Tenant Context
|
||
```python
|
||
# FastAPI dependency to extract and validate tenant context
|
||
async def get_tenant_context(
|
||
tenant_id: str,
|
||
kb_id: str,
|
||
token: str = Depends(get_auth_token)
|
||
) -> TenantContext:
|
||
# Verify user can access this tenant/KB
|
||
# Return validated context object
|
||
pass
|
||
```
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
1. **True Multi-Tenancy**: Complete data isolation between tenants
|
||
2. **Scalability**: Support hundreds of tenants in single instance
|
||
3. **Cost Efficiency**: Shared infrastructure reduces per-tenant costs
|
||
4. **Flexibility**: Per-tenant model and parameter configuration
|
||
5. **Security**: Fine-grained access control and audit trails
|
||
6. **Resource Management**: Per-tenant quotas prevent resource abuse
|
||
7. **Operational Simplicity**: Single instance to manage
|
||
|
||
### Negative/Tradeoffs
|
||
1. **Increased Complexity**: More code, more testing required (~2-3x development effort)
|
||
2. **Performance Overhead**: Tenant/KB filtering on every query (~5-10% latency impact)
|
||
3. **Storage Overhead**: Tenant/KB metadata increases storage footprint (~2-3%)
|
||
4. **Operational Complexity**: More configuration options, training needed
|
||
5. **Breaking Changes**: API endpoints change, requires migration scripts
|
||
6. **Backward Compatibility**: Existing workspaces need migration strategy
|
||
|
||
### Security Considerations
|
||
1. **Data Isolation**: Tenant-aware queries prevent cross-tenant leaks
|
||
2. **Authentication**: JWT tokens must include tenant scope
|
||
3. **Authorization**: RBAC prevents unauthorized access to KBs
|
||
4. **Audit Trail**: All operations logged for compliance
|
||
5. **Key Management**: Per-tenant API keys need separate management
|
||
6. **Potential Vulnerabilities**:
|
||
- Parameter injection in tenant/KB IDs (mitigate: strict validation)
|
||
- JWT token hijacking (mitigate: short expiry, rate limiting)
|
||
- Side-channel attacks via timing (mitigate: constant-time comparisons)
|
||
- Resource exhaustion (mitigate: quotas and rate limiting)
|
||
|
||
### Performance Impact
|
||
- **Query Latency**: +5-10% from additional filtering
|
||
- **Storage Size**: +2-3% for tenant/KB metadata
|
||
- **Memory Usage**: +20-30% from maintaining multiple LightRAG instances
|
||
- **CPU Usage**: +10-15% from authentication/authorization checks
|
||
|
||
### Migration Path for Existing Deployments
|
||
1. **Phase 1**: Deploy with backward compatibility (single tenant = existing workspace)
|
||
2. **Phase 2**: Provide migration script to convert workspaces to tenants
|
||
3. **Phase 3**: Support hybrid mode (legacy workspaces + new tenants)
|
||
4. **Phase 4**: Deprecate workspace mode in favor of tenant mode
|
||
|
||
## Implementation Plan (Summary)
|
||
|
||
See `002-implementation-strategy.md` for detailed step-by-step implementation guide.
|
||
|
||
### High-Level Phases
|
||
1. **Phase 1 (2-3 weeks)**: Core infrastructure
|
||
- Database schema changes
|
||
- Tenant/KB models
|
||
- Storage access layer updates
|
||
|
||
2. **Phase 2 (2-3 weeks)**: API layer
|
||
- Tenant-aware routing
|
||
- Request/response models
|
||
- Authentication/authorization
|
||
|
||
3. **Phase 3 (1-2 weeks)**: LightRAG integration
|
||
- Instance manager
|
||
- Per-tenant configurations
|
||
- Query execution
|
||
|
||
4. **Phase 4 (1 week)**: Testing & deployment
|
||
- Unit/integration tests
|
||
- Migration scripts
|
||
- Documentation
|
||
|
||
## Alternatives Considered
|
||
|
||
### 1. Separate Database Per Tenant
|
||
- **Approach**: Each tenant gets its own database/storage instance
|
||
- **Rejected because**:
|
||
- Massive operational overhead (n×database connections, backups, upgrades)
|
||
- Expensive (n×database licensing)
|
||
- Complex to manage tenants across instances
|
||
- Makes sharing resources impossible
|
||
|
||
### 2. Dedicated Server Instance Per Tenant
|
||
- **Approach**: Each tenant runs their own LightRAG instance
|
||
- **Rejected because**:
|
||
- Massive resource waste (minimum resources per instance)
|
||
- Very expensive at scale (n×server costs)
|
||
- Difficult to manage and monitor
|
||
- Cannot share LLM/embedding infrastructure
|
||
|
||
### 3. Simple Workspace Extension
|
||
- **Approach**: Just rename "workspace" to "tenant"
|
||
- **Rejected because**:
|
||
- No knowledge base concept (multiple KB per tenant fails)
|
||
- Cannot enforce cross-tenant access prevention
|
||
- No RBAC or fine-grained permissions
|
||
- Cannot manage per-tenant configuration
|
||
- No resource quotas
|
||
|
||
### 4. Sharding by Tenant Hash
|
||
- **Approach**: Hash tenant ID to determine shard, send queries to correct shard
|
||
- **Rejected because**:
|
||
- Breaks operational simplicity (multiple instances to manage)
|
||
- Rebalancing is complex when adding/removing tenants
|
||
- Doesn't reduce resource overhead
|
||
|
||
## Evidence/References
|
||
|
||
### Code References
|
||
- **Storage base class**: `lightrag/base.py:176-185` (StorageNameSpace)
|
||
- **Namespace constants**: `lightrag/namespace.py` (NameSpace class)
|
||
- **Workspace implementation**: `lightrag/kg/json_kv_impl.py:28-39` (JsonKVStorage)
|
||
- **PostgreSQL workspace support**: `lightrag/kg/postgres_impl.py:44-59`
|
||
- **API server architecture**: `lightrag/api/lightrag_server.py:1-300`
|
||
- **Authentication**: `lightrag/api/auth.py` (JWT token management)
|
||
- **Config**: `lightrag/api/config.py:200-220` (workspace argument)
|
||
|
||
### Related Documentation
|
||
- Current workspace isolation documented in `lightrag/api/README-zh.md:165-173`
|
||
- Storage implementations in `lightrag/kg/` directory
|
||
|
||
## Next Steps
|
||
1. Review and approve this ADR
|
||
2. Create detailed design documents for each component (see ADR 002-007)
|
||
3. Conduct security review of proposed architecture
|
||
4. Estimate development effort and allocate resources
|
||
5. Create implementation tickets and sprint planning
|
||
|
||
---
|
||
|
||
**Document Version**: 1.0
|
||
**Last Updated**: 2025-11-20
|
||
**Author**: Architecture Design Process
|
||
**Status**: Proposed - Awaiting Review and Approval
|