LightRAG/docs/archives/adr/001-multi-tenant-architecture-overview.md

# ADR 001: Multi-Tenant, Multi-Knowledge-Base Architecture for LightRAG

## Status: Proposed

## Context

### Current State
LightRAG is a retrieval-augmented generation system that currently operates as a single-instance system with basic workspace-level data isolation. The existing architecture uses:

- **Workspace concept**: Directory-based or database-field-based isolation for file/database storage
- **Single LightRAG instance**: One RAG system per server process, configured at startup
- **Basic authentication**: JWT tokens and API key support without tenant/knowledge-base awareness
- **Shared configuration**: All data uses the same LLM, embedding, and storage configurations

### Limitations of Current Architecture
1. **No true multi-tenancy**: Cannot serve multiple independent tenants securely
2. **No knowledge base isolation**: All data belongs to a single knowledge base
3. **Shared compute resources**: LLM and embedding calls are shared across all workspaces
4. **Static configuration**: All tenants must use the same models and settings
5. **Cross-tenant data leak risk**: Workspace isolation is not cryptographically enforced
6. **No resource quotas**: No limits on storage, compute, or API usage per tenant
7. **Authentication limitations**: JWT tokens don't support fine-grained access control

### Existing Code Evidence
- **Workspace in base.py**: `StorageNameSpace` class (line 176) includes `workspace` field for basic isolation
- **Namespace concept**: `NameSpace` class in `namespace.py` defines storage categories but no tenant/KB concept
- **Storage implementations**: Each storage type (PostgreSQL, JSON, Neo4j) implements workspace filtering:
  - `PostgreSQLDB` constructor accepts workspace parameter (line 56 in postgres_impl.py)
  - `JsonKVStorage` creates workspace directories (line 30-39 in json_kv_impl.py)
- **API configuration**: `lightrag_server.py` accepts `--workspace` flag but no tenant/KB parameters
- **Authentication**: `auth.py` provides JWT tokens with roles but no tenant/KB scoping

### Business Requirements
Organizations deploying LightRAG need to:
1. Serve multiple independent customers (tenants) from a single instance
2. Support multiple knowledge bases per tenant for different use cases
3. Enforce complete data isolation between tenants
4. Manage per-tenant resource quotas and billing
5. Support per-tenant configuration (models, parameters, API keys)
6. Provide audit trails and access logs per tenant

## Decision

### High-Level Architecture
Implement a **multi-tenant, multi-knowledge-base (MT-MKB)** architecture that:

1. **Adds tenant abstraction layer** above the current workspace concept
2. **Introduces knowledge base concept** as a first-class entity
3. **Implements tenant-aware routing** at the API level
4. **Enforces data isolation** through composite keys and access control
5. **Supports per-tenant/KB configuration** for models and parameters
6. **Adds role-based access control (RBAC)** for fine-grained permissions

### Core Design Principles
1. **Backward Compatibility**: Existing single-workspace setups continue to work
2. **Layered Isolation**: Tenant > Knowledge Base > Document > Chunk/Entity
3. **Zero Trust**: All data access requires explicit tenant/KB context
4. **Default Deny**: Cross-tenant access is explicitly blocked unless authorized
5. **Audit Trail**: All operations logged with tenant/KB context
6. **Resource Aware**: Quotas and limits per tenant/KB

### Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│                    FastAPI Server (Single Instance)              │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  │  API Router      │  │ Auth/Middleware  │  │  Request Handler │
│  │  Layer           │  │ (Tenant Extract) │  │  Layer           │
│  └──────┬───────────┘  └──────┬───────────┘  └──────┬───────────┘
│         │                      │                      │
│  ┌──────▼──────────────────────▼──────────────────────▼──────┐
│  │        Tenant Context (TenantID + KnowledgeBaseID)       │
│  │        Injected via Dependency Injection / Middleware    │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │         Tenant-Aware LightRAG Instance Manager             │
│  │         (Caches instances per tenant)                      │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐        │
│  │  │  Tenant 1   │  │  Tenant 2   │  │  Tenant N    │        │
│  │  │  KB1, KB2   │  │  KB1, KB3   │  │  KB1, ...    │        │
│  │  └─────────────┘  └─────────────┘  └──────────────┘        │
│  │                                                             │
│  │  Multiple LightRAG Instances (per tenant or cached)        │
│  └──────┬──────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │         Storage Access Layer with Tenant Filtering         │
│  │         (Adds tenant/KB filters to all queries)            │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │                                                              │
│  │  ┌────────────────┐  ┌────────────┐  ┌────────────────┐   │
│  │  │  PostgreSQL    │  │  Neo4j     │  │  Redis/Milvus │   │
│  │  │  (Shared DB)   │  │  (Shared)  │  │  (Shared)      │   │
│  │  └────────────────┘  └────────────┘  └────────────────┘   │
│  │                                                              │
│  │  All queries filtered by tenant/KB at storage layer        │
│  └────────────────────────────────────────────────────────────┘
│                                                                   │
└─────────────────────────────────────────────────────────────────┘
```

### Key Components

#### 1. Tenant Model
- **TenantID**: Unique identifier (UUID or slug)
- **TenantName**: Human-readable name
- **Configuration**: Per-tenant LLM, embedding, and rerank model configs
- **ResourceQuotas**: Storage, API calls, concurrent requests limits
- **CreatedAt/UpdatedAt**: Audit timestamps

#### 2. Knowledge Base Model
- **KnowledgeBaseID**: Unique within tenant
- **TenantID**: Parent tenant reference
- **KBName**: Display name
- **Description**: Purpose and content overview
- **Configuration**: Per-KB indexing and query parameters
- **Status**: Active/Archived
- **Metadata**: Custom fields for tenant-specific data

#### 3. Storage Isolation Strategy
All storage operations will include tenant/KB filters:
- **Document storage**: `workspace = f"{tenant_id}_{kb_id}"`
- **Vector storage**: Add `tenant_id` and `kb_id` metadata fields
- **Graph storage**: Store tenant/KB info as node/edge attributes
- **KV storage**: Prefix keys with `tenant_id:kb_id:entity_id`

#### 4. API Routing
```
POST   /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/add
GET    /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/{doc_id}
POST   /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/query
GET    /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/graph
```

#### 5. Authentication & Authorization
```python
# JWT Token Payload
{
    "sub": "user_id",                    # User identifier
    "tenant_id": "tenant_uuid",          # Assigned tenant
    "knowledge_base_ids": ["kb1", "kb2"], # Accessible KBs
    "role": "admin|editor|viewer",       # Role within tenant
    "exp": 1234567890,                   # Expiration
    "permissions": {
        "create_kb": true,
        "delete_documents": true,
        "run_queries": true
    }
}
```

#### 6. Dependency Injection for Tenant Context
```python
# FastAPI dependency to extract and validate tenant context
async def get_tenant_context(
    tenant_id: str,
    kb_id: str,
    token: str = Depends(get_auth_token)
) -> TenantContext:
    # Verify user can access this tenant/KB
    # Return validated context object
    pass
```

## Consequences

### Positive
1. **True Multi-Tenancy**: Complete data isolation between tenants
2. **Scalability**: Support hundreds of tenants in single instance
3. **Cost Efficiency**: Shared infrastructure reduces per-tenant costs
4. **Flexibility**: Per-tenant model and parameter configuration
5. **Security**: Fine-grained access control and audit trails
6. **Resource Management**: Per-tenant quotas prevent resource abuse
7. **Operational Simplicity**: Single instance to manage

### Negative/Tradeoffs
1. **Increased Complexity**: More code, more testing required (~2-3x development effort)
2. **Performance Overhead**: Tenant/KB filtering on every query (~5-10% latency impact)
3. **Storage Overhead**: Tenant/KB metadata increases storage footprint (~2-3%)
4. **Operational Complexity**: More configuration options, training needed
5. **Breaking Changes**: API endpoints change, requires migration scripts
6. **Backward Compatibility**: Existing workspaces need migration strategy

### Security Considerations
1. **Data Isolation**: Tenant-aware queries prevent cross-tenant leaks
2. **Authentication**: JWT tokens must include tenant scope
3. **Authorization**: RBAC prevents unauthorized access to KBs
4. **Audit Trail**: All operations logged for compliance
5. **Key Management**: Per-tenant API keys need separate management
6. **Potential Vulnerabilities**:
   - Parameter injection in tenant/KB IDs (mitigate: strict validation)
   - JWT token hijacking (mitigate: short expiry, rate limiting)
   - Side-channel attacks via timing (mitigate: constant-time comparisons)
   - Resource exhaustion (mitigate: quotas and rate limiting)

### Performance Impact
- **Query Latency**: +5-10% from additional filtering
- **Storage Size**: +2-3% for tenant/KB metadata
- **Memory Usage**: +20-30% from maintaining multiple LightRAG instances
- **CPU Usage**: +10-15% from authentication/authorization checks

### Migration Path for Existing Deployments
1. **Phase 1**: Deploy with backward compatibility (single tenant = existing workspace)
2. **Phase 2**: Provide migration script to convert workspaces to tenants
3. **Phase 3**: Support hybrid mode (legacy workspaces + new tenants)
4. **Phase 4**: Deprecate workspace mode in favor of tenant mode

## Implementation Plan (Summary)

See `002-implementation-strategy.md` for detailed step-by-step implementation guide.

### High-Level Phases
1. **Phase 1 (2-3 weeks)**: Core infrastructure
   - Database schema changes
   - Tenant/KB models
   - Storage access layer updates

2. **Phase 2 (2-3 weeks)**: API layer
   - Tenant-aware routing
   - Request/response models
   - Authentication/authorization

3. **Phase 3 (1-2 weeks)**: LightRAG integration
   - Instance manager
   - Per-tenant configurations
   - Query execution

4. **Phase 4 (1 week)**: Testing & deployment
   - Unit/integration tests
   - Migration scripts
   - Documentation

## Alternatives Considered

### 1. Separate Database Per Tenant
- **Approach**: Each tenant gets its own database/storage instance
- **Rejected because**:
  - Massive operational overhead (n×database connections, backups, upgrades)
  - Expensive (n×database licensing)
  - Complex to manage tenants across instances
  - Makes sharing resources impossible

### 2. Dedicated Server Instance Per Tenant
- **Approach**: Each tenant runs their own LightRAG instance
- **Rejected because**:
  - Massive resource waste (minimum resources per instance)
  - Very expensive at scale (n×server costs)
  - Difficult to manage and monitor
  - Cannot share LLM/embedding infrastructure

### 3. Simple Workspace Extension
- **Approach**: Just rename "workspace" to "tenant"
- **Rejected because**:
  - No knowledge base concept (multiple KB per tenant fails)
  - Cannot enforce cross-tenant access prevention
  - No RBAC or fine-grained permissions
  - Cannot manage per-tenant configuration
  - No resource quotas

### 4. Sharding by Tenant Hash
- **Approach**: Hash tenant ID to determine shard, send queries to correct shard
- **Rejected because**:
  - Breaks operational simplicity (multiple instances to manage)
  - Rebalancing is complex when adding/removing tenants
  - Doesn't reduce resource overhead

## Evidence/References

### Code References
- **Storage base class**: `lightrag/base.py:176-185` (StorageNameSpace)
- **Namespace constants**: `lightrag/namespace.py` (NameSpace class)
- **Workspace implementation**: `lightrag/kg/json_kv_impl.py:28-39` (JsonKVStorage)
- **PostgreSQL workspace support**: `lightrag/kg/postgres_impl.py:44-59`
- **API server architecture**: `lightrag/api/lightrag_server.py:1-300`
- **Authentication**: `lightrag/api/auth.py` (JWT token management)
- **Config**: `lightrag/api/config.py:200-220` (workspace argument)

### Related Documentation
- Current workspace isolation documented in `lightrag/api/README-zh.md:165-173`
- Storage implementations in `lightrag/kg/` directory

## Next Steps
1. Review and approve this ADR
2. Create detailed design documents for each component (see ADR 002-007)
3. Conduct security review of proposed architecture
4. Estimate development effort and allocate resources
5. Create implementation tickets and sprint planning

---

**Document Version**: 1.0
**Last Updated**: 2025-11-20
**Author**: Architecture Design Process
**Status**: Proposed - Awaiting Review and Approval