LightRAG/docs/archives/adr/001-multi-tenant-architecture-overview.md
Raphael MANSUY 2b292d4924
docs: Enterprise Edition & Multi-tenancy attribution (#5)
* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
2025-12-04 18:09:15 +08:00

16 KiB
Raw Blame History

ADR 001: Multi-Tenant, Multi-Knowledge-Base Architecture for LightRAG

Status: Proposed

Context

Current State

LightRAG is a retrieval-augmented generation system that currently operates as a single-instance system with basic workspace-level data isolation. The existing architecture uses:

  • Workspace concept: Directory-based or database-field-based isolation for file/database storage
  • Single LightRAG instance: One RAG system per server process, configured at startup
  • Basic authentication: JWT tokens and API key support without tenant/knowledge-base awareness
  • Shared configuration: All data uses the same LLM, embedding, and storage configurations

Limitations of Current Architecture

  1. No true multi-tenancy: Cannot serve multiple independent tenants securely
  2. No knowledge base isolation: All data belongs to a single knowledge base
  3. Shared compute resources: LLM and embedding calls are shared across all workspaces
  4. Static configuration: All tenants must use the same models and settings
  5. Cross-tenant data leak risk: Workspace isolation is not cryptographically enforced
  6. No resource quotas: No limits on storage, compute, or API usage per tenant
  7. Authentication limitations: JWT tokens don't support fine-grained access control

Existing Code Evidence

  • Workspace in base.py: StorageNameSpace class (line 176) includes workspace field for basic isolation
  • Namespace concept: NameSpace class in namespace.py defines storage categories but no tenant/KB concept
  • Storage implementations: Each storage type (PostgreSQL, JSON, Neo4j) implements workspace filtering:
    • PostgreSQLDB constructor accepts workspace parameter (line 56 in postgres_impl.py)
    • JsonKVStorage creates workspace directories (line 30-39 in json_kv_impl.py)
  • API configuration: lightrag_server.py accepts --workspace flag but no tenant/KB parameters
  • Authentication: auth.py provides JWT tokens with roles but no tenant/KB scoping

Business Requirements

Organizations deploying LightRAG need to:

  1. Serve multiple independent customers (tenants) from a single instance
  2. Support multiple knowledge bases per tenant for different use cases
  3. Enforce complete data isolation between tenants
  4. Manage per-tenant resource quotas and billing
  5. Support per-tenant configuration (models, parameters, API keys)
  6. Provide audit trails and access logs per tenant

Decision

High-Level Architecture

Implement a multi-tenant, multi-knowledge-base (MT-MKB) architecture that:

  1. Adds tenant abstraction layer above the current workspace concept
  2. Introduces knowledge base concept as a first-class entity
  3. Implements tenant-aware routing at the API level
  4. Enforces data isolation through composite keys and access control
  5. Supports per-tenant/KB configuration for models and parameters
  6. Adds role-based access control (RBAC) for fine-grained permissions

Core Design Principles

  1. Backward Compatibility: Existing single-workspace setups continue to work
  2. Layered Isolation: Tenant > Knowledge Base > Document > Chunk/Entity
  3. Zero Trust: All data access requires explicit tenant/KB context
  4. Default Deny: Cross-tenant access is explicitly blocked unless authorized
  5. Audit Trail: All operations logged with tenant/KB context
  6. Resource Aware: Quotas and limits per tenant/KB

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    FastAPI Server (Single Instance)              │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  │  API Router      │  │ Auth/Middleware  │  │  Request Handler │
│  │  Layer           │  │ (Tenant Extract) │  │  Layer           │
│  └──────┬───────────┘  └──────┬───────────┘  └──────┬───────────┘
│         │                      │                      │
│  ┌──────▼──────────────────────▼──────────────────────▼──────┐
│  │        Tenant Context (TenantID + KnowledgeBaseID)       │
│  │        Injected via Dependency Injection / Middleware    │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │         Tenant-Aware LightRAG Instance Manager             │
│  │         (Caches instances per tenant)                      │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐        │
│  │  │  Tenant 1   │  │  Tenant 2   │  │  Tenant N    │        │
│  │  │  KB1, KB2   │  │  KB1, KB3   │  │  KB1, ...    │        │
│  │  └─────────────┘  └─────────────┘  └──────────────┘        │
│  │                                                             │
│  │  Multiple LightRAG Instances (per tenant or cached)        │
│  └──────┬──────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │         Storage Access Layer with Tenant Filtering         │
│  │         (Adds tenant/KB filters to all queries)            │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │                                                              │
│  │  ┌────────────────┐  ┌────────────┐  ┌────────────────┐   │
│  │  │  PostgreSQL    │  │  Neo4j     │  │  Redis/Milvus │   │
│  │  │  (Shared DB)   │  │  (Shared)  │  │  (Shared)      │   │
│  │  └────────────────┘  └────────────┘  └────────────────┘   │
│  │                                                              │
│  │  All queries filtered by tenant/KB at storage layer        │
│  └────────────────────────────────────────────────────────────┘
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Key Components

1. Tenant Model

  • TenantID: Unique identifier (UUID or slug)
  • TenantName: Human-readable name
  • Configuration: Per-tenant LLM, embedding, and rerank model configs
  • ResourceQuotas: Storage, API calls, concurrent requests limits
  • CreatedAt/UpdatedAt: Audit timestamps

2. Knowledge Base Model

  • KnowledgeBaseID: Unique within tenant
  • TenantID: Parent tenant reference
  • KBName: Display name
  • Description: Purpose and content overview
  • Configuration: Per-KB indexing and query parameters
  • Status: Active/Archived
  • Metadata: Custom fields for tenant-specific data

3. Storage Isolation Strategy

All storage operations will include tenant/KB filters:

  • Document storage: workspace = f"{tenant_id}_{kb_id}"
  • Vector storage: Add tenant_id and kb_id metadata fields
  • Graph storage: Store tenant/KB info as node/edge attributes
  • KV storage: Prefix keys with tenant_id:kb_id:entity_id

4. API Routing

POST   /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/add
GET    /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/{doc_id}
POST   /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/query
GET    /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/graph

5. Authentication & Authorization

# JWT Token Payload
{
    "sub": "user_id",                    # User identifier
    "tenant_id": "tenant_uuid",          # Assigned tenant
    "knowledge_base_ids": ["kb1", "kb2"], # Accessible KBs
    "role": "admin|editor|viewer",       # Role within tenant
    "exp": 1234567890,                   # Expiration
    "permissions": {
        "create_kb": true,
        "delete_documents": true,
        "run_queries": true
    }
}

6. Dependency Injection for Tenant Context

# FastAPI dependency to extract and validate tenant context
async def get_tenant_context(
    tenant_id: str, 
    kb_id: str,
    token: str = Depends(get_auth_token)
) -> TenantContext:
    # Verify user can access this tenant/KB
    # Return validated context object
    pass

Consequences

Positive

  1. True Multi-Tenancy: Complete data isolation between tenants
  2. Scalability: Support hundreds of tenants in single instance
  3. Cost Efficiency: Shared infrastructure reduces per-tenant costs
  4. Flexibility: Per-tenant model and parameter configuration
  5. Security: Fine-grained access control and audit trails
  6. Resource Management: Per-tenant quotas prevent resource abuse
  7. Operational Simplicity: Single instance to manage

Negative/Tradeoffs

  1. Increased Complexity: More code, more testing required (~2-3x development effort)
  2. Performance Overhead: Tenant/KB filtering on every query (~5-10% latency impact)
  3. Storage Overhead: Tenant/KB metadata increases storage footprint (~2-3%)
  4. Operational Complexity: More configuration options, training needed
  5. Breaking Changes: API endpoints change, requires migration scripts
  6. Backward Compatibility: Existing workspaces need migration strategy

Security Considerations

  1. Data Isolation: Tenant-aware queries prevent cross-tenant leaks
  2. Authentication: JWT tokens must include tenant scope
  3. Authorization: RBAC prevents unauthorized access to KBs
  4. Audit Trail: All operations logged for compliance
  5. Key Management: Per-tenant API keys need separate management
  6. Potential Vulnerabilities:
    • Parameter injection in tenant/KB IDs (mitigate: strict validation)
    • JWT token hijacking (mitigate: short expiry, rate limiting)
    • Side-channel attacks via timing (mitigate: constant-time comparisons)
    • Resource exhaustion (mitigate: quotas and rate limiting)

Performance Impact

  • Query Latency: +5-10% from additional filtering
  • Storage Size: +2-3% for tenant/KB metadata
  • Memory Usage: +20-30% from maintaining multiple LightRAG instances
  • CPU Usage: +10-15% from authentication/authorization checks

Migration Path for Existing Deployments

  1. Phase 1: Deploy with backward compatibility (single tenant = existing workspace)
  2. Phase 2: Provide migration script to convert workspaces to tenants
  3. Phase 3: Support hybrid mode (legacy workspaces + new tenants)
  4. Phase 4: Deprecate workspace mode in favor of tenant mode

Implementation Plan (Summary)

See 002-implementation-strategy.md for detailed step-by-step implementation guide.

High-Level Phases

  1. Phase 1 (2-3 weeks): Core infrastructure

    • Database schema changes
    • Tenant/KB models
    • Storage access layer updates
  2. Phase 2 (2-3 weeks): API layer

    • Tenant-aware routing
    • Request/response models
    • Authentication/authorization
  3. Phase 3 (1-2 weeks): LightRAG integration

    • Instance manager
    • Per-tenant configurations
    • Query execution
  4. Phase 4 (1 week): Testing & deployment

    • Unit/integration tests
    • Migration scripts
    • Documentation

Alternatives Considered

1. Separate Database Per Tenant

  • Approach: Each tenant gets its own database/storage instance
  • Rejected because:
    • Massive operational overhead (n×database connections, backups, upgrades)
    • Expensive (n×database licensing)
    • Complex to manage tenants across instances
    • Makes sharing resources impossible

2. Dedicated Server Instance Per Tenant

  • Approach: Each tenant runs their own LightRAG instance
  • Rejected because:
    • Massive resource waste (minimum resources per instance)
    • Very expensive at scale (n×server costs)
    • Difficult to manage and monitor
    • Cannot share LLM/embedding infrastructure

3. Simple Workspace Extension

  • Approach: Just rename "workspace" to "tenant"
  • Rejected because:
    • No knowledge base concept (multiple KB per tenant fails)
    • Cannot enforce cross-tenant access prevention
    • No RBAC or fine-grained permissions
    • Cannot manage per-tenant configuration
    • No resource quotas

4. Sharding by Tenant Hash

  • Approach: Hash tenant ID to determine shard, send queries to correct shard
  • Rejected because:
    • Breaks operational simplicity (multiple instances to manage)
    • Rebalancing is complex when adding/removing tenants
    • Doesn't reduce resource overhead

Evidence/References

Code References

  • Storage base class: lightrag/base.py:176-185 (StorageNameSpace)
  • Namespace constants: lightrag/namespace.py (NameSpace class)
  • Workspace implementation: lightrag/kg/json_kv_impl.py:28-39 (JsonKVStorage)
  • PostgreSQL workspace support: lightrag/kg/postgres_impl.py:44-59
  • API server architecture: lightrag/api/lightrag_server.py:1-300
  • Authentication: lightrag/api/auth.py (JWT token management)
  • Config: lightrag/api/config.py:200-220 (workspace argument)
  • Current workspace isolation documented in lightrag/api/README-zh.md:165-173
  • Storage implementations in lightrag/kg/ directory

Next Steps

  1. Review and approve this ADR
  2. Create detailed design documents for each component (see ADR 002-007)
  3. Conduct security review of proposed architecture
  4. Estimate development effort and allocate resources
  5. Create implementation tickets and sprint planning

Document Version: 1.0
Last Updated: 2025-11-20
Author: Architecture Design Process
Status: Proposed - Awaiting Review and Approval