docs: Enterprise Edition & Multi-tenancy attribution (#5 )

* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.

2025-12-04 18:09:15 +08:00

16 KiB

Raw Blame History

ADR 001: Multi-Tenant, Multi-Knowledge-Base Architecture for LightRAG

Status: Proposed

Context

Current State

LightRAG is a retrieval-augmented generation system that currently operates as a single-instance system with basic workspace-level data isolation. The existing architecture uses:

Workspace concept: Directory-based or database-field-based isolation for file/database storage
Single LightRAG instance: One RAG system per server process, configured at startup
Basic authentication: JWT tokens and API key support without tenant/knowledge-base awareness
Shared configuration: All data uses the same LLM, embedding, and storage configurations

Limitations of Current Architecture

No true multi-tenancy: Cannot serve multiple independent tenants securely
No knowledge base isolation: All data belongs to a single knowledge base
Shared compute resources: LLM and embedding calls are shared across all workspaces
Static configuration: All tenants must use the same models and settings
Cross-tenant data leak risk: Workspace isolation is not cryptographically enforced
No resource quotas: No limits on storage, compute, or API usage per tenant
Authentication limitations: JWT tokens don't support fine-grained access control

Existing Code Evidence

Workspace in base.py: StorageNameSpace class (line 176) includes workspace field for basic isolation
Namespace concept: NameSpace class in namespace.py defines storage categories but no tenant/KB concept
Storage implementations: Each storage type (PostgreSQL, JSON, Neo4j) implements workspace filtering:
- PostgreSQLDB constructor accepts workspace parameter (line 56 in postgres_impl.py)
- JsonKVStorage creates workspace directories (line 30-39 in json_kv_impl.py)
API configuration: lightrag_server.py accepts --workspace flag but no tenant/KB parameters
Authentication: auth.py provides JWT tokens with roles but no tenant/KB scoping

Business Requirements

Organizations deploying LightRAG need to:

Serve multiple independent customers (tenants) from a single instance
Support multiple knowledge bases per tenant for different use cases
Enforce complete data isolation between tenants
Manage per-tenant resource quotas and billing
Support per-tenant configuration (models, parameters, API keys)
Provide audit trails and access logs per tenant

Decision

High-Level Architecture

Implement a multi-tenant, multi-knowledge-base (MT-MKB) architecture that:

Adds tenant abstraction layer above the current workspace concept
Introduces knowledge base concept as a first-class entity
Implements tenant-aware routing at the API level
Enforces data isolation through composite keys and access control
Supports per-tenant/KB configuration for models and parameters
Adds role-based access control (RBAC) for fine-grained permissions

Core Design Principles

Backward Compatibility: Existing single-workspace setups continue to work
Layered Isolation: Tenant > Knowledge Base > Document > Chunk/Entity
Zero Trust: All data access requires explicit tenant/KB context
Default Deny: Cross-tenant access is explicitly blocked unless authorized
Audit Trail: All operations logged with tenant/KB context
Resource Aware: Quotas and limits per tenant/KB

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    FastAPI Server (Single Instance)              │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  │  API Router      │  │ Auth/Middleware  │  │  Request Handler │
│  │  Layer           │  │ (Tenant Extract) │  │  Layer           │
│  └──────┬───────────┘  └──────┬───────────┘  └──────┬───────────┘
│         │                      │                      │
│  ┌──────▼──────────────────────▼──────────────────────▼──────┐
│  │        Tenant Context (TenantID + KnowledgeBaseID)       │
│  │        Injected via Dependency Injection / Middleware    │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │         Tenant-Aware LightRAG Instance Manager             │
│  │         (Caches instances per tenant)                      │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐        │
│  │  │  Tenant 1   │  │  Tenant 2   │  │  Tenant N    │        │
│  │  │  KB1, KB2   │  │  KB1, KB3   │  │  KB1, ...    │        │
│  │  └─────────────┘  └─────────────┘  └──────────────┘        │
│  │                                                             │
│  │  Multiple LightRAG Instances (per tenant or cached)        │
│  └──────┬──────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │         Storage Access Layer with Tenant Filtering         │
│  │         (Adds tenant/KB filters to all queries)            │
│  └──────┬─────────────────────────────────────────────────────┘
│         │
│  ┌──────▼──────────────────────────────────────────────────────┐
│  │                                                              │
│  │  ┌────────────────┐  ┌────────────┐  ┌────────────────┐   │
│  │  │  PostgreSQL    │  │  Neo4j     │  │  Redis/Milvus │   │
│  │  │  (Shared DB)   │  │  (Shared)  │  │  (Shared)      │   │
│  │  └────────────────┘  └────────────┘  └────────────────┘   │
│  │                                                              │
│  │  All queries filtered by tenant/KB at storage layer        │
│  └────────────────────────────────────────────────────────────┘
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Key Components

1. Tenant Model

TenantID: Unique identifier (UUID or slug)
TenantName: Human-readable name
Configuration: Per-tenant LLM, embedding, and rerank model configs
ResourceQuotas: Storage, API calls, concurrent requests limits
CreatedAt/UpdatedAt: Audit timestamps

2. Knowledge Base Model

KnowledgeBaseID: Unique within tenant
TenantID: Parent tenant reference
KBName: Display name
Description: Purpose and content overview
Configuration: Per-KB indexing and query parameters
Status: Active/Archived
Metadata: Custom fields for tenant-specific data

3. Storage Isolation Strategy

All storage operations will include tenant/KB filters:

Document storage: workspace = f"{tenant_id}_{kb_id}"
Vector storage: Add tenant_id and kb_id metadata fields
Graph storage: Store tenant/KB info as node/edge attributes
KV storage: Prefix keys with tenant_id:kb_id:entity_id

4. API Routing

POST   /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/add
GET    /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/{doc_id}
POST   /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/query
GET    /api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/graph

5. Authentication & Authorization

# JWT Token Payload
{
    "sub": "user_id",                    # User identifier
    "tenant_id": "tenant_uuid",          # Assigned tenant
    "knowledge_base_ids": ["kb1", "kb2"], # Accessible KBs
    "role": "admin|editor|viewer",       # Role within tenant
    "exp": 1234567890,                   # Expiration
    "permissions": {
        "create_kb": true,
        "delete_documents": true,
        "run_queries": true
    }
}

6. Dependency Injection for Tenant Context

# FastAPI dependency to extract and validate tenant context
async def get_tenant_context(
    tenant_id: str, 
    kb_id: str,
    token: str = Depends(get_auth_token)
) -> TenantContext:
    # Verify user can access this tenant/KB
    # Return validated context object
    pass

Consequences

Positive

True Multi-Tenancy: Complete data isolation between tenants
Scalability: Support hundreds of tenants in single instance
Cost Efficiency: Shared infrastructure reduces per-tenant costs
Flexibility: Per-tenant model and parameter configuration
Security: Fine-grained access control and audit trails
Resource Management: Per-tenant quotas prevent resource abuse
Operational Simplicity: Single instance to manage

Negative/Tradeoffs

Increased Complexity: More code, more testing required (~2-3x development effort)
Performance Overhead: Tenant/KB filtering on every query (~5-10% latency impact)
Storage Overhead: Tenant/KB metadata increases storage footprint (~2-3%)
Operational Complexity: More configuration options, training needed
Breaking Changes: API endpoints change, requires migration scripts
Backward Compatibility: Existing workspaces need migration strategy

Security Considerations

Data Isolation: Tenant-aware queries prevent cross-tenant leaks
Authentication: JWT tokens must include tenant scope
Authorization: RBAC prevents unauthorized access to KBs
Audit Trail: All operations logged for compliance
Key Management: Per-tenant API keys need separate management
Potential Vulnerabilities:
- Parameter injection in tenant/KB IDs (mitigate: strict validation)
- JWT token hijacking (mitigate: short expiry, rate limiting)
- Side-channel attacks via timing (mitigate: constant-time comparisons)
- Resource exhaustion (mitigate: quotas and rate limiting)

Performance Impact

Query Latency: +5-10% from additional filtering
Storage Size: +2-3% for tenant/KB metadata
Memory Usage: +20-30% from maintaining multiple LightRAG instances
CPU Usage: +10-15% from authentication/authorization checks

Migration Path for Existing Deployments

Phase 1: Deploy with backward compatibility (single tenant = existing workspace)
Phase 2: Provide migration script to convert workspaces to tenants
Phase 3: Support hybrid mode (legacy workspaces + new tenants)
Phase 4: Deprecate workspace mode in favor of tenant mode

Implementation Plan (Summary)

See 002-implementation-strategy.md for detailed step-by-step implementation guide.

High-Level Phases

Phase 1 (2-3 weeks): Core infrastructure
- Database schema changes
- Tenant/KB models
- Storage access layer updates
Phase 2 (2-3 weeks): API layer
- Tenant-aware routing
- Request/response models
- Authentication/authorization
Phase 3 (1-2 weeks): LightRAG integration
- Instance manager
- Per-tenant configurations
- Query execution
Phase 4 (1 week): Testing & deployment
- Unit/integration tests
- Migration scripts
- Documentation

Alternatives Considered

1. Separate Database Per Tenant

Approach: Each tenant gets its own database/storage instance
Rejected because:
- Massive operational overhead (n×database connections, backups, upgrades)
- Expensive (n×database licensing)
- Complex to manage tenants across instances
- Makes sharing resources impossible

2. Dedicated Server Instance Per Tenant

Approach: Each tenant runs their own LightRAG instance
Rejected because:
- Massive resource waste (minimum resources per instance)
- Very expensive at scale (n×server costs)
- Difficult to manage and monitor
- Cannot share LLM/embedding infrastructure

3. Simple Workspace Extension

Approach: Just rename "workspace" to "tenant"
Rejected because:
- No knowledge base concept (multiple KB per tenant fails)
- Cannot enforce cross-tenant access prevention
- No RBAC or fine-grained permissions
- Cannot manage per-tenant configuration
- No resource quotas

4. Sharding by Tenant Hash

Approach: Hash tenant ID to determine shard, send queries to correct shard
Rejected because:
- Breaks operational simplicity (multiple instances to manage)
- Rebalancing is complex when adding/removing tenants
- Doesn't reduce resource overhead

Evidence/References

Code References

Storage base class: lightrag/base.py:176-185 (StorageNameSpace)
Namespace constants: lightrag/namespace.py (NameSpace class)
Workspace implementation: lightrag/kg/json_kv_impl.py:28-39 (JsonKVStorage)
PostgreSQL workspace support: lightrag/kg/postgres_impl.py:44-59
API server architecture: lightrag/api/lightrag_server.py:1-300
Authentication: lightrag/api/auth.py (JWT token management)
Config: lightrag/api/config.py:200-220 (workspace argument)

Current workspace isolation documented in lightrag/api/README-zh.md:165-173
Storage implementations in lightrag/kg/ directory

Next Steps

Review and approve this ADR
Create detailed design documents for each component (see ADR 002-007)
Conduct security review of proposed architecture
Estimate development effort and allocate resources
Create implementation tickets and sprint planning

Document Version: 1.0
Last Updated: 2025-11-20
Author: Architecture Design Process
Status: Proposed - Awaiting Review and Approval

16 KiB Raw Blame History Unescape Escape