docs: Enterprise Edition & Multi-tenancy attribution (#5 )

* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.

2025-12-04 18:09:15 +08:00

19 KiB

Raw Blame History

ADR 003: Data Models and Storage Design

Status: Proposed

Overview

This document details the data models for tenants, knowledge bases, and the storage architecture for complete data isolation.

Data Models

1. Core Entity Models

1.1 Tenant Model

@dataclass
class Tenant:
    """
    Represents a tenant in the multi-tenant system.
    A tenant is the top-level isolation boundary.
    """
    tenant_id: str  # UUID: e.g., "550e8400-e29b-41d4-a716-446655440000"
    tenant_name: str  # Display name: e.g., "Acme Corp"
    description: Optional[str]  # Free-text description
    
    # Configuration
    config: TenantConfig
    quota: ResourceQuota
    
    # Lifecycle
    is_active: bool = True
    created_at: datetime
    updated_at: datetime
    created_by: Optional[str]
    updated_by: Optional[str]
    
    # Metadata
    metadata: Dict[str, Any] = field(default_factory=dict)
    
    # Statistics
    kb_count: int = 0
    total_documents: int = 0
    total_storage_mb: float = 0.0

1.2 Knowledge Base Model

@dataclass
class KnowledgeBase:
    """
    Represents a knowledge base within a tenant.
    Contains documents, entities, and relationships for a specific domain.
    """
    kb_id: str  # UUID: e.g., "660e8400-e29b-41d4-a716-446655440000"
    tenant_id: str  # Foreign key to Tenant
    kb_name: str  # Display name: e.g., "Product Documentation"
    description: Optional[str]
    
    # Status and lifecycle
    is_active: bool = True
    status: str = "ready"  # ready | indexing | error
    
    # Statistics
    document_count: int = 0
    entity_count: int = 0
    relationship_count: int = 0
    chunk_count: int = 0
    storage_used_mb: float = 0.0
    
    # Indexing info
    last_indexed_at: Optional[datetime] = None
    index_version: int = 1
    
    # Configuration (can override tenant defaults)
    config: Optional[KBConfig] = None
    
    # Timestamps
    created_at: datetime
    updated_at: datetime
    
    # Metadata
    metadata: Dict[str, Any] = field(default_factory=dict)

1.3 Configuration Models

@dataclass
class TenantConfig:
    """Per-tenant model and parameter configuration"""
    # Model selection
    llm_model: str = "gpt-4o-mini"
    embedding_model: str = "bge-m3:latest"
    rerank_model: Optional[str] = None
    
    # LLM parameters
    llm_model_kwargs: Dict[str, Any] = field(default_factory=dict)
    llm_temperature: float = 1.0
    llm_max_tokens: int = 4096
    
    # Embedding parameters
    embedding_dim: int = 1024
    embedding_batch_num: int = 10
    
    # Query defaults
    top_k: int = 40
    chunk_top_k: int = 20
    cosine_threshold: float = 0.2
    enable_llm_cache: bool = True
    enable_rerank: bool = True
    
    # Chunking defaults
    chunk_size: int = 1200
    chunk_overlap: int = 100
    
    # Custom tenant metadata
    custom_metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class KBConfig:
    """Per-knowledge-base configuration (overrides tenant defaults)"""
    # Only include fields that override tenant config
    top_k: Optional[int] = None
    chunk_size: Optional[int] = None
    cosine_threshold: Optional[float] = None
    custom_metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class ResourceQuota:
    """Resource limits for a tenant"""
    max_documents: int = 10000
    max_storage_gb: float = 100.0
    max_concurrent_queries: int = 10
    max_monthly_api_calls: int = 100000
    max_kb_per_tenant: int = 50
    max_entities_per_kb: int = 100000
    max_relationships_per_kb: int = 500000

1.4 Request Context

@dataclass
class TenantContext:
    """
    Request-scoped tenant context.
    Injected into all request handlers and passed through the call stack.
    """
    tenant_id: str
    kb_id: str
    user_id: str
    role: str  # admin | editor | viewer | viewer:read-only
    
    # Authorization
    permissions: Dict[str, bool] = field(default_factory=dict)
    knowledge_base_ids: List[str] = field(default_factory=list)  # Accessible KBs
    
    # Request tracking
    request_id: str = field(default_factory=lambda: str(uuid4()))
    ip_address: Optional[str] = None
    user_agent: Optional[str] = None
    
    # Computed properties
    @property
    def workspace_namespace(self) -> str:
        """Backward compatible workspace namespace"""
        return f"{self.tenant_id}_{self.kb_id}"
    
    def can_access_kb(self, kb_id: str) -> bool:
        """Check if user can access specific KB"""
        return kb_id in self.knowledge_base_ids or "*" in self.knowledge_base_ids
    
    def has_permission(self, permission: str) -> bool:
        """Check if user has specific permission"""
        return self.permissions.get(permission, False)

Storage Architecture

2. Storage Isolation Strategy

2.1 Composite Key Design

All data items are identified using composite keys that enforce tenant/KB isolation:

<tenant_id>:<kb_id>:<entity_id>

Examples:

Document: acme:prod-docs:doc-12345
Entity: acme:prod-docs:ent-company-apple
Chunk: acme:prod-docs:chunk-doc-12345-001
Relationship: acme:prod-docs:rel-apple-ceo-tim_cook

2.2 Storage-Specific Implementation

2.3 PostgreSQL Storage

Schema Design

-- Tenants table
CREATE TABLE tenants (
    tenant_id UUID PRIMARY KEY,
    tenant_name VARCHAR(255) NOT NULL,
    description TEXT,
    llm_model VARCHAR(255) DEFAULT 'gpt-4o-mini',
    embedding_model VARCHAR(255) DEFAULT 'bge-m3:latest',
    rerank_model VARCHAR(255),
    chunk_size INTEGER DEFAULT 1200,
    chunk_overlap INTEGER DEFAULT 100,
    top_k INTEGER DEFAULT 40,
    cosine_threshold FLOAT DEFAULT 0.2,
    max_documents INTEGER DEFAULT 10000,
    max_storage_gb FLOAT DEFAULT 100.0,
    is_active BOOLEAN DEFAULT TRUE,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(255),
    CONSTRAINT valid_tenant_name CHECK (length(tenant_name) > 0)
);

-- Knowledge bases table
CREATE TABLE knowledge_bases (
    kb_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id) ON DELETE CASCADE,
    kb_name VARCHAR(255) NOT NULL,
    description TEXT,
    doc_count INTEGER DEFAULT 0,
    entity_count INTEGER DEFAULT 0,
    relationship_count INTEGER DEFAULT 0,
    chunk_count INTEGER DEFAULT 0,
    storage_used_mb FLOAT DEFAULT 0.0,
    is_active BOOLEAN DEFAULT TRUE,
    status VARCHAR(50) DEFAULT 'ready',
    last_indexed_at TIMESTAMP,
    index_version INTEGER DEFAULT 1,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(255),
    UNIQUE(tenant_id, kb_name),
    CONSTRAINT valid_kb_name CHECK (length(kb_name) > 0)
);

-- Documents table (updated with tenant/kb)
CREATE TABLE documents (
    doc_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
    kb_id UUID NOT NULL REFERENCES knowledge_bases(kb_id),
    doc_name VARCHAR(255) NOT NULL,
    doc_path TEXT,
    file_type VARCHAR(50),
    file_size INTEGER,
    chunk_count INTEGER DEFAULT 0,
    content_hash VARCHAR(64),  -- SHA256 for deduplication
    is_active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(255),
    CONSTRAINT fk_tenant_kb UNIQUE (tenant_id, kb_id, doc_id)
);

-- Chunks table (text chunks with tenant/kb filtering)
CREATE TABLE chunks (
    chunk_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
    kb_id UUID NOT NULL REFERENCES knowledge_bases(kb_id),
    doc_id UUID NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
    chunk_index INTEGER,
    content TEXT NOT NULL,
    token_count INTEGER,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_tenant_kb_chunk UNIQUE (tenant_id, kb_id, chunk_id)
);

-- Entities table (knowledge graph entities)
CREATE TABLE entities (
    entity_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
    kb_id UUID NOT NULL REFERENCES knowledge_bases(kb_id),
    entity_name VARCHAR(500) NOT NULL,
    entity_type VARCHAR(100),
    description TEXT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_tenant_kb_entity UNIQUE (tenant_id, kb_id, entity_id)
);

-- Relationships table (knowledge graph relationships)
CREATE TABLE relationships (
    rel_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
    kb_id UUID NOT NULL REFERENCES knowledge_bases(kb_id),
    source_entity_id UUID NOT NULL REFERENCES entities(entity_id) ON DELETE CASCADE,
    target_entity_id UUID NOT NULL REFERENCES entities(entity_id) ON DELETE CASCADE,
    relation_type VARCHAR(100) NOT NULL,
    description TEXT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_tenant_kb_rel UNIQUE (tenant_id, kb_id, rel_id)
);

-- Vector embeddings table
CREATE TABLE vector_embeddings (
    vector_id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
    kb_id UUID NOT NULL REFERENCES knowledge_bases(kb_id),
    entity_id UUID NOT NULL REFERENCES entities(entity_id) ON DELETE CASCADE,
    embedding vector(1024),  -- pgvector extension required
    embedding_model VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT fk_tenant_kb_vector UNIQUE (tenant_id, kb_id, vector_id)
);

-- Create indexes for tenant/kb filtering on all tables
CREATE INDEX idx_documents_tenant_kb ON documents(tenant_id, kb_id);
CREATE INDEX idx_chunks_tenant_kb ON chunks(tenant_id, kb_id, doc_id);
CREATE INDEX idx_entities_tenant_kb ON entities(tenant_id, kb_id);
CREATE INDEX idx_relationships_tenant_kb ON relationships(tenant_id, kb_id);
CREATE INDEX idx_vectors_tenant_kb ON vector_embeddings(tenant_id, kb_id);

-- Full-text search index
CREATE INDEX idx_chunks_fts ON chunks USING GIN(to_tsvector('english', content));

-- Composite indexes for common queries
CREATE INDEX idx_docs_tenant_active ON documents(tenant_id, kb_id, is_active);
CREATE INDEX idx_entities_tenant_type ON entities(tenant_id, kb_id, entity_type);
CREATE INDEX idx_rel_tenant_source ON relationships(tenant_id, kb_id, source_entity_id);

Query Examples

-- Get all documents for a tenant/KB
SELECT * FROM documents 
WHERE tenant_id = $1 AND kb_id = $2 AND is_active = true;

-- Get all chunks for a document (with tenant isolation)
SELECT * FROM chunks 
WHERE tenant_id = $1 AND kb_id = $2 AND doc_id = $3
ORDER BY chunk_index;

-- Search entities by name and type (tenant-scoped)
SELECT * FROM entities 
WHERE tenant_id = $1 AND kb_id = $2 
AND entity_name ILIKE '%' || $3 || '%'
AND entity_type = $4;

-- Find related chunks for an entity (tenant-scoped)
SELECT DISTINCT c.* FROM chunks c
WHERE c.tenant_id = $1 AND c.kb_id = $2
AND c.chunk_id IN (
    SELECT chunk_id FROM chunk_entity_links
    WHERE tenant_id = $1 AND kb_id = $2
    AND entity_id = $3
);

2.4 Neo4j Storage

Schema Design

// Tenant node
CREATE CONSTRAINT unique_tenant_id IF NOT EXISTS
  FOR (t:Tenant) REQUIRE t.tenant_id IS UNIQUE;

// Knowledge base node
CREATE CONSTRAINT unique_kb_id IF NOT EXISTS
  FOR (k:KnowledgeBase) REQUIRE k.kb_id IS UNIQUE;

// Entity node with tenant/kb scope
CREATE CONSTRAINT unique_entity IF NOT EXISTS
  FOR (e:Entity) REQUIRE (e.tenant_id, e.kb_id, e.entity_id) IS UNIQUE;

// Create nodes with tenant/kb properties
CREATE (t:Tenant {
  tenant_id: 'tenant-uuid',
  tenant_name: 'Acme Corp',
  created_at: timestamp()
});

CREATE (kb:KnowledgeBase {
  kb_id: 'kb-uuid',
  tenant_id: 'tenant-uuid',
  kb_name: 'Product Docs',
  created_at: timestamp()
}) -[:BELONGS_TO]-> (t:Tenant {tenant_id: 'tenant-uuid'});

// Entity with tenant/kb scope
CREATE (e:Entity {
  entity_id: 'entity-uuid',
  tenant_id: 'tenant-uuid',
  kb_id: 'kb-uuid',
  name: 'Apple Inc',
  type: 'Organization'
}) -[:IN_KB]-> (kb:KnowledgeBase {kb_id: 'kb-uuid'});

Query Examples

// Get all entities in a KB
MATCH (e:Entity {tenant_id: $tenant_id, kb_id: $kb_id})
RETURN e;

// Get entities connected to another entity (tenant-scoped)
MATCH (e1:Entity {tenant_id: $tenant_id, kb_id: $kb_id, entity_id: $entity_id})
-[r:RELATES_TO]-
(e2:Entity {tenant_id: $tenant_id, kb_id: $kb_id})
RETURN e1, r, e2;

// Prevent cross-tenant queries
MATCH (e:Entity)
WHERE e.tenant_id = $tenant_id AND e.kb_id = $kb_id
RETURN e;

// Enforce scope in relationship queries
MATCH (e1:Entity {tenant_id: $tenant_id, kb_id: $kb_id})
-[r:RELATES_TO]->
(e2:Entity {tenant_id: $tenant_id, kb_id: $kb_id})
RETURN e1, r, e2;

2.5 Vector Database Storage (Milvus/Qdrant)

Collection Schema

# Milvus collection
collection_schema = {
    "fields": [
        {"name": "id", "type": "VARCHAR", "params": {"max_length": 512}},
        {"name": "tenant_id", "type": "VARCHAR", "params": {"max_length": 36}},
        {"name": "kb_id", "type": "VARCHAR", "params": {"max_length": 36}},
        {"name": "entity_id", "type": "VARCHAR", "params": {"max_length": 512}},
        {"name": "entity_type", "type": "VARCHAR", "params": {"max_length": 100}},
        {"name": "embedding", "type": "FLOAT_VECTOR", "params": {"dim": 1024}},
        {"name": "text", "type": "VARCHAR", "params": {"max_length": 4096}},
        {"name": "metadata", "type": "JSON"},
        {"name": "created_at", "type": "INT64"},
    ],
    "primary_field": "id",
    "vector_field": "embedding"
}

# Create index with tenant/kb partitioning
index_params = {
    "metric_type": "L2",  # or "IP" for inner product
    "index_type": "HNSW",
    "params": {"efConstruction": 200, "M": 16}
}

# Partition by tenant for better performance
collection.create_partition(partition_name=f"{tenant_id}_{kb_id}")

Query Examples

# Search with tenant/kb filter
expr = f'tenant_id == "{tenant_id}" AND kb_id == "{kb_id}"'
results = collection.search(
    data=query_embedding,
    anns_field="embedding",
    param={"metric_type": "L2", "params": {"ef": 100}},
    limit=10,
    expr=expr,
    output_fields=["entity_id", "text", "metadata"]
)

# Prevent cross-tenant queries
# Always include tenant/kb filter in expr

Access Control Lists (ACL)

3.1 Role Definitions

class Role(str, Enum):
    ADMIN = "admin"           # Full control
    EDITOR = "editor"         # Create/update/delete documents and KBs
    VIEWER = "viewer"         # Query and read-only access
    VIEWER_READONLY = "viewer:read-only"  # Query access only

class Permission(str, Enum):
    # Tenant-level permissions
    MANAGE_TENANT = "tenant:manage"
    MANAGE_MEMBERS = "tenant:manage_members"
    MANAGE_BILLING = "tenant:manage_billing"
    
    # KB-level permissions
    CREATE_KB = "kb:create"
    DELETE_KB = "kb:delete"
    MANAGE_KB = "kb:manage"
    
    # Document-level permissions
    CREATE_DOCUMENT = "document:create"
    UPDATE_DOCUMENT = "document:update"
    DELETE_DOCUMENT = "document:delete"
    READ_DOCUMENT = "document:read"
    
    # Query permissions
    RUN_QUERY = "query:run"
    ACCESS_KB = "kb:access"

ROLE_PERMISSIONS = {
    Role.ADMIN: [Permission.value for Permission in Permission],
    Role.EDITOR: [
        Permission.CREATE_KB,
        Permission.DELETE_KB,
        Permission.CREATE_DOCUMENT,
        Permission.UPDATE_DOCUMENT,
        Permission.DELETE_DOCUMENT,
        Permission.READ_DOCUMENT,
        Permission.RUN_QUERY,
        Permission.ACCESS_KB,
    ],
    Role.VIEWER: [
        Permission.READ_DOCUMENT,
        Permission.RUN_QUERY,
        Permission.ACCESS_KB,
    ],
    Role.VIEWER_READONLY: [
        Permission.RUN_QUERY,
        Permission.ACCESS_KB,
    ],
}

3.2 JWT Token Payload with Permissions

{
    "sub": "user-123",
    "tenant_id": "acme-corp",
    "knowledge_base_ids": ["kb-1", "kb-2"],  # Accessible KBs
    "role": "admin",  # or editor, viewer
    "permissions": {
        "kb:create": true,
        "kb:delete": true,
        "document:create": true,
        "query:run": true,
        ...
    },
    "exp": 1703123456,
    "iat": 1703100000,
    "iss": "lightrag-server",
    "metadata": {
        "department": "engineering",
        "cost_center": "cc-123"
    }
}

Backward Compatibility

4.1 Legacy Workspace to Tenant Migration

For existing single-workspace deployments:

Auto-create tenant on startup if not exists:

async def initialize_tenant_from_workspace(workspace: str) -> Tenant:
    """Create tenant from legacy workspace name"""
    tenant_id = workspace if workspace else "default"
    tenant = Tenant(
        tenant_id=tenant_id,
        tenant_name=workspace or "default",
        metadata={"legacy_workspace": True}
    )
    return tenant

Transparent workspace → tenant mapping:

def get_workspace_namespace(tenant_id: str, kb_id: str) -> str:
    """Backward compatible workspace string"""
    return f"{tenant_id}_{kb_id}"

Migration script provided to convert existing data

Data Validation & Constraints

5.1 Validation Rules

class TenantValidator:
    @staticmethod
    def validate_tenant_id(tenant_id: str) -> bool:
        """Validate tenant ID format (UUID)"""
        return bool(UUID(tenant_id))
    
    @staticmethod
    def validate_tenant_name(name: str) -> bool:
        """Validate tenant name"""
        return 1 <= len(name) <= 255

class KBValidator:
    @staticmethod
    def validate_kb_id(kb_id: str) -> bool:
        """Validate KB ID format"""
        return bool(UUID(kb_id))
    
    @staticmethod
    def validate_kb_name(name: str, tenant_id: str) -> bool:
        """Validate KB name is unique within tenant"""
        # Check with database
        pass

class EntityValidator:
    @staticmethod
    def validate_entity_id(entity_id: str, tenant_id: str, kb_id: str) -> bool:
        """Validate entity belongs to tenant/KB"""
        # Parse composite key
        parts = entity_id.split(':')
        return len(parts) == 3 and parts[0] == tenant_id and parts[1] == kb_id

Summary Table

Component	Single-Tenant	Multi-Tenant
Isolation Boundary	Workspace	Tenant + KB
Data Sharing	N/A	Cross-KB within tenant possible
Configuration	Global	Per-tenant + per-KB
Storage Model	Shared	Tenant-scoped queries
Authentication	Simple JWT	Tenant-aware JWT
Complexity	Low	Medium
Performance	Baseline	+5-10% overhead

Document Version: 1.0
Last Updated: 2025-11-20
Related Files: 002-implementation-strategy.md, 004-api-design.md

19 KiB Raw Blame History

ADR 003: Data Models and Storage Design

Status: Proposed

Overview

Data Models

1. Core Entity Models

1.1 Tenant Model

1.2 Knowledge Base Model

1.3 Configuration Models

1.4 Request Context

Storage Architecture

2. Storage Isolation Strategy

2.1 Composite Key Design

2.2 Storage-Specific Implementation

2.3 PostgreSQL Storage

Schema Design

Query Examples

2.4 Neo4j Storage

Schema Design

Query Examples

2.5 Vector Database Storage (Milvus/Qdrant)

Collection Schema

Query Examples

Access Control Lists (ACL)

3.1 Role Definitions

3.2 JWT Token Payload with Permissions

Backward Compatibility

4.1 Legacy Workspace to Tenant Migration

Data Validation & Constraints

5.1 Validation Rules

Summary Table

19 KiB

Raw Blame History