# ADR 002: Implementation Strategy - Multi-Tenant, Multi-Knowledge-Base Architecture

## Status: Proposed

## Overview
This document provides a detailed, step-by-step implementation strategy for the multi-tenant, multi-knowledge-base (MT-MKB) architecture. It includes specific code changes, file modifications, new components, and testing strategies.

## Phase 1: Core Infrastructure (Weeks 1-3)

### 1.1 Database Schema Changes

#### Files to Create/Modify
- **New**: `lightrag/models/tenant.py` - Tenant and KnowledgeBase models
- **New**: `lightrag/models/__init__.py` - Model exports
- **Modify**: All storage implementations (PostgreSQL, Neo4j, MongoDB, etc.)

#### 1.1.1 Tenant and KnowledgeBase Models

**File**: `lightrag/models/tenant.py`
```python
from dataclasses import dataclass, field
from typing import Optional, Dict, Any
from datetime import datetime
from uuid import uuid4

@dataclass
class ResourceQuota:
    """Resource limits for a tenant"""
    max_documents: int = 10000
    max_storage_gb: float = 100.0
    max_concurrent_queries: int = 10
    max_monthly_api_calls: int = 100000
    max_kb_per_tenant: int = 50

@dataclass
class TenantConfig:
    """Per-tenant configuration for models and parameters"""
    llm_model: str = "gpt-4o-mini"
    embedding_model: str = "bge-m3:latest"
    rerank_model: Optional[str] = None
    chunk_size: int = 1200
    chunk_overlap: int = 100
    top_k: int = 40
    cosine_threshold: float = 0.2
    enable_llm_cache: bool = True
    custom_metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class Tenant:
    """Tenant representation"""
    tenant_id: str = field(default_factory=lambda: str(uuid4()))
    tenant_name: str = ""
    description: Optional[str] = None
    config: TenantConfig = field(default_factory=TenantConfig)
    quota: ResourceQuota = field(default_factory=ResourceQuota)
    is_active: bool = True
    created_at: datetime = field(default_factory=datetime.utcnow)
    updated_at: datetime = field(default_factory=datetime.utcnow)
    metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class KnowledgeBase:
    """Knowledge Base representation"""
    kb_id: str = field(default_factory=lambda: str(uuid4()))
    tenant_id: str = ""  # Foreign key to Tenant
    kb_name: str = ""
    description: Optional[str] = None
    is_active: bool = True
    doc_count: int = 0
    storage_used_mb: float = 0.0
    last_indexed_at: Optional[datetime] = None
    created_at: datetime = field(default_factory=datetime.utcnow)
    updated_at: datetime = field(default_factory=datetime.utcnow)
    metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class TenantContext:
    """Request-scoped tenant context"""
    tenant_id: str
    kb_id: str
    user_id: str
    role: str  # admin, editor, viewer
    permissions: Dict[str, bool] = field(default_factory=dict)
    
    @property
    def workspace_namespace(self) -> str:
        """Backward compatible workspace namespace"""
        return f"{self.tenant_id}_{self.kb_id}"
```

#### 1.1.2 PostgreSQL Schema Migration

**File**: `lightrag/kg/migrations/001_add_tenant_schema.sql`
```sql
-- Create tenants table
CREATE TABLE IF NOT EXISTS tenants (
    tenant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_name VARCHAR(255) NOT NULL,
    description TEXT,
    llm_model VARCHAR(255) DEFAULT 'gpt-4o-mini',
    embedding_model VARCHAR(255) DEFAULT 'bge-m3:latest',
    rerank_model VARCHAR(255),
    chunk_size INTEGER DEFAULT 1200,
    chunk_overlap INTEGER DEFAULT 100,
    top_k INTEGER DEFAULT 40,
    cosine_threshold FLOAT DEFAULT 0.2,
    enable_llm_cache BOOLEAN DEFAULT TRUE,
    max_documents INTEGER DEFAULT 10000,
    max_storage_gb FLOAT DEFAULT 100.0,
    max_concurrent_queries INTEGER DEFAULT 10,
    max_monthly_api_calls INTEGER DEFAULT 100000,
    is_active BOOLEAN DEFAULT TRUE,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(255),
    updated_by VARCHAR(255)
);

-- Create knowledge_bases table
CREATE TABLE IF NOT EXISTS knowledge_bases (
    kb_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id) ON DELETE CASCADE,
    kb_name VARCHAR(255) NOT NULL,
    description TEXT,
    doc_count INTEGER DEFAULT 0,
    storage_used_mb FLOAT DEFAULT 0.0,
    is_active BOOLEAN DEFAULT TRUE,
    last_indexed_at TIMESTAMP,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(255),
    updated_by VARCHAR(255),
    UNIQUE(tenant_id, kb_name),
    INDEX idx_tenant_kb (tenant_id, kb_id)
);

-- Create api_keys table (for per-tenant API keys)
CREATE TABLE IF NOT EXISTS api_keys (
    api_key_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(tenant_id) ON DELETE CASCADE,
    key_name VARCHAR(255) NOT NULL,
    hashed_key VARCHAR(255) NOT NULL UNIQUE,
    knowledge_base_ids UUID[] DEFAULT '{}',  -- NULL = all KBs
    permissions TEXT[] DEFAULT ARRAY['query', 'document:read'],
    is_active BOOLEAN DEFAULT TRUE,
    last_used_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP,
    created_by VARCHAR(255)
);

-- Add tenant/kb columns to existing tables with defaults for backward compatibility
ALTER TABLE IF EXISTS kv_store_full_docs 
ADD COLUMN IF NOT EXISTS tenant_id UUID DEFAULT NULL,
ADD COLUMN IF NOT EXISTS kb_id UUID DEFAULT NULL;

ALTER TABLE IF EXISTS kv_store_text_chunks 
ADD COLUMN IF NOT EXISTS tenant_id UUID DEFAULT NULL,
ADD COLUMN IF NOT EXISTS kb_id UUID DEFAULT NULL;

ALTER TABLE IF EXISTS vector_store_entities 
ADD COLUMN IF NOT EXISTS tenant_id UUID DEFAULT NULL,
ADD COLUMN IF NOT EXISTS kb_id UUID DEFAULT NULL;

-- Create indexes for tenant/kb filtering
CREATE INDEX IF NOT EXISTS idx_kv_store_tenant_kb ON kv_store_full_docs(tenant_id, kb_id);
CREATE INDEX IF NOT EXISTS idx_chunks_tenant_kb ON kv_store_text_chunks(tenant_id, kb_id);
CREATE INDEX IF NOT EXISTS idx_vectors_tenant_kb ON vector_store_entities(tenant_id, kb_id);
```

#### 1.1.3 MongoDB Schema

**File**: `lightrag/kg/migrations/mongo_001_add_tenant_collections.py`
```python
from typing import Any
import motor.motor_asyncio  # type: ignore

async def migrate_add_tenant_collections(client: motor.motor_asyncio.AsyncMotorClient):
    """Add tenant and knowledge base collections to MongoDB"""
    db = client.lightrag
    
    # Create tenants collection with schema validation
    await db.create_collection("tenants", validator={
        "$jsonSchema": {
            "bsonType": "object",
            "required": ["tenant_id", "tenant_name", "created_at"],
            "properties": {
                "tenant_id": {"bsonType": "string"},
                "tenant_name": {"bsonType": "string"},
                "description": {"bsonType": "string"},
                "llm_model": {"bsonType": "string", "default": "gpt-4o-mini"},
                "embedding_model": {"bsonType": "string", "default": "bge-m3:latest"},
                "is_active": {"bsonType": "bool", "default": True},
                "metadata": {"bsonType": "object"},
                "created_at": {"bsonType": "date"},
                "updated_at": {"bsonType": "date"},
            }
        }
    })
    
    # Create knowledge_bases collection
    await db.create_collection("knowledge_bases", validator={
        "$jsonSchema": {
            "bsonType": "object",
            "required": ["kb_id", "tenant_id", "kb_name"],
            "properties": {
                "kb_id": {"bsonType": "string"},
                "tenant_id": {"bsonType": "string"},
                "kb_name": {"bsonType": "string"},
                "description": {"bsonType": "string"},
                "is_active": {"bsonType": "bool", "default": True},
                "metadata": {"bsonType": "object"},
                "created_at": {"bsonType": "date"},
            }
        }
    })
    
    # Create indexes
    await db.tenants.create_index("tenant_id", unique=True)
    await db.knowledge_bases.create_index([("tenant_id", 1), ("kb_id", 1)], unique=True)
    await db.knowledge_bases.create_index([("tenant_id", 1)])
    
    # Add tenant_id and kb_id indexes to existing collections
    for collection_name in ["documents", "chunks", "entities"]:
        col = db[collection_name]
        await col.create_index([("tenant_id", 1), ("kb_id", 1)])
```

### 1.2 Create Tenant Management Service

**File**: `lightrag/services/tenant_service.py`
```python
from typing import Optional, List, Dict, Any
from uuid import UUID
from lightrag.models.tenant import Tenant, KnowledgeBase, TenantContext, TenantConfig
from lightrag.base import BaseKVStorage

class TenantService:
    """Service for managing tenants and knowledge bases"""
    
    def __init__(self, kv_storage: BaseKVStorage):
        self.kv_storage = kv_storage
        self.tenant_namespace = "__tenants__"
        self.kb_namespace = "__knowledge_bases__"
    
    async def create_tenant(self, tenant_name: str, config: Optional[TenantConfig] = None) -> Tenant:
        """Create a new tenant"""
        tenant = Tenant(tenant_name=tenant_name, config=config or TenantConfig())
        await self.kv_storage.upsert({
            f"{self.tenant_namespace}:{tenant.tenant_id}": {
                "id": tenant.tenant_id,
                "name": tenant.tenant_name,
                "config": asdict(tenant.config),
                "quota": asdict(tenant.quota),
                "is_active": tenant.is_active,
                "created_at": tenant.created_at.isoformat(),
                "updated_at": tenant.updated_at.isoformat(),
            }
        })
        return tenant
    
    async def get_tenant(self, tenant_id: str) -> Optional[Tenant]:
        """Retrieve a tenant by ID"""
        data = await self.kv_storage.get_by_id(f"{self.tenant_namespace}:{tenant_id}")
        if not data:
            return None
        return self._deserialize_tenant(data)
    
    async def create_knowledge_base(self, tenant_id: str, kb_name: str, description: Optional[str] = None) -> KnowledgeBase:
        """Create a new knowledge base for a tenant"""
        # Verify tenant exists
        tenant = await self.get_tenant(tenant_id)
        if not tenant:
            raise ValueError(f"Tenant {tenant_id} not found")
        
        kb = KnowledgeBase(
            tenant_id=tenant_id,
            kb_name=kb_name,
            description=description
        )
        await self.kv_storage.upsert({
            f"{self.kb_namespace}:{tenant_id}:{kb.kb_id}": {
                "id": kb.kb_id,
                "tenant_id": kb.tenant_id,
                "kb_name": kb.kb_name,
                "description": kb.description,
                "is_active": kb.is_active,
                "created_at": kb.created_at.isoformat(),
            }
        })
        return kb
    
    async def list_knowledge_bases(self, tenant_id: str) -> List[KnowledgeBase]:
        """List all knowledge bases for a tenant"""
        # Implementation depends on storage backend
        pass
    
    def _deserialize_tenant(self, data: Dict[str, Any]) -> Tenant:
        """Convert stored data to Tenant object"""
        pass
```

### 1.3 Update Storage Base Classes

**File**: `lightrag/base.py` (Modifications)

Add tenant context to all StorageNameSpace classes:
```python
@dataclass
class StorageNameSpace(ABC):
    namespace: str
    workspace: str  # Keep for backward compatibility
    global_config: dict[str, Any]
    tenant_id: Optional[str] = None  # NEW
    kb_id: Optional[str] = None      # NEW

    async def initialize(self):
        """Initialize the storage"""
        pass

    # Helper method to build composite workspace key
    def _get_composite_workspace(self) -> str:
        """Build workspace key with tenant/kb isolation"""
        if self.tenant_id and self.kb_id:
            return f"{self.tenant_id}_{self.kb_id}"
        elif self.workspace:
            return self.workspace
        else:
            return "_"  # Default for backward compatibility
```

### 1.4 Update Storage Implementations

#### PostgreSQL Storage Update

**File**: `lightrag/kg/postgres_impl.py` (Key modifications)

```python
# Modify all queries to include tenant/kb filters
class PGKVStorage(BaseKVStorage):
    async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
        # Add tenant/kb columns when upserting
        for key, value in data.items():
            if self.tenant_id and self.kb_id:
                value['tenant_id'] = self.tenant_id
                value['kb_id'] = self.kb_id
        
        # Original upsert logic with tenant/kb in WHERE clause
        # ... existing code ...
    
    async def query_with_tenant_filter(self, query: str) -> List[Any]:
        """Execute query with automatic tenant/kb filtering"""
        if self.tenant_id and self.kb_id:
            # Add WHERE clause filters
            if "WHERE" in query:
                query += f" AND tenant_id = $1 AND kb_id = $2"
            else:
                query += f" WHERE tenant_id = $1 AND kb_id = $2"
            return await self._execute(query, [self.tenant_id, self.kb_id])
        return await self._execute(query)

class PGVectorStorage(BaseVectorStorage):
    async def query(self, query: str, top_k: int, query_embedding: list[float] = None) -> list[dict[str, Any]]:
        # Add tenant/kb filtering
        sql = """
            SELECT * FROM vector_store_entities 
            WHERE tenant_id = $1 AND kb_id = $2
            AND vector <-> $3 < $4
            ORDER BY vector <-> $3
            LIMIT $5
        """
        # Filter results by tenant/kb
        results = await self._execute(sql, [self.tenant_id, self.kb_id, query_embedding, threshold, top_k])
        return results
```

#### JSON Storage Update

**File**: `lightrag/kg/json_kv_impl.py` (Key modifications)

```python
@dataclass
class JsonKVStorage(BaseKVStorage):
    async def _get_file_path(self) -> str:
        """Get file path with tenant/kb isolation"""
        working_dir = self.global_config["working_dir"]
        
        # Build tenant/kb specific directory
        if self.tenant_id and self.kb_id:
            dir_path = os.path.join(working_dir, self.tenant_id, self.kb_id)
            file_name = f"kv_store_{self.namespace}.json"
        elif self.workspace:
            dir_path = os.path.join(working_dir, self.workspace)
            file_name = f"kv_store_{self.namespace}.json"
        else:
            dir_path = working_dir
            file_name = f"kv_store_{self.namespace}.json"
        
        os.makedirs(dir_path, exist_ok=True)
        return os.path.join(dir_path, file_name)
    
    async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
        """Insert with tenant/kb context"""
        # Add tenant/kb to metadata
        for key, value in data.items():
            if self.tenant_id:
                value['__tenant_id__'] = self.tenant_id
            if self.kb_id:
                value['__kb_id__'] = self.kb_id
        
        # Original upsert logic
        # ... existing code ...
```

## Phase 2: API Layer (Weeks 2-3)

### 2.1 Create Tenant-Aware Request Models

**File**: `lightrag/api/models/requests.py` (New)

```python
from pydantic import BaseModel, Field, validator
from typing import Optional, List
from uuid import UUID

class TenantRequest(BaseModel):
    """Base model for tenant-scoped requests"""
    tenant_id: str = Field(..., description="Tenant identifier")
    kb_id: str = Field(..., description="Knowledge base identifier")

class CreateTenantRequest(BaseModel):
    tenant_name: str = Field(..., min_length=1, max_length=255)
    description: Optional[str] = None
    llm_model: Optional[str] = None
    embedding_model: Optional[str] = None

class CreateKnowledgeBaseRequest(BaseModel):
    kb_name: str = Field(..., min_length=1, max_length=255)
    description: Optional[str] = None

class DocumentAddRequest(TenantRequest):
    """Request to add documents to a knowledge base"""
    document_path: str = Field(..., description="Path to document")
    metadata: Optional[dict] = None

class QueryRequest(TenantRequest):
    """Request to query a knowledge base"""
    query: str = Field(..., min_length=3)
    mode: str = Field(default="mix", regex="local|global|hybrid|naive|mix|bypass")
    top_k: Optional[int] = None
    stream: Optional[bool] = None
```

### 2.2 Create Tenant-Aware Dependency Injection

**File**: `lightrag/api/dependencies.py` (New)

```python
from fastapi import Depends, HTTPException, status, Path, Header
from typing import Optional
from lightrag.models.tenant import TenantContext
from lightrag.services.tenant_service import TenantService
from lightrag.api.auth import validate_token, get_tenant_from_token

async def get_tenant_context(
    tenant_id: str = Path(..., description="Tenant ID"),
    kb_id: str = Path(..., description="Knowledge Base ID"),
    authorization: Optional[str] = Header(None),
    api_key: Optional[str] = Header(None, alias="X-API-Key"),
    tenant_service: TenantService = Depends(get_tenant_service),
) -> TenantContext:
    """
    Dependency to extract and validate tenant context from request.
    Verifies user has access to the specified tenant/KB.
    """
    
    # Determine authentication method
    if authorization and authorization.startswith("Bearer "):
        # JWT token authentication
        token = authorization[7:]
        try:
            token_data = await validate_token(token)
        except Exception as e:
            raise HTTPException(status_code=401, detail="Invalid token")
        
        user_id = token_data.get("sub")
        token_tenant_id = token_data.get("tenant_id")
        
        # Verify user's tenant matches request tenant
        if token_tenant_id != tenant_id:
            raise HTTPException(status_code=403, detail="Access denied: tenant mismatch")
        
        # Verify user can access this KB
        accessible_kbs = token_data.get("knowledge_base_ids", [])
        if kb_id not in accessible_kbs and "*" not in accessible_kbs:
            raise HTTPException(status_code=403, detail="Access denied: KB not accessible")
    
    elif api_key:
        # API key authentication
        user_id = await validate_api_key(api_key, tenant_id, kb_id)
        if not user_id:
            raise HTTPException(status_code=401, detail="Invalid API key")
    
    else:
        raise HTTPException(status_code=401, detail="Missing authentication")
    
    # Verify tenant and KB exist
    tenant = await tenant_service.get_tenant(tenant_id)
    if not tenant or not tenant.is_active:
        raise HTTPException(status_code=404, detail="Tenant not found")
    
    # Return validated context
    return TenantContext(
        tenant_id=tenant_id,
        kb_id=kb_id,
        user_id=user_id,
        role=token_data.get("role", "viewer"),
        permissions=token_data.get("permissions", {})
    )

async def get_tenant_service() -> TenantService:
    """Get singleton tenant service"""
    # This should be initialized at app startup
    pass
```

### 2.3 Create Tenant-Aware API Routes

**File**: `lightrag/api/routers/tenant_routes.py` (New)

```python
from fastapi import APIRouter, Depends, HTTPException
from typing import List, Optional
from lightrag.api.models.requests import CreateTenantRequest, CreateKnowledgeBaseRequest
from lightrag.api.dependencies import get_tenant_context, get_tenant_service
from lightrag.models.tenant import TenantContext

router = APIRouter(prefix="/api/v1/tenants", tags=["tenants"])

@router.post("")
async def create_tenant(
    request: CreateTenantRequest,
    tenant_service = Depends(get_tenant_service),
) -> dict:
    """Create a new tenant"""
    tenant = await tenant_service.create_tenant(
        tenant_name=request.tenant_name,
        config=request.dict(exclude_none=True)
    )
    return {"status": "success", "data": tenant}

@router.get("/{tenant_id}")
async def get_tenant(
    tenant_context: TenantContext = Depends(get_tenant_context),
    tenant_service = Depends(get_tenant_service),
) -> dict:
    """Get tenant details"""
    tenant = await tenant_service.get_tenant(tenant_context.tenant_id)
    return {"status": "success", "data": tenant}

@router.post("/{tenant_id}/knowledge-bases")
async def create_knowledge_base(
    request: CreateKnowledgeBaseRequest,
    tenant_context: TenantContext = Depends(get_tenant_context),
    tenant_service = Depends(get_tenant_service),
) -> dict:
    """Create a knowledge base in a tenant"""
    kb = await tenant_service.create_knowledge_base(
        tenant_id=tenant_context.tenant_id,
        kb_name=request.kb_name,
        description=request.description
    )
    return {"status": "success", "data": kb}

@router.get("/{tenant_id}/knowledge-bases")
async def list_knowledge_bases(
    tenant_context: TenantContext = Depends(get_tenant_context),
    tenant_service = Depends(get_tenant_service),
) -> dict:
    """List all knowledge bases in a tenant"""
    kbs = await tenant_service.list_knowledge_bases(tenant_context.tenant_id)
    return {"status": "success", "data": kbs}
```

### 2.4 Update Query Routes for Multi-Tenancy

**File**: `lightrag/api/routers/query_routes.py` (Modifications)

```python
@router.post("/api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/query")
async def query_knowledge_base(
    request: QueryRequest,
    tenant_context: TenantContext = Depends(get_tenant_context),
    rag_manager = Depends(get_rag_instance_manager),
) -> QueryResponse:
    """
    Query a specific knowledge base with tenant isolation.
    
    The request context is automatically scoped to the tenant/KB
    via dependency injection.
    """
    
    # Get tenant-specific RAG instance (with per-tenant config)
    rag = await rag_manager.get_rag_instance(
        tenant_id=tenant_context.tenant_id,
        kb_id=tenant_context.kb_id
    )
    
    # Execute query with tenant context
    result = await rag.aquery(
        query=request.query,
        param=QueryParam(mode=request.mode, top_k=request.top_k or 40),
        # Inject tenant context into query execution
        tenant_context=tenant_context
    )
    
    return QueryResponse(response=result["response"])
```

### 2.5 Update Document Routes for Multi-Tenancy

**File**: `lightrag/api/routers/document_routes.py` (Modifications)

```python
@router.post("/api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/add")
async def add_document(
    file: UploadFile = File(...),
    tenant_context: TenantContext = Depends(get_tenant_context),
    rag_manager = Depends(get_rag_instance_manager),
) -> dict:
    """
    Add a document to a specific knowledge base.
    
    Tenant/KB context is enforced through dependency injection.
    """
    
    # Get tenant-specific RAG instance
    rag = await rag_manager.get_rag_instance(
        tenant_id=tenant_context.tenant_id,
        kb_id=tenant_context.kb_id
    )
    
    # Insert document with tenant/KB context automatically
    result = await rag.ainsert(
        file_path=file.filename,
        tenant_id=tenant_context.tenant_id,
        kb_id=tenant_context.kb_id
    )
    
    return {"status": "success", "data": result}

@router.delete("/api/v1/tenants/{tenant_id}/knowledge-bases/{kb_id}/documents/{doc_id}")
async def delete_document(
    doc_id: str,
    tenant_context: TenantContext = Depends(get_tenant_context),
    rag_manager = Depends(get_rag_instance_manager),
) -> dict:
    """Delete document with tenant isolation"""
    
    rag = await rag_manager.get_rag_instance(
        tenant_id=tenant_context.tenant_id,
        kb_id=tenant_context.kb_id
    )
    
    # Verify document belongs to this tenant/KB before deletion
    result = await rag.adelete_by_doc_id(
        doc_id=doc_id,
        tenant_id=tenant_context.tenant_id,
        kb_id=tenant_context.kb_id
    )
    
    return {"status": "success", "message": "Document deleted"}
```

## Phase 3: LightRAG Integration (Weeks 2-4)

### 3.1 Create Tenant-Aware LightRAG Instance Manager

**File**: `lightrag/tenant_rag_manager.py` (New)

```python
from typing import Dict, Optional, Tuple
from lightrag import LightRAG
from lightrag.models.tenant import TenantContext, TenantConfig
from lightrag.services.tenant_service import TenantService
import asyncio
from functools import lru_cache

class TenantRAGManager:
    """
    Manages LightRAG instances per tenant/KB combination.
    Handles caching, initialization, and cleanup of instances.
    """
    
    def __init__(
        self,
        base_working_dir: str,
        tenant_service: TenantService,
        max_cached_instances: int = 100,
    ):
        self.base_working_dir = base_working_dir
        self.tenant_service = tenant_service
        self.max_cached_instances = max_cached_instances
        self._instances: Dict[Tuple[str, str], LightRAG] = {}
        self._lock = asyncio.Lock()
    
    async def get_rag_instance(
        self,
        tenant_id: str,
        kb_id: str,
    ) -> LightRAG:
        """
        Get or create a LightRAG instance for a tenant/KB combination.
        
        Instances are cached to avoid repeated initialization.
        Each instance uses a separate namespace for complete isolation.
        """
        cache_key = (tenant_id, kb_id)
        
        # Return cached instance if exists
        if cache_key in self._instances:
            instance = self._instances[cache_key]
            if instance._storages_status.value >= 1:  # INITIALIZED
                return instance
        
        async with self._lock:
            # Double-check locking pattern
            if cache_key in self._instances:
                return self._instances[cache_key]
            
            # Get tenant config
            tenant = await self.tenant_service.get_tenant(tenant_id)
            if not tenant:
                raise ValueError(f"Tenant {tenant_id} not found")
            
            # Create tenant-specific working directory
            tenant_working_dir = os.path.join(
                self.base_working_dir,
                tenant_id,
                kb_id
            )
            
            # Create LightRAG instance with tenant-specific config and workspace
            instance = LightRAG(
                working_dir=tenant_working_dir,
                workspace=f"{tenant_id}_{kb_id}",  # Backward compatible workspace
                # Use tenant-specific models and settings
                llm_model_name=tenant.config.llm_model,
                embedding_func=self._get_embedding_func(tenant),
                llm_model_func=self._get_llm_func(tenant),
                # ... other tenant-specific configurations ...
            )
            
            # Initialize storages
            await instance.initialize_storages()
            
            # Cache the instance
            if len(self._instances) >= self.max_cached_instances:
                # Evict oldest entry
                oldest_key = next(iter(self._instances))
                await self._instances[oldest_key].finalize_storages()
                del self._instances[oldest_key]
            
            self._instances[cache_key] = instance
            return instance
    
    async def cleanup_instance(self, tenant_id: str, kb_id: str) -> None:
        """Clean up and remove a cached instance"""
        cache_key = (tenant_id, kb_id)
        if cache_key in self._instances:
            await self._instances[cache_key].finalize_storages()
            del self._instances[cache_key]
    
    async def cleanup_all(self) -> None:
        """Clean up all cached instances"""
        for instance in self._instances.values():
            await instance.finalize_storages()
        self._instances.clear()
    
    def _get_embedding_func(self, tenant: TenantConfig):
        """Create embedding function with tenant-specific model"""
        # Use tenant's embedding model configuration
        # Can be overridden from global config
        pass
    
    def _get_llm_func(self, tenant: TenantConfig):
        """Create LLM function with tenant-specific model"""
        # Use tenant's LLM model configuration
        pass
```

### 3.2 Modify LightRAG Query Methods

**File**: `lightrag/lightrag.py` (Key modifications)

```python
async def aquery(
    self,
    query: str,
    param: QueryParam,
    tenant_context: Optional[TenantContext] = None,  # NEW
) -> QueryResult:
    """
    Query with optional tenant context for filtering.
    
    Args:
        query: The query string
        param: Query parameters
        tenant_context: Tenant context for data isolation (NEW)
    """
    
    # If tenant context provided, inject it into all storage operations
    if tenant_context:
        # Temporarily set tenant/kb context on storages
        original_tenant = getattr(self, '_tenant_id', None)
        original_kb = getattr(self, '_kb_id', None)
        
        self._tenant_id = tenant_context.tenant_id
        self._kb_id = tenant_context.kb_id
    
    try:
        # Existing query logic
        # All storage operations will now respect tenant/kb context
        result = await self._execute_query(query, param)
        return result
    finally:
        # Restore original context
        if tenant_context:
            self._tenant_id = original_tenant
            self._kb_id = original_kb

async def ainsert(
    self,
    file_path: str,
    tenant_id: Optional[str] = None,  # NEW
    kb_id: Optional[str] = None,      # NEW
    **kwargs,
) -> InsertionResult:
    """Insert documents with optional tenant/KB context"""
    
    if tenant_id:
        self._tenant_id = tenant_id
    if kb_id:
        self._kb_id = kb_id
    
    # Existing insertion logic
    # Documents will be stored with tenant/kb metadata
    result = await self._process_documents(file_path, **kwargs)
    return result
```

## Phase 4: Testing & Deployment (Week 4)

### 4.1 Unit Tests

**File**: `tests/test_tenant_isolation.py` (New)

```python
import pytest
from lightrag.models.tenant import Tenant, KnowledgeBase, TenantContext
from lightrag.services.tenant_service import TenantService

@pytest.mark.asyncio
class TestTenantIsolation:
    
    async def test_tenant_creation(self, tenant_service):
        """Test creating a tenant"""
        tenant = await tenant_service.create_tenant("Test Tenant")
        assert tenant.tenant_name == "Test Tenant"
        assert tenant.is_active is True
    
    async def test_knowledge_base_creation(self, tenant_service):
        """Test creating KB in a tenant"""
        tenant = await tenant_service.create_tenant("Tenant 1")
        kb = await tenant_service.create_knowledge_base(
            tenant.tenant_id, 
            "KB 1"
        )
        assert kb.tenant_id == tenant.tenant_id
    
    async def test_cross_tenant_data_isolation(self, tenant_service, rag_manager):
        """Test that data from one tenant cannot be accessed by another"""
        # Create two tenants
        tenant1 = await tenant_service.create_tenant("Tenant 1")
        tenant2 = await tenant_service.create_tenant("Tenant 2")
        
        # Create KBs
        kb1 = await tenant_service.create_knowledge_base(tenant1.tenant_id, "KB1")
        kb2 = await tenant_service.create_knowledge_base(tenant2.tenant_id, "KB2")
        
        # Add documents to each KB
        rag1 = await rag_manager.get_rag_instance(tenant1.tenant_id, kb1.kb_id)
        rag2 = await rag_manager.get_rag_instance(tenant2.tenant_id, kb2.kb_id)
        
        # Verify documents are isolated
        # Query in tenant2 should not return documents from tenant1
        pass
    
    async def test_query_with_tenant_context(self, rag_manager):
        """Test queries include tenant context"""
        context = TenantContext(
            tenant_id="tenant1",
            kb_id="kb1",
            user_id="user1",
            role="admin"
        )
        # Execute query with context
        # Verify only tenant1/kb1 data returned
        pass
```

### 4.2 Integration Tests

**File**: `tests/test_api_tenant_routes.py` (New)

```python
import pytest
from fastapi.testclient import TestClient

@pytest.mark.asyncio
class TestTenantAPIs:
    
    async def test_create_tenant_endpoint(self, client: TestClient, auth_token):
        """Test POST /api/v1/tenants"""
        response = client.post(
            "/api/v1/tenants",
            json={"tenant_name": "New Tenant"},
            headers={"Authorization": f"Bearer {auth_token}"}
        )
        assert response.status_code == 201
        data = response.json()
        assert data["status"] == "success"
        assert "tenant_id" in data["data"]
    
    async def test_create_knowledge_base_endpoint(self, client: TestClient, tenant_id, auth_token):
        """Test POST /api/v1/tenants/{tenant_id}/knowledge-bases"""
        response = client.post(
            f"/api/v1/tenants/{tenant_id}/knowledge-bases",
            json={"kb_name": "KB 1"},
            headers={"Authorization": f"Bearer {auth_token}"}
        )
        assert response.status_code == 201
        data = response.json()
        assert "kb_id" in data["data"]
    
    async def test_cross_tenant_access_denied(self, client: TestClient, tenant1_token, tenant2_id):
        """Test accessing tenant2 with tenant1 token fails"""
        response = client.get(
            f"/api/v1/tenants/{tenant2_id}",
            headers={"Authorization": f"Bearer {tenant1_token}"}
        )
        assert response.status_code == 403
    
    async def test_query_with_tenant_isolation(self, client: TestClient, tenant_id, kb_id, auth_token):
        """Test query is isolated to tenant/KB"""
        # Add document to KB
        # Query should only search that KB
        pass
```

### 4.3 Migration Script

**File**: `scripts/migrate_workspace_to_tenant.py` (New)

```python
"""
Migration script to convert existing workspaces to multi-tenant architecture.
Creates a default tenant for each workspace.
"""

import asyncio
import argparse
from lightrag.services.tenant_service import TenantService
from lightrag.models.tenant import Tenant
import uuid

async def migrate_workspaces_to_tenants(
    working_dir: str,
    storage_config: dict
):
    """
    Migrate existing workspace-based deployments to multi-tenant.
    
    For each workspace directory:
    1. Create a tenant with that workspace name
    2. Create a default KB
    3. Map workspace data to tenant/KB
    """
    
    tenant_service = TenantService(storage_config)
    
    # Scan working directory for existing workspaces
    workspaces = []  # Get from directory structure
    
    for workspace_name in workspaces:
        print(f"Migrating workspace: {workspace_name}")
        
        # Create tenant from workspace
        tenant = await tenant_service.create_tenant(
            tenant_name=workspace_name or "default",
            metadata={"migrated_from_workspace": workspace_name}
        )
        
        # Create default KB
        kb = await tenant_service.create_knowledge_base(
            tenant.tenant_id,
            kb_name="default",
            description="Default knowledge base (migrated from workspace)"
        )
        
        # Migrate data from workspace files to tenant/KB storage
        # Update storage paths and metadata
        
        print(f"  ✓ Created tenant {tenant.tenant_id}")
        print(f"  ✓ Created KB {kb.kb_id}")
    
    print("\nMigration complete!")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Migrate workspaces to multi-tenant")
    parser.add_argument("--working-dir", required=True)
    args = parser.parse_args()
    
    asyncio.run(migrate_workspaces_to_tenants(args.working_dir, {}))
```

### 4.4 Deployment Checklist

```markdown
## Pre-Deployment Checklist

### Database & Schema
- [ ] Database migration scripts tested on staging
- [ ] Backup of production database created
- [ ] Index creation verified on prod-like data volume
- [ ] Schema rollback scripts prepared

### Code Changes
- [ ] All unit tests passing (100% coverage of new code)
- [ ] Integration tests passing
- [ ] Load testing completed (1000+ tenant/KB combinations)
- [ ] Security audit completed
- [ ] Code review approved by 2+ team members

### Documentation
- [ ] API documentation updated
- [ ] Migration guide prepared
- [ ] Tenant management guide written
- [ ] Troubleshooting guide created

### Deployment
- [ ] Feature flag to enable multi-tenancy (default: off)
- [ ] Gradual rollout: 10% → 50% → 100%
- [ ] Health checks monitor tenant isolation
- [ ] Rollback plan tested
- [ ] Team trained on new architecture
- [ ] On-call engineer assigned for release window

### Post-Deployment
- [ ] Monitor error rates and latency
- [ ] Verify tenant data isolation (spot checks)
- [ ] Collect feedback from early adopters
- [ ] Performance baseline established
```

## Configuration Examples

### Environment Variables

```bash
# Tenant Manager Configuration
TENANT_ENABLED=true
MAX_CACHED_INSTANCES=100
TENANT_CONFIG_SYNC_INTERVAL=300

# Storage Configuration (remains the same)
LIGHTRAG_KV_STORAGE=PGKVStorage
LIGHTRAG_VECTOR_STORAGE=PGVectorStorage
LIGHTRAG_GRAPH_STORAGE=PGGraphStorage

# Tenant Service Configuration
TENANT_SERVICE_STORAGE=PostgreSQL
TENANT_DB_HOST=localhost
TENANT_DB_PORT=5432
TENANT_DB_NAME=lightrag_tenants
```

### Python Configuration

```python
# In config.py or app initialization
class TenantConfig:
    ENABLED = os.getenv("TENANT_ENABLED", "false").lower() == "true"
    MAX_CACHED_INSTANCES = int(os.getenv("MAX_CACHED_INSTANCES", "100"))
    SYNC_INTERVAL = int(os.getenv("TENANT_CONFIG_SYNC_INTERVAL", "300"))
    
    # Storage for tenant metadata
    STORAGE_TYPE = os.getenv("TENANT_SERVICE_STORAGE", "PostgreSQL")
    STORAGE_CONFIG = {
        "host": os.getenv("TENANT_DB_HOST"),
        "port": int(os.getenv("TENANT_DB_PORT", "5432")),
        "database": os.getenv("TENANT_DB_NAME", "lightrag_tenants"),
    }
```

## Testing Strategy

### Unit Testing (40% of tests)
- Tenant service operations
- Storage isolation logic
- Configuration management
- Authentication/authorization

### Integration Testing (40% of tests)
- API endpoint functionality
- Cross-component data flow
- Tenant context propagation
- Error handling

### System Testing (20% of tests)
- End-to-end workflows per tenant
- Multi-tenant concurrent operations
- Resource quota enforcement
- Performance under load

## Performance Targets

| Metric | Target | Measurement |
|--------|--------|-------------|
| Query latency | <10ms overhead | Per query with/without tenant filtering |
| API response time | <200ms p99 | Single query endpoint |
| Storage overhead | <3% | Per-tenant metadata vs. data |
| Memory per instance | <500MB | Per cached LightRAG instance |
| Tenant isolation overhead | <15% | Compare to single-tenant baseline |

## Known Limitations & Future Work

### Phase 1 Limitations
1. No cross-tenant queries or data sharing
2. No tenant-to-tenant access delegation
3. No per-tenant storage encryption
4. No real-time multi-region replication
5. No automatic tenant data backup management

### Future Enhancements (Phase 2)
1. **Cross-tenant sharing**: Allow tenants to share specific KB data
2. **Advanced RBAC**: Support custom roles and fine-grained permissions
3. **Encryption at rest**: Per-tenant data encryption
4. **Audit logging**: Comprehensive audit trail with retention policies
5. **Multi-region**: Replicate tenant data across regions
6. **Tenant quotas**: Storage, API call, and compute quotas with enforcement
7. **SSO integration**: Enterprise SSO (SAML, OIDC) support

---

**Document Version**: 1.0  
**Last Updated**: 2025-11-20  
**Phase Duration**: 3-4 weeks  
**Estimated Effort**: 160 developer hours  
**Team Size**: 2-3 backend engineers