LightRAG/specs/001-multi-workspace-server/research.md
Clément THOMAS 62b2a71dda feat(api): add multi-workspace server support for multi-tenant deployments
Enable a single LightRAG server instance to serve multiple isolated workspaces
via HTTP header-based routing. This allows multi-tenant SaaS deployments where
each tenant's data is completely isolated.

Key features:
- Header-based workspace routing (LIGHTRAG-WORKSPACE, X-Workspace-ID fallback)
- Process-local pool of LightRAG instances with LRU eviction
- FastAPI dependency (get_rag) for workspace resolution per request
- Full backward compatibility - existing deployments work unchanged
- Strict multi-tenant mode option (LIGHTRAG_ALLOW_DEFAULT_WORKSPACE=false)
- Configurable pool size (LIGHTRAG_MAX_WORKSPACES_IN_POOL)
- Graceful shutdown with workspace finalization

Configuration:
- LIGHTRAG_DEFAULT_WORKSPACE: Default workspace (falls back to WORKSPACE)
- LIGHTRAG_ALLOW_DEFAULT_WORKSPACE: Require explicit header when false
- LIGHTRAG_MAX_WORKSPACES_IN_POOL: Max concurrent workspace instances (default: 50)

Files:
- New: lightrag/api/workspace_manager.py (core multi-workspace module)
- New: tests/test_multi_workspace_server.py (17 unit tests)
- New: render.yaml (Render deployment blueprint)
- Modified: All route files to use get_rag dependency
- Updated: README.md, env.example with documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-01 12:07:22 +01:00

195 lines
7.7 KiB
Markdown

# Research: Multi-Workspace Server Support
**Date**: 2025-12-01
**Feature**: 001-multi-workspace-server
## Executive Summary
Research confirms that the existing LightRAG codebase provides solid foundation for multi-workspace support at the server level. The core library already has workspace isolation; the gap is purely at the API server layer.
## Research Findings
### 1. Existing Workspace Support in LightRAG Core
**Decision**: Leverage existing `workspace` parameter in `LightRAG` class
**Findings**:
- `LightRAG` class accepts `workspace: str` parameter (default: `os.getenv("WORKSPACE", "")`)
- Storage implementations use `get_final_namespace(namespace, workspace)` to create isolated keys
- Namespace format: `"{workspace}:{namespace}"` when workspace is set, else just `"{namespace}"`
- Pipeline status, locks, and in-memory state are all workspace-aware via `shared_storage.py`
- `DocumentManager` creates workspace-specific input directories
**Evidence**:
```python
# lightrag/lightrag.py
workspace: str = field(default_factory=lambda: os.getenv("WORKSPACE", ""))
# lightrag/kg/shared_storage.py
def get_final_namespace(namespace: str, workspace: str | None = None) -> str:
if workspace is None:
workspace = get_default_workspace()
if not workspace:
return namespace
return f"{workspace}:{namespace}"
```
**Implications**: No changes needed to core isolation; just need to instantiate separate `LightRAG` objects with different `workspace` values.
### 2. Current Server Architecture
**Decision**: Refactor from closure pattern to FastAPI dependency injection
**Findings**:
- Server creates a single global `LightRAG` instance in `create_app(args)`
- Routes receive the RAG instance via closure (factory function pattern):
```python
def create_document_routes(rag: LightRAG, doc_manager, api_key):
@router.post("/scan")
async def scan_for_new_documents(...):
# rag captured from enclosing scope
```
- This pattern prevents per-request workspace switching
**Alternative Considered**: Keep closure pattern and add workspace switching to existing instance
- **Rejected Because**: LightRAG instance configuration is immutable after creation; switching workspace would require re-initializing storage connections
**Chosen Approach**: Replace closure with FastAPI `Depends()` that resolves workspace → instance
### 3. Instance Pool Design
**Decision**: Use `asyncio.Lock` protected dictionary with LRU eviction
**Findings**:
- Python's `asyncio.Lock` is appropriate for protecting async operations
- LRU eviction via `collections.OrderedDict` or manual tracking
- Instance initialization is async (`await rag.initialize_storages()`)
- Concurrent requests for same new workspace must share initialization
**Pattern**:
```python
_instances: dict[str, LightRAG] = {}
_lock = asyncio.Lock()
_lru_order: list[str] = [] # Most recent at end
async def get_instance(workspace: str) -> LightRAG:
async with _lock:
if workspace in _instances:
# Move to end of LRU list
_lru_order.remove(workspace)
_lru_order.append(workspace)
return _instances[workspace]
# Evict if at capacity
if len(_instances) >= max_pool_size:
oldest = _lru_order.pop(0)
await _instances[oldest].finalize_storages()
del _instances[oldest]
# Create and initialize
instance = LightRAG(workspace=workspace, ...)
await instance.initialize_storages()
_instances[workspace] = instance
_lru_order.append(workspace)
return instance
```
**Alternative Considered**: Use `async_lru` library or `cachetools.TTLCache`
- **Rejected Because**: Adds external dependency; simple dict+lock is sufficient and well-understood
### 4. Header Routing Strategy
**Decision**: `LIGHTRAG-WORKSPACE` primary, `X-Workspace-ID` fallback
**Findings**:
- Custom headers conventionally use `X-` prefix, but this is deprecated per RFC 6648
- Product-specific headers (e.g., `LIGHTRAG-WORKSPACE`) are clearer and recommended
- Fallback to common convention (`X-Workspace-ID`) aids adoption
**Implementation**:
```python
def get_workspace_from_request(request: Request) -> str | None:
workspace = request.headers.get("LIGHTRAG-WORKSPACE", "").strip()
if not workspace:
workspace = request.headers.get("X-Workspace-ID", "").strip()
return workspace or None
```
### 5. Configuration Schema
**Decision**: Three new environment variables
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `LIGHTRAG_DEFAULT_WORKSPACE` | str | `""` (from `WORKSPACE`) | Default workspace when no header |
| `LIGHTRAG_ALLOW_DEFAULT_WORKSPACE` | bool | `true` | If false, reject requests without header |
| `LIGHTRAG_MAX_WORKSPACES_IN_POOL` | int | `50` | Maximum concurrent workspace instances |
**Rationale**:
- `LIGHTRAG_` prefix namespaces new vars to avoid conflicts
- `ALLOW_DEFAULT_WORKSPACE=false` enables strict multi-tenant mode
- Default pool size of 50 balances memory vs. reinitialization overhead
### 6. Workspace Identifier Validation
**Decision**: Alphanumeric, hyphens, underscores; 1-64 characters
**Findings**:
- Must be safe for filesystem paths (workspace creates subdirectories)
- Must be safe for database keys (used in storage namespacing)
- Must prevent injection attacks (path traversal, SQL injection)
**Validation Regex**: `^[a-zA-Z0-9][a-zA-Z0-9_-]{0,63}$`
- Starts with alphanumeric (prevents hidden directories like `.hidden`)
- Allows hyphens and underscores for readability
- Max 64 chars (reasonable for identifiers, fits in most DB column sizes)
### 7. Error Handling
**Decision**: Return 400 for missing/invalid workspace; 503 for initialization failures
| Scenario | HTTP Status | Error Message |
|----------|-------------|---------------|
| Missing header, default disabled | 400 | `Missing LIGHTRAG-WORKSPACE header` |
| Invalid workspace identifier | 400 | `Invalid workspace identifier: must be alphanumeric...` |
| Workspace initialization fails | 503 | `Failed to initialize workspace: {details}` |
### 8. Logging Strategy
**Decision**: Log workspace identifier at INFO level; never log credentials
**Implementation**:
- Log workspace on request: `logger.info(f"Request to workspace: {workspace}")`
- Log pool events: `logger.info(f"Initialized workspace: {workspace}")`
- Log evictions: `logger.info(f"Evicted workspace from pool: {workspace}")`
- NEVER log: API keys, storage credentials, auth tokens
### 9. Test Strategy
**Decision**: Pytest with markers following existing patterns
**Test Categories**:
1. **Unit tests** (`@pytest.mark.offline`): Workspace resolution, validation, pool logic
2. **Integration tests** (`@pytest.mark.integration`): Full request flow with mock LLM/embedding
3. **Backward compatibility tests** (`@pytest.mark.offline`): Single-workspace mode unchanged
**Key Test Scenarios**:
- Two workspaces → ingest document in A → query from B returns nothing
- No header + `ALLOW_DEFAULT_WORKSPACE=true` → uses default
- No header + `ALLOW_DEFAULT_WORKSPACE=false` → returns 400
- Pool at capacity → evicts LRU → new workspace initializes
## Resolved Questions
| Question | Resolution |
|----------|------------|
| How to handle concurrent init of same workspace? | `asyncio.Lock` ensures single initialization; others wait |
| Should evicted workspace finalize storage? | Yes, call `finalize_storages()` to release resources |
| How to share config between instances? | Clone config; only `workspace` differs per instance |
| Where to put pool management code? | New module `workspace_manager.py` |
## Next Steps
1. Create `data-model.md` with entity definitions
2. Document contracts (no new API endpoints; header-based routing is transparent)
3. Create `quickstart.md` for multi-workspace deployment