* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad. * Add multi-tenant testing strategy and ADR index documentation - Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details. - Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles. * feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise - Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints. - Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options. - Documented the LightRAG architecture, storage backends, LLM integrations, and query modes. - Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights. |
||
|---|---|---|
| .. | ||
| 001-multi-tenant-architecture-overview.md | ||
| 002-implementation-strategy.md | ||
| 003-data-models-and-storage.md | ||
| 004-api-design.md | ||
| 005-security-analysis.md | ||
| 006-architecture-diagrams-alternatives.md | ||
| 007-deployment-guide-quick-reference.md | ||
| 008-multi-tenant-testing-strategy.md | ||
| README.md | ||
LightRAG Multi-Tenant Architecture - Complete ADR Index
Document Overview
This collection of 7 Architecture Decision Records provides comprehensive guidance for implementing a multi-tenant, multi-knowledge-base system in LightRAG. All recommendations are grounded in actual codebase analysis and include detailed implementation specifications.
📋 Complete Document Index
ADR 001: Multi-Tenant Architecture Overview
Purpose: Establish the core architectural decision and rationale
Length: ~400 lines
Key Sections:
- Current state analysis (single-instance, workspace-level isolation)
- Architectural decision (multi-tenant with per-KB scoping)
- Consequences (complexity, performance, security trade-offs)
- Code evidence (6 direct references to existing patterns)
- Alternative approaches evaluated (4 alternatives considered)
When to Read: First - understand why multi-tenant is necessary
For Roles: Architects, Tech Leads, Decision Makers
Decision Status: Proposed (Ready for stakeholder approval)
ADR 002: Implementation Strategy
Purpose: Detailed roadmap for implementation across 4 phases
Length: ~800 lines
Key Sections:
- Phase 1 (2-3 weeks): Database schema, tenant models, core infrastructure
- Phase 2 (2-3 weeks): API layer, tenant routing, permission checking
- Phase 3 (1-2 weeks): LightRAG integration, instance caching, query modification
- Phase 4 (1 week): Testing, migration, deployment
- Configuration examples with real environment variables
- Performance targets and success metrics
- Known limitations and future work
Total Effort: ~160 developer hours across 4 weeks
When to Read: Second - use for sprint planning and task breakdown
For Roles: Engineering Leads, Project Managers, Developers
Implementation Detail: High-level code examples (not pseudo-code)
ADR 003: Data Models and Storage Design
Purpose: Complete specification of data models and storage schema
Length: ~700 lines
Key Sections:
- Core data models with Python dataclass definitions
- PostgreSQL schema with 8 tables, composite indexes, and migration scripts
- Neo4j schema with Cypher examples
- MongoDB/Vector DB schema with partition strategies
- Access control lists and role-based permissions
- Data validation rules and constraints
- Backward compatibility mapping for workspace-to-tenant migration
When to Read: Before database migration work begins
For Roles: Database Engineers, Backend Developers
Schema Completeness: 100% (Production-ready SQL)
ADR 004: API Design and Routing
Purpose: Complete REST API specification for multi-tenant system
Length: ~900 lines
Key Sections:
- API versioning and base URL structure (
/api/v1/tenants/{tenant_id}/...) - Authentication mechanisms (JWT RS256, API keys with rotation)
- Tenant management endpoints (CRUD operations)
- Knowledge base endpoints (lifecycle management)
- Document endpoints (upload, status, deletion)
- Query endpoints (standard, streaming, with data)
- Error handling with 8 error codes and examples
- Rate limiting configuration per tenant
- 10+ cURL examples for all operations
- OpenAPI/Swagger documentation structure
Endpoint Count: 30+ endpoints defined
When to Read: Before API development begins
For Roles: API Developers, Frontend Engineers, QA
Specification Completeness: 100% (Ready to implement)
ADR 005: Security Analysis and Mitigation
Purpose: Comprehensive security analysis with threat modeling
Length: ~900 lines
Key Sections:
- Security principles (Zero Trust, Defense in Depth, Complete Mediation)
- Threat model with 7 attack vectors:
- Unauthorized cross-tenant access → Dependency injection validation
- Authentication bypass → Strong JWT signature verification
- Parameter injection/path traversal → UUID validation + parameterized queries
- Information disclosure → Generic errors + log sanitization
- DoS via resource exhaustion → Per-tenant rate limits
- Data leakage via logs → Field redaction + PII hashing
- Replay attacks → JTI tracking + idempotency keys
- JWT security configuration (RS256 recommended)
- API key security (bcrypt hashing, rotation policy)
- CORS and TLS/HTTPS configuration
- Audit logging structure with 14 event types
- Vulnerability scanning strategy
- Compliance considerations (GDPR, SOC 2, ISO 27001, HIPAA)
- Security checklist with 13 verification items
When to Read: Before security implementation phase
For Roles: Security Engineers, Backend Developers, Compliance Officers
Threat Coverage: Comprehensive (All major attack vectors)
ADR 006: Architecture Diagrams and Alternatives
Purpose: Visual representation of architecture and detailed alternatives analysis
Length: ~700 lines
Key Sections:
- Full system architecture ASCII diagram (6 layers)
- Query execution flow diagram (10 steps)
- Document upload flow diagram (7 steps)
- 5 alternative approaches with pros/cons:
- Database per Tenant (Rejected: 100x cost, operational nightmare)
- Server per Tenant (Rejected: Resource waste, uneconomical)
- Workspace Rename (Rejected: No KB isolation, weak security)
- Shared Single Instance (Rejected: Data isolation risk too high)
- Sharding by Hash (Rejected: Complexity without sufficient benefit)
- Comparison matrix showing why proposed approach wins
- Risk assessment for each alternative
When to Read: For architectural validation and decision support
For Roles: Architects, Tech Leads, Stakeholders
Visualization Quality: High (ASCII diagrams suitable for documentation/slides)
ADR 007: Deployment Guide and Quick Reference
Purpose: Practical guide for deployment, testing, and operations
Length: ~800 lines
Key Sections:
- Quick start for developers (setup, testing, manual testing)
- Docker Compose configuration for complete stack
- Environment variable reference
- Backward compatibility and migration from workspace model
- Monitoring and observability setup
- Prometheus queries for key metrics
- Rollout strategy (4-phase soft launch to production)
- Troubleshooting guide with solutions
- Success criteria checklist
- Support resources and documentation index
When to Read: During deployment and operational phases
For Roles: DevOps Engineers, Operators, Support Teams
Operational Readiness: Complete (All runbooks provided)
🎯 Reading Paths by Role
👨💼 For Executives/Product Managers
- Executive Summary (this document, sections below)
- ADR 001 - Sections: Decision, Consequences, Alternatives
- ADR 002 - Sections: Timeline, Effort, Success Metrics
- ADR 007 - Sections: Rollout Strategy, Success Criteria
Time Investment: 45 minutes
Key Takeaway: What we're building, why it matters, and when it ships
🏗️ For Architects/Tech Leads
- ADR 001 - Complete
- ADR 006 - Complete (diagrams + alternatives)
- ADR 003 - Sections: Core Models, Storage Strategy
- ADR 002 - Sections: Phase Overview, Configuration
- ADR 005 - Sections: Threat Model, Security Checklist
Time Investment: 3 hours
Key Takeaway: Complete architectural vision with design justification
👨💻 For Developers (API/Backend)
- ADR 002 - Complete (detailed code examples)
- ADR 004 - Complete (endpoint specifications)
- ADR 003 - Sections: Core Models, PostgreSQL Schema
- ADR 005 - Sections: Threat Mitigations (code-level)
- ADR 007 - Sections: Quick Start, Testing
Time Investment: 6 hours
Key Takeaway: Exact code changes needed, APIs to implement, test strategy
🔐 For Security/DevOps
- ADR 005 - Complete (threat model, mitigations, compliance)
- ADR 007 - Complete (monitoring, troubleshooting)
- ADR 004 - Sections: Authentication, Error Handling
- ADR 002 - Sections: Configuration, Testing
- ADR 001 - Sections: Consequences (security)
Time Investment: 4 hours
Key Takeaway: Security architecture, deployment checklist, monitoring strategy
📊 For Database Engineers
- ADR 003 - Complete
- ADR 002 - Sections: Phase 1 (Database changes)
- ADR 001 - Sections: Current Architecture
- ADR 005 - Sections: Parameter Injection Mitigation
Time Investment: 4 hours
Key Takeaway: Schema changes, migration scripts, storage isolation strategy
📌 Executive Summary
The Opportunity
LightRAG currently supports single-instance deployments with basic workspace-level isolation. To serve multiple organizations and knowledge domains (SaaS model), we need true multi-tenancy with knowledge base-level isolation.
The Decision
Implement multi-tenant architecture with multi-knowledge-base support using:
- Tenant abstraction layer (UUID-based isolation)
- Knowledge bases as first-class entities
- Composite key strategy (
tenant_id:kb_id:entity_id) - Storage layer automatic filtering (defense in depth)
- Per-tenant RAG instance caching (performance optimization)
Investment Required
- Effort: ~160 developer-hours
- Timeline: 4 weeks (1 week per phase)
- Team Size: 4 developers + 1 tech lead
- Infrastructure: Database migration, Redis for caching
Business Impact
- Enables: Multi-customer SaaS model
- Reduces: Per-customer hosting costs by 10-50x
- Improves: Data isolation and security posture
- Provides: RBAC and audit logging for compliance
- Supports: Future expansion to 100+ concurrent tenants
Risk Assessment
| Risk | Severity | Mitigation |
|---|---|---|
| Cross-tenant data access | Critical | Defense-in-depth filters + automated tests |
| Performance degradation | High | Instance caching, indexed queries, monitoring |
| Migration failures | Medium | Dual-write period, rollback plan, testing |
| Operational complexity | Medium | Comprehensive monitoring, runbooks, training |
Success Metrics
✓ Functional: All API endpoints working with tenant isolation
✓ Security: Zero cross-tenant data access in production
✓ Performance: Query latency < 200ms p99, cache hit rate > 90%
✓ Operational: 99.5% uptime, <5min incident response time
✓ Business: Support 50+ active tenants on single instance
🚀 Quick Implementation Checklist
Pre-Implementation (Week 0)
- Review all 7 ADRs with team (30-45 minutes)
- Secure stakeholder approval
- Create detailed Jira tickets from ADR 002
- Set up development databases (PostgreSQL, Redis)
- Brief security team on threat model (ADR 005)
Phase 1: Core Infrastructure (Week 1-2)
- Create database schema (ADR 003)
- Implement tenant models (dataclasses)
- Create TenantService for CRUD
- Add tenant/KB columns to storage base classes
- Run unit tests on isolation
Phase 2: API Layer (Week 2-3)
- Implement tenant routes (CRUD)
- Implement KB routes (CRUD)
- Create dependency injection for TenantContext
- Update document/query routes with tenant filtering
- Test with API examples from ADR 004
Phase 3: RAG Integration (Week 3)
- Implement TenantRAGManager (instance caching)
- Modify LightRAG.query() to accept tenant context
- Modify LightRAG.insert() to accept tenant context
- Set up monitoring (Prometheus metrics)
- Run integration tests
Phase 4: Deployment (Week 4)
- Run security audit against ADR 005 checklist
- Run load tests with multiple tenants
- Prepare migration script for existing workspaces
- Deploy to staging (1 week soak test)
- Deploy to production (4-phase rollout)
- Run incident response drills
📚 Document Navigation
adr/
├── 001-multi-tenant-architecture-overview.md [START HERE - Why]
├── 002-implementation-strategy.md [Then read - How & When]
├── 003-data-models-and-storage.md [Reference - Database design]
├── 004-api-design.md [Reference - API specs]
├── 005-security-analysis.md [Reference - Security checklist]
├── 006-architecture-diagrams-alternatives.md [Reference - Visual overview]
├── 007-deployment-guide-quick-reference.md [Reference - Operations]
└── README.md [This file - Navigation]
🔄 Decision Record Details
| Aspect | Details |
|---|---|
| Decision | Multi-tenant, multi-KB architecture |
| Status | Proposed (Awaiting approval) |
| Stakeholders | Engineering, Security, Product, Operations |
| Effort Estimate | 160 developer-hours over 4 weeks |
| Risk Level | Medium (Well-scoped, tested patterns) |
| Alternatives | 5 considered, 4 rejected with justification |
| Security Review | Required before Phase 1 start |
| Rollout Plan | 4-phase soft launch (25%→50%→75%→100%) |
| Success Criteria | 13 items in ADR 007 |
| Contingency | 2-week delay buffer, rollback to v1.0 if needed |
❓ Frequently Asked Questions
Q: Why multi-tenant and not just multi-workspace?
A: Current workspace is implicit and lacks KB-level isolation. Multi-tenant provides explicit isolation, RBAC, audit logging, and SaaS-readiness. See ADR 001 and ADR 006 (alternatives) for detailed comparison.
Q: Will this break existing installations?
A: No. Legacy workspace deployments continue working - they automatically become a tenant with KB named "default". See ADR 003 (Backward Compatibility) for migration details.
Q: What's the performance impact?
A: Approximately 5-10% latency overhead (tenant filtering in queries) offset by instance caching (>90% hit rate). Net impact: negligible for most workloads. See ADR 002 (Performance Targets) for details.
Q: How do we ensure data isolation?
A: Defense in depth:
- API Layer: TenantContext dependency validates token and extracts tenant_id
- Storage Layer: All queries auto-filtered by
WHERE tenant_id = ? AND kb_id = ? - Testing: Automated tests verify cross-tenant access is denied See ADR 005 (Threat Model) for complete security analysis.
Q: Can we support 100+ tenants on one instance?
A: Yes. Architecture supports ~100 concurrent cached instances (configurable). For 100+ tenants, use: instance caching (active tenants), database scaling (PostgreSQL replication), and monitoring. See ADR 002 (Known Limitations) for scaling guidance.
Q: What if a tenant hits the storage quota?
A: System enforces ResourceQuota (configurable per tenant). Exceeding quota returns 429 (Too Many Requests). Tenant admin receives alerts. See ADR 003 (ResourceQuota Model) and ADR 004 (Error Handling).
Q: Can we migrate from workspace without downtime?
A: Yes, with dual-write period:
- Deploy v1.5 (supports both models)
- Activate background migration job
- Verify all data migrated
- Remove workspace support Total downtime: 0 minutes. See ADR 007 (Migration Strategy).
📞 Getting Help
Questions about Architecture?
→ Review ADR 001, 006 or ask technical lead
Need Implementation Details?
→ See ADR 002 (phased approach) or ADR 003/004 (specs)
Security Concerns?
→ Review ADR 005 (threat model) or contact security team
Deployment/Operations?
→ See ADR 007 (deployment guide, troubleshooting)
Want to See Alternatives?
→ Review ADR 006 (5 alternatives with pros/cons)
Document Set Version: 1.0
Last Updated: 2025-11-20
Total Pages: ~4,000 lines across 7 documents
Status: ✅ Ready for Review and Implementation
Next Step: Schedule ADR review meeting with stakeholders