* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad. * Add multi-tenant testing strategy and ADR index documentation - Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details. - Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles. * feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise - Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints. - Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options. - Documented the LightRAG architecture, storage backends, LLM integrations, and query modes. - Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
389 lines
17 KiB
Markdown
389 lines
17 KiB
Markdown
# LightRAG Multi-Tenant Architecture - Complete ADR Index
|
|
|
|
## Document Overview
|
|
|
|
This collection of 7 Architecture Decision Records provides comprehensive guidance for implementing a multi-tenant, multi-knowledge-base system in LightRAG. All recommendations are grounded in actual codebase analysis and include detailed implementation specifications.
|
|
|
|
---
|
|
|
|
## 📋 Complete Document Index
|
|
|
|
### [ADR 001: Multi-Tenant Architecture Overview](./001-multi-tenant-architecture-overview.md)
|
|
**Purpose**: Establish the core architectural decision and rationale
|
|
**Length**: ~400 lines
|
|
**Key Sections**:
|
|
- Current state analysis (single-instance, workspace-level isolation)
|
|
- Architectural decision (multi-tenant with per-KB scoping)
|
|
- Consequences (complexity, performance, security trade-offs)
|
|
- Code evidence (6 direct references to existing patterns)
|
|
- Alternative approaches evaluated (4 alternatives considered)
|
|
|
|
**When to Read**: First - understand why multi-tenant is necessary
|
|
**For Roles**: Architects, Tech Leads, Decision Makers
|
|
**Decision Status**: **Proposed** (Ready for stakeholder approval)
|
|
|
|
---
|
|
|
|
### [ADR 002: Implementation Strategy](./002-implementation-strategy.md)
|
|
**Purpose**: Detailed roadmap for implementation across 4 phases
|
|
**Length**: ~800 lines
|
|
**Key Sections**:
|
|
- **Phase 1** (2-3 weeks): Database schema, tenant models, core infrastructure
|
|
- **Phase 2** (2-3 weeks): API layer, tenant routing, permission checking
|
|
- **Phase 3** (1-2 weeks): LightRAG integration, instance caching, query modification
|
|
- **Phase 4** (1 week): Testing, migration, deployment
|
|
- Configuration examples with real environment variables
|
|
- Performance targets and success metrics
|
|
- Known limitations and future work
|
|
|
|
**Total Effort**: ~160 developer hours across 4 weeks
|
|
**When to Read**: Second - use for sprint planning and task breakdown
|
|
**For Roles**: Engineering Leads, Project Managers, Developers
|
|
**Implementation Detail**: **High-level code examples** (not pseudo-code)
|
|
|
|
---
|
|
|
|
### [ADR 003: Data Models and Storage Design](./003-data-models-and-storage.md)
|
|
**Purpose**: Complete specification of data models and storage schema
|
|
**Length**: ~700 lines
|
|
**Key Sections**:
|
|
- Core data models with Python dataclass definitions
|
|
- PostgreSQL schema with 8 tables, composite indexes, and migration scripts
|
|
- Neo4j schema with Cypher examples
|
|
- MongoDB/Vector DB schema with partition strategies
|
|
- Access control lists and role-based permissions
|
|
- Data validation rules and constraints
|
|
- Backward compatibility mapping for workspace-to-tenant migration
|
|
|
|
**When to Read**: Before database migration work begins
|
|
**For Roles**: Database Engineers, Backend Developers
|
|
**Schema Completeness**: **100%** (Production-ready SQL)
|
|
|
|
---
|
|
|
|
### [ADR 004: API Design and Routing](./004-api-design.md)
|
|
**Purpose**: Complete REST API specification for multi-tenant system
|
|
**Length**: ~900 lines
|
|
**Key Sections**:
|
|
- API versioning and base URL structure (`/api/v1/tenants/{tenant_id}/...`)
|
|
- Authentication mechanisms (JWT RS256, API keys with rotation)
|
|
- Tenant management endpoints (CRUD operations)
|
|
- Knowledge base endpoints (lifecycle management)
|
|
- Document endpoints (upload, status, deletion)
|
|
- Query endpoints (standard, streaming, with data)
|
|
- Error handling with 8 error codes and examples
|
|
- Rate limiting configuration per tenant
|
|
- 10+ cURL examples for all operations
|
|
- OpenAPI/Swagger documentation structure
|
|
|
|
**Endpoint Count**: 30+ endpoints defined
|
|
**When to Read**: Before API development begins
|
|
**For Roles**: API Developers, Frontend Engineers, QA
|
|
**Specification Completeness**: **100%** (Ready to implement)
|
|
|
|
---
|
|
|
|
### [ADR 005: Security Analysis and Mitigation](./005-security-analysis.md)
|
|
**Purpose**: Comprehensive security analysis with threat modeling
|
|
**Length**: ~900 lines
|
|
**Key Sections**:
|
|
- Security principles (Zero Trust, Defense in Depth, Complete Mediation)
|
|
- Threat model with 7 attack vectors:
|
|
1. Unauthorized cross-tenant access → Dependency injection validation
|
|
2. Authentication bypass → Strong JWT signature verification
|
|
3. Parameter injection/path traversal → UUID validation + parameterized queries
|
|
4. Information disclosure → Generic errors + log sanitization
|
|
5. DoS via resource exhaustion → Per-tenant rate limits
|
|
6. Data leakage via logs → Field redaction + PII hashing
|
|
7. Replay attacks → JTI tracking + idempotency keys
|
|
- JWT security configuration (RS256 recommended)
|
|
- API key security (bcrypt hashing, rotation policy)
|
|
- CORS and TLS/HTTPS configuration
|
|
- Audit logging structure with 14 event types
|
|
- Vulnerability scanning strategy
|
|
- Compliance considerations (GDPR, SOC 2, ISO 27001, HIPAA)
|
|
- Security checklist with 13 verification items
|
|
|
|
**When to Read**: Before security implementation phase
|
|
**For Roles**: Security Engineers, Backend Developers, Compliance Officers
|
|
**Threat Coverage**: **Comprehensive** (All major attack vectors)
|
|
|
|
---
|
|
|
|
### [ADR 006: Architecture Diagrams and Alternatives](./006-architecture-diagrams-alternatives.md)
|
|
**Purpose**: Visual representation of architecture and detailed alternatives analysis
|
|
**Length**: ~700 lines
|
|
**Key Sections**:
|
|
- Full system architecture ASCII diagram (6 layers)
|
|
- Query execution flow diagram (10 steps)
|
|
- Document upload flow diagram (7 steps)
|
|
- 5 alternative approaches with pros/cons:
|
|
1. Database per Tenant (Rejected: 100x cost, operational nightmare)
|
|
2. Server per Tenant (Rejected: Resource waste, uneconomical)
|
|
3. Workspace Rename (Rejected: No KB isolation, weak security)
|
|
4. Shared Single Instance (Rejected: Data isolation risk too high)
|
|
5. Sharding by Hash (Rejected: Complexity without sufficient benefit)
|
|
- Comparison matrix showing why proposed approach wins
|
|
- Risk assessment for each alternative
|
|
|
|
**When to Read**: For architectural validation and decision support
|
|
**For Roles**: Architects, Tech Leads, Stakeholders
|
|
**Visualization Quality**: **High** (ASCII diagrams suitable for documentation/slides)
|
|
|
|
---
|
|
|
|
### [ADR 007: Deployment Guide and Quick Reference](./007-deployment-guide-quick-reference.md)
|
|
**Purpose**: Practical guide for deployment, testing, and operations
|
|
**Length**: ~800 lines
|
|
**Key Sections**:
|
|
- Quick start for developers (setup, testing, manual testing)
|
|
- Docker Compose configuration for complete stack
|
|
- Environment variable reference
|
|
- Backward compatibility and migration from workspace model
|
|
- Monitoring and observability setup
|
|
- Prometheus queries for key metrics
|
|
- Rollout strategy (4-phase soft launch to production)
|
|
- Troubleshooting guide with solutions
|
|
- Success criteria checklist
|
|
- Support resources and documentation index
|
|
|
|
**When to Read**: During deployment and operational phases
|
|
**For Roles**: DevOps Engineers, Operators, Support Teams
|
|
**Operational Readiness**: **Complete** (All runbooks provided)
|
|
|
|
---
|
|
|
|
## 🎯 Reading Paths by Role
|
|
|
|
### 👨💼 For Executives/Product Managers
|
|
1. **Executive Summary** (this document, sections below)
|
|
2. [ADR 001](./001-multi-tenant-architecture-overview.md) - Sections: Decision, Consequences, Alternatives
|
|
3. [ADR 002](./002-implementation-strategy.md) - Sections: Timeline, Effort, Success Metrics
|
|
4. [ADR 007](./007-deployment-guide-quick-reference.md) - Sections: Rollout Strategy, Success Criteria
|
|
|
|
**Time Investment**: 45 minutes
|
|
**Key Takeaway**: What we're building, why it matters, and when it ships
|
|
|
|
---
|
|
|
|
### 🏗️ For Architects/Tech Leads
|
|
1. [ADR 001](./001-multi-tenant-architecture-overview.md) - Complete
|
|
2. [ADR 006](./006-architecture-diagrams-alternatives.md) - Complete (diagrams + alternatives)
|
|
3. [ADR 003](./003-data-models-and-storage.md) - Sections: Core Models, Storage Strategy
|
|
4. [ADR 002](./002-implementation-strategy.md) - Sections: Phase Overview, Configuration
|
|
5. [ADR 005](./005-security-analysis.md) - Sections: Threat Model, Security Checklist
|
|
|
|
**Time Investment**: 3 hours
|
|
**Key Takeaway**: Complete architectural vision with design justification
|
|
|
|
---
|
|
|
|
### 👨💻 For Developers (API/Backend)
|
|
1. [ADR 002](./002-implementation-strategy.md) - Complete (detailed code examples)
|
|
2. [ADR 004](./004-api-design.md) - Complete (endpoint specifications)
|
|
3. [ADR 003](./003-data-models-and-storage.md) - Sections: Core Models, PostgreSQL Schema
|
|
5. [ADR 005](./005-security-analysis.md) - Sections: Threat Mitigations (code-level)
|
|
6. [ADR 007](./007-deployment-guide-quick-reference.md) - Sections: Quick Start, Testing
|
|
|
|
**Time Investment**: 6 hours
|
|
**Key Takeaway**: Exact code changes needed, APIs to implement, test strategy
|
|
|
|
---
|
|
|
|
### 🔐 For Security/DevOps
|
|
1. [ADR 005](./005-security-analysis.md) - Complete (threat model, mitigations, compliance)
|
|
2. [ADR 007](./007-deployment-guide-quick-reference.md) - Complete (monitoring, troubleshooting)
|
|
3. [ADR 004](./004-api-design.md) - Sections: Authentication, Error Handling
|
|
4. [ADR 002](./002-implementation-strategy.md) - Sections: Configuration, Testing
|
|
5. [ADR 001](./001-multi-tenant-architecture-overview.md) - Sections: Consequences (security)
|
|
|
|
**Time Investment**: 4 hours
|
|
**Key Takeaway**: Security architecture, deployment checklist, monitoring strategy
|
|
|
|
---
|
|
|
|
### 📊 For Database Engineers
|
|
1. [ADR 003](./003-data-models-and-storage.md) - Complete
|
|
2. [ADR 002](./002-implementation-strategy.md) - Sections: Phase 1 (Database changes)
|
|
3. [ADR 001](./001-multi-tenant-architecture-overview.md) - Sections: Current Architecture
|
|
4. [ADR 005](./005-security-analysis.md) - Sections: Parameter Injection Mitigation
|
|
|
|
**Time Investment**: 4 hours
|
|
**Key Takeaway**: Schema changes, migration scripts, storage isolation strategy
|
|
|
|
---
|
|
|
|
## 📌 Executive Summary
|
|
|
|
### The Opportunity
|
|
LightRAG currently supports single-instance deployments with basic workspace-level isolation. To serve multiple organizations and knowledge domains (SaaS model), we need true multi-tenancy with knowledge base-level isolation.
|
|
|
|
### The Decision
|
|
Implement **multi-tenant architecture with multi-knowledge-base support** using:
|
|
- Tenant abstraction layer (UUID-based isolation)
|
|
- Knowledge bases as first-class entities
|
|
- Composite key strategy (`tenant_id:kb_id:entity_id`)
|
|
- Storage layer automatic filtering (defense in depth)
|
|
- Per-tenant RAG instance caching (performance optimization)
|
|
|
|
### Investment Required
|
|
- **Effort**: ~160 developer-hours
|
|
- **Timeline**: 4 weeks (1 week per phase)
|
|
- **Team Size**: 4 developers + 1 tech lead
|
|
- **Infrastructure**: Database migration, Redis for caching
|
|
|
|
### Business Impact
|
|
- **Enables**: Multi-customer SaaS model
|
|
- **Reduces**: Per-customer hosting costs by 10-50x
|
|
- **Improves**: Data isolation and security posture
|
|
- **Provides**: RBAC and audit logging for compliance
|
|
- **Supports**: Future expansion to 100+ concurrent tenants
|
|
|
|
### Risk Assessment
|
|
| Risk | Severity | Mitigation |
|
|
|------|----------|-----------|
|
|
| Cross-tenant data access | **Critical** | Defense-in-depth filters + automated tests |
|
|
| Performance degradation | **High** | Instance caching, indexed queries, monitoring |
|
|
| Migration failures | **Medium** | Dual-write period, rollback plan, testing |
|
|
| Operational complexity | **Medium** | Comprehensive monitoring, runbooks, training |
|
|
|
|
### Success Metrics
|
|
✓ **Functional**: All API endpoints working with tenant isolation
|
|
✓ **Security**: Zero cross-tenant data access in production
|
|
✓ **Performance**: Query latency < 200ms p99, cache hit rate > 90%
|
|
✓ **Operational**: 99.5% uptime, <5min incident response time
|
|
✓ **Business**: Support 50+ active tenants on single instance
|
|
|
|
---
|
|
|
|
## 🚀 Quick Implementation Checklist
|
|
|
|
### Pre-Implementation (Week 0)
|
|
- [ ] Review all 7 ADRs with team (30-45 minutes)
|
|
- [ ] Secure stakeholder approval
|
|
- [ ] Create detailed Jira tickets from ADR 002
|
|
- [ ] Set up development databases (PostgreSQL, Redis)
|
|
- [ ] Brief security team on threat model (ADR 005)
|
|
|
|
### Phase 1: Core Infrastructure (Week 1-2)
|
|
- [ ] Create database schema (ADR 003)
|
|
- [ ] Implement tenant models (dataclasses)
|
|
- [ ] Create TenantService for CRUD
|
|
- [ ] Add tenant/KB columns to storage base classes
|
|
- [ ] Run unit tests on isolation
|
|
|
|
### Phase 2: API Layer (Week 2-3)
|
|
- [ ] Implement tenant routes (CRUD)
|
|
- [ ] Implement KB routes (CRUD)
|
|
- [ ] Create dependency injection for TenantContext
|
|
- [ ] Update document/query routes with tenant filtering
|
|
- [ ] Test with API examples from ADR 004
|
|
|
|
### Phase 3: RAG Integration (Week 3)
|
|
- [ ] Implement TenantRAGManager (instance caching)
|
|
- [ ] Modify LightRAG.query() to accept tenant context
|
|
- [ ] Modify LightRAG.insert() to accept tenant context
|
|
- [ ] Set up monitoring (Prometheus metrics)
|
|
- [ ] Run integration tests
|
|
|
|
### Phase 4: Deployment (Week 4)
|
|
- [ ] Run security audit against ADR 005 checklist
|
|
- [ ] Run load tests with multiple tenants
|
|
- [ ] Prepare migration script for existing workspaces
|
|
- [ ] Deploy to staging (1 week soak test)
|
|
- [ ] Deploy to production (4-phase rollout)
|
|
- [ ] Run incident response drills
|
|
|
|
---
|
|
|
|
## 📚 Document Navigation
|
|
|
|
```
|
|
adr/
|
|
├── 001-multi-tenant-architecture-overview.md [START HERE - Why]
|
|
├── 002-implementation-strategy.md [Then read - How & When]
|
|
├── 003-data-models-and-storage.md [Reference - Database design]
|
|
├── 004-api-design.md [Reference - API specs]
|
|
├── 005-security-analysis.md [Reference - Security checklist]
|
|
├── 006-architecture-diagrams-alternatives.md [Reference - Visual overview]
|
|
├── 007-deployment-guide-quick-reference.md [Reference - Operations]
|
|
└── README.md [This file - Navigation]
|
|
```
|
|
|
|
---
|
|
|
|
## 🔄 Decision Record Details
|
|
|
|
| Aspect | Details |
|
|
|--------|---------|
|
|
| **Decision** | Multi-tenant, multi-KB architecture |
|
|
| **Status** | Proposed (Awaiting approval) |
|
|
| **Stakeholders** | Engineering, Security, Product, Operations |
|
|
| **Effort Estimate** | 160 developer-hours over 4 weeks |
|
|
| **Risk Level** | Medium (Well-scoped, tested patterns) |
|
|
| **Alternatives** | 5 considered, 4 rejected with justification |
|
|
| **Security Review** | Required before Phase 1 start |
|
|
| **Rollout Plan** | 4-phase soft launch (25%→50%→75%→100%) |
|
|
| **Success Criteria** | 13 items in ADR 007 |
|
|
| **Contingency** | 2-week delay buffer, rollback to v1.0 if needed |
|
|
|
|
---
|
|
|
|
## ❓ Frequently Asked Questions
|
|
|
|
### Q: Why multi-tenant and not just multi-workspace?
|
|
**A**: Current workspace is implicit and lacks KB-level isolation. Multi-tenant provides explicit isolation, RBAC, audit logging, and SaaS-readiness. See ADR 001 and ADR 006 (alternatives) for detailed comparison.
|
|
|
|
### Q: Will this break existing installations?
|
|
**A**: No. Legacy workspace deployments continue working - they automatically become a tenant with KB named "default". See ADR 003 (Backward Compatibility) for migration details.
|
|
|
|
### Q: What's the performance impact?
|
|
**A**: Approximately 5-10% latency overhead (tenant filtering in queries) offset by instance caching (>90% hit rate). Net impact: negligible for most workloads. See ADR 002 (Performance Targets) for details.
|
|
|
|
### Q: How do we ensure data isolation?
|
|
**A**: Defense in depth:
|
|
1. **API Layer**: TenantContext dependency validates token and extracts tenant_id
|
|
2. **Storage Layer**: All queries auto-filtered by `WHERE tenant_id = ? AND kb_id = ?`
|
|
3. **Testing**: Automated tests verify cross-tenant access is denied
|
|
See ADR 005 (Threat Model) for complete security analysis.
|
|
|
|
### Q: Can we support 100+ tenants on one instance?
|
|
**A**: Yes. Architecture supports ~100 concurrent cached instances (configurable). For 100+ tenants, use: instance caching (active tenants), database scaling (PostgreSQL replication), and monitoring. See ADR 002 (Known Limitations) for scaling guidance.
|
|
|
|
### Q: What if a tenant hits the storage quota?
|
|
**A**: System enforces ResourceQuota (configurable per tenant). Exceeding quota returns 429 (Too Many Requests). Tenant admin receives alerts. See ADR 003 (ResourceQuota Model) and ADR 004 (Error Handling).
|
|
|
|
### Q: Can we migrate from workspace without downtime?
|
|
**A**: Yes, with dual-write period:
|
|
1. Deploy v1.5 (supports both models)
|
|
2. Activate background migration job
|
|
3. Verify all data migrated
|
|
4. Remove workspace support
|
|
Total downtime: 0 minutes. See ADR 007 (Migration Strategy).
|
|
|
|
---
|
|
|
|
## 📞 Getting Help
|
|
|
|
**Questions about Architecture?**
|
|
→ Review ADR 001, 006 or ask technical lead
|
|
|
|
**Need Implementation Details?**
|
|
→ See ADR 002 (phased approach) or ADR 003/004 (specs)
|
|
|
|
**Security Concerns?**
|
|
→ Review ADR 005 (threat model) or contact security team
|
|
|
|
**Deployment/Operations?**
|
|
→ See ADR 007 (deployment guide, troubleshooting)
|
|
|
|
**Want to See Alternatives?**
|
|
→ Review ADR 006 (5 alternatives with pros/cons)
|
|
|
|
---
|
|
|
|
**Document Set Version**: 1.0
|
|
**Last Updated**: 2025-11-20
|
|
**Total Pages**: ~4,000 lines across 7 documents
|
|
**Status**: ✅ Ready for Review and Implementation
|
|
**Next Step**: Schedule ADR review meeting with stakeholders
|