History

Raphael MANSUY fe9b8ec02a tests: stabilize integration tests + skip external services; fix multi-tenant API behavior and idempotency (#4 ) * feat: Implement multi-tenant architecture with tenant and knowledge base models - Added data models for tenants, knowledge bases, and related configurations. - Introduced role and permission management for users in the multi-tenant system. - Created a service layer for managing tenants and knowledge bases, including CRUD operations. - Developed a tenant-aware instance manager for LightRAG with caching and isolation features. - Added a migration script to transition existing workspace-based deployments to the new multi-tenant architecture. * chore: ignore lightrag/api/webui/assets/ directory * chore: stop tracking lightrag/api/webui/assets (ignore in .gitignore) * feat: Initialize LightRAG Multi-Tenant Stack with PostgreSQL - Added README.md for project overview, setup instructions, and architecture details. - Created docker-compose.yml to define services: PostgreSQL, Redis, LightRAG API, and Web UI. - Introduced env.example for environment variable configuration. - Implemented init-postgres.sql for PostgreSQL schema initialization with multi-tenant support. - Added reproduce_issue.py for testing default tenant access via API. * feat: Enhance TenantSelector and update related components for improved multi-tenant support * feat: Enhance testing capabilities and update documentation - Updated Makefile to include new test commands for various modes (compatibility, isolation, multi-tenant, security, coverage, and dry-run). - Modified API health check endpoint in Makefile to reflect new port configuration. - Updated QUICK_START.md and README.md to reflect changes in service URLs and ports. - Added environment variables for testing modes in env.example. - Introduced run_all_tests.sh script to automate testing across different modes. - Created conftest.py for pytest configuration, including database fixtures and mock services. - Implemented database helper functions for streamlined database operations in tests. - Added test collection hooks to skip tests based on the current MULTITENANT_MODE. * feat: Implement multi-tenant support with demo mode enabled by default - Added multi-tenant configuration to the environment and Docker setup. - Created pre-configured demo tenants (acme-corp and techstart) for testing. - Updated API endpoints to support tenant-specific data access. - Enhanced Makefile commands for better service management and database operations. - Introduced user-tenant membership system with role-based access control. - Added comprehensive documentation for multi-tenant setup and usage. - Fixed issues with document visibility in multi-tenant environments. - Implemented necessary database migrations for user memberships and legacy support. * feat(audit): Add final audit report for multi-tenant implementation - Documented overall assessment, architecture overview, test results, security findings, and recommendations. - Included detailed findings on critical security issues and architectural concerns. fix(security): Implement security fixes based on audit findings - Removed global RAG fallback and enforced strict tenant context. - Configured super-admin access and required user authentication for tenant access. - Cleared localStorage on logout and improved error handling in WebUI. chore(logs): Create task logs for audit and security fixes implementation - Documented actions, decisions, and next steps for both audit and security fixes. - Summarized test results and remaining recommendations. chore(scripts): Enhance development stack management scripts - Added scripts for cleaning, starting, and stopping the development stack. - Improved output messages and ensured graceful shutdown of services. feat(starter): Initialize PostgreSQL with AGE extension support - Created initialization scripts for PostgreSQL extensions including uuid-ossp, vector, and AGE. - Ensured successful installation and verification of extensions. * feat: Implement auto-select for first tenant and KB on initial load in WebUI - Removed WEBUI_INITIAL_STATE_FIX.md as the issue is resolved. - Added useTenantInitialization hook to automatically select the first available tenant and KB on app load. - Integrated the new hook into the Root component of the WebUI. - Updated RetrievalTesting component to ensure a KB is selected before allowing user interaction. - Created end-to-end tests for multi-tenant isolation and real service interactions. - Added scripts for starting, stopping, and cleaning the development stack. - Enhanced API and tenant routes to support tenant-specific pipeline status initialization. - Updated constants for backend URL to reflect the correct port. - Improved error handling and logging in various components. * feat: Add multi-tenant support with enhanced E2E testing scripts and client functionality * update client * Add integration and unit tests for multi-tenant API, models, security, and storage - Implement integration tests for tenant and knowledge base management endpoints in `test_tenant_api_routes.py`. - Create unit tests for tenant isolation, model validation, and role permissions in `test_tenant_models.py`. - Add security tests to enforce role-based permissions and context validation in `test_tenant_security.py`. - Develop tests for tenant-aware storage operations and context isolation in `test_tenant_storage_phase3.py`. * feat(e2e): Implement OpenAI model support and database reset functionality * Add comprehensive test suite for gpt-5-nano compatibility - Introduced tests for parameter normalization, embeddings, and entity extraction. - Implemented direct API testing for gpt-5-nano. - Validated .env configuration loading and OpenAI API connectivity. - Analyzed reasoning token overhead with various token limits. - Documented test procedures and expected outcomes in README files. - Ensured all tests pass for production readiness. * kg(postgres_impl): ensure AGE extension is loaded in session and configure graph initialization * dev: add hybrid dev helper scripts, Makefile, docker-compose.dev-db and local development docs * feat(dev): add dev helper scripts and local development documentation for hybrid setup * feat(multi-tenant): add detailed specifications and logs for multi-tenant improvements, including UX, backend handling, and ingestion pipeline * feat(migration): add generated tenant/kb columns, indexes, triggers; drop unused tables; update schema and docs * test(backward-compat): adapt tests to new StorageNameSpace/TenantService APIs (use concrete dummy storages) * chore: multi-tenant and UX updates — docs, webui, storage, tenant service adjustments * tests: stabilize integration tests + skip external services; fix multi-tenant API behavior and idempotency - gpt5_nano_compatibility: add pytest-asyncio markers, skip when OPENAI key missing, prevent module-level asyncio.run collection, add conftest - Ollama tests: add server availability check and skip markers; avoid pytest collection warnings by renaming helper classes - Graph storage tests: rename interactive test functions to avoid pytest collection - Document & Tenant routes: support external_ids for idempotency; ensure HTTPExceptions are re-raised - LightRAG core: support external_ids in apipeline_enqueue_documents and idempotent logic - Tests updated to match API changes (tenant routes & document routes) - Add logs and scripts for inspection and audit		2025-12-04 16:04:21 +08:00
..
001-multi-tenant-architecture-overview.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
002-implementation-strategy.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
003-data-models-and-storage.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
004-api-design.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
005-security-analysis.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
006-architecture-diagrams-alternatives.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
007-deployment-guide-quick-reference.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00
008-multi-tenant-testing-strategy.md	tests: stabilize integration tests + skip external services; fix multi-tenant API behavior and idempotency (#4 )	2025-12-04 16:04:21 +08:00
README.md	feat: Add multi-tenant architecture ADRs and deployment guide	2025-11-20 15:27:31 +08:00

README.md

LightRAG Multi-Tenant Architecture - Complete ADR Index

Document Overview

This collection of 7 Architecture Decision Records provides comprehensive guidance for implementing a multi-tenant, multi-knowledge-base system in LightRAG. All recommendations are grounded in actual codebase analysis and include detailed implementation specifications.

📋 Complete Document Index

ADR 001: Multi-Tenant Architecture Overview

Purpose: Establish the core architectural decision and rationale
Length: ~400 lines
Key Sections:

Current state analysis (single-instance, workspace-level isolation)
Architectural decision (multi-tenant with per-KB scoping)
Consequences (complexity, performance, security trade-offs)
Code evidence (6 direct references to existing patterns)
Alternative approaches evaluated (4 alternatives considered)

When to Read: First - understand why multi-tenant is necessary
For Roles: Architects, Tech Leads, Decision Makers
Decision Status: Proposed (Ready for stakeholder approval)

ADR 002: Implementation Strategy

Purpose: Detailed roadmap for implementation across 4 phases
Length: ~800 lines
Key Sections:

Phase 1 (2-3 weeks): Database schema, tenant models, core infrastructure
Phase 2 (2-3 weeks): API layer, tenant routing, permission checking
Phase 3 (1-2 weeks): LightRAG integration, instance caching, query modification
Phase 4 (1 week): Testing, migration, deployment
Configuration examples with real environment variables
Performance targets and success metrics
Known limitations and future work

Total Effort: ~160 developer hours across 4 weeks
When to Read: Second - use for sprint planning and task breakdown
For Roles: Engineering Leads, Project Managers, Developers
Implementation Detail: High-level code examples (not pseudo-code)

ADR 003: Data Models and Storage Design

Purpose: Complete specification of data models and storage schema
Length: ~700 lines
Key Sections:

Core data models with Python dataclass definitions
PostgreSQL schema with 8 tables, composite indexes, and migration scripts
Neo4j schema with Cypher examples
MongoDB/Vector DB schema with partition strategies
Access control lists and role-based permissions
Data validation rules and constraints
Backward compatibility mapping for workspace-to-tenant migration

When to Read: Before database migration work begins
For Roles: Database Engineers, Backend Developers
Schema Completeness: 100% (Production-ready SQL)

ADR 004: API Design and Routing

Purpose: Complete REST API specification for multi-tenant system
Length: ~900 lines
Key Sections:

API versioning and base URL structure (/api/v1/tenants/{tenant_id}/...)
Authentication mechanisms (JWT RS256, API keys with rotation)
Tenant management endpoints (CRUD operations)
Knowledge base endpoints (lifecycle management)
Document endpoints (upload, status, deletion)
Query endpoints (standard, streaming, with data)
Error handling with 8 error codes and examples
Rate limiting configuration per tenant
10+ cURL examples for all operations
OpenAPI/Swagger documentation structure

Endpoint Count: 30+ endpoints defined
When to Read: Before API development begins
For Roles: API Developers, Frontend Engineers, QA
Specification Completeness: 100% (Ready to implement)

ADR 005: Security Analysis and Mitigation

Purpose: Comprehensive security analysis with threat modeling
Length: ~900 lines
Key Sections:

Security principles (Zero Trust, Defense in Depth, Complete Mediation)
Threat model with 7 attack vectors:
1. Unauthorized cross-tenant access → Dependency injection validation
2. Authentication bypass → Strong JWT signature verification
3. Parameter injection/path traversal → UUID validation + parameterized queries
4. Information disclosure → Generic errors + log sanitization
5. DoS via resource exhaustion → Per-tenant rate limits
6. Data leakage via logs → Field redaction + PII hashing
7. Replay attacks → JTI tracking + idempotency keys
JWT security configuration (RS256 recommended)
API key security (bcrypt hashing, rotation policy)
CORS and TLS/HTTPS configuration
Audit logging structure with 14 event types
Vulnerability scanning strategy
Compliance considerations (GDPR, SOC 2, ISO 27001, HIPAA)
Security checklist with 13 verification items

When to Read: Before security implementation phase
For Roles: Security Engineers, Backend Developers, Compliance Officers
Threat Coverage: Comprehensive (All major attack vectors)

ADR 006: Architecture Diagrams and Alternatives

Purpose: Visual representation of architecture and detailed alternatives analysis
Length: ~700 lines
Key Sections:

Full system architecture ASCII diagram (6 layers)
Query execution flow diagram (10 steps)
Document upload flow diagram (7 steps)
5 alternative approaches with pros/cons:
1. Database per Tenant (Rejected: 100x cost, operational nightmare)
2. Server per Tenant (Rejected: Resource waste, uneconomical)
3. Workspace Rename (Rejected: No KB isolation, weak security)
4. Shared Single Instance (Rejected: Data isolation risk too high)
5. Sharding by Hash (Rejected: Complexity without sufficient benefit)
Comparison matrix showing why proposed approach wins
Risk assessment for each alternative

When to Read: For architectural validation and decision support
For Roles: Architects, Tech Leads, Stakeholders
Visualization Quality: High (ASCII diagrams suitable for documentation/slides)

ADR 007: Deployment Guide and Quick Reference

Purpose: Practical guide for deployment, testing, and operations
Length: ~800 lines
Key Sections:

Quick start for developers (setup, testing, manual testing)
Docker Compose configuration for complete stack
Environment variable reference
Backward compatibility and migration from workspace model
Monitoring and observability setup
Prometheus queries for key metrics
Rollout strategy (4-phase soft launch to production)
Troubleshooting guide with solutions
Success criteria checklist
Support resources and documentation index

When to Read: During deployment and operational phases
For Roles: DevOps Engineers, Operators, Support Teams
Operational Readiness: Complete (All runbooks provided)

🎯 Reading Paths by Role

👨‍💼 For Executives/Product Managers

Executive Summary (this document, sections below)
ADR 001 - Sections: Decision, Consequences, Alternatives
ADR 002 - Sections: Timeline, Effort, Success Metrics
ADR 007 - Sections: Rollout Strategy, Success Criteria

Time Investment: 45 minutes
Key Takeaway: What we're building, why it matters, and when it ships

🏗️ For Architects/Tech Leads

ADR 001 - Complete
ADR 006 - Complete (diagrams + alternatives)
ADR 003 - Sections: Core Models, Storage Strategy
ADR 002 - Sections: Phase Overview, Configuration
ADR 005 - Sections: Threat Model, Security Checklist

Time Investment: 3 hours
Key Takeaway: Complete architectural vision with design justification

👨‍💻 For Developers (API/Backend)

ADR 002 - Complete (detailed code examples)
ADR 004 - Complete (endpoint specifications)
ADR 003 - Sections: Core Models, PostgreSQL Schema
ADR 005 - Sections: Threat Mitigations (code-level)
ADR 007 - Sections: Quick Start, Testing

Time Investment: 6 hours
Key Takeaway: Exact code changes needed, APIs to implement, test strategy

🔐 For Security/DevOps

ADR 005 - Complete (threat model, mitigations, compliance)
ADR 007 - Complete (monitoring, troubleshooting)
ADR 004 - Sections: Authentication, Error Handling
ADR 002 - Sections: Configuration, Testing
ADR 001 - Sections: Consequences (security)

Time Investment: 4 hours
Key Takeaway: Security architecture, deployment checklist, monitoring strategy

📊 For Database Engineers

ADR 003 - Complete
ADR 002 - Sections: Phase 1 (Database changes)
ADR 001 - Sections: Current Architecture
ADR 005 - Sections: Parameter Injection Mitigation

Time Investment: 4 hours
Key Takeaway: Schema changes, migration scripts, storage isolation strategy

📌 Executive Summary

The Opportunity

LightRAG currently supports single-instance deployments with basic workspace-level isolation. To serve multiple organizations and knowledge domains (SaaS model), we need true multi-tenancy with knowledge base-level isolation.

The Decision

Implement multi-tenant architecture with multi-knowledge-base support using:

Tenant abstraction layer (UUID-based isolation)
Knowledge bases as first-class entities
Composite key strategy (tenant_id:kb_id:entity_id)
Storage layer automatic filtering (defense in depth)
Per-tenant RAG instance caching (performance optimization)

Investment Required

Effort: ~160 developer-hours
Timeline: 4 weeks (1 week per phase)
Team Size: 4 developers + 1 tech lead
Infrastructure: Database migration, Redis for caching

Business Impact

Enables: Multi-customer SaaS model
Reduces: Per-customer hosting costs by 10-50x
Improves: Data isolation and security posture
Provides: RBAC and audit logging for compliance
Supports: Future expansion to 100+ concurrent tenants

Risk Assessment

Risk	Severity	Mitigation
Cross-tenant data access	Critical	Defense-in-depth filters + automated tests
Performance degradation	High	Instance caching, indexed queries, monitoring
Migration failures	Medium	Dual-write period, rollback plan, testing
Operational complexity	Medium	Comprehensive monitoring, runbooks, training

Success Metrics

✓ Functional: All API endpoints working with tenant isolation
✓ Security: Zero cross-tenant data access in production
✓ Performance: Query latency < 200ms p99, cache hit rate > 90%
✓ Operational: 99.5% uptime, <5min incident response time
✓ Business: Support 50+ active tenants on single instance

🚀 Quick Implementation Checklist

Pre-Implementation (Week 0)

Review all 7 ADRs with team (30-45 minutes)
Secure stakeholder approval
Create detailed Jira tickets from ADR 002
Set up development databases (PostgreSQL, Redis)
Brief security team on threat model (ADR 005)

Phase 1: Core Infrastructure (Week 1-2)

Create database schema (ADR 003)
Implement tenant models (dataclasses)
Create TenantService for CRUD
Add tenant/KB columns to storage base classes
Run unit tests on isolation

Phase 2: API Layer (Week 2-3)

Implement tenant routes (CRUD)
Implement KB routes (CRUD)
Create dependency injection for TenantContext
Update document/query routes with tenant filtering
Test with API examples from ADR 004

Phase 3: RAG Integration (Week 3)

Implement TenantRAGManager (instance caching)
Modify LightRAG.query() to accept tenant context
Modify LightRAG.insert() to accept tenant context
Set up monitoring (Prometheus metrics)
Run integration tests

Phase 4: Deployment (Week 4)

Run security audit against ADR 005 checklist
Run load tests with multiple tenants
Prepare migration script for existing workspaces
Deploy to staging (1 week soak test)
Deploy to production (4-phase rollout)
Run incident response drills

adr/
├── 001-multi-tenant-architecture-overview.md      [START HERE - Why]
├── 002-implementation-strategy.md                 [Then read - How & When]
├── 003-data-models-and-storage.md                [Reference - Database design]
├── 004-api-design.md                              [Reference - API specs]
├── 005-security-analysis.md                       [Reference - Security checklist]
├── 006-architecture-diagrams-alternatives.md     [Reference - Visual overview]
├── 007-deployment-guide-quick-reference.md       [Reference - Operations]
└── README.md                                      [This file - Navigation]

🔄 Decision Record Details

Aspect	Details
Decision	Multi-tenant, multi-KB architecture
Status	Proposed (Awaiting approval)
Stakeholders	Engineering, Security, Product, Operations
Effort Estimate	160 developer-hours over 4 weeks
Risk Level	Medium (Well-scoped, tested patterns)
Alternatives	5 considered, 4 rejected with justification
Security Review	Required before Phase 1 start
Rollout Plan	4-phase soft launch (25%→50%→75%→100%)
Success Criteria	13 items in ADR 007
Contingency	2-week delay buffer, rollback to v1.0 if needed

❓ Frequently Asked Questions

Q: Why multi-tenant and not just multi-workspace?

A: Current workspace is implicit and lacks KB-level isolation. Multi-tenant provides explicit isolation, RBAC, audit logging, and SaaS-readiness. See ADR 001 and ADR 006 (alternatives) for detailed comparison.

Q: Will this break existing installations?

A: No. Legacy workspace deployments continue working - they automatically become a tenant with KB named "default". See ADR 003 (Backward Compatibility) for migration details.

Q: What's the performance impact?

A: Approximately 5-10% latency overhead (tenant filtering in queries) offset by instance caching (>90% hit rate). Net impact: negligible for most workloads. See ADR 002 (Performance Targets) for details.

Q: How do we ensure data isolation?

A: Defense in depth:

API Layer: TenantContext dependency validates token and extracts tenant_id
Storage Layer: All queries auto-filtered by WHERE tenant_id = ? AND kb_id = ?
Testing: Automated tests verify cross-tenant access is denied See ADR 005 (Threat Model) for complete security analysis.

Q: Can we support 100+ tenants on one instance?

A: Yes. Architecture supports ~100 concurrent cached instances (configurable). For 100+ tenants, use: instance caching (active tenants), database scaling (PostgreSQL replication), and monitoring. See ADR 002 (Known Limitations) for scaling guidance.

Q: What if a tenant hits the storage quota?

A: System enforces ResourceQuota (configurable per tenant). Exceeding quota returns 429 (Too Many Requests). Tenant admin receives alerts. See ADR 003 (ResourceQuota Model) and ADR 004 (Error Handling).

Q: Can we migrate from workspace without downtime?

A: Yes, with dual-write period:

Deploy v1.5 (supports both models)
Activate background migration job
Verify all data migrated
Remove workspace support Total downtime: 0 minutes. See ADR 007 (Migration Strategy).

📞 Getting Help

Questions about Architecture?
→ Review ADR 001, 006 or ask technical lead

Need Implementation Details?
→ See ADR 002 (phased approach) or ADR 003/004 (specs)

Security Concerns?
→ Review ADR 005 (threat model) or contact security team

Deployment/Operations?
→ See ADR 007 (deployment guide, troubleshooting)

Want to See Alternatives?
→ Review ADR 006 (5 alternatives with pros/cons)

Document Set Version: 1.0
Last Updated: 2025-11-20
Total Pages: ~4,000 lines across 7 documents
Status: ✅ Ready for Review and Implementation
Next Step: Schedule ADR review meeting with stakeholders