LightRAG/scripts/migrate_workspace_to_tenant.py
Raphael MANSUY fe9b8ec02a
tests: stabilize integration tests + skip external services; fix multi-tenant API behavior and idempotency (#4)
* feat: Implement multi-tenant architecture with tenant and knowledge base models

- Added data models for tenants, knowledge bases, and related configurations.
- Introduced role and permission management for users in the multi-tenant system.
- Created a service layer for managing tenants and knowledge bases, including CRUD operations.
- Developed a tenant-aware instance manager for LightRAG with caching and isolation features.
- Added a migration script to transition existing workspace-based deployments to the new multi-tenant architecture.

* chore: ignore lightrag/api/webui/assets/ directory

* chore: stop tracking lightrag/api/webui/assets (ignore in .gitignore)

* feat: Initialize LightRAG Multi-Tenant Stack with PostgreSQL

- Added README.md for project overview, setup instructions, and architecture details.
- Created docker-compose.yml to define services: PostgreSQL, Redis, LightRAG API, and Web UI.
- Introduced env.example for environment variable configuration.
- Implemented init-postgres.sql for PostgreSQL schema initialization with multi-tenant support.
- Added reproduce_issue.py for testing default tenant access via API.

* feat: Enhance TenantSelector and update related components for improved multi-tenant support

* feat: Enhance testing capabilities and update documentation

- Updated Makefile to include new test commands for various modes (compatibility, isolation, multi-tenant, security, coverage, and dry-run).
- Modified API health check endpoint in Makefile to reflect new port configuration.
- Updated QUICK_START.md and README.md to reflect changes in service URLs and ports.
- Added environment variables for testing modes in env.example.
- Introduced run_all_tests.sh script to automate testing across different modes.
- Created conftest.py for pytest configuration, including database fixtures and mock services.
- Implemented database helper functions for streamlined database operations in tests.
- Added test collection hooks to skip tests based on the current MULTITENANT_MODE.

* feat: Implement multi-tenant support with demo mode enabled by default

- Added multi-tenant configuration to the environment and Docker setup.
- Created pre-configured demo tenants (acme-corp and techstart) for testing.
- Updated API endpoints to support tenant-specific data access.
- Enhanced Makefile commands for better service management and database operations.
- Introduced user-tenant membership system with role-based access control.
- Added comprehensive documentation for multi-tenant setup and usage.
- Fixed issues with document visibility in multi-tenant environments.
- Implemented necessary database migrations for user memberships and legacy support.

* feat(audit): Add final audit report for multi-tenant implementation

- Documented overall assessment, architecture overview, test results, security findings, and recommendations.
- Included detailed findings on critical security issues and architectural concerns.

fix(security): Implement security fixes based on audit findings

- Removed global RAG fallback and enforced strict tenant context.
- Configured super-admin access and required user authentication for tenant access.
- Cleared localStorage on logout and improved error handling in WebUI.

chore(logs): Create task logs for audit and security fixes implementation

- Documented actions, decisions, and next steps for both audit and security fixes.
- Summarized test results and remaining recommendations.

chore(scripts): Enhance development stack management scripts

- Added scripts for cleaning, starting, and stopping the development stack.
- Improved output messages and ensured graceful shutdown of services.

feat(starter): Initialize PostgreSQL with AGE extension support

- Created initialization scripts for PostgreSQL extensions including uuid-ossp, vector, and AGE.
- Ensured successful installation and verification of extensions.

* feat: Implement auto-select for first tenant and KB on initial load in WebUI

- Removed WEBUI_INITIAL_STATE_FIX.md as the issue is resolved.
- Added useTenantInitialization hook to automatically select the first available tenant and KB on app load.
- Integrated the new hook into the Root component of the WebUI.
- Updated RetrievalTesting component to ensure a KB is selected before allowing user interaction.
- Created end-to-end tests for multi-tenant isolation and real service interactions.
- Added scripts for starting, stopping, and cleaning the development stack.
- Enhanced API and tenant routes to support tenant-specific pipeline status initialization.
- Updated constants for backend URL to reflect the correct port.
- Improved error handling and logging in various components.

* feat: Add multi-tenant support with enhanced E2E testing scripts and client functionality

* update client

* Add integration and unit tests for multi-tenant API, models, security, and storage

- Implement integration tests for tenant and knowledge base management endpoints in `test_tenant_api_routes.py`.
- Create unit tests for tenant isolation, model validation, and role permissions in `test_tenant_models.py`.
- Add security tests to enforce role-based permissions and context validation in `test_tenant_security.py`.
- Develop tests for tenant-aware storage operations and context isolation in `test_tenant_storage_phase3.py`.

* feat(e2e): Implement OpenAI model support and database reset functionality

* Add comprehensive test suite for gpt-5-nano compatibility

- Introduced tests for parameter normalization, embeddings, and entity extraction.
- Implemented direct API testing for gpt-5-nano.
- Validated .env configuration loading and OpenAI API connectivity.
- Analyzed reasoning token overhead with various token limits.
- Documented test procedures and expected outcomes in README files.
- Ensured all tests pass for production readiness.

* kg(postgres_impl): ensure AGE extension is loaded in session and configure graph initialization

* dev: add hybrid dev helper scripts, Makefile, docker-compose.dev-db and local development docs

* feat(dev): add dev helper scripts and local development documentation for hybrid setup

* feat(multi-tenant): add detailed specifications and logs for multi-tenant improvements, including UX, backend handling, and ingestion pipeline

* feat(migration): add generated tenant/kb columns, indexes, triggers; drop unused tables; update schema and docs

* test(backward-compat): adapt tests to new StorageNameSpace/TenantService APIs (use concrete dummy storages)

* chore: multi-tenant and UX updates — docs, webui, storage, tenant service adjustments

* tests: stabilize integration tests + skip external services; fix multi-tenant API behavior and idempotency

- gpt5_nano_compatibility: add pytest-asyncio markers, skip when OPENAI key missing, prevent module-level asyncio.run collection, add conftest
- Ollama tests: add server availability check and skip markers; avoid pytest collection warnings by renaming helper classes
- Graph storage tests: rename interactive test functions to avoid pytest collection
- Document & Tenant routes: support external_ids for idempotency; ensure HTTPExceptions are re-raised
- LightRAG core: support external_ids in apipeline_enqueue_documents and idempotent logic
- Tests updated to match API changes (tenant routes & document routes)
- Add logs and scripts for inspection and audit
2025-12-04 16:04:21 +08:00

342 lines
11 KiB
Python

#!/usr/bin/env python
"""
Workspace-to-Tenant Migration Script
Migrates existing single-tenant workspace-based deployments to multi-tenant architecture.
This script:
1. Scans existing workspace directories
2. Creates a default tenant for each workspace
3. Creates a default knowledge base within each tenant
4. Preserves all existing data structure for backward compatibility
Usage:
python migrate_workspace_to_tenant.py --working-dir /path/to/rag_storage
python migrate_workspace_to_tenant.py --working-dir /path/to/rag_storage --dry-run
python migrate_workspace_to_tenant.py --working-dir /path/to/rag_storage --skip-backup
"""
import asyncio
import argparse
import os
import sys
import shutil
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional
from lightrag.services.tenant_service import TenantService
from lightrag.models.tenant import Tenant, TenantConfig
from lightrag.utils import logger
class WorkspaceToTenantMigrator:
"""
Handles migration from workspace-based to multi-tenant architecture.
"""
def __init__(self, working_dir: str, dry_run: bool = False, backup: bool = True):
"""
Initialize the migrator.
Args:
working_dir: Root directory containing workspace folders
dry_run: If True, simulate migration without making changes
backup: If True, create backup before migration
"""
self.working_dir = Path(working_dir)
self.dry_run = dry_run
self.backup = backup
self.tenant_service = TenantService()
self.migration_log: List[str] = []
self.error_log: List[str] = []
def validate_working_dir(self) -> bool:
"""Validate that working directory exists."""
if not self.working_dir.exists():
self.error_log.append(f"Working directory does not exist: {self.working_dir}")
return False
if not self.working_dir.is_dir():
self.error_log.append(f"Path is not a directory: {self.working_dir}")
return False
return True
def discover_workspaces(self) -> List[str]:
"""
Discover existing workspace directories.
Workspaces are identified by common RAG storage files like:
- kv_store_*.json
- doc_status_storage.json
- rag_storage.db
Returns:
List of workspace directory names
"""
workspaces = []
# Check for common RAG storage files
for item in self.working_dir.iterdir():
if not item.is_dir():
continue
# Skip special directories
if item.name.startswith(('.', '__')):
continue
# Check if directory contains RAG storage files
has_rag_files = any([
(item / f"kv_store_{name}.json").exists()
for name in ["full_docs", "text_chunks", "entities", "relations"]
]) or (item / "doc_status_storage.json").exists()
if has_rag_files or item.name.startswith("workspace_"):
workspaces.append(item.name)
return sorted(workspaces)
def backup_working_dir(self) -> Optional[Path]:
"""
Create a backup of the working directory.
Returns:
Path to backup directory, or None if backup failed
"""
if not self.backup:
return None
backup_dir = self.working_dir.parent / f"{self.working_dir.name}_backup_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
try:
msg = f"Creating backup at {backup_dir}"
logger.info(msg)
self.migration_log.append(msg)
if not self.dry_run:
shutil.copytree(self.working_dir, backup_dir)
return backup_dir
except Exception as e:
msg = f"Failed to create backup: {e}"
logger.error(msg)
self.error_log.append(msg)
return None
async def migrate_workspace(self, workspace_name: str) -> bool:
"""
Migrate a single workspace to multi-tenant structure.
Args:
workspace_name: Name of the workspace to migrate
Returns:
True if migration successful, False otherwise
"""
try:
msg = f"\nMigrating workspace: {workspace_name}"
logger.info(msg)
self.migration_log.append(msg)
# Create tenant from workspace
tenant_name = workspace_name if workspace_name != "" else "default"
if not self.dry_run:
tenant = await self.tenant_service.create_tenant(
tenant_name=tenant_name,
config=None # Use default config
)
msg = f" ✓ Created tenant '{tenant_name}' with ID: {tenant.tenant_id}"
logger.info(msg)
self.migration_log.append(msg)
# Create default knowledge base
kb = await self.tenant_service.create_knowledge_base(
tenant_id=tenant.tenant_id,
kb_name="default",
description="Default knowledge base (migrated from workspace)"
)
msg = f" ✓ Created default KB with ID: {kb.kb_id}"
logger.info(msg)
self.migration_log.append(msg)
else:
msg = f" [DRY RUN] Would create tenant '{tenant_name}' with default KB"
logger.info(msg)
self.migration_log.append(msg)
return True
except Exception as e:
msg = f" ✗ Failed to migrate workspace '{workspace_name}': {e}"
logger.error(msg)
self.error_log.append(msg)
return False
async def migrate_all_workspaces(self, workspaces: List[str]) -> Dict[str, bool]:
"""
Migrate all discovered workspaces.
Args:
workspaces: List of workspace names to migrate
Returns:
Dictionary mapping workspace name to migration status
"""
results = {}
for workspace in workspaces:
success = await self.migrate_workspace(workspace)
results[workspace] = success
return results
def generate_report(self, workspaces: List[str], results: Dict[str, bool]) -> str:
"""
Generate a migration report.
Args:
workspaces: List of workspaces processed
results: Migration results
Returns:
Formatted report string
"""
successful = sum(1 for v in results.values() if v)
failed = len(workspaces) - successful
report = f"""
╔══════════════════════════════════════════════════════════════╗
║ WORKSPACE-TO-TENANT MIGRATION REPORT ║
╚══════════════════════════════════════════════════════════════╝
Working Directory: {self.working_dir}
Dry Run Mode: {self.dry_run}
Workspaces Processed: {len(workspaces)}
Successfully Migrated: {successful}
Failed: {failed}
Migration Log:
"""
for line in self.migration_log:
report += f"\n{line}"
if self.error_log:
report += "\n\nErrors Encountered:"
for error in self.error_log:
report += f"\n{error}"
report += "\n"
return report
async def run(self) -> bool:
"""
Execute the migration process.
Returns:
True if migration completed successfully, False otherwise
"""
# Validate setup
if not self.validate_working_dir():
logger.error("Validation failed")
return False
# Discover workspaces
workspaces = self.discover_workspaces()
if not workspaces:
msg = "No workspaces found to migrate"
logger.warning(msg)
self.migration_log.append(msg)
return True
msg = f"Discovered {len(workspaces)} workspace(s): {', '.join(workspaces)}"
logger.info(msg)
self.migration_log.append(msg)
# Create backup if not dry-run
if not self.dry_run:
backup_path = self.backup_working_dir()
if not backup_path and self.backup:
logger.warning("Backup failed but continuing with migration")
# Migrate workspaces
results = await self.migrate_all_workspaces(workspaces)
# Generate and display report
report = self.generate_report(workspaces, results)
print(report)
# Save report to file
report_path = self.working_dir / f"migration_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
try:
if not self.dry_run:
with open(report_path, 'w') as f:
f.write(report)
logger.info(f"Migration report saved to {report_path}")
except Exception as e:
logger.error(f"Failed to save migration report: {e}")
# Return success if no failures
return all(results.values())
def main():
"""Main entry point for migration script."""
parser = argparse.ArgumentParser(
description="Migrate workspace-based deployment to multi-tenant architecture",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Perform actual migration
python migrate_workspace_to_tenant.py --working-dir /path/to/rag_storage
# Preview what would be migrated without making changes
python migrate_workspace_to_tenant.py --working-dir /path/to/rag_storage --dry-run
# Migrate without creating backup
python migrate_workspace_to_tenant.py --working-dir /path/to/rag_storage --skip-backup
"""
)
parser.add_argument(
"--working-dir",
required=True,
help="Path to the working directory containing workspaces"
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Simulate migration without making actual changes"
)
parser.add_argument(
"--skip-backup",
action="store_true",
help="Skip creating a backup of the working directory"
)
args = parser.parse_args()
# Create migrator
migrator = WorkspaceToTenantMigrator(
working_dir=args.working_dir,
dry_run=args.dry_run,
backup=not args.skip_backup
)
# Run migration
try:
success = asyncio.run(migrator.run())
sys.exit(0 if success else 1)
except KeyboardInterrupt:
logger.warning("Migration interrupted by user")
sys.exit(1)
except Exception as e:
logger.error(f"Migration failed: {e}", exc_info=True)
sys.exit(1)
if __name__ == "__main__":
main()