LightRAG/tools/README_MIGRATE_LLM_CACHE.md
2025-12-04 19:14:30 +08:00

11 KiB

LLM Cache Migration Tool - User Guide

Overview

This tool migrates LightRAG's LLM response cache between different KV storage implementations. It specifically migrates caches generated during file extraction (mode default), including entity extraction and summary caches.

Supported Storage Types

  1. JsonKVStorage - File-based JSON storage
  2. RedisKVStorage - Redis database storage
  3. PGKVStorage - PostgreSQL database storage
  4. MongoKVStorage - MongoDB database storage

Cache Types

The tool migrates the following cache types:

  • default:extract:* - Entity and relationship extraction caches
  • default:summary:* - Entity and relationship summary caches

Note: Query caches (modes like local, global, etc.) are NOT migrated.

Prerequisites

1. Environment Variable Configuration

Ensure the relevant storage environment variables are configured in your .env file:

Workspace Configuration (Optional)

# Generic workspace (shared by all storages)
WORKSPACE=space1

# Or configure independent workspace for specific storage
POSTGRES_WORKSPACE=pg_space
MONGODB_WORKSPACE=mongo_space
REDIS_WORKSPACE=redis_space

Workspace Priority: Storage-specific > Generic WORKSPACE > Empty string

JsonKVStorage

WORKING_DIR=./rag_storage

RedisKVStorage

REDIS_URI=redis://localhost:6379

PGKVStorage

POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_DATABASE=your_database

MongoKVStorage

MONGO_URI=mongodb://root:root@localhost:27017/
MONGO_DATABASE=LightRAG

2. Install Dependencies

Ensure LightRAG and its dependencies are installed:

pip install -r requirements.txt

Usage

Basic Usage

Run from the LightRAG project root directory:

python tools/migrate_llm_cache.py

Interactive Workflow

The tool guides you through the following steps:

1. Select Source Storage Type

Supported KV Storage Types:
[1] JsonKVStorage
[2] RedisKVStorage
[3] PGKVStorage
[4] MongoKVStorage

Select Source storage type (1-4) (Press Enter or 0 to exit): 1

Note: You can press Enter or type 0 at the source storage selection to exit gracefully.

2. Source Storage Validation

The tool will:

  • Check required environment variables
  • Auto-detect workspace configuration
  • Initialize and connect to storage
  • Count cache records available for migration
Checking environment variables...
✓ All required environment variables are set

Initializing Source storage...
- Storage Type: JsonKVStorage
- Workspace: space1
- Connection Status: ✓ Success

Counting cache records...
- Total: 8,734 records

Progress Display by Storage Type:

  • JsonKVStorage: Fast in-memory counting, no progress display needed
  • RedisKVStorage: Real-time scanning progress
    Scanning Redis keys... found 8,734 records
    
  • PostgreSQL: Shows timing if operation takes >1 second
    Counting PostgreSQL records... (took 2.3s)
    
  • MongoDB: Shows timing if operation takes >1 second
    Counting MongoDB documents... (took 1.8s)
    

3. Select Target Storage Type

Repeat steps 1-2 to select and validate the target storage.

4. Confirm Migration

==================================================
Migration Confirmation
Source: JsonKVStorage (workspace: space1) - 8,734 records
Target: MongoKVStorage (workspace: space1) - 0 records
Batch Size: 1,000 records/batch

⚠ Warning: Target storage already has 0 records
Migration will overwrite records with the same keys

Continue? (y/n): y

5. Execute Migration

Observe migration progress:

=== Starting Migration ===
Batch 1/9: ████████░░ 1000/8734 (11%) - default:extract
Batch 2/9: ████████████████░░ 2000/8734 (23%) - default:extract
...
Batch 9/9: ████████████████████ 8734/8734 (100%) - default:summary

Persisting data to disk...

6. Review Migration Report

The tool provides a comprehensive final report showing statistics and any errors encountered:

Successful Migration:

Migration Complete - Final Report

📊 Statistics:
  Total source records:    8,734
  Total batches:           9
  Successful batches:      9
  Failed batches:          0
  Successfully migrated:   8,734
  Failed to migrate:       0
  Success rate:            100.00%

✓ SUCCESS: All records migrated successfully!

Migration with Errors:

Migration Complete - Final Report

📊 Statistics:
  Total source records:    8,734
  Total batches:           9
  Successful batches:      8
  Failed batches:          1
  Successfully migrated:   7,734
  Failed to migrate:       1,000
  Success rate:            88.55%

⚠️  Errors encountered: 1

Error Details:
------------------------------------------------------------

Error Summary:
  - ConnectionError: 1 occurrence(s)

First 5 errors:

  1. Batch 2
     Type: ConnectionError
     Message: Connection timeout after 30s
     Records lost: 1,000

⚠️  WARNING: Migration completed with errors!
   Please review the error details above.

Technical Details

Workspace Handling

The tool retrieves workspace in the following priority order:

  1. Storage-specific workspace environment variables

    • PGKVStorage: POSTGRES_WORKSPACE
    • MongoKVStorage: MONGODB_WORKSPACE
    • RedisKVStorage: REDIS_WORKSPACE
  2. Generic workspace environment variable

    • WORKSPACE
  3. Default value

    • Empty string (uses storage's default workspace)

Batch Migration

  • Default batch size: 1000 records/batch
  • Avoids memory overflow from loading too much data at once
  • Each batch is committed independently, supporting resume capability

Memory-Efficient Pagination

For large datasets, the tool implements storage-specific pagination strategies:

  • JsonKVStorage: Direct in-memory access (data already loaded in shared storage)
  • RedisKVStorage: Cursor-based SCAN with pipeline batching (1000 keys/batch)
  • PGKVStorage: SQL LIMIT/OFFSET pagination (1000 records/batch)
  • MongoKVStorage: Cursor streaming with batch_size (1000 documents/batch)

This ensures the tool can handle millions of cache records without memory issues.

Prefix Filtering Implementation

The tool uses optimized filtering methods for different storage types:

  • JsonKVStorage: Direct dictionary iteration with lock protection
  • RedisKVStorage: SCAN command with namespace-prefixed patterns + pipeline for bulk GET
  • PGKVStorage: SQL LIKE queries with proper field mapping (id, return_value, etc.)
  • MongoKVStorage: MongoDB regex queries on _id field with cursor streaming

Error Handling & Resilience

The tool implements comprehensive error tracking to ensure transparent and resilient migrations:

Batch-Level Error Tracking

  • Each batch is independently error-checked
  • Failed batches are logged but don't stop the migration
  • Successful batches are committed even if later batches fail
  • Real-time progress shows ✓ (success) or ✗ (failed) for each batch

Error Reporting

After migration completes, a detailed report includes:

  • Statistics: Total records, success/failure counts, success rate
  • Error Summary: Grouped by error type with occurrence counts
  • Error Details: Batch number, error type, message, and records lost
  • Recommendations: Clear indication of success or need for review

No Double Data Loading

  • Unlike traditional verification approaches, the tool does NOT reload all target data
  • Errors are detected during migration, not after
  • This eliminates memory overhead and handles pre-existing target data correctly

Important Notes

  1. Data Overwrite Warning

    • Migration will overwrite records with the same keys in the target storage
    • Tool displays a warning if target storage already has data
    • Pre-existing data in target storage is handled correctly
  2. Workspace Consistency

    • Recommended to use the same workspace for source and target
    • Cache data in different workspaces are completely isolated
  3. Interrupt and Resume

    • Migration can be interrupted at any time (Ctrl+C)
    • Already migrated data will remain in target storage
    • Re-running will overwrite existing records
    • Failed batches can be manually retried
  4. Performance Considerations

    • Large data migration may take considerable time
    • Recommend migrating during off-peak hours
    • Ensure stable network connection (for remote databases)
    • Memory usage stays constant regardless of dataset size

Troubleshooting

Missing Environment Variables

✗ Missing required environment variables: POSTGRES_USER, POSTGRES_PASSWORD

Solution: Add missing variables to your .env file

Connection Failed

✗ Initialization failed: Connection refused

Solutions:

  • Check if database service is running
  • Verify connection parameters (host, port, credentials)
  • Check firewall settings

Solutions:

  • Check migration process for error logs
  • Re-run migration tool
  • Check target storage capacity and permissions

Example Scenarios

Scenario 1: JSON to MongoDB Migration

Use case: Migrating from single-machine development to production

# 1. Configure environment variables
WORKSPACE=production
MONGO_URI=mongodb://user:pass@prod-server:27017/
MONGO_DATABASE=LightRAG

# 2. Run tool
python tools/migrate_llm_cache.py

# 3. Select: 1 (JsonKVStorage) -> 4 (MongoKVStorage)

Scenario 2: PostgreSQL Database Switch

Use case: Database migration or upgrade

# 1. Configure old and new databases
POSTGRES_WORKSPACE=old_db  # Source
# ... Configure new database as default

# 2. Run tool and select same storage type

Scenario 3: Redis to PostgreSQL

Use case: Migrating from cache storage to relational database

# 1. Ensure both databases are accessible
REDIS_URI=redis://old-redis:6379
POSTGRES_HOST=new-postgres-server
# ... Other PostgreSQL configs

# 2. Run tool
python tools/migrate_llm_cache.py

# 3. Select: 2 (RedisKVStorage) -> 3 (PGKVStorage)

Tool Limitations

  1. Only Default Mode Caches

    • Only migrates default:extract:* and default:summary:*
    • Query caches are not included
  2. Workspace Isolation

    • Different workspaces are treated as completely separate
    • Cross-workspace migration requires manual workspace reconfiguration
  3. Network Dependency

    • Tool requires stable network connection for remote databases
    • Large datasets may fail if connection is interrupted

Best Practices

  1. Backup Before Migration

    • Always backup your data before migration
    • Test migration on non-production data first
  2. Verify Results

    • Check the verification output after migration
    • Manually verify a few cache entries if needed
  3. Monitor Performance

    • Watch database resource usage during migration
    • Consider migrating in smaller batches if needed
  4. Clean Old Data

    • After successful migration, consider cleaning old cache data
    • Keep backups for a reasonable period before deletion

Support

For issues or questions:

  • Check LightRAG documentation
  • Review error logs for detailed information
  • Ensure all environment variables are correctly configured