LightRAG/lightrag/tools/README_CLEAN_LLM_QUERY_CACHE.md
yangdx 1485cb82e9 Add LLM query cache cleanup tool for KV storage backends
- Interactive cleanup workflow
- Supports all KV storage types
- Batch deletion with progress
- Comprehensive error reporting
- Preserves workspace isolation
2025-11-09 13:37:33 +08:00

18 KiB

LLM Query Cache Cleanup Tool - User Guide

Overview

This tool cleans up LightRAG's LLM query cache from KV storage implementations. It specifically targets query caches generated during RAG query operations (modes: mix, hybrid, local, global), including both query and keywords caches.

Supported Storage Types

  1. JsonKVStorage - File-based JSON storage
  2. RedisKVStorage - Redis database storage
  3. PGKVStorage - PostgreSQL database storage
  4. MongoKVStorage - MongoDB database storage

Cache Types

The tool cleans up the following query cache types:

Query Cache Modes (4 types)

  • mix:* - Mixed mode query caches
  • hybrid:* - Hybrid mode query caches
  • local:* - Local mode query caches
  • global:* - Global mode query caches

Cache Content Types (2 types)

  • *:query:* - Query result caches
  • *:keywords:* - Keywords extraction caches

Cache Key Format

<mode>:<cache_type>:<hash>

Examples:

  • mix:query:5ce04d25e957c290216cee5bfe6344fa
  • mix:keywords:fee77b98244a0b047ce95e21060de60e
  • global:query:abc123def456...
  • local:keywords:789xyz...

Important Note: This tool does NOT clean extraction caches (default:extract:* and default:summary:*). Use the migration tool or manual deletion for those caches.

Prerequisites

  • The tool reads storage configuration from environment variables or config.ini
  • Ensure the target storage is properly configured and accessible
  • Backup important data before running cleanup operations

Usage

Basic Usage

Run from the LightRAG project root directory:

python -m lightrag.tools.clean_llm_query_cache
# or
python lightrag/tools/clean_llm_query_cache.py

Interactive Workflow

The tool guides you through the following steps:

1. Select Storage Type

============================================================
LLM Query Cache Cleanup Tool - LightRAG
============================================================

=== Storage Setup ===

Supported KV Storage Types:
[1] JsonKVStorage
[2] RedisKVStorage
[3] PGKVStorage
[4] MongoKVStorage

Select storage type (1-4) (Press Enter to exit): 1

Note: You can press Enter or type 0 at any prompt to exit gracefully.

2. Storage Validation

The tool will:

  • Check required environment variables
  • Auto-detect workspace configuration
  • Initialize and connect to storage
  • Verify connection status
Checking configuration...
✓ All required environment variables are set

Initializing storage...
- Storage Type: JsonKVStorage
- Workspace: space1
- Connection Status: ✓ Success

3. View Cache Statistics

The tool displays a detailed breakdown of query caches by mode and type:

Counting query cache records...

📊 Query Cache Statistics (Before Cleanup):
┌────────────┬────────────┬────────────┬────────────┐
│ Mode       │ Query      │ Keywords   │ Total      │
├────────────┼────────────┼────────────┼────────────┤
│ mix        │      1,234 │        567 │      1,801 │
│ hybrid     │        890 │        423 │      1,313 │
│ local      │      2,345 │      1,123 │      3,468 │
│ global     │        678 │        345 │      1,023 │
├────────────┼────────────┼────────────┼────────────┤
│ Total      │      5,147 │      2,458 │      7,605 │
└────────────┴────────────┴────────────┴────────────┘

4. Select Cleanup Scope

Choose what type of caches to delete:

=== Cleanup Options ===
[1] Delete all query caches (both query and keywords)
[2] Delete query caches only (keep keywords)
[3] Delete keywords caches only (keep query)
[0] Cancel

Select cleanup option (0-3): 1

Cleanup Types:

  • Option 1 (all): Deletes both query and keywords caches across all modes
  • Option 2 (query): Deletes only query caches, preserves keywords caches
  • Option 3 (keywords): Deletes only keywords caches, preserves query caches

5. Confirm Deletion

Review the cleanup plan and confirm:

============================================================
Cleanup Confirmation
============================================================
Storage: JsonKVStorage (workspace: space1)
Cleanup Type: all
Records to Delete: 7,605 / 7,605

⚠️  WARNING: This will delete ALL query caches across all modes!

Continue with deletion? (y/n): y

6. Execute Cleanup

The tool performs batch deletion with real-time progress:

JsonKVStorage Example:

=== Starting Cleanup ===
💡 Processing 1,000 records at a time from JsonKVStorage

Batch 1/8: ████░░░░░░░░░░░░░░░░ 1,000/7,605 (13.1%) ✓
Batch 2/8: ████████░░░░░░░░░░░░ 2,000/7,605 (26.3%) ✓
...
Batch 8/8: ████████████████████ 7,605/7,605 (100.0%) ✓

Persisting changes to storage...
✓ Changes persisted successfully

RedisKVStorage Example:

=== Starting Cleanup ===
💡 Processing Redis keys in batches of 1,000

Batch 1: Deleted 1,000 keys (Total: 1,000) ✓
Batch 2: Deleted 1,000 keys (Total: 2,000) ✓
...

PostgreSQL Example:

=== Starting Cleanup ===
💡 Executing PostgreSQL DELETE query

✓ Deleted 7,605 records in 0.45s

MongoDB Example:

=== Starting Cleanup ===
💡 Executing MongoDB deleteMany operations

Pattern 1/8: Deleted 1,234 records ✓
Pattern 2/8: Deleted 567 records ✓
...
Total deleted: 7,605 records

7. Review Cleanup Report

The tool provides a comprehensive final report:

Successful Cleanup:

============================================================
Cleanup Complete - Final Report
============================================================

📊 Statistics:
  Total records to delete:  7,605
  Total batches:            8
  Successful batches:       8
  Failed batches:           0
  Successfully deleted:     7,605
  Failed to delete:         0
  Success rate:             100.00%

📈 Before/After Comparison:
  Total caches before:      7,605
  Total caches after:       0
  Net reduction:            7,605

============================================================
✓ SUCCESS: All records cleaned up successfully!
============================================================

📊 Query Cache Statistics (After Cleanup):
┌────────────┬────────────┬────────────┬────────────┐
│ Mode       │ Query      │ Keywords   │ Total      │
├────────────┼────────────┼────────────┼────────────┤
│ mix        │          0 │          0 │          0 │
│ hybrid     │          0 │          0 │          0 │
│ local      │          0 │          0 │          0 │
│ global     │          0 │          0 │          0 │
├────────────┼────────────┼────────────┼────────────┤
│ Total      │          0 │          0 │          0 │
└────────────┴────────────┴────────────┴────────────┘

Cleanup with Errors:

============================================================
Cleanup Complete - Final Report
============================================================

📊 Statistics:
  Total records to delete:  7,605
  Total batches:            8
  Successful batches:       7
  Failed batches:           1
  Successfully deleted:     6,605
  Failed to delete:         1,000
  Success rate:             86.85%

📈 Before/After Comparison:
  Total caches before:      7,605
  Total caches after:       1,000
  Net reduction:            6,605

⚠️  Errors encountered: 1

Error Details:
------------------------------------------------------------

Error Summary:
  - ConnectionError: 1 occurrence(s)

First 5 errors:

  1. Batch 3
     Type: ConnectionError
     Message: Connection timeout after 30s
     Records lost: 1,000

============================================================
⚠️  WARNING: Cleanup completed with errors!
   Please review the error details above.
============================================================

Technical Details

Workspace Handling

The tool retrieves workspace in the following priority order:

  1. Storage-specific workspace environment variables

    • PGKVStorage: POSTGRES_WORKSPACE
    • MongoKVStorage: MONGODB_WORKSPACE
    • RedisKVStorage: REDIS_WORKSPACE
  2. Generic workspace environment variable

    • WORKSPACE
  3. Default value

    • Empty string (uses storage's default workspace)

Batch Deletion

  • Default batch size: 1000 records/batch
  • Prevents memory overflow and connection timeouts
  • Each batch is processed independently
  • Failed batches are logged but don't stop cleanup

Storage-Specific Deletion Strategies

JsonKVStorage

  • Collects all matching keys first (snapshot approach)
  • Deletes in batches with lock protection
  • Fast in-memory operations

RedisKVStorage

  • Uses SCAN with pattern matching
  • Pipeline DELETE for batch operations
  • Cursor-based iteration for large datasets

PostgreSQL

  • Single DELETE query with OR conditions
  • Efficient server-side bulk deletion
  • Uses LIKE patterns for mode/type matching

MongoDB

  • Multiple deleteMany operations (one per pattern)
  • Regex-based document matching
  • Returns exact deletion counts

Pattern Matching Implementation

JsonKVStorage:

# Direct key prefix matching
if key.startswith("mix:query:") or key.startswith("mix:keywords:")

RedisKVStorage:

# SCAN with namespace-prefixed patterns
pattern = f"{namespace}:mix:query:*"
cursor, keys = await redis.scan(cursor, match=pattern)

PostgreSQL:

# SQL LIKE conditions
WHERE id LIKE 'mix:query:%' OR id LIKE 'mix:keywords:%'

MongoDB:

# Regex queries on _id field
{"_id": {"$regex": "^mix:query:"}}

Error Handling & Resilience

The tool implements comprehensive error tracking:

Batch-Level Error Tracking

  • Each batch is independently error-checked
  • Failed batches are logged with full details
  • Successful batches commit even if later batches fail
  • Real-time progress shows ✓ (success) or ✗ (failed)

Error Reporting

After cleanup completes, a detailed report includes:

  • Statistics: Total records, success/failure counts, success rate
  • Before/After Comparison: Net reduction in cache count
  • Error Summary: Grouped by error type with occurrence counts
  • Error Details: Batch number, error type, message, and records lost
  • Recommendations: Clear indication of success or need for review

Verification

  • Post-cleanup count verification
  • Before/after statistics comparison
  • Identifies partial cleanup scenarios

Important Notes

  1. Irreversible Operation

    • Deleted caches cannot be recovered
    • Always backup important data before cleanup
    • Test on non-production data first
  2. Performance Impact

    • Query performance may degrade temporarily after cleanup
    • Caches will rebuild on subsequent queries
    • Consider cleanup during off-peak hours
  3. Selective Cleanup

    • Choose cleanup scope carefully
    • Keywords caches may be valuable for future queries
    • Query caches rebuild faster than keywords caches
  4. Workspace Isolation

    • Cleanup only affects the selected workspace
    • Other workspaces remain untouched
    • Verify workspace before confirming cleanup
  5. Interrupt and Resume

    • Cleanup can be interrupted at any time (Ctrl+C)
    • Already deleted records cannot be recovered
    • No automatic resume - must run tool again

Storage Configuration

The tool supports multiple configuration methods with the following priority:

  1. Environment variables (highest priority)
  2. config.ini file (medium priority)
  3. Default values (lowest priority)

Environment Variable Configuration

Configure storage settings in your .env file:

Workspace Configuration (Optional)

# Generic workspace (shared by all storages)
WORKSPACE=space1

# Or configure independent workspace for specific storage
POSTGRES_WORKSPACE=pg_space
MONGODB_WORKSPACE=mongo_space
REDIS_WORKSPACE=redis_space

Workspace Priority: Storage-specific > Generic WORKSPACE > Empty string

JsonKVStorage

WORKING_DIR=./rag_storage

RedisKVStorage

REDIS_URI=redis://localhost:6379

PGKVStorage

POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_DATABASE=your_database

MongoKVStorage

MONGO_URI=mongodb://root:root@localhost:27017/
MONGO_DATABASE=LightRAG

config.ini Configuration

Alternatively, create a config.ini file in the project root:

[redis]
uri = redis://localhost:6379

[postgres]
host = localhost
port = 5432
user = postgres
password = yourpassword
database = lightrag

[mongodb]
uri = mongodb://root:root@localhost:27017/
database = LightRAG

Note: Environment variables take precedence over config.ini settings.

Troubleshooting

Missing Environment Variables

⚠️  Warning: Missing environment variables: POSTGRES_USER, POSTGRES_PASSWORD

Solution: Add missing variables to your .env file or configure in config.ini

Connection Failed

✗ Initialization failed: Connection refused

Solutions:

  • Check if database service is running
  • Verify connection parameters (host, port, credentials)
  • Check firewall settings
  • Ensure network connectivity for remote databases

No Caches Found

⚠️  No query caches found in storage

Possible Reasons:

  • No queries have been run yet
  • Caches were already cleaned
  • Wrong workspace selected
  • Different storage type was used for queries

Partial Cleanup

⚠️  WARNING: Cleanup completed with errors!

Solutions:

  • Check error details in the report
  • Verify storage connection stability
  • Re-run tool to clean remaining caches
  • Check storage capacity and permissions

Use Cases

Use Case 1: Clean All Query Caches

Scenario: Free up storage space by removing all query caches

# Run tool
python -m lightrag.tools.clean_llm_query_cache

# Select: Storage type -> Option 1 (all) -> Confirm (y)

Result: All query and keywords caches deleted, maximum storage freed

Use Case 2: Refresh Query Caches Only

Scenario: Force query cache rebuild while keeping keywords

# Run tool
python -m lightrag.tools.clean_llm_query_cache

# Select: Storage type -> Option 2 (query only) -> Confirm (y)

Result: Query caches deleted, keywords preserved for faster rebuild

Use Case 3: Clean Stale Keywords

Scenario: Remove outdated keywords while keeping recent query results

# Run tool
python -m lightrag.tools.clean_llm_query_cache

# Select: Storage type -> Option 3 (keywords only) -> Confirm (y)

Result: Keywords deleted, query caches preserved

Use Case 4: Workspace-Specific Cleanup

Scenario: Clean caches for a specific workspace

# Configure workspace
export WORKSPACE=development

# Run tool
python -m lightrag.tools.clean_llm_query_cache

# Select: Storage type -> Cleanup option -> Confirm (y)

Result: Only development workspace caches cleaned

Best Practices

  1. Backup Before Cleanup

    • Always backup your storage before major cleanup
    • Test cleanup on non-production data first
    • Document cleanup decisions
  2. Monitor Performance

    • Watch storage metrics during cleanup
    • Monitor query performance after cleanup
    • Allow time for cache rebuild
  3. Scheduled Cleanup

    • Clean caches periodically (weekly/monthly)
    • Automate cleanup for development environments
    • Keep production cleanup manual for safety
  4. Selective Deletion

    • Consider cleanup scope based on needs
    • Keywords caches are harder to rebuild
    • Query caches rebuild automatically
  5. Storage Capacity

    • Monitor storage usage trends
    • Clean caches before reaching capacity limits
    • Archive old data if needed

Comparison with Migration Tool

Feature Cleanup Tool Migration Tool
Purpose Delete query caches Migrate extraction caches
Cache Types mix/hybrid/local/global default:extract/summary
Modes query, keywords extract, summary
Operation Deletion Copy between storages
Reversible No Yes (source unchanged)
Use Case Free storage, refresh caches Change storage backend

Limitations

  1. Single Storage Operation

    • Can only clean one storage type at a time
    • To clean multiple storages, run tool multiple times
  2. No Dry Run Mode

    • Deletion is immediate after confirmation
    • No preview-only mode available
    • Test on non-production first
  3. No Selective Mode Cleanup

    • Cannot clean only specific modes (e.g., only mix)
    • Cleanup applies to all modes for selected cache type
    • All-or-nothing per cache type
  4. No Scheduled Cleanup

    • Manual execution required
    • No built-in scheduling
    • Use cron/scheduler if automation needed
  5. Verification Limitations

    • Post-cleanup verification may fail in error scenarios
    • Manual verification recommended for critical operations

Future Enhancements

Potential improvements for future versions:

  • Selective mode cleanup (e.g., clean only mix mode)
  • Age-based cleanup (delete caches older than X days)
  • Size-based cleanup (delete largest caches first)
  • Dry run mode for safe preview
  • Automated scheduling support
  • Cache statistics export
  • Incremental cleanup with pause/resume

Support

For issues, questions, or feature requests:

  • Check the error details in the cleanup report
  • Review storage configuration
  • Verify workspace settings
  • Test with a small dataset first
  • Report bugs through project issue tracker