LightRAG/Metadata_Filtering.md

13 KiB

Metadata Filtering in LightRAG

Overview

LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.

Important Limitations:

  • Metadata filtering is only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j
  • Only chunk-based queries support metadata filtering (Mix and Naive modes)
  • Metadata is stored in document status and propagated to chunks during extraction

Metadata Structure

Metadata is stored as a dictionary (dict[str, Any]) in:

  • Entity nodes (graph storage)
  • Relationship edges (graph storage)
  • Text chunks (KV storage)
  • Vector embeddings (vector storage)
metadata = {
    "author": "John Doe",
    "department": "Engineering",
    "document_type": "technical_spec",
    "version": "1.0"
}

Critical: Metadata Persistence in Document Status

Metadata is stored in DocProcessingStatus - This ensures metadata is not lost if the processing queue is stopped or interrupted.

How It Works

  1. Document Status Storage (lightrag/base.py - DocProcessingStatus)

    @dataclass
    class DocProcessingStatus:
        # ... other fields
        metadata: dict[str, Any] = field(default_factory=dict)
        """Additional metadata - PERSISTED across queue restarts"""
    
  2. Metadata Flow:

    • Metadata stored in DocProcessingStatus.metadata when document is enqueued
    • If queue stops, metadata persists in document status storage
    • When processing resumes, metadata is read from document status
    • Metadata is propagated to chunks during extraction
  3. Why This Matters:

    • Queue can be stopped/restarted without losing metadata
    • Metadata survives system crashes or interruptions
    • Ensures data consistency across processing pipeline

Metadata Filtering During Queries

MetadataFilter Class

from lightrag.types import MetadataFilter

# Simple filter
filter1 = MetadataFilter(
    operator="AND",
    operands=[{"department": "Engineering"}]
)

# Complex filter with OR
filter2 = MetadataFilter(
    operator="OR",
    operands=[
        {"author": "John Doe"},
        {"author": "Jane Smith"}
    ]
)

# Nested filter
filter3 = MetadataFilter(
    operator="AND",
    operands=[
        {"document_type": "technical_spec"},
        MetadataFilter(
            operator="OR",
            operands=[
                {"version": "1.0"},
                {"version": "2.0"}
            ]
        )
    ]
)

Supported Operators

  • AND: All conditions must be true
  • OR: At least one condition must be true
  • NOT: Negates the condition

Supported Query Modes

Filters vector chunks from both KG and direct vector search:

query_param = QueryParam(
    mode="mix",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[
            {"department": "Engineering"},
            {"status": "approved"}
        ]
    )
)

Naive Mode

Filters vector chunks directly:

query_param = QueryParam(
    mode="naive",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[{"document_type": "manual"}]
    )
)

Implementation Details

Architecture Flow

  1. API Layer (lightrag/api/routers/query_routes.py)

    • REST endpoint receives metadata_filter as JSON dict
    • Converts JSON to MetadataFilter object using MetadataFilter.from_dict()
  2. QueryParam (lightrag/base.py)

    • MetadataFilter object is passed into QueryParam.metadata_filter
    • QueryParam carries the filter through the query pipeline
  3. Query Execution (lightrag/operate.py)

    • Only chunk-based queries use the filter:
      • Line 2749: chunks_vdb.query(..., metadata_filter=query_param.metadata_filter) (Mix/Naive modes)
  4. Storage Layer (lightrag/kg/postgres_impl.py)

    • PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators

Code Locations

Key files implementing metadata support:

  • lightrag/types.py: MetadataFilter class definition
  • lightrag/base.py: QueryParam with metadata_filter field, DocProcessingStatus with metadata persistence
  • lightrag/api/routers/query_routes.py: API endpoint that initializes MetadataFilter from JSON
  • lightrag/operate.py: Query functions that pass filter to storage (Line 2749)
  • lightrag/kg/postgres_impl.py: PostgreSQL JSONB filter implementation

Query Examples

Example 1: Filter by Department (Mix Mode)

from lightrag import QueryParam
from lightrag.types import MetadataFilter

query_param = QueryParam(
    mode="mix",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[{"department": "Engineering"}]
    )
)

response = rag.query("What are the key projects?", param=query_param)

Example 2: Multi-tenant Filtering (Naive Mode)

query_param = QueryParam(
    mode="naive",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[
            {"tenant_id": "tenant_a"},
            {"access_level": "admin"}
        ]
    )
)

response = rag.query("Show admin resources", param=query_param)

Example 3: Version Filtering (Mix Mode)

query_param = QueryParam(
    mode="mix",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[
            {"doc_type": "manual"},
            {"status": "current"}
        ]
    )
)

response = rag.query("How to configure?", param=query_param)

Storage Backend Support

Important: Metadata filtering is currently only supported for PostgreSQL vector storage.

Vector Storage

  • PGVectorStorage: Full support with JSONB filtering
  • NanoVectorDBStorage: Not supported
  • MilvusVectorDBStorage: Not supported
  • ChromaVectorDBStorage: Not supported
  • FaissVectorDBStorage: Not supported
  • QdrantVectorDBStorage: Not supported
  • MongoVectorDBStorage: Not supported

For metadata filtering support:

rag = LightRAG(
    working_dir="./storage",
    vector_storage="PGVectorStorage",
    # Graph storage can be any type
    # ... other config
)

Server API Examples

REST API Query with Metadata Filter

Simple Filter (Naive Mode)

curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the key features?",
    "mode": "naive",
    "metadata_filter": {
      "operator": "AND",
      "operands": [
        {"department": "Engineering"},
        {"year": 2024}
      ]
    }
  }'

Complex Nested Filter (Mix Mode)

curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Show me technical documentation",
    "mode": "mix",
    "metadata_filter": {
      "operator": "AND",
      "operands": [
        {"document_type": "technical_spec"},
        {
          "operator": "OR",
          "operands": [
            {"version": "1.0"},
            {"version": "2.0"}
          ]
        }
      ]
    }
  }'

Multi-tenant Query (Mix Mode)

curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "List all projects",
    "mode": "mix",
    "metadata_filter": {
      "operator": "AND",
      "operands": [
        {"tenant_id": "tenant_a"},
        {"access_level": "admin"}
      ]
    },
    "top_k": 20
  }'

Python Client with Server

import requests
from lightrag.types import MetadataFilter

# Option 1: Use MetadataFilter class and convert to dict
metadata_filter = MetadataFilter(
    operator="AND",
    operands=[
        {"department": "Engineering"},
        {"status": "approved"}
    ]
)

response = requests.post(
    "http://localhost:9621/query",
    json={
        "query": "What are the approved engineering documents?",
        "mode": "mix",  # Use mix or naive mode
        "metadata_filter": metadata_filter.to_dict(),
        "top_k": 10
    }
)

# Option 2: Send dict directly (API will convert to MetadataFilter)
response = requests.post(
    "http://localhost:9621/query",
    json={
        "query": "What are the approved engineering documents?",
        "mode": "naive",  # Use mix or naive mode
        "metadata_filter": {
            "operator": "AND",
            "operands": [
                {"department": "Engineering"},
                {"status": "approved"}
            ]
        },
        "top_k": 10
    }
)

result = response.json()
print(result["response"])

How the API Processes Metadata Filters

When you send a query to the REST API:

  1. JSON Request → API receives metadata_filter as a dict
  2. API ConversionMetadataFilter.from_dict() creates MetadataFilter object
  3. QueryParam → MetadataFilter is set in QueryParam.metadata_filter
  4. Query Execution → QueryParam with filter is passed to kg_query() or naive_query()
  5. Storage Query → Filter is passed to vector storage query methods (chunks only)
  6. SQL → PGVectorStorage converts filter to JSONB WHERE clause

Best Practices

1. Consistent Metadata Schema

# Good - consistent schema
metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}

2. Simple Indexable Values

# Good - simple values
metadata = {
    "status": "approved",
    "priority": "high",
    "year": 2024
}

3. Use Appropriate Query Mode

  • Mix mode: Best for combining KG context with filtered chunks
  • Naive mode: Best for pure vector search with metadata filtering

4. Performance Considerations

  • Keep metadata fields minimal (Should be done automatically by the ORM)
  • For PostgreSQL: Create GIN indexes on JSONB metadata columns:
    CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata);
    
  • Avoid overly complex nested filters

Troubleshooting

Filter Not Working

  1. Verify storage backend: Ensure you're using PGVectorStorage
  2. Verify query mode: Use "mix" or "naive" mode only
  3. Verify metadata exists in chunks
  4. Check metadata field names match exactly (case-sensitive)
  5. Check logs for filter parsing errors
  6. Test without filter first to ensure data exists

Performance Issues

  1. Reduce filter complexity
  2. Create GIN indexes on JSONB metadata columns in PostgreSQL
  3. Profile query execution time
  4. Consider caching frequently used filters

Unsupported Storage Backend

If you're using a storage backend that doesn't support metadata filtering:

  1. Migrate to PGVectorStorage
  2. Or implement post-filtering in application code
  3. Or contribute metadata filtering support for your backend

Metadata Not Persisting After Queue Restart

  • Metadata is stored in DocProcessingStatus.metadata
  • Check document status storage is properly configured
  • Verify metadata is set before document is enqueued

API Reference

MetadataFilter

class MetadataFilter(BaseModel):
    operator: str  # "AND", "OR", or "NOT"
    operands: List[Union[Dict[str, Any], 'MetadataFilter']]
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for JSON serialization"""
        ...
    
    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
        """Create MetadataFilter from dictionary (used by API)"""
        ...

QueryParam

@dataclass
class QueryParam:
    metadata_filter: MetadataFilter | None = None  # Filter passed to chunk queries
    mode: str = "mix"  # Only "mix" and "naive" support metadata filtering
    top_k: int = 60
    # ... other fields

DocProcessingStatus

@dataclass
class DocProcessingStatus:
    # ... other fields
    metadata: dict[str, Any] = field(default_factory=dict)
    """Additional metadata - PERSISTED across queue restarts"""

Query Method

# Synchronous
response = rag.query(
    query: str,
    param: QueryParam  # QueryParam contains metadata_filter
)

# Asynchronous
response = await rag.aquery(
    query: str,
    param: QueryParam  # QueryParam contains metadata_filter
)

REST API Query Endpoint

# In lightrag/api/routers/query_routes.py
@router.post("/query")
async def query_endpoint(request: QueryRequest):
    # API receives metadata_filter as dict
    metadata_filter_dict = request.metadata_filter
    
    # Convert dict to MetadataFilter object
    metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
    
    # Create QueryParam with MetadataFilter
    query_param = QueryParam(
        mode=request.mode,  # Must be "mix" or "naive"
        metadata_filter=metadata_filter,
        top_k=request.top_k
    )
    
    # Execute query with QueryParam
    result = await rag.aquery(request.query, param=query_param)
    return result