13 KiB
Metadata Filtering in LightRAG
Overview
LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.
Important Limitations:
- Metadata filtering is only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j
- Only chunk-based queries support metadata filtering (Mix and Naive modes)
- Metadata is stored in document status and propagated to chunks during extraction
Metadata Structure
Metadata is stored as a dictionary (dict[str, Any]) in:
- Entity nodes (graph storage)
- Relationship edges (graph storage)
- Text chunks (KV storage)
- Vector embeddings (vector storage)
metadata = {
"author": "John Doe",
"department": "Engineering",
"document_type": "technical_spec",
"version": "1.0"
}
Critical: Metadata Persistence in Document Status
Metadata is stored in DocProcessingStatus - This ensures metadata is not lost if the processing queue is stopped or interrupted.
How It Works
-
Document Status Storage (
lightrag/base.py-DocProcessingStatus)@dataclass class DocProcessingStatus: # ... other fields metadata: dict[str, Any] = field(default_factory=dict) """Additional metadata - PERSISTED across queue restarts""" -
Metadata Flow:
- Metadata stored in
DocProcessingStatus.metadatawhen document is enqueued - If queue stops, metadata persists in document status storage
- When processing resumes, metadata is read from document status
- Metadata is propagated to chunks during extraction
- Metadata stored in
-
Why This Matters:
- Queue can be stopped/restarted without losing metadata
- Metadata survives system crashes or interruptions
- Ensures data consistency across processing pipeline
Metadata Filtering During Queries
MetadataFilter Class
from lightrag.types import MetadataFilter
# Simple filter
filter1 = MetadataFilter(
operator="AND",
operands=[{"department": "Engineering"}]
)
# Complex filter with OR
filter2 = MetadataFilter(
operator="OR",
operands=[
{"author": "John Doe"},
{"author": "Jane Smith"}
]
)
# Nested filter
filter3 = MetadataFilter(
operator="AND",
operands=[
{"document_type": "technical_spec"},
MetadataFilter(
operator="OR",
operands=[
{"version": "1.0"},
{"version": "2.0"}
]
)
]
)
Supported Operators
- AND: All conditions must be true
- OR: At least one condition must be true
- NOT: Negates the condition
Supported Query Modes
Mix Mode (Recommended)
Filters vector chunks from both KG and direct vector search:
query_param = QueryParam(
mode="mix",
metadata_filter=MetadataFilter(
operator="AND",
operands=[
{"department": "Engineering"},
{"status": "approved"}
]
)
)
Naive Mode
Filters vector chunks directly:
query_param = QueryParam(
mode="naive",
metadata_filter=MetadataFilter(
operator="AND",
operands=[{"document_type": "manual"}]
)
)
Implementation Details
Architecture Flow
-
API Layer (
lightrag/api/routers/query_routes.py)- REST endpoint receives
metadata_filteras JSON dict - Converts JSON to
MetadataFilterobject usingMetadataFilter.from_dict()
- REST endpoint receives
-
QueryParam (
lightrag/base.py)MetadataFilterobject is passed intoQueryParam.metadata_filter- QueryParam carries the filter through the query pipeline
-
Query Execution (
lightrag/operate.py)- Only chunk-based queries use the filter:
- Line 2749:
chunks_vdb.query(..., metadata_filter=query_param.metadata_filter)(Mix/Naive modes)
- Line 2749:
- Only chunk-based queries use the filter:
-
Storage Layer (
lightrag/kg/postgres_impl.py)- PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators
Code Locations
Key files implementing metadata support:
lightrag/types.py:MetadataFilterclass definitionlightrag/base.py:QueryParamwithmetadata_filterfield,DocProcessingStatuswith metadata persistencelightrag/api/routers/query_routes.py: API endpoint that initializes MetadataFilter from JSONlightrag/operate.py: Query functions that pass filter to storage (Line 2749)lightrag/kg/postgres_impl.py: PostgreSQL JSONB filter implementation
Query Examples
Example 1: Filter by Department (Mix Mode)
from lightrag import QueryParam
from lightrag.types import MetadataFilter
query_param = QueryParam(
mode="mix",
metadata_filter=MetadataFilter(
operator="AND",
operands=[{"department": "Engineering"}]
)
)
response = rag.query("What are the key projects?", param=query_param)
Example 2: Multi-tenant Filtering (Naive Mode)
query_param = QueryParam(
mode="naive",
metadata_filter=MetadataFilter(
operator="AND",
operands=[
{"tenant_id": "tenant_a"},
{"access_level": "admin"}
]
)
)
response = rag.query("Show admin resources", param=query_param)
Example 3: Version Filtering (Mix Mode)
query_param = QueryParam(
mode="mix",
metadata_filter=MetadataFilter(
operator="AND",
operands=[
{"doc_type": "manual"},
{"status": "current"}
]
)
)
response = rag.query("How to configure?", param=query_param)
Storage Backend Support
Important: Metadata filtering is currently only supported for PostgreSQL vector storage.
Vector Storage
- PGVectorStorage: Full support with JSONB filtering
- NanoVectorDBStorage: Not supported
- MilvusVectorDBStorage: Not supported
- ChromaVectorDBStorage: Not supported
- FaissVectorDBStorage: Not supported
- QdrantVectorDBStorage: Not supported
- MongoVectorDBStorage: Not supported
Recommended Configuration
For metadata filtering support:
rag = LightRAG(
working_dir="./storage",
vector_storage="PGVectorStorage",
# Graph storage can be any type
# ... other config
)
Server API Examples
REST API Query with Metadata Filter
Simple Filter (Naive Mode)
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "What are the key features?",
"mode": "naive",
"metadata_filter": {
"operator": "AND",
"operands": [
{"department": "Engineering"},
{"year": 2024}
]
}
}'
Complex Nested Filter (Mix Mode)
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "Show me technical documentation",
"mode": "mix",
"metadata_filter": {
"operator": "AND",
"operands": [
{"document_type": "technical_spec"},
{
"operator": "OR",
"operands": [
{"version": "1.0"},
{"version": "2.0"}
]
}
]
}
}'
Multi-tenant Query (Mix Mode)
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "List all projects",
"mode": "mix",
"metadata_filter": {
"operator": "AND",
"operands": [
{"tenant_id": "tenant_a"},
{"access_level": "admin"}
]
},
"top_k": 20
}'
Python Client with Server
import requests
from lightrag.types import MetadataFilter
# Option 1: Use MetadataFilter class and convert to dict
metadata_filter = MetadataFilter(
operator="AND",
operands=[
{"department": "Engineering"},
{"status": "approved"}
]
)
response = requests.post(
"http://localhost:9621/query",
json={
"query": "What are the approved engineering documents?",
"mode": "mix", # Use mix or naive mode
"metadata_filter": metadata_filter.to_dict(),
"top_k": 10
}
)
# Option 2: Send dict directly (API will convert to MetadataFilter)
response = requests.post(
"http://localhost:9621/query",
json={
"query": "What are the approved engineering documents?",
"mode": "naive", # Use mix or naive mode
"metadata_filter": {
"operator": "AND",
"operands": [
{"department": "Engineering"},
{"status": "approved"}
]
},
"top_k": 10
}
)
result = response.json()
print(result["response"])
How the API Processes Metadata Filters
When you send a query to the REST API:
- JSON Request → API receives
metadata_filteras a dict - API Conversion →
MetadataFilter.from_dict()creates MetadataFilter object - QueryParam → MetadataFilter is set in
QueryParam.metadata_filter - Query Execution → QueryParam with filter is passed to
kg_query()ornaive_query() - Storage Query → Filter is passed to vector storage query methods (chunks only)
- SQL → PGVectorStorage converts filter to JSONB WHERE clause
Best Practices
1. Consistent Metadata Schema
# Good - consistent schema
metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}
2. Simple Indexable Values
# Good - simple values
metadata = {
"status": "approved",
"priority": "high",
"year": 2024
}
3. Use Appropriate Query Mode
- Mix mode: Best for combining KG context with filtered chunks
- Naive mode: Best for pure vector search with metadata filtering
4. Performance Considerations
- Keep metadata fields minimal (Should be done automatically by the ORM)
- For PostgreSQL: Create GIN indexes on JSONB metadata columns:
CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata); - Avoid overly complex nested filters
Troubleshooting
Filter Not Working
- Verify storage backend: Ensure you're using PGVectorStorage
- Verify query mode: Use "mix" or "naive" mode only
- Verify metadata exists in chunks
- Check metadata field names match exactly (case-sensitive)
- Check logs for filter parsing errors
- Test without filter first to ensure data exists
Performance Issues
- Reduce filter complexity
- Create GIN indexes on JSONB metadata columns in PostgreSQL
- Profile query execution time
- Consider caching frequently used filters
Unsupported Storage Backend
If you're using a storage backend that doesn't support metadata filtering:
- Migrate to PGVectorStorage
- Or implement post-filtering in application code
- Or contribute metadata filtering support for your backend
Metadata Not Persisting After Queue Restart
- Metadata is stored in
DocProcessingStatus.metadata - Check document status storage is properly configured
- Verify metadata is set before document is enqueued
API Reference
MetadataFilter
class MetadataFilter(BaseModel):
operator: str # "AND", "OR", or "NOT"
operands: List[Union[Dict[str, Any], 'MetadataFilter']]
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for JSON serialization"""
...
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
"""Create MetadataFilter from dictionary (used by API)"""
...
QueryParam
@dataclass
class QueryParam:
metadata_filter: MetadataFilter | None = None # Filter passed to chunk queries
mode: str = "mix" # Only "mix" and "naive" support metadata filtering
top_k: int = 60
# ... other fields
DocProcessingStatus
@dataclass
class DocProcessingStatus:
# ... other fields
metadata: dict[str, Any] = field(default_factory=dict)
"""Additional metadata - PERSISTED across queue restarts"""
Query Method
# Synchronous
response = rag.query(
query: str,
param: QueryParam # QueryParam contains metadata_filter
)
# Asynchronous
response = await rag.aquery(
query: str,
param: QueryParam # QueryParam contains metadata_filter
)
REST API Query Endpoint
# In lightrag/api/routers/query_routes.py
@router.post("/query")
async def query_endpoint(request: QueryRequest):
# API receives metadata_filter as dict
metadata_filter_dict = request.metadata_filter
# Convert dict to MetadataFilter object
metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
# Create QueryParam with MetadataFilter
query_param = QueryParam(
mode=request.mode, # Must be "mix" or "naive"
metadata_filter=metadata_filter,
top_k=request.top_k
)
# Execute query with QueryParam
result = await rag.aquery(request.query, param=query_param)
return result