docs (metadata): Added Metadata_Filtering.md with examples and explanations for the functionality

This commit is contained in:
GGrassia 2025-10-31 10:35:54 +01:00
parent 166bdf7f99
commit cd664de057

481
Metadata_Filtering.md Normal file
View file

@ -0,0 +1,481 @@
# Metadata Filtering in LightRAG
## Overview
LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.
**Important Limitations**:
- Metadata filtering is **only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j**
- Only **chunk-based queries** support metadata filtering (Mix and Naive modes)
- Metadata is stored in document status and propagated to chunks during extraction
## Metadata Structure
Metadata is stored as a dictionary (`dict[str, Any]`) in:
- Entity nodes (graph storage)
- Relationship edges (graph storage)
- Text chunks (KV storage)
- Vector embeddings (vector storage)
```python
metadata = {
"author": "John Doe",
"department": "Engineering",
"document_type": "technical_spec",
"version": "1.0"
}
```
## Critical: Metadata Persistence in Document Status
**Metadata is stored in DocProcessingStatus** - This ensures metadata is not lost if the processing queue is stopped or interrupted.
### How It Works
1. **Document Status Storage** (`lightrag/base.py` - `DocProcessingStatus`)
```python
@dataclass
class DocProcessingStatus:
# ... other fields
metadata: dict[str, Any] = field(default_factory=dict)
"""Additional metadata - PERSISTED across queue restarts"""
```
2. **Metadata Flow**:
- Metadata stored in `DocProcessingStatus.metadata` when document is enqueued
- If queue stops, metadata persists in document status storage
- When processing resumes, metadata is read from document status
- Metadata is propagated to chunks during extraction
3. **Why This Matters**:
- Queue can be stopped/restarted without losing metadata
- Metadata survives system crashes or interruptions
- Ensures data consistency across processing pipeline
## Metadata Filtering During Queries
### MetadataFilter Class
```python
from lightrag.types import MetadataFilter
# Simple filter
filter1 = MetadataFilter(
operator="AND",
operands=[{"department": "Engineering"}]
)
# Complex filter with OR
filter2 = MetadataFilter(
operator="OR",
operands=[
{"author": "John Doe"},
{"author": "Jane Smith"}
]
)
# Nested filter
filter3 = MetadataFilter(
operator="AND",
operands=[
{"document_type": "technical_spec"},
MetadataFilter(
operator="OR",
operands=[
{"version": "1.0"},
{"version": "2.0"}
]
)
]
)
```
### Supported Operators
- **AND**: All conditions must be true
- **OR**: At least one condition must be true
- **NOT**: Negates the condition
## Supported Query Modes
### Mix Mode (Recommended)
Filters vector chunks from both KG and direct vector search:
```python
query_param = QueryParam(
mode="mix",
metadata_filter=MetadataFilter(
operator="AND",
operands=[
{"department": "Engineering"},
{"status": "approved"}
]
)
)
```
### Naive Mode
Filters vector chunks directly:
```python
query_param = QueryParam(
mode="naive",
metadata_filter=MetadataFilter(
operator="AND",
operands=[{"document_type": "manual"}]
)
)
```
## Implementation Details
### Architecture Flow
1. **API Layer** (`lightrag/api/routers/query_routes.py`)
- REST endpoint receives `metadata_filter` as JSON dict
- Converts JSON to `MetadataFilter` object using `MetadataFilter.from_dict()`
2. **QueryParam** (`lightrag/base.py`)
- `MetadataFilter` object is passed into `QueryParam.metadata_filter`
- QueryParam carries the filter through the query pipeline
3. **Query Execution** (`lightrag/operate.py`)
- Only chunk-based queries use the filter:
- Line 2749: `chunks_vdb.query(..., metadata_filter=query_param.metadata_filter)` (Mix/Naive modes)
4. **Storage Layer** (`lightrag/kg/postgres_impl.py`)
- PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators
### Code Locations
Key files implementing metadata support:
- `lightrag/types.py`: `MetadataFilter` class definition
- `lightrag/base.py`: `QueryParam` with `metadata_filter` field, `DocProcessingStatus` with metadata persistence
- `lightrag/api/routers/query_routes.py`: API endpoint that initializes MetadataFilter from JSON
- `lightrag/operate.py`: Query functions that pass filter to storage (Line 2749)
- `lightrag/kg/postgres_impl.py`: PostgreSQL JSONB filter implementation
## Query Examples
### Example 1: Filter by Department (Mix Mode)
```python
from lightrag import QueryParam
from lightrag.types import MetadataFilter
query_param = QueryParam(
mode="mix",
metadata_filter=MetadataFilter(
operator="AND",
operands=[{"department": "Engineering"}]
)
)
response = rag.query("What are the key projects?", param=query_param)
```
### Example 2: Multi-tenant Filtering (Naive Mode)
```python
query_param = QueryParam(
mode="naive",
metadata_filter=MetadataFilter(
operator="AND",
operands=[
{"tenant_id": "tenant_a"},
{"access_level": "admin"}
]
)
)
response = rag.query("Show admin resources", param=query_param)
```
### Example 3: Version Filtering (Mix Mode)
```python
query_param = QueryParam(
mode="mix",
metadata_filter=MetadataFilter(
operator="AND",
operands=[
{"doc_type": "manual"},
{"status": "current"}
]
)
)
response = rag.query("How to configure?", param=query_param)
```
## Storage Backend Support
**Important**: Metadata filtering is currently only supported for PostgreSQL vector storage.
### Vector Storage
- **PGVectorStorage**: Full support with JSONB filtering
- **NanoVectorDBStorage**: Not supported
- **MilvusVectorDBStorage**: Not supported
- **ChromaVectorDBStorage**: Not supported
- **FaissVectorDBStorage**: Not supported
- **QdrantVectorDBStorage**: Not supported
- **MongoVectorDBStorage**: Not supported
### Recommended Configuration
For metadata filtering support:
```python
rag = LightRAG(
working_dir="./storage",
vector_storage="PGVectorStorage",
# Graph storage can be any type
# ... other config
)
```
## Server API Examples
### REST API Query with Metadata Filter
#### Simple Filter (Naive Mode)
```bash
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "What are the key features?",
"mode": "naive",
"metadata_filter": {
"operator": "AND",
"operands": [
{"department": "Engineering"},
{"year": 2024}
]
}
}'
```
#### Complex Nested Filter (Mix Mode)
```bash
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "Show me technical documentation",
"mode": "mix",
"metadata_filter": {
"operator": "AND",
"operands": [
{"document_type": "technical_spec"},
{
"operator": "OR",
"operands": [
{"version": "1.0"},
{"version": "2.0"}
]
}
]
}
}'
```
#### Multi-tenant Query (Mix Mode)
```bash
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "List all projects",
"mode": "mix",
"metadata_filter": {
"operator": "AND",
"operands": [
{"tenant_id": "tenant_a"},
{"access_level": "admin"}
]
},
"top_k": 20
}'
```
### Python Client with Server
```python
import requests
from lightrag.types import MetadataFilter
# Option 1: Use MetadataFilter class and convert to dict
metadata_filter = MetadataFilter(
operator="AND",
operands=[
{"department": "Engineering"},
{"status": "approved"}
]
)
response = requests.post(
"http://localhost:9621/query",
json={
"query": "What are the approved engineering documents?",
"mode": "mix", # Use mix or naive mode
"metadata_filter": metadata_filter.to_dict(),
"top_k": 10
}
)
# Option 2: Send dict directly (API will convert to MetadataFilter)
response = requests.post(
"http://localhost:9621/query",
json={
"query": "What are the approved engineering documents?",
"mode": "naive", # Use mix or naive mode
"metadata_filter": {
"operator": "AND",
"operands": [
{"department": "Engineering"},
{"status": "approved"}
]
},
"top_k": 10
}
)
result = response.json()
print(result["response"])
```
### How the API Processes Metadata Filters
When you send a query to the REST API:
1. **JSON Request** → API receives `metadata_filter` as a dict
2. **API Conversion**`MetadataFilter.from_dict()` creates MetadataFilter object
3. **QueryParam** → MetadataFilter is set in `QueryParam.metadata_filter`
4. **Query Execution** → QueryParam with filter is passed to `kg_query()` or `naive_query()`
5. **Storage Query** → Filter is passed to vector storage query methods (chunks only)
6. **SQL** → PGVectorStorage converts filter to JSONB WHERE clause
## Best Practices
### 1. Consistent Metadata Schema
```python
# Good - consistent schema
metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}
```
### 2. Simple Indexable Values
```python
# Good - simple values
metadata = {
"status": "approved",
"priority": "high",
"year": 2024
}
```
### 3. Use Appropriate Query Mode
- **Mix mode**: Best for combining KG context with filtered chunks
- **Naive mode**: Best for pure vector search with metadata filtering
### 4. Performance Considerations
- Keep metadata fields minimal (Should be done automatically by the ORM)
- For PostgreSQL: Create GIN indexes on JSONB metadata columns:
```sql
CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata);
```
- Avoid overly complex nested filters
## Troubleshooting
### Filter Not Working
1. **Verify storage backend**: Ensure you're using PGVectorStorage
2. **Verify query mode**: Use "mix" or "naive" mode only
3. Verify metadata exists in chunks
4. Check metadata field names match exactly (case-sensitive)
5. Check logs for filter parsing errors
6. Test without filter first to ensure data exists
### Performance Issues
1. Reduce filter complexity
2. Create GIN indexes on JSONB metadata columns in PostgreSQL
3. Profile query execution time
4. Consider caching frequently used filters
### Unsupported Storage Backend
If you're using a storage backend that doesn't support metadata filtering:
1. Migrate to PGVectorStorage
2. Or implement post-filtering in application code
3. Or contribute metadata filtering support for your backend
### Metadata Not Persisting After Queue Restart
- Metadata is stored in `DocProcessingStatus.metadata`
- Check document status storage is properly configured
- Verify metadata is set before document is enqueued
## API Reference
### MetadataFilter
```python
class MetadataFilter(BaseModel):
operator: str # "AND", "OR", or "NOT"
operands: List[Union[Dict[str, Any], 'MetadataFilter']]
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for JSON serialization"""
...
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
"""Create MetadataFilter from dictionary (used by API)"""
...
```
### QueryParam
```python
@dataclass
class QueryParam:
metadata_filter: MetadataFilter | None = None # Filter passed to chunk queries
mode: str = "mix" # Only "mix" and "naive" support metadata filtering
top_k: int = 60
# ... other fields
```
### DocProcessingStatus
```python
@dataclass
class DocProcessingStatus:
# ... other fields
metadata: dict[str, Any] = field(default_factory=dict)
"""Additional metadata - PERSISTED across queue restarts"""
```
### Query Method
```python
# Synchronous
response = rag.query(
query: str,
param: QueryParam # QueryParam contains metadata_filter
)
# Asynchronous
response = await rag.aquery(
query: str,
param: QueryParam # QueryParam contains metadata_filter
)
```
### REST API Query Endpoint
```python
# In lightrag/api/routers/query_routes.py
@router.post("/query")
async def query_endpoint(request: QueryRequest):
# API receives metadata_filter as dict
metadata_filter_dict = request.metadata_filter
# Convert dict to MetadataFilter object
metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
# Create QueryParam with MetadataFilter
query_param = QueryParam(
mode=request.mode, # Must be "mix" or "naive"
metadata_filter=metadata_filter,
top_k=request.top_k
)
# Execute query with QueryParam
result = await rag.aquery(request.query, param=query_param)
return result
```