docs (metadata): Added Metadata_Filtering.md with examples and explanations for the functionality
This commit is contained in:
parent
166bdf7f99
commit
cd664de057
1 changed files with 481 additions and 0 deletions
481
Metadata_Filtering.md
Normal file
481
Metadata_Filtering.md
Normal file
|
|
@ -0,0 +1,481 @@
|
||||||
|
# Metadata Filtering in LightRAG
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.
|
||||||
|
|
||||||
|
**Important Limitations**:
|
||||||
|
- Metadata filtering is **only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j**
|
||||||
|
- Only **chunk-based queries** support metadata filtering (Mix and Naive modes)
|
||||||
|
- Metadata is stored in document status and propagated to chunks during extraction
|
||||||
|
|
||||||
|
## Metadata Structure
|
||||||
|
|
||||||
|
Metadata is stored as a dictionary (`dict[str, Any]`) in:
|
||||||
|
- Entity nodes (graph storage)
|
||||||
|
- Relationship edges (graph storage)
|
||||||
|
- Text chunks (KV storage)
|
||||||
|
- Vector embeddings (vector storage)
|
||||||
|
|
||||||
|
```python
|
||||||
|
metadata = {
|
||||||
|
"author": "John Doe",
|
||||||
|
"department": "Engineering",
|
||||||
|
"document_type": "technical_spec",
|
||||||
|
"version": "1.0"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Critical: Metadata Persistence in Document Status
|
||||||
|
|
||||||
|
**Metadata is stored in DocProcessingStatus** - This ensures metadata is not lost if the processing queue is stopped or interrupted.
|
||||||
|
|
||||||
|
### How It Works
|
||||||
|
|
||||||
|
1. **Document Status Storage** (`lightrag/base.py` - `DocProcessingStatus`)
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class DocProcessingStatus:
|
||||||
|
# ... other fields
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
"""Additional metadata - PERSISTED across queue restarts"""
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Metadata Flow**:
|
||||||
|
- Metadata stored in `DocProcessingStatus.metadata` when document is enqueued
|
||||||
|
- If queue stops, metadata persists in document status storage
|
||||||
|
- When processing resumes, metadata is read from document status
|
||||||
|
- Metadata is propagated to chunks during extraction
|
||||||
|
|
||||||
|
3. **Why This Matters**:
|
||||||
|
- Queue can be stopped/restarted without losing metadata
|
||||||
|
- Metadata survives system crashes or interruptions
|
||||||
|
- Ensures data consistency across processing pipeline
|
||||||
|
|
||||||
|
## Metadata Filtering During Queries
|
||||||
|
|
||||||
|
### MetadataFilter Class
|
||||||
|
|
||||||
|
```python
|
||||||
|
from lightrag.types import MetadataFilter
|
||||||
|
|
||||||
|
# Simple filter
|
||||||
|
filter1 = MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[{"department": "Engineering"}]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Complex filter with OR
|
||||||
|
filter2 = MetadataFilter(
|
||||||
|
operator="OR",
|
||||||
|
operands=[
|
||||||
|
{"author": "John Doe"},
|
||||||
|
{"author": "Jane Smith"}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Nested filter
|
||||||
|
filter3 = MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[
|
||||||
|
{"document_type": "technical_spec"},
|
||||||
|
MetadataFilter(
|
||||||
|
operator="OR",
|
||||||
|
operands=[
|
||||||
|
{"version": "1.0"},
|
||||||
|
{"version": "2.0"}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Supported Operators
|
||||||
|
|
||||||
|
- **AND**: All conditions must be true
|
||||||
|
- **OR**: At least one condition must be true
|
||||||
|
- **NOT**: Negates the condition
|
||||||
|
|
||||||
|
## Supported Query Modes
|
||||||
|
|
||||||
|
### Mix Mode (Recommended)
|
||||||
|
Filters vector chunks from both KG and direct vector search:
|
||||||
|
```python
|
||||||
|
query_param = QueryParam(
|
||||||
|
mode="mix",
|
||||||
|
metadata_filter=MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[
|
||||||
|
{"department": "Engineering"},
|
||||||
|
{"status": "approved"}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Naive Mode
|
||||||
|
Filters vector chunks directly:
|
||||||
|
```python
|
||||||
|
query_param = QueryParam(
|
||||||
|
mode="naive",
|
||||||
|
metadata_filter=MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[{"document_type": "manual"}]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation Details
|
||||||
|
|
||||||
|
### Architecture Flow
|
||||||
|
|
||||||
|
1. **API Layer** (`lightrag/api/routers/query_routes.py`)
|
||||||
|
- REST endpoint receives `metadata_filter` as JSON dict
|
||||||
|
- Converts JSON to `MetadataFilter` object using `MetadataFilter.from_dict()`
|
||||||
|
|
||||||
|
2. **QueryParam** (`lightrag/base.py`)
|
||||||
|
- `MetadataFilter` object is passed into `QueryParam.metadata_filter`
|
||||||
|
- QueryParam carries the filter through the query pipeline
|
||||||
|
|
||||||
|
3. **Query Execution** (`lightrag/operate.py`)
|
||||||
|
- Only chunk-based queries use the filter:
|
||||||
|
- Line 2749: `chunks_vdb.query(..., metadata_filter=query_param.metadata_filter)` (Mix/Naive modes)
|
||||||
|
|
||||||
|
4. **Storage Layer** (`lightrag/kg/postgres_impl.py`)
|
||||||
|
- PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators
|
||||||
|
|
||||||
|
### Code Locations
|
||||||
|
|
||||||
|
Key files implementing metadata support:
|
||||||
|
- `lightrag/types.py`: `MetadataFilter` class definition
|
||||||
|
- `lightrag/base.py`: `QueryParam` with `metadata_filter` field, `DocProcessingStatus` with metadata persistence
|
||||||
|
- `lightrag/api/routers/query_routes.py`: API endpoint that initializes MetadataFilter from JSON
|
||||||
|
- `lightrag/operate.py`: Query functions that pass filter to storage (Line 2749)
|
||||||
|
- `lightrag/kg/postgres_impl.py`: PostgreSQL JSONB filter implementation
|
||||||
|
|
||||||
|
## Query Examples
|
||||||
|
|
||||||
|
### Example 1: Filter by Department (Mix Mode)
|
||||||
|
```python
|
||||||
|
from lightrag import QueryParam
|
||||||
|
from lightrag.types import MetadataFilter
|
||||||
|
|
||||||
|
query_param = QueryParam(
|
||||||
|
mode="mix",
|
||||||
|
metadata_filter=MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[{"department": "Engineering"}]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
response = rag.query("What are the key projects?", param=query_param)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example 2: Multi-tenant Filtering (Naive Mode)
|
||||||
|
```python
|
||||||
|
query_param = QueryParam(
|
||||||
|
mode="naive",
|
||||||
|
metadata_filter=MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[
|
||||||
|
{"tenant_id": "tenant_a"},
|
||||||
|
{"access_level": "admin"}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
response = rag.query("Show admin resources", param=query_param)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example 3: Version Filtering (Mix Mode)
|
||||||
|
```python
|
||||||
|
query_param = QueryParam(
|
||||||
|
mode="mix",
|
||||||
|
metadata_filter=MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[
|
||||||
|
{"doc_type": "manual"},
|
||||||
|
{"status": "current"}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
response = rag.query("How to configure?", param=query_param)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storage Backend Support
|
||||||
|
|
||||||
|
**Important**: Metadata filtering is currently only supported for PostgreSQL vector storage.
|
||||||
|
|
||||||
|
### Vector Storage
|
||||||
|
- **PGVectorStorage**: Full support with JSONB filtering
|
||||||
|
- **NanoVectorDBStorage**: Not supported
|
||||||
|
- **MilvusVectorDBStorage**: Not supported
|
||||||
|
- **ChromaVectorDBStorage**: Not supported
|
||||||
|
- **FaissVectorDBStorage**: Not supported
|
||||||
|
- **QdrantVectorDBStorage**: Not supported
|
||||||
|
- **MongoVectorDBStorage**: Not supported
|
||||||
|
|
||||||
|
### Recommended Configuration
|
||||||
|
|
||||||
|
For metadata filtering support:
|
||||||
|
```python
|
||||||
|
rag = LightRAG(
|
||||||
|
working_dir="./storage",
|
||||||
|
vector_storage="PGVectorStorage",
|
||||||
|
# Graph storage can be any type
|
||||||
|
# ... other config
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Server API Examples
|
||||||
|
|
||||||
|
### REST API Query with Metadata Filter
|
||||||
|
|
||||||
|
#### Simple Filter (Naive Mode)
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:9621/query \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"query": "What are the key features?",
|
||||||
|
"mode": "naive",
|
||||||
|
"metadata_filter": {
|
||||||
|
"operator": "AND",
|
||||||
|
"operands": [
|
||||||
|
{"department": "Engineering"},
|
||||||
|
{"year": 2024}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Complex Nested Filter (Mix Mode)
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:9621/query \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"query": "Show me technical documentation",
|
||||||
|
"mode": "mix",
|
||||||
|
"metadata_filter": {
|
||||||
|
"operator": "AND",
|
||||||
|
"operands": [
|
||||||
|
{"document_type": "technical_spec"},
|
||||||
|
{
|
||||||
|
"operator": "OR",
|
||||||
|
"operands": [
|
||||||
|
{"version": "1.0"},
|
||||||
|
{"version": "2.0"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Multi-tenant Query (Mix Mode)
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:9621/query \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"query": "List all projects",
|
||||||
|
"mode": "mix",
|
||||||
|
"metadata_filter": {
|
||||||
|
"operator": "AND",
|
||||||
|
"operands": [
|
||||||
|
{"tenant_id": "tenant_a"},
|
||||||
|
{"access_level": "admin"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"top_k": 20
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python Client with Server
|
||||||
|
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
from lightrag.types import MetadataFilter
|
||||||
|
|
||||||
|
# Option 1: Use MetadataFilter class and convert to dict
|
||||||
|
metadata_filter = MetadataFilter(
|
||||||
|
operator="AND",
|
||||||
|
operands=[
|
||||||
|
{"department": "Engineering"},
|
||||||
|
{"status": "approved"}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
response = requests.post(
|
||||||
|
"http://localhost:9621/query",
|
||||||
|
json={
|
||||||
|
"query": "What are the approved engineering documents?",
|
||||||
|
"mode": "mix", # Use mix or naive mode
|
||||||
|
"metadata_filter": metadata_filter.to_dict(),
|
||||||
|
"top_k": 10
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Option 2: Send dict directly (API will convert to MetadataFilter)
|
||||||
|
response = requests.post(
|
||||||
|
"http://localhost:9621/query",
|
||||||
|
json={
|
||||||
|
"query": "What are the approved engineering documents?",
|
||||||
|
"mode": "naive", # Use mix or naive mode
|
||||||
|
"metadata_filter": {
|
||||||
|
"operator": "AND",
|
||||||
|
"operands": [
|
||||||
|
{"department": "Engineering"},
|
||||||
|
{"status": "approved"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"top_k": 10
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
result = response.json()
|
||||||
|
print(result["response"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### How the API Processes Metadata Filters
|
||||||
|
|
||||||
|
When you send a query to the REST API:
|
||||||
|
|
||||||
|
1. **JSON Request** → API receives `metadata_filter` as a dict
|
||||||
|
2. **API Conversion** → `MetadataFilter.from_dict()` creates MetadataFilter object
|
||||||
|
3. **QueryParam** → MetadataFilter is set in `QueryParam.metadata_filter`
|
||||||
|
4. **Query Execution** → QueryParam with filter is passed to `kg_query()` or `naive_query()`
|
||||||
|
5. **Storage Query** → Filter is passed to vector storage query methods (chunks only)
|
||||||
|
6. **SQL** → PGVectorStorage converts filter to JSONB WHERE clause
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### 1. Consistent Metadata Schema
|
||||||
|
```python
|
||||||
|
# Good - consistent schema
|
||||||
|
metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
|
||||||
|
metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Simple Indexable Values
|
||||||
|
```python
|
||||||
|
# Good - simple values
|
||||||
|
metadata = {
|
||||||
|
"status": "approved",
|
||||||
|
"priority": "high",
|
||||||
|
"year": 2024
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Use Appropriate Query Mode
|
||||||
|
- **Mix mode**: Best for combining KG context with filtered chunks
|
||||||
|
- **Naive mode**: Best for pure vector search with metadata filtering
|
||||||
|
|
||||||
|
### 4. Performance Considerations
|
||||||
|
- Keep metadata fields minimal (Should be done automatically by the ORM)
|
||||||
|
- For PostgreSQL: Create GIN indexes on JSONB metadata columns:
|
||||||
|
```sql
|
||||||
|
CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata);
|
||||||
|
```
|
||||||
|
- Avoid overly complex nested filters
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Filter Not Working
|
||||||
|
1. **Verify storage backend**: Ensure you're using PGVectorStorage
|
||||||
|
2. **Verify query mode**: Use "mix" or "naive" mode only
|
||||||
|
3. Verify metadata exists in chunks
|
||||||
|
4. Check metadata field names match exactly (case-sensitive)
|
||||||
|
5. Check logs for filter parsing errors
|
||||||
|
6. Test without filter first to ensure data exists
|
||||||
|
|
||||||
|
### Performance Issues
|
||||||
|
1. Reduce filter complexity
|
||||||
|
2. Create GIN indexes on JSONB metadata columns in PostgreSQL
|
||||||
|
3. Profile query execution time
|
||||||
|
4. Consider caching frequently used filters
|
||||||
|
|
||||||
|
### Unsupported Storage Backend
|
||||||
|
If you're using a storage backend that doesn't support metadata filtering:
|
||||||
|
1. Migrate to PGVectorStorage
|
||||||
|
2. Or implement post-filtering in application code
|
||||||
|
3. Or contribute metadata filtering support for your backend
|
||||||
|
|
||||||
|
### Metadata Not Persisting After Queue Restart
|
||||||
|
- Metadata is stored in `DocProcessingStatus.metadata`
|
||||||
|
- Check document status storage is properly configured
|
||||||
|
- Verify metadata is set before document is enqueued
|
||||||
|
|
||||||
|
## API Reference
|
||||||
|
|
||||||
|
### MetadataFilter
|
||||||
|
```python
|
||||||
|
class MetadataFilter(BaseModel):
|
||||||
|
operator: str # "AND", "OR", or "NOT"
|
||||||
|
operands: List[Union[Dict[str, Any], 'MetadataFilter']]
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""Convert to dictionary for JSON serialization"""
|
||||||
|
...
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
|
||||||
|
"""Create MetadataFilter from dictionary (used by API)"""
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### QueryParam
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class QueryParam:
|
||||||
|
metadata_filter: MetadataFilter | None = None # Filter passed to chunk queries
|
||||||
|
mode: str = "mix" # Only "mix" and "naive" support metadata filtering
|
||||||
|
top_k: int = 60
|
||||||
|
# ... other fields
|
||||||
|
```
|
||||||
|
|
||||||
|
### DocProcessingStatus
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class DocProcessingStatus:
|
||||||
|
# ... other fields
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
"""Additional metadata - PERSISTED across queue restarts"""
|
||||||
|
```
|
||||||
|
|
||||||
|
### Query Method
|
||||||
|
```python
|
||||||
|
# Synchronous
|
||||||
|
response = rag.query(
|
||||||
|
query: str,
|
||||||
|
param: QueryParam # QueryParam contains metadata_filter
|
||||||
|
)
|
||||||
|
|
||||||
|
# Asynchronous
|
||||||
|
response = await rag.aquery(
|
||||||
|
query: str,
|
||||||
|
param: QueryParam # QueryParam contains metadata_filter
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### REST API Query Endpoint
|
||||||
|
```python
|
||||||
|
# In lightrag/api/routers/query_routes.py
|
||||||
|
@router.post("/query")
|
||||||
|
async def query_endpoint(request: QueryRequest):
|
||||||
|
# API receives metadata_filter as dict
|
||||||
|
metadata_filter_dict = request.metadata_filter
|
||||||
|
|
||||||
|
# Convert dict to MetadataFilter object
|
||||||
|
metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
|
||||||
|
|
||||||
|
# Create QueryParam with MetadataFilter
|
||||||
|
query_param = QueryParam(
|
||||||
|
mode=request.mode, # Must be "mix" or "naive"
|
||||||
|
metadata_filter=metadata_filter,
|
||||||
|
top_k=request.top_k
|
||||||
|
)
|
||||||
|
|
||||||
|
# Execute query with QueryParam
|
||||||
|
result = await rag.aquery(request.query, param=query_param)
|
||||||
|
return result
|
||||||
|
```
|
||||||
Loading…
Add table
Reference in a new issue