docs (metadata): Added Metadata_Filtering.md with examples and explanations for the functionality
This commit is contained in:
parent
166bdf7f99
commit
cd664de057
1 changed files with 481 additions and 0 deletions
481
Metadata_Filtering.md
Normal file
481
Metadata_Filtering.md
Normal file
|
|
@ -0,0 +1,481 @@
|
|||
# Metadata Filtering in LightRAG
|
||||
|
||||
## Overview
|
||||
|
||||
LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.
|
||||
|
||||
**Important Limitations**:
|
||||
- Metadata filtering is **only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j**
|
||||
- Only **chunk-based queries** support metadata filtering (Mix and Naive modes)
|
||||
- Metadata is stored in document status and propagated to chunks during extraction
|
||||
|
||||
## Metadata Structure
|
||||
|
||||
Metadata is stored as a dictionary (`dict[str, Any]`) in:
|
||||
- Entity nodes (graph storage)
|
||||
- Relationship edges (graph storage)
|
||||
- Text chunks (KV storage)
|
||||
- Vector embeddings (vector storage)
|
||||
|
||||
```python
|
||||
metadata = {
|
||||
"author": "John Doe",
|
||||
"department": "Engineering",
|
||||
"document_type": "technical_spec",
|
||||
"version": "1.0"
|
||||
}
|
||||
```
|
||||
|
||||
## Critical: Metadata Persistence in Document Status
|
||||
|
||||
**Metadata is stored in DocProcessingStatus** - This ensures metadata is not lost if the processing queue is stopped or interrupted.
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **Document Status Storage** (`lightrag/base.py` - `DocProcessingStatus`)
|
||||
```python
|
||||
@dataclass
|
||||
class DocProcessingStatus:
|
||||
# ... other fields
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
"""Additional metadata - PERSISTED across queue restarts"""
|
||||
```
|
||||
|
||||
2. **Metadata Flow**:
|
||||
- Metadata stored in `DocProcessingStatus.metadata` when document is enqueued
|
||||
- If queue stops, metadata persists in document status storage
|
||||
- When processing resumes, metadata is read from document status
|
||||
- Metadata is propagated to chunks during extraction
|
||||
|
||||
3. **Why This Matters**:
|
||||
- Queue can be stopped/restarted without losing metadata
|
||||
- Metadata survives system crashes or interruptions
|
||||
- Ensures data consistency across processing pipeline
|
||||
|
||||
## Metadata Filtering During Queries
|
||||
|
||||
### MetadataFilter Class
|
||||
|
||||
```python
|
||||
from lightrag.types import MetadataFilter
|
||||
|
||||
# Simple filter
|
||||
filter1 = MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[{"department": "Engineering"}]
|
||||
)
|
||||
|
||||
# Complex filter with OR
|
||||
filter2 = MetadataFilter(
|
||||
operator="OR",
|
||||
operands=[
|
||||
{"author": "John Doe"},
|
||||
{"author": "Jane Smith"}
|
||||
]
|
||||
)
|
||||
|
||||
# Nested filter
|
||||
filter3 = MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[
|
||||
{"document_type": "technical_spec"},
|
||||
MetadataFilter(
|
||||
operator="OR",
|
||||
operands=[
|
||||
{"version": "1.0"},
|
||||
{"version": "2.0"}
|
||||
]
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Supported Operators
|
||||
|
||||
- **AND**: All conditions must be true
|
||||
- **OR**: At least one condition must be true
|
||||
- **NOT**: Negates the condition
|
||||
|
||||
## Supported Query Modes
|
||||
|
||||
### Mix Mode (Recommended)
|
||||
Filters vector chunks from both KG and direct vector search:
|
||||
```python
|
||||
query_param = QueryParam(
|
||||
mode="mix",
|
||||
metadata_filter=MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[
|
||||
{"department": "Engineering"},
|
||||
{"status": "approved"}
|
||||
]
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Naive Mode
|
||||
Filters vector chunks directly:
|
||||
```python
|
||||
query_param = QueryParam(
|
||||
mode="naive",
|
||||
metadata_filter=MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[{"document_type": "manual"}]
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Architecture Flow
|
||||
|
||||
1. **API Layer** (`lightrag/api/routers/query_routes.py`)
|
||||
- REST endpoint receives `metadata_filter` as JSON dict
|
||||
- Converts JSON to `MetadataFilter` object using `MetadataFilter.from_dict()`
|
||||
|
||||
2. **QueryParam** (`lightrag/base.py`)
|
||||
- `MetadataFilter` object is passed into `QueryParam.metadata_filter`
|
||||
- QueryParam carries the filter through the query pipeline
|
||||
|
||||
3. **Query Execution** (`lightrag/operate.py`)
|
||||
- Only chunk-based queries use the filter:
|
||||
- Line 2749: `chunks_vdb.query(..., metadata_filter=query_param.metadata_filter)` (Mix/Naive modes)
|
||||
|
||||
4. **Storage Layer** (`lightrag/kg/postgres_impl.py`)
|
||||
- PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators
|
||||
|
||||
### Code Locations
|
||||
|
||||
Key files implementing metadata support:
|
||||
- `lightrag/types.py`: `MetadataFilter` class definition
|
||||
- `lightrag/base.py`: `QueryParam` with `metadata_filter` field, `DocProcessingStatus` with metadata persistence
|
||||
- `lightrag/api/routers/query_routes.py`: API endpoint that initializes MetadataFilter from JSON
|
||||
- `lightrag/operate.py`: Query functions that pass filter to storage (Line 2749)
|
||||
- `lightrag/kg/postgres_impl.py`: PostgreSQL JSONB filter implementation
|
||||
|
||||
## Query Examples
|
||||
|
||||
### Example 1: Filter by Department (Mix Mode)
|
||||
```python
|
||||
from lightrag import QueryParam
|
||||
from lightrag.types import MetadataFilter
|
||||
|
||||
query_param = QueryParam(
|
||||
mode="mix",
|
||||
metadata_filter=MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[{"department": "Engineering"}]
|
||||
)
|
||||
)
|
||||
|
||||
response = rag.query("What are the key projects?", param=query_param)
|
||||
```
|
||||
|
||||
### Example 2: Multi-tenant Filtering (Naive Mode)
|
||||
```python
|
||||
query_param = QueryParam(
|
||||
mode="naive",
|
||||
metadata_filter=MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[
|
||||
{"tenant_id": "tenant_a"},
|
||||
{"access_level": "admin"}
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
response = rag.query("Show admin resources", param=query_param)
|
||||
```
|
||||
|
||||
### Example 3: Version Filtering (Mix Mode)
|
||||
```python
|
||||
query_param = QueryParam(
|
||||
mode="mix",
|
||||
metadata_filter=MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[
|
||||
{"doc_type": "manual"},
|
||||
{"status": "current"}
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
response = rag.query("How to configure?", param=query_param)
|
||||
```
|
||||
|
||||
## Storage Backend Support
|
||||
|
||||
**Important**: Metadata filtering is currently only supported for PostgreSQL vector storage.
|
||||
|
||||
### Vector Storage
|
||||
- **PGVectorStorage**: Full support with JSONB filtering
|
||||
- **NanoVectorDBStorage**: Not supported
|
||||
- **MilvusVectorDBStorage**: Not supported
|
||||
- **ChromaVectorDBStorage**: Not supported
|
||||
- **FaissVectorDBStorage**: Not supported
|
||||
- **QdrantVectorDBStorage**: Not supported
|
||||
- **MongoVectorDBStorage**: Not supported
|
||||
|
||||
### Recommended Configuration
|
||||
|
||||
For metadata filtering support:
|
||||
```python
|
||||
rag = LightRAG(
|
||||
working_dir="./storage",
|
||||
vector_storage="PGVectorStorage",
|
||||
# Graph storage can be any type
|
||||
# ... other config
|
||||
)
|
||||
```
|
||||
|
||||
## Server API Examples
|
||||
|
||||
### REST API Query with Metadata Filter
|
||||
|
||||
#### Simple Filter (Naive Mode)
|
||||
```bash
|
||||
curl -X POST http://localhost:9621/query \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"query": "What are the key features?",
|
||||
"mode": "naive",
|
||||
"metadata_filter": {
|
||||
"operator": "AND",
|
||||
"operands": [
|
||||
{"department": "Engineering"},
|
||||
{"year": 2024}
|
||||
]
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
#### Complex Nested Filter (Mix Mode)
|
||||
```bash
|
||||
curl -X POST http://localhost:9621/query \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"query": "Show me technical documentation",
|
||||
"mode": "mix",
|
||||
"metadata_filter": {
|
||||
"operator": "AND",
|
||||
"operands": [
|
||||
{"document_type": "technical_spec"},
|
||||
{
|
||||
"operator": "OR",
|
||||
"operands": [
|
||||
{"version": "1.0"},
|
||||
{"version": "2.0"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
#### Multi-tenant Query (Mix Mode)
|
||||
```bash
|
||||
curl -X POST http://localhost:9621/query \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"query": "List all projects",
|
||||
"mode": "mix",
|
||||
"metadata_filter": {
|
||||
"operator": "AND",
|
||||
"operands": [
|
||||
{"tenant_id": "tenant_a"},
|
||||
{"access_level": "admin"}
|
||||
]
|
||||
},
|
||||
"top_k": 20
|
||||
}'
|
||||
```
|
||||
|
||||
### Python Client with Server
|
||||
|
||||
```python
|
||||
import requests
|
||||
from lightrag.types import MetadataFilter
|
||||
|
||||
# Option 1: Use MetadataFilter class and convert to dict
|
||||
metadata_filter = MetadataFilter(
|
||||
operator="AND",
|
||||
operands=[
|
||||
{"department": "Engineering"},
|
||||
{"status": "approved"}
|
||||
]
|
||||
)
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:9621/query",
|
||||
json={
|
||||
"query": "What are the approved engineering documents?",
|
||||
"mode": "mix", # Use mix or naive mode
|
||||
"metadata_filter": metadata_filter.to_dict(),
|
||||
"top_k": 10
|
||||
}
|
||||
)
|
||||
|
||||
# Option 2: Send dict directly (API will convert to MetadataFilter)
|
||||
response = requests.post(
|
||||
"http://localhost:9621/query",
|
||||
json={
|
||||
"query": "What are the approved engineering documents?",
|
||||
"mode": "naive", # Use mix or naive mode
|
||||
"metadata_filter": {
|
||||
"operator": "AND",
|
||||
"operands": [
|
||||
{"department": "Engineering"},
|
||||
{"status": "approved"}
|
||||
]
|
||||
},
|
||||
"top_k": 10
|
||||
}
|
||||
)
|
||||
|
||||
result = response.json()
|
||||
print(result["response"])
|
||||
```
|
||||
|
||||
### How the API Processes Metadata Filters
|
||||
|
||||
When you send a query to the REST API:
|
||||
|
||||
1. **JSON Request** → API receives `metadata_filter` as a dict
|
||||
2. **API Conversion** → `MetadataFilter.from_dict()` creates MetadataFilter object
|
||||
3. **QueryParam** → MetadataFilter is set in `QueryParam.metadata_filter`
|
||||
4. **Query Execution** → QueryParam with filter is passed to `kg_query()` or `naive_query()`
|
||||
5. **Storage Query** → Filter is passed to vector storage query methods (chunks only)
|
||||
6. **SQL** → PGVectorStorage converts filter to JSONB WHERE clause
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Consistent Metadata Schema
|
||||
```python
|
||||
# Good - consistent schema
|
||||
metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
|
||||
metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}
|
||||
```
|
||||
|
||||
### 2. Simple Indexable Values
|
||||
```python
|
||||
# Good - simple values
|
||||
metadata = {
|
||||
"status": "approved",
|
||||
"priority": "high",
|
||||
"year": 2024
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Use Appropriate Query Mode
|
||||
- **Mix mode**: Best for combining KG context with filtered chunks
|
||||
- **Naive mode**: Best for pure vector search with metadata filtering
|
||||
|
||||
### 4. Performance Considerations
|
||||
- Keep metadata fields minimal (Should be done automatically by the ORM)
|
||||
- For PostgreSQL: Create GIN indexes on JSONB metadata columns:
|
||||
```sql
|
||||
CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata);
|
||||
```
|
||||
- Avoid overly complex nested filters
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Filter Not Working
|
||||
1. **Verify storage backend**: Ensure you're using PGVectorStorage
|
||||
2. **Verify query mode**: Use "mix" or "naive" mode only
|
||||
3. Verify metadata exists in chunks
|
||||
4. Check metadata field names match exactly (case-sensitive)
|
||||
5. Check logs for filter parsing errors
|
||||
6. Test without filter first to ensure data exists
|
||||
|
||||
### Performance Issues
|
||||
1. Reduce filter complexity
|
||||
2. Create GIN indexes on JSONB metadata columns in PostgreSQL
|
||||
3. Profile query execution time
|
||||
4. Consider caching frequently used filters
|
||||
|
||||
### Unsupported Storage Backend
|
||||
If you're using a storage backend that doesn't support metadata filtering:
|
||||
1. Migrate to PGVectorStorage
|
||||
2. Or implement post-filtering in application code
|
||||
3. Or contribute metadata filtering support for your backend
|
||||
|
||||
### Metadata Not Persisting After Queue Restart
|
||||
- Metadata is stored in `DocProcessingStatus.metadata`
|
||||
- Check document status storage is properly configured
|
||||
- Verify metadata is set before document is enqueued
|
||||
|
||||
## API Reference
|
||||
|
||||
### MetadataFilter
|
||||
```python
|
||||
class MetadataFilter(BaseModel):
|
||||
operator: str # "AND", "OR", or "NOT"
|
||||
operands: List[Union[Dict[str, Any], 'MetadataFilter']]
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""Convert to dictionary for JSON serialization"""
|
||||
...
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
|
||||
"""Create MetadataFilter from dictionary (used by API)"""
|
||||
...
|
||||
```
|
||||
|
||||
### QueryParam
|
||||
```python
|
||||
@dataclass
|
||||
class QueryParam:
|
||||
metadata_filter: MetadataFilter | None = None # Filter passed to chunk queries
|
||||
mode: str = "mix" # Only "mix" and "naive" support metadata filtering
|
||||
top_k: int = 60
|
||||
# ... other fields
|
||||
```
|
||||
|
||||
### DocProcessingStatus
|
||||
```python
|
||||
@dataclass
|
||||
class DocProcessingStatus:
|
||||
# ... other fields
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
"""Additional metadata - PERSISTED across queue restarts"""
|
||||
```
|
||||
|
||||
### Query Method
|
||||
```python
|
||||
# Synchronous
|
||||
response = rag.query(
|
||||
query: str,
|
||||
param: QueryParam # QueryParam contains metadata_filter
|
||||
)
|
||||
|
||||
# Asynchronous
|
||||
response = await rag.aquery(
|
||||
query: str,
|
||||
param: QueryParam # QueryParam contains metadata_filter
|
||||
)
|
||||
```
|
||||
|
||||
### REST API Query Endpoint
|
||||
```python
|
||||
# In lightrag/api/routers/query_routes.py
|
||||
@router.post("/query")
|
||||
async def query_endpoint(request: QueryRequest):
|
||||
# API receives metadata_filter as dict
|
||||
metadata_filter_dict = request.metadata_filter
|
||||
|
||||
# Convert dict to MetadataFilter object
|
||||
metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
|
||||
|
||||
# Create QueryParam with MetadataFilter
|
||||
query_param = QueryParam(
|
||||
mode=request.mode, # Must be "mix" or "naive"
|
||||
metadata_filter=metadata_filter,
|
||||
top_k=request.top_k
|
||||
)
|
||||
|
||||
# Execute query with QueryParam
|
||||
result = await rag.aquery(request.query, param=query_param)
|
||||
return result
|
||||
```
|
||||
Loading…
Add table
Reference in a new issue