docs (metadata): Added Metadata_Filtering.md with examples and explanations for the functionality

2025-10-31 10:35:54 +01:00 · 2025-10-31 10:35:54 +01:00 · cd664de057
commit cd664de057
parent 166bdf7f99
1 changed files with 481 additions and 0 deletions
--- a/Metadata_Filtering.md
+++ b/Metadata_Filtering.md
@ -0,0 +1,481 @@
+# Metadata Filtering in LightRAG
+
+## Overview
+
+LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.
+
+**Important Limitations**:
+- Metadata filtering is **only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j**
+- Only **chunk-based queries** support metadata filtering (Mix and Naive modes)
+- Metadata is stored in document status and propagated to chunks during extraction
+
+## Metadata Structure
+
+Metadata is stored as a dictionary (`dict[str, Any]`) in:
+- Entity nodes (graph storage)
+- Relationship edges (graph storage)
+- Text chunks (KV storage)
+- Vector embeddings (vector storage)
+
+```python
+metadata = {
+    "author": "John Doe",
+    "department": "Engineering",
+    "document_type": "technical_spec",
+    "version": "1.0"
+}
+```
+
+## Critical: Metadata Persistence in Document Status
+
+**Metadata is stored in DocProcessingStatus** - This ensures metadata is not lost if the processing queue is stopped or interrupted.
+
+### How It Works
+
+1. **Document Status Storage** (`lightrag/base.py` - `DocProcessingStatus`)
+   ```python
+   @dataclass
+   class DocProcessingStatus:
+       # ... other fields
+       metadata: dict[str, Any] = field(default_factory=dict)
+       """Additional metadata - PERSISTED across queue restarts"""
+   ```
+
+2. **Metadata Flow**:
+   - Metadata stored in `DocProcessingStatus.metadata` when document is enqueued
+   - If queue stops, metadata persists in document status storage
+   - When processing resumes, metadata is read from document status
+   - Metadata is propagated to chunks during extraction
+
+3. **Why This Matters**:
+   - Queue can be stopped/restarted without losing metadata
+   - Metadata survives system crashes or interruptions
+   - Ensures data consistency across processing pipeline
+
+## Metadata Filtering During Queries
+
+### MetadataFilter Class
+
+```python
+from lightrag.types import MetadataFilter
+
+# Simple filter
+filter1 = MetadataFilter(
+    operator="AND",
+    operands=[{"department": "Engineering"}]
+)
+
+# Complex filter with OR
+filter2 = MetadataFilter(
+    operator="OR",
+    operands=[
+        {"author": "John Doe"},
+        {"author": "Jane Smith"}
+    ]
+)
+
+# Nested filter
+filter3 = MetadataFilter(
+    operator="AND",
+    operands=[
+        {"document_type": "technical_spec"},
+        MetadataFilter(
+            operator="OR",
+            operands=[
+                {"version": "1.0"},
+                {"version": "2.0"}
+            ]
+        )
+    ]
+)
+```
+
+### Supported Operators
+
+- **AND**: All conditions must be true
+- **OR**: At least one condition must be true
+- **NOT**: Negates the condition
+
+## Supported Query Modes
+
+### Mix Mode (Recommended)
+Filters vector chunks from both KG and direct vector search:
+```python
+query_param = QueryParam(
+    mode="mix",
+    metadata_filter=MetadataFilter(
+        operator="AND",
+        operands=[
+            {"department": "Engineering"},
+            {"status": "approved"}
+        ]
+    )
+)
+```
+
+### Naive Mode
+Filters vector chunks directly:
+```python
+query_param = QueryParam(
+    mode="naive",
+    metadata_filter=MetadataFilter(
+        operator="AND",
+        operands=[{"document_type": "manual"}]
+    )
+)
+```
+
+## Implementation Details
+
+### Architecture Flow
+
+1. **API Layer** (`lightrag/api/routers/query_routes.py`)
+   - REST endpoint receives `metadata_filter` as JSON dict
+   - Converts JSON to `MetadataFilter` object using `MetadataFilter.from_dict()`
+   
+2. **QueryParam** (`lightrag/base.py`)
+   - `MetadataFilter` object is passed into `QueryParam.metadata_filter`
+   - QueryParam carries the filter through the query pipeline
+   
+3. **Query Execution** (`lightrag/operate.py`)
+   - Only chunk-based queries use the filter:
+     - Line 2749: `chunks_vdb.query(..., metadata_filter=query_param.metadata_filter)` (Mix/Naive modes)
+   
+4. **Storage Layer** (`lightrag/kg/postgres_impl.py`)
+   - PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators
+
+### Code Locations
+
+Key files implementing metadata support:
+- `lightrag/types.py`: `MetadataFilter` class definition
+- `lightrag/base.py`: `QueryParam` with `metadata_filter` field, `DocProcessingStatus` with metadata persistence
+- `lightrag/api/routers/query_routes.py`: API endpoint that initializes MetadataFilter from JSON
+- `lightrag/operate.py`: Query functions that pass filter to storage (Line 2749)
+- `lightrag/kg/postgres_impl.py`: PostgreSQL JSONB filter implementation
+
+## Query Examples
+
+### Example 1: Filter by Department (Mix Mode)
+```python
+from lightrag import QueryParam
+from lightrag.types import MetadataFilter
+
+query_param = QueryParam(
+    mode="mix",
+    metadata_filter=MetadataFilter(
+        operator="AND",
+        operands=[{"department": "Engineering"}]
+    )
+)
+
+response = rag.query("What are the key projects?", param=query_param)
+```
+
+### Example 2: Multi-tenant Filtering (Naive Mode)
+```python
+query_param = QueryParam(
+    mode="naive",
+    metadata_filter=MetadataFilter(
+        operator="AND",
+        operands=[
+            {"tenant_id": "tenant_a"},
+            {"access_level": "admin"}
+        ]
+    )
+)
+
+response = rag.query("Show admin resources", param=query_param)
+```
+
+### Example 3: Version Filtering (Mix Mode)
+```python
+query_param = QueryParam(
+    mode="mix",
+    metadata_filter=MetadataFilter(
+        operator="AND",
+        operands=[
+            {"doc_type": "manual"},
+            {"status": "current"}
+        ]
+    )
+)
+
+response = rag.query("How to configure?", param=query_param)
+```
+
+## Storage Backend Support
+
+**Important**: Metadata filtering is currently only supported for PostgreSQL vector storage.
+
+### Vector Storage
+- **PGVectorStorage**: Full support with JSONB filtering
+- **NanoVectorDBStorage**:  Not supported
+- **MilvusVectorDBStorage**:  Not supported
+- **ChromaVectorDBStorage**:  Not supported
+- **FaissVectorDBStorage**: Not supported
+- **QdrantVectorDBStorage**:  Not supported
+- **MongoVectorDBStorage**:  Not supported
+
+### Recommended Configuration
+
+For metadata filtering support:
+```python
+rag = LightRAG(
+    working_dir="./storage",
+    vector_storage="PGVectorStorage",
+    # Graph storage can be any type
+    # ... other config
+)
+```
+
+## Server API Examples
+
+### REST API Query with Metadata Filter
+
+#### Simple Filter (Naive Mode)
+```bash
+curl -X POST http://localhost:9621/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "What are the key features?",
+    "mode": "naive",
+    "metadata_filter": {
+      "operator": "AND",
+      "operands": [
+        {"department": "Engineering"},
+        {"year": 2024}
+      ]
+    }
+  }'
+```
+
+#### Complex Nested Filter (Mix Mode)
+```bash
+curl -X POST http://localhost:9621/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "Show me technical documentation",
+    "mode": "mix",
+    "metadata_filter": {
+      "operator": "AND",
+      "operands": [
+        {"document_type": "technical_spec"},
+        {
+          "operator": "OR",
+          "operands": [
+            {"version": "1.0"},
+            {"version": "2.0"}
+          ]
+        }
+      ]
+    }
+  }'
+```
+
+#### Multi-tenant Query (Mix Mode)
+```bash
+curl -X POST http://localhost:9621/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "List all projects",
+    "mode": "mix",
+    "metadata_filter": {
+      "operator": "AND",
+      "operands": [
+        {"tenant_id": "tenant_a"},
+        {"access_level": "admin"}
+      ]
+    },
+    "top_k": 20
+  }'
+```
+
+### Python Client with Server
+
+```python
+import requests
+from lightrag.types import MetadataFilter
+
+# Option 1: Use MetadataFilter class and convert to dict
+metadata_filter = MetadataFilter(
+    operator="AND",
+    operands=[
+        {"department": "Engineering"},
+        {"status": "approved"}
+    ]
+)
+
+response = requests.post(
+    "http://localhost:9621/query",
+    json={
+        "query": "What are the approved engineering documents?",
+        "mode": "mix",  # Use mix or naive mode
+        "metadata_filter": metadata_filter.to_dict(),
+        "top_k": 10
+    }
+)
+
+# Option 2: Send dict directly (API will convert to MetadataFilter)
+response = requests.post(
+    "http://localhost:9621/query",
+    json={
+        "query": "What are the approved engineering documents?",
+        "mode": "naive",  # Use mix or naive mode
+        "metadata_filter": {
+            "operator": "AND",
+            "operands": [
+                {"department": "Engineering"},
+                {"status": "approved"}
+            ]
+        },
+        "top_k": 10
+    }
+)
+
+result = response.json()
+print(result["response"])
+```
+
+### How the API Processes Metadata Filters
+
+When you send a query to the REST API:
+
+1. **JSON Request** → API receives `metadata_filter` as a dict
+2. **API Conversion** → `MetadataFilter.from_dict()` creates MetadataFilter object
+3. **QueryParam** → MetadataFilter is set in `QueryParam.metadata_filter`
+4. **Query Execution** → QueryParam with filter is passed to `kg_query()` or `naive_query()`
+5. **Storage Query** → Filter is passed to vector storage query methods (chunks only)
+6. **SQL** → PGVectorStorage converts filter to JSONB WHERE clause
+
+## Best Practices
+
+### 1. Consistent Metadata Schema
+```python
+# Good - consistent schema
+metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
+metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}
+```
+
+### 2. Simple Indexable Values
+```python
+# Good - simple values
+metadata = {
+    "status": "approved",
+    "priority": "high",
+    "year": 2024
+}
+```
+
+### 3. Use Appropriate Query Mode
+- **Mix mode**: Best for combining KG context with filtered chunks
+- **Naive mode**: Best for pure vector search with metadata filtering
+
+### 4. Performance Considerations
+- Keep metadata fields minimal (Should be done automatically by the ORM)
+- For PostgreSQL: Create GIN indexes on JSONB metadata columns:
+  ```sql
+  CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata);
+  ```
+- Avoid overly complex nested filters
+
+## Troubleshooting
+
+### Filter Not Working
+1. **Verify storage backend**: Ensure you're using PGVectorStorage
+2. **Verify query mode**: Use "mix" or "naive" mode only
+3. Verify metadata exists in chunks
+4. Check metadata field names match exactly (case-sensitive)
+5. Check logs for filter parsing errors
+6. Test without filter first to ensure data exists
+
+### Performance Issues
+1. Reduce filter complexity
+2. Create GIN indexes on JSONB metadata columns in PostgreSQL
+3. Profile query execution time
+4. Consider caching frequently used filters
+
+### Unsupported Storage Backend
+If you're using a storage backend that doesn't support metadata filtering:
+1. Migrate to PGVectorStorage
+2. Or implement post-filtering in application code
+3. Or contribute metadata filtering support for your backend
+
+### Metadata Not Persisting After Queue Restart
+- Metadata is stored in `DocProcessingStatus.metadata`
+- Check document status storage is properly configured
+- Verify metadata is set before document is enqueued
+
+## API Reference
+
+### MetadataFilter
+```python
+class MetadataFilter(BaseModel):
+    operator: str  # "AND", "OR", or "NOT"
+    operands: List[Union[Dict[str, Any], 'MetadataFilter']]
+    
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert to dictionary for JSON serialization"""
+        ...
+    
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
+        """Create MetadataFilter from dictionary (used by API)"""
+        ...
+```
+
+### QueryParam
+```python
+@dataclass
+class QueryParam:
+    metadata_filter: MetadataFilter | None = None  # Filter passed to chunk queries
+    mode: str = "mix"  # Only "mix" and "naive" support metadata filtering
+    top_k: int = 60
+    # ... other fields
+```
+
+### DocProcessingStatus
+```python
+@dataclass
+class DocProcessingStatus:
+    # ... other fields
+    metadata: dict[str, Any] = field(default_factory=dict)
+    """Additional metadata - PERSISTED across queue restarts"""
+```
+
+### Query Method
+```python
+# Synchronous
+response = rag.query(
+    query: str,
+    param: QueryParam  # QueryParam contains metadata_filter
+)
+
+# Asynchronous
+response = await rag.aquery(
+    query: str,
+    param: QueryParam  # QueryParam contains metadata_filter
+)
+```
+
+### REST API Query Endpoint
+```python
+# In lightrag/api/routers/query_routes.py
+@router.post("/query")
+async def query_endpoint(request: QueryRequest):
+    # API receives metadata_filter as dict
+    metadata_filter_dict = request.metadata_filter
+    
+    # Convert dict to MetadataFilter object
+    metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
+    
+    # Create QueryParam with MetadataFilter
+    query_param = QueryParam(
+        mode=request.mode,  # Must be "mix" or "naive"
+        metadata_filter=metadata_filter,
+        top_k=request.top_k
+    )
+    
+    # Execute query with QueryParam
+    result = await rag.aquery(request.query, param=query_param)
+    return result
+```