# Metadata Filtering in LightRAG

## Overview

LightRAG supports metadata filtering during queries to retrieve only relevant chunks based on metadata criteria.

**Important Limitations**:
- Metadata filtering is **only supported for PostgreSQL (PGVectorStorage), with metadata insertion also visible on Neo4j**
- Only **chunk-based queries** support metadata filtering (Mix and Naive modes)
- Metadata is stored in document status and propagated to chunks during extraction

## Metadata Structure

Metadata is stored as a dictionary (`dict[str, Any]`) in:
- Entity nodes (graph storage)
- Relationship edges (graph storage)
- Text chunks (KV storage)
- Vector embeddings (vector storage)

```python
metadata = {
    "author": "John Doe",
    "department": "Engineering",
    "document_type": "technical_spec",
    "version": "1.0"
}
```

## Critical: Metadata Persistence in Document Status

**Metadata is stored in DocProcessingStatus** - This ensures metadata is not lost if the processing queue is stopped or interrupted.

### How It Works

1. **Document Status Storage** (`lightrag/base.py` - `DocProcessingStatus`)
   ```python
   @dataclass
   class DocProcessingStatus:
       # ... other fields
       metadata: dict[str, Any] = field(default_factory=dict)
       """Additional metadata - PERSISTED across queue restarts"""
   ```

2. **Metadata Flow**:
   - Metadata stored in `DocProcessingStatus.metadata` when document is enqueued
   - If queue stops, metadata persists in document status storage
   - When processing resumes, metadata is read from document status
   - Metadata is propagated to chunks during extraction

3. **Why This Matters**:
   - Queue can be stopped/restarted without losing metadata
   - Metadata survives system crashes or interruptions
   - Ensures data consistency across processing pipeline

## Metadata Filtering During Queries

### MetadataFilter Class

```python
from lightrag.types import MetadataFilter

# Simple filter
filter1 = MetadataFilter(
    operator="AND",
    operands=[{"department": "Engineering"}]
)

# Complex filter with OR
filter2 = MetadataFilter(
    operator="OR",
    operands=[
        {"author": "John Doe"},
        {"author": "Jane Smith"}
    ]
)

# Nested filter
filter3 = MetadataFilter(
    operator="AND",
    operands=[
        {"document_type": "technical_spec"},
        MetadataFilter(
            operator="OR",
            operands=[
                {"version": "1.0"},
                {"version": "2.0"}
            ]
        )
    ]
)
```

### Supported Operators

- **AND**: All conditions must be true
- **OR**: At least one condition must be true
- **NOT**: Negates the condition

## Supported Query Modes

### Mix Mode (Recommended)
Filters vector chunks from both KG and direct vector search:
```python
query_param = QueryParam(
    mode="mix",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[
            {"department": "Engineering"},
            {"status": "approved"}
        ]
    )
)
```

### Naive Mode
Filters vector chunks directly:
```python
query_param = QueryParam(
    mode="naive",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[{"document_type": "manual"}]
    )
)
```

## Implementation Details

### Architecture Flow

1. **API Layer** (`lightrag/api/routers/query_routes.py`)
   - REST endpoint receives `metadata_filter` as JSON dict
   - Converts JSON to `MetadataFilter` object using `MetadataFilter.from_dict()`
   
2. **QueryParam** (`lightrag/base.py`)
   - `MetadataFilter` object is passed into `QueryParam.metadata_filter`
   - QueryParam carries the filter through the query pipeline
   
3. **Query Execution** (`lightrag/operate.py`)
   - Only chunk-based queries use the filter:
     - Line 2749: `chunks_vdb.query(..., metadata_filter=query_param.metadata_filter)` (Mix/Naive modes)
   
4. **Storage Layer** (`lightrag/kg/postgres_impl.py`)
   - PGVectorStorage: Converts filter to SQL WHERE clause with JSONB operators

### Code Locations

Key files implementing metadata support:
- `lightrag/types.py`: `MetadataFilter` class definition
- `lightrag/base.py`: `QueryParam` with `metadata_filter` field, `DocProcessingStatus` with metadata persistence
- `lightrag/api/routers/query_routes.py`: API endpoint that initializes MetadataFilter from JSON
- `lightrag/operate.py`: Query functions that pass filter to storage (Line 2749)
- `lightrag/kg/postgres_impl.py`: PostgreSQL JSONB filter implementation

## Query Examples

### Example 1: Filter by Department (Mix Mode)
```python
from lightrag import QueryParam
from lightrag.types import MetadataFilter

query_param = QueryParam(
    mode="mix",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[{"department": "Engineering"}]
    )
)

response = rag.query("What are the key projects?", param=query_param)
```

### Example 2: Multi-tenant Filtering (Naive Mode)
```python
query_param = QueryParam(
    mode="naive",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[
            {"tenant_id": "tenant_a"},
            {"access_level": "admin"}
        ]
    )
)

response = rag.query("Show admin resources", param=query_param)
```

### Example 3: Version Filtering (Mix Mode)
```python
query_param = QueryParam(
    mode="mix",
    metadata_filter=MetadataFilter(
        operator="AND",
        operands=[
            {"doc_type": "manual"},
            {"status": "current"}
        ]
    )
)

response = rag.query("How to configure?", param=query_param)
```

## Storage Backend Support

**Important**: Metadata filtering is currently only supported for PostgreSQL vector storage.

### Vector Storage
- **PGVectorStorage**: Full support with JSONB filtering
- **NanoVectorDBStorage**:  Not supported
- **MilvusVectorDBStorage**:  Not supported
- **ChromaVectorDBStorage**:  Not supported
- **FaissVectorDBStorage**: Not supported
- **QdrantVectorDBStorage**:  Not supported
- **MongoVectorDBStorage**:  Not supported

### Recommended Configuration

For metadata filtering support:
```python
rag = LightRAG(
    working_dir="./storage",
    vector_storage="PGVectorStorage",
    # Graph storage can be any type
    # ... other config
)
```

## Server API Examples

### REST API Query with Metadata Filter

#### Simple Filter (Naive Mode)
```bash
curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the key features?",
    "mode": "naive",
    "metadata_filter": {
      "operator": "AND",
      "operands": [
        {"department": "Engineering"},
        {"year": 2024}
      ]
    }
  }'
```

#### Complex Nested Filter (Mix Mode)
```bash
curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Show me technical documentation",
    "mode": "mix",
    "metadata_filter": {
      "operator": "AND",
      "operands": [
        {"document_type": "technical_spec"},
        {
          "operator": "OR",
          "operands": [
            {"version": "1.0"},
            {"version": "2.0"}
          ]
        }
      ]
    }
  }'
```

#### Multi-tenant Query (Mix Mode)
```bash
curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "List all projects",
    "mode": "mix",
    "metadata_filter": {
      "operator": "AND",
      "operands": [
        {"tenant_id": "tenant_a"},
        {"access_level": "admin"}
      ]
    },
    "top_k": 20
  }'
```

### Python Client with Server

```python
import requests
from lightrag.types import MetadataFilter

# Option 1: Use MetadataFilter class and convert to dict
metadata_filter = MetadataFilter(
    operator="AND",
    operands=[
        {"department": "Engineering"},
        {"status": "approved"}
    ]
)

response = requests.post(
    "http://localhost:9621/query",
    json={
        "query": "What are the approved engineering documents?",
        "mode": "mix",  # Use mix or naive mode
        "metadata_filter": metadata_filter.to_dict(),
        "top_k": 10
    }
)

# Option 2: Send dict directly (API will convert to MetadataFilter)
response = requests.post(
    "http://localhost:9621/query",
    json={
        "query": "What are the approved engineering documents?",
        "mode": "naive",  # Use mix or naive mode
        "metadata_filter": {
            "operator": "AND",
            "operands": [
                {"department": "Engineering"},
                {"status": "approved"}
            ]
        },
        "top_k": 10
    }
)

result = response.json()
print(result["response"])
```

### How the API Processes Metadata Filters

When you send a query to the REST API:

1. **JSON Request** → API receives `metadata_filter` as a dict
2. **API Conversion** → `MetadataFilter.from_dict()` creates MetadataFilter object
3. **QueryParam** → MetadataFilter is set in `QueryParam.metadata_filter`
4. **Query Execution** → QueryParam with filter is passed to `kg_query()` or `naive_query()`
5. **Storage Query** → Filter is passed to vector storage query methods (chunks only)
6. **SQL** → PGVectorStorage converts filter to JSONB WHERE clause

## Best Practices

### 1. Consistent Metadata Schema
```python
# Good - consistent schema
metadata1 = {"author": "John", "dept": "Eng", "year": 2024}
metadata2 = {"author": "Jane", "dept": "Sales", "year": 2024}
```

### 2. Simple Indexable Values
```python
# Good - simple values
metadata = {
    "status": "approved",
    "priority": "high",
    "year": 2024
}
```

### 3. Use Appropriate Query Mode
- **Mix mode**: Best for combining KG context with filtered chunks
- **Naive mode**: Best for pure vector search with metadata filtering

### 4. Performance Considerations
- Keep metadata fields minimal (Should be done automatically by the ORM)
- For PostgreSQL: Create GIN indexes on JSONB metadata columns:
  ```sql
  CREATE INDEX idx_chunks_metadata ON chunks USING GIN (metadata);
  ```
- Avoid overly complex nested filters

## Troubleshooting

### Filter Not Working
1. **Verify storage backend**: Ensure you're using PGVectorStorage
2. **Verify query mode**: Use "mix" or "naive" mode only
3. Verify metadata exists in chunks
4. Check metadata field names match exactly (case-sensitive)
5. Check logs for filter parsing errors
6. Test without filter first to ensure data exists

### Performance Issues
1. Reduce filter complexity
2. Create GIN indexes on JSONB metadata columns in PostgreSQL
3. Profile query execution time
4. Consider caching frequently used filters

### Unsupported Storage Backend
If you're using a storage backend that doesn't support metadata filtering:
1. Migrate to PGVectorStorage
2. Or implement post-filtering in application code
3. Or contribute metadata filtering support for your backend

### Metadata Not Persisting After Queue Restart
- Metadata is stored in `DocProcessingStatus.metadata`
- Check document status storage is properly configured
- Verify metadata is set before document is enqueued

## API Reference

### MetadataFilter
```python
class MetadataFilter(BaseModel):
    operator: str  # "AND", "OR", or "NOT"
    operands: List[Union[Dict[str, Any], 'MetadataFilter']]
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for JSON serialization"""
        ...
    
    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> 'MetadataFilter':
        """Create MetadataFilter from dictionary (used by API)"""
        ...
```

### QueryParam
```python
@dataclass
class QueryParam:
    metadata_filter: MetadataFilter | None = None  # Filter passed to chunk queries
    mode: str = "mix"  # Only "mix" and "naive" support metadata filtering
    top_k: int = 60
    # ... other fields
```

### DocProcessingStatus
```python
@dataclass
class DocProcessingStatus:
    # ... other fields
    metadata: dict[str, Any] = field(default_factory=dict)
    """Additional metadata - PERSISTED across queue restarts"""
```

### Query Method
```python
# Synchronous
response = rag.query(
    query: str,
    param: QueryParam  # QueryParam contains metadata_filter
)

# Asynchronous
response = await rag.aquery(
    query: str,
    param: QueryParam  # QueryParam contains metadata_filter
)
```

### REST API Query Endpoint
```python
# In lightrag/api/routers/query_routes.py
@router.post("/query")
async def query_endpoint(request: QueryRequest):
    # API receives metadata_filter as dict
    metadata_filter_dict = request.metadata_filter
    
    # Convert dict to MetadataFilter object
    metadata_filter = MetadataFilter.from_dict(metadata_filter_dict) if metadata_filter_dict else None
    
    # Create QueryParam with MetadataFilter
    query_param = QueryParam(
        mode=request.mode,  # Must be "mix" or "naive"
        metadata_filter=metadata_filter,
        top_k=request.top_k
    )
    
    # Execute query with QueryParam
    result = await rag.aquery(request.query, param=query_param)
    return result
```