LightRAG/docs/0004-storage-backends.md
Raphael MANSUY 2b292d4924
docs: Enterprise Edition & Multi-tenancy attribution (#5)
* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
2025-12-04 18:09:15 +08:00

786 lines
22 KiB
Markdown

# LightRAG Storage Backends
> Complete guide to storage backend configuration and implementation
**Version**: 1.4.9.1 | **Last Updated**: December 2025
---
## Table of Contents
1. [Overview](#overview)
2. [Storage Types](#storage-types)
3. [Backend Comparison](#backend-comparison)
4. [PostgreSQL Backend](#postgresql-backend)
5. [MongoDB Backend](#mongodb-backend)
6. [Neo4j Backend](#neo4j-backend)
7. [Redis Backend](#redis-backend)
8. [File-Based Backends](#file-based-backends)
9. [Vector Databases](#vector-databases)
10. [Configuration Reference](#configuration-reference)
---
## Overview
LightRAG uses four types of storage, each with multiple backend options:
```
┌─────────────────────────────────────────────────────────────────────────┐
│ Storage Architecture │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LightRAG Core │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ KV Storage │ │ Vector Store │ │ Graph Store │ │
│ │ (Documents, │ │ (Embeddings) │ │ (KG Nodes │ │
│ │ Chunks, │ │ │ │ & Edges) │ │
│ │ Cache) │ │ │ │ │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Backend Implementations │ │
│ │ │ │
│ │ PostgreSQL │ MongoDB │ Redis │ Neo4j │ Milvus │ Qdrant │ FAISS │ │
│ │ JSON/File │ NetworkX │ NanoVectorDB │ Memgraph │ ... │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## Storage Types
### 1. Key-Value Storage (`BaseKVStorage`)
Stores documents, chunks, and LLM cache.
| Implementation | Description | Use Case |
|----------------|-------------|----------|
| `JsonKVStorage` | File-based JSON | Development, single-node |
| `PGKVStorage` | PostgreSQL tables | Production, multi-node |
| `MongoKVStorage` | MongoDB collections | Production, flexible schema |
| `RedisKVStorage` | Redis hash maps | High-performance caching |
### 2. Vector Storage (`BaseVectorStorage`)
Stores and queries embedding vectors.
| Implementation | Description | Use Case |
|----------------|-------------|----------|
| `NanoVectorDBStorage` | In-memory, file-persisted | Development, small datasets |
| `PGVectorStorage` | PostgreSQL + pgvector | Production, unified DB |
| `MilvusVectorDBStorage` | Milvus vector DB | Large-scale production |
| `QdrantVectorDBStorage` | Qdrant vector DB | Cloud-native production |
| `FaissVectorDBStorage` | FAISS index | Local high-performance |
| `MongoVectorDBStorage` | MongoDB Atlas Vector | MongoDB ecosystem |
### 3. Graph Storage (`BaseGraphStorage`)
Stores knowledge graph nodes and edges.
| Implementation | Description | Use Case |
|----------------|-------------|----------|
| `NetworkXStorage` | In-memory NetworkX | Development, small graphs |
| `PGGraphStorage` | PostgreSQL tables | Production, unified DB |
| `Neo4JStorage` | Native graph DB | Complex graph queries |
| `MemgraphStorage` | In-memory graph DB | Real-time analytics |
| `MongoGraphStorage` | MongoDB documents | Document-graph hybrid |
### 4. Document Status Storage (`DocStatusStorage`)
Tracks document processing status.
| Implementation | Description | Use Case |
|----------------|-------------|----------|
| `JsonDocStatusStorage` | File-based JSON | Development |
| `PGDocStatusStorage` | PostgreSQL | Production |
| `MongoDocStatusStorage` | MongoDB | Production |
| `RedisDocStatusStorage` | Redis | Distributed |
---
## Backend Comparison
### Feature Matrix
```
┌────────────────────┬─────────┬────────┬───────┬────────┬───────────┐
│ Feature │ PG Full │ Mongo │ Neo4j │ Mixed │ File-Only │
├────────────────────┼─────────┼────────┼───────┼────────┼───────────┤
│ KV Storage │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
│ Vector Storage │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
│ Graph Storage │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │
│ Doc Status │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
│ Multi-tenant │ ✅ │ ✅ │ ✅ │ ✅ │ ⚠️ │
│ Horizontal Scale │ ✅ │ ✅ │ ✅ │ ✅ │ ❌ │
│ ACID Transactions │ ✅ │ ⚠️ │ ✅ │ ⚠️ │ ❌ │
│ Zero Dependencies │ ❌ │ ❌ │ ❌ │ ❌ │ ✅ │
│ Graph Queries │ ⚠️ │ ⚠️ │ ✅ │ ✅ │ ⚠️ │
│ Vector Search │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
└────────────────────┴─────────┴────────┴───────┴────────┴───────────┘
Legend: ✅ Full support ⚠️ Limited ❌ Not supported
```
### Performance Characteristics
| Backend | Write Speed | Read Speed | Memory Usage | Disk Usage |
|---------|-------------|------------|--------------|------------|
| PostgreSQL Full | Fast | Fast | Medium | Compact |
| MongoDB Full | Fast | Fast | Medium | Medium |
| Neo4j + Vector | Slow | Fast (graph) | High | Medium |
| File-based | Slow | Medium | Low | Compact |
| Milvus/Qdrant | Fast | Very Fast | High | Large |
---
## PostgreSQL Backend
### Complete PostgreSQL Setup
PostgreSQL can handle ALL storage types (recommended for production):
```python
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# All PostgreSQL backends
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="PGGraphStorage",
doc_status_storage="PGDocStatusStorage",
)
```
### Environment Variables
```bash
# Required
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
POSTGRES_DATABASE=lightrag
# Optional
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer # disable|allow|prefer|require|verify-ca|verify-full
POSTGRES_SSL_CERT=/path/to/cert
POSTGRES_SSL_KEY=/path/to/key
POSTGRES_SSL_ROOT_CERT=/path/to/ca
# Vector index configuration
POSTGRES_VECTOR_INDEX_TYPE=hnsw # hnsw|ivfflat
POSTGRES_HNSW_M=16
POSTGRES_HNSW_EF=64
POSTGRES_IVFFLAT_LISTS=100
```
### Schema Overview
```sql
-- Documents table
CREATE TABLE LIGHTRAG_DOC_FULL (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
doc_name VARCHAR(1024),
content TEXT,
meta JSONB,
createtime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (workspace, id)
);
-- Chunks table
CREATE TABLE LIGHTRAG_DOC_CHUNKS (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
full_doc_id VARCHAR(255),
chunk_order_index INT,
tokens INT,
content TEXT,
content_summary TEXT,
file_path VARCHAR(32768),
create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (workspace, id)
);
-- Entity vectors (pgvector extension required)
CREATE TABLE LIGHTRAG_VDB_ENTITY (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
entity_name VARCHAR(1024),
content TEXT,
content_vector VECTOR(1024), -- Adjust dimension to match embedding
source_id TEXT,
file_path TEXT,
create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (workspace, id)
);
-- Graph nodes
CREATE TABLE LIGHTRAG_GRAPH_NODES (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
entity_type VARCHAR(255),
description TEXT,
source_id TEXT,
file_path TEXT,
created_at INT,
PRIMARY KEY (workspace, id)
);
-- Graph edges
CREATE TABLE LIGHTRAG_GRAPH_EDGES (
workspace VARCHAR(1024) NOT NULL,
source_id VARCHAR(255) NOT NULL,
target_id VARCHAR(255) NOT NULL,
weight FLOAT,
description TEXT,
keywords TEXT,
source_chunk_id TEXT,
file_path TEXT,
created_at INT,
PRIMARY KEY (workspace, source_id, target_id)
);
```
### pgvector Index Types
```sql
-- HNSW index (recommended for accuracy)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY
USING hnsw (content_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- IVFFlat index (faster but less accurate)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY
USING ivfflat (content_vector vector_cosine_ops)
WITH (lists = 100);
```
---
## MongoDB Backend
### Complete MongoDB Setup
```python
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# All MongoDB backends
kv_storage="MongoKVStorage",
vector_storage="MongoVectorDBStorage",
graph_storage="MongoGraphStorage",
doc_status_storage="MongoDocStatusStorage",
)
```
### Environment Variables
```bash
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag
# Atlas Vector Search (optional)
MONGO_ATLAS_CLUSTER=your-cluster
MONGO_ATLAS_API_KEY=your-api-key
```
### Collection Structure
```javascript
// Documents collection
db.lightrag_doc_full.insertOne({
_id: "workspace:doc_id",
workspace: "default",
doc_id: "abc123",
doc_name: "document.txt",
content: "Full document text...",
meta: { source: "upload" },
created_at: ISODate(),
updated_at: ISODate()
});
// Entities collection (with vector)
db.lightrag_entities.insertOne({
_id: "workspace:entity_id",
workspace: "default",
entity_name: "Apple Inc.",
entity_type: "organization",
description: "Technology company...",
content: "Apple Inc.\nTechnology company...",
embedding: [0.1, 0.2, ...], // Vector embedding
source_id: "chunk_001,chunk_002",
file_path: "document.txt"
});
// Graph edges collection
db.lightrag_graph_edges.insertOne({
_id: "workspace:source:target",
workspace: "default",
source: "Apple Inc.",
target: "iPhone",
weight: 3.5,
description: "Produces the iPhone",
keywords: "technology,smartphone"
});
```
### Vector Search Index (Atlas)
```javascript
// Create vector search index
db.lightrag_entities.createSearchIndex({
name: "vector_index",
definition: {
mappings: {
dynamic: true,
fields: {
embedding: {
type: "knnVector",
dimensions: 1024,
similarity: "cosine"
}
}
}
}
});
```
---
## Neo4j Backend
### Neo4j for Graph Storage
Neo4j provides native graph storage with Cypher queries:
```python
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# Neo4j for graph, other backends for KV/Vector
kv_storage="PGKVStorage", # or JsonKVStorage
vector_storage="PGVectorStorage", # or other vector DB
graph_storage="Neo4JStorage", # Neo4j graph
doc_status_storage="PGDocStatusStorage",
)
```
### Environment Variables
```bash
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password
# Optional
NEO4J_DATABASE=neo4j
NEO4J_ENCRYPTED=false
```
### Graph Schema
```cypher
// Entity nodes
CREATE (e:Entity {
entity_id: "Apple Inc.",
entity_type: "organization",
description: "Technology company...",
source_id: "chunk_001",
workspace: "default"
})
// Relationship
MATCH (a:Entity {entity_id: "Apple Inc."})
MATCH (b:Entity {entity_id: "iPhone"})
CREATE (a)-[r:RELATED_TO {
weight: 3.5,
description: "Produces",
keywords: "technology"
}]->(b)
```
### Cypher Queries Used
```cypher
-- Get node with edges
MATCH (n:Entity {entity_id: $entity_id, workspace: $workspace})
OPTIONAL MATCH (n)-[r]-(m)
RETURN n, r, m
-- Get knowledge graph (BFS)
MATCH path = (start:Entity {entity_id: $label})-[*1..3]-(connected)
WHERE start.workspace = $workspace
RETURN path
LIMIT $max_nodes
-- Search nodes
MATCH (n:Entity)
WHERE n.workspace = $workspace
AND toLower(n.entity_id) CONTAINS toLower($query)
RETURN n.entity_id
ORDER BY n.degree DESC
LIMIT $limit
```
---
## Redis Backend
### Redis for KV and Doc Status
```python
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
kv_storage="RedisKVStorage",
vector_storage="NanoVectorDBStorage", # Redis doesn't have vector
graph_storage="NetworkXStorage", # Redis doesn't have graph
doc_status_storage="RedisDocStatusStorage",
)
```
### Environment Variables
```bash
REDIS_URI=redis://localhost:6379
# or with auth
REDIS_URI=redis://user:password@localhost:6379/0
```
### Key Structure
```
# Document storage
lightrag:{workspace}:full_docs:{doc_id} -> JSON document
# Chunks storage
lightrag:{workspace}:text_chunks:{chunk_id} -> JSON chunk
# LLM cache
lightrag:{workspace}:llm_cache:{cache_key} -> JSON response
# Document status
lightrag:{workspace}:doc_status:{doc_id} -> JSON status
```
---
## File-Based Backends
### Zero-Dependency Setup
Best for development and small-scale usage:
```python
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# All file-based (default)
kv_storage="JsonKVStorage",
vector_storage="NanoVectorDBStorage",
graph_storage="NetworkXStorage",
doc_status_storage="JsonDocStatusStorage",
)
```
### File Structure
```
./rag_storage/
├── full_docs.json # Complete documents
├── text_chunks.json # Document chunks
├── llm_response_cache.json # LLM cache
├── full_entities.json # Entity metadata
├── full_relations.json # Relation metadata
├── vdb_entities.json # Entity vectors
├── vdb_relationships.json # Relation vectors
├── vdb_chunks.json # Chunk vectors
├── graph_chunk_entity_relation.graphml # Knowledge graph
└── doc_status.json # Processing status
```
### NanoVectorDB Format
```json
{
"data": {
"ent-abc123": {
"__id__": "ent-abc123",
"__vector__": [0.1, 0.2, 0.3, ...],
"entity_name": "Apple Inc.",
"content": "Apple Inc.\nTechnology company",
"source_id": "chunk_001"
}
},
"matrix": [[0.1, 0.2, ...], ...],
"index_to_id": ["ent-abc123", ...]
}
```
---
## Vector Databases
### Milvus
```python
rag = LightRAG(
vector_storage="MilvusVectorDBStorage",
vector_db_storage_cls_kwargs={
"host": "localhost",
"port": 19530,
"collection_name": "lightrag_vectors"
}
)
```
```bash
# Environment variables
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_TOKEN=your_token # For Zilliz Cloud
```
### Qdrant
```python
rag = LightRAG(
vector_storage="QdrantVectorDBStorage",
vector_db_storage_cls_kwargs={
"collection_name": "lightrag"
}
)
```
```bash
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_api_key # Optional
```
### FAISS
```python
rag = LightRAG(
vector_storage="FaissVectorDBStorage",
vector_db_storage_cls_kwargs={
"index_type": "IVF_FLAT", # or HNSW
"nlist": 100
}
)
```
---
## Configuration Reference
### Complete Environment Variables
```bash
# Storage Selection
KV_STORAGE=PGKVStorage
VECTOR_STORAGE=PGVectorStorage
GRAPH_STORAGE=PGGraphStorage
DOC_STATUS_STORAGE=PGDocStatusStorage
# PostgreSQL
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secret
POSTGRES_DATABASE=lightrag
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer
# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag
# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
# Redis
REDIS_URI=redis://localhost:6379
# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530
# Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=
# Memgraph
MEMGRAPH_URI=bolt://localhost:7687
```
### Programmatic Configuration
```python
from lightrag import LightRAG
rag = LightRAG(
# Working directory
working_dir="./rag_storage",
workspace="my_project", # Multi-tenant namespace
# Storage backends
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="PGGraphStorage",
doc_status_storage="PGDocStatusStorage",
# Vector DB options
vector_db_storage_cls_kwargs={
"cosine_better_than_threshold": 0.2,
# Backend-specific options...
},
# Processing
chunk_token_size=1200,
chunk_overlap_token_size=100,
)
```
---
## Multi-Tenant Data Isolation
All storage backends support multi-tenant isolation:
```python
# Workspace creates isolated namespace
rag = LightRAG(
working_dir="./rag_storage",
workspace="tenant_a:kb_prod", # Composite namespace
)
# Or with explicit tenant context
from lightrag.tenant_rag_manager import TenantRAGManager
manager = TenantRAGManager(
base_working_dir="./rag_storage",
tenant_service=tenant_service,
template_rag=template_rag,
)
# Get tenant-specific instance
rag = await manager.get_rag_instance(
tenant_id="tenant_a",
kb_id="kb_prod"
)
```
### Isolation Pattern
```
┌─────────────────────────────────────────────────────────────┐
│ Multi-Tenant Data Isolation │
├─────────────────────────────────────────────────────────────┤
│ │
│ PostgreSQL: WHERE workspace = 'tenant_a:kb_prod:default' │
│ │
│ MongoDB: { workspace: "tenant_a:kb_prod:default" } │
│ │
│ Redis: lightrag:tenant_a:kb_prod:default:{key} │
│ │
│ Neo4j: MATCH (n {workspace: $workspace}) │
│ │
│ File: ./rag_storage/tenant_a:kb_prod/ │
│ │
└─────────────────────────────────────────────────────────────┘
```
---
## Migration Between Backends
### Export/Import Pattern
```python
# Export from source
source_rag = LightRAG(
kv_storage="JsonKVStorage",
vector_storage="NanoVectorDBStorage",
graph_storage="NetworkXStorage",
)
# Initialize source
await source_rag.initialize_storages()
# Get all data
docs = await source_rag.full_docs.get_all()
chunks = await source_rag.text_chunks.get_all()
# ... export other data
# Import to target
target_rag = LightRAG(
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="PGGraphStorage",
)
await target_rag.initialize_storages()
await target_rag.full_docs.upsert(docs)
await target_rag.text_chunks.upsert(chunks)
# ... import other data
await target_rag.finalize_storages()
```
---
## Best Practices
### Production Recommendations
1. **Use PostgreSQL Full Stack** for simplicity and reliability
2. **Enable connection pooling** for high concurrency
3. **Create indexes** on frequently queried columns
4. **Monitor storage growth** and plan capacity
5. **Regular backups** with point-in-time recovery
6. **Use SSL/TLS** for database connections
### Performance Tuning
```bash
# PostgreSQL tuning
POSTGRES_MAX_CONNECTIONS=200
POSTGRES_VECTOR_INDEX_TYPE=hnsw
POSTGRES_HNSW_M=32
POSTGRES_HNSW_EF=128
# LightRAG tuning
MAX_PARALLEL_INSERT=4
EMBEDDING_BATCH_NUM=20
MAX_ASYNC=8
```
---
**Version**: 1.4.9.1 | **License**: MIT