* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad. * Add multi-tenant testing strategy and ADR index documentation - Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details. - Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles. * feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise - Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints. - Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options. - Documented the LightRAG architecture, storage backends, LLM integrations, and query modes. - Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
22 KiB
22 KiB
LightRAG Storage Backends
Complete guide to storage backend configuration and implementation
Version: 1.4.9.1 | Last Updated: December 2025
Table of Contents
- Overview
- Storage Types
- Backend Comparison
- PostgreSQL Backend
- MongoDB Backend
- Neo4j Backend
- Redis Backend
- File-Based Backends
- Vector Databases
- Configuration Reference
Overview
LightRAG uses four types of storage, each with multiple backend options:
┌─────────────────────────────────────────────────────────────────────────┐
│ Storage Architecture │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LightRAG Core │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ KV Storage │ │ Vector Store │ │ Graph Store │ │
│ │ (Documents, │ │ (Embeddings) │ │ (KG Nodes │ │
│ │ Chunks, │ │ │ │ & Edges) │ │
│ │ Cache) │ │ │ │ │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Backend Implementations │ │
│ │ │ │
│ │ PostgreSQL │ MongoDB │ Redis │ Neo4j │ Milvus │ Qdrant │ FAISS │ │
│ │ JSON/File │ NetworkX │ NanoVectorDB │ Memgraph │ ... │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Storage Types
1. Key-Value Storage (BaseKVStorage)
Stores documents, chunks, and LLM cache.
| Implementation | Description | Use Case |
|---|---|---|
JsonKVStorage |
File-based JSON | Development, single-node |
PGKVStorage |
PostgreSQL tables | Production, multi-node |
MongoKVStorage |
MongoDB collections | Production, flexible schema |
RedisKVStorage |
Redis hash maps | High-performance caching |
2. Vector Storage (BaseVectorStorage)
Stores and queries embedding vectors.
| Implementation | Description | Use Case |
|---|---|---|
NanoVectorDBStorage |
In-memory, file-persisted | Development, small datasets |
PGVectorStorage |
PostgreSQL + pgvector | Production, unified DB |
MilvusVectorDBStorage |
Milvus vector DB | Large-scale production |
QdrantVectorDBStorage |
Qdrant vector DB | Cloud-native production |
FaissVectorDBStorage |
FAISS index | Local high-performance |
MongoVectorDBStorage |
MongoDB Atlas Vector | MongoDB ecosystem |
3. Graph Storage (BaseGraphStorage)
Stores knowledge graph nodes and edges.
| Implementation | Description | Use Case |
|---|---|---|
NetworkXStorage |
In-memory NetworkX | Development, small graphs |
PGGraphStorage |
PostgreSQL tables | Production, unified DB |
Neo4JStorage |
Native graph DB | Complex graph queries |
MemgraphStorage |
In-memory graph DB | Real-time analytics |
MongoGraphStorage |
MongoDB documents | Document-graph hybrid |
4. Document Status Storage (DocStatusStorage)
Tracks document processing status.
| Implementation | Description | Use Case |
|---|---|---|
JsonDocStatusStorage |
File-based JSON | Development |
PGDocStatusStorage |
PostgreSQL | Production |
MongoDocStatusStorage |
MongoDB | Production |
RedisDocStatusStorage |
Redis | Distributed |
Backend Comparison
Feature Matrix
┌────────────────────┬─────────┬────────┬───────┬────────┬───────────┐
│ Feature │ PG Full │ Mongo │ Neo4j │ Mixed │ File-Only │
├────────────────────┼─────────┼────────┼───────┼────────┼───────────┤
│ KV Storage │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
│ Vector Storage │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
│ Graph Storage │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │
│ Doc Status │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
│ Multi-tenant │ ✅ │ ✅ │ ✅ │ ✅ │ ⚠️ │
│ Horizontal Scale │ ✅ │ ✅ │ ✅ │ ✅ │ ❌ │
│ ACID Transactions │ ✅ │ ⚠️ │ ✅ │ ⚠️ │ ❌ │
│ Zero Dependencies │ ❌ │ ❌ │ ❌ │ ❌ │ ✅ │
│ Graph Queries │ ⚠️ │ ⚠️ │ ✅ │ ✅ │ ⚠️ │
│ Vector Search │ ✅ │ ✅ │ ❌ │ ✅ │ ✅ │
└────────────────────┴─────────┴────────┴───────┴────────┴───────────┘
Legend: ✅ Full support ⚠️ Limited ❌ Not supported
Performance Characteristics
| Backend | Write Speed | Read Speed | Memory Usage | Disk Usage |
|---|---|---|---|---|
| PostgreSQL Full | Fast | Fast | Medium | Compact |
| MongoDB Full | Fast | Fast | Medium | Medium |
| Neo4j + Vector | Slow | Fast (graph) | High | Medium |
| File-based | Slow | Medium | Low | Compact |
| Milvus/Qdrant | Fast | Very Fast | High | Large |
PostgreSQL Backend
Complete PostgreSQL Setup
PostgreSQL can handle ALL storage types (recommended for production):
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# All PostgreSQL backends
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="PGGraphStorage",
doc_status_storage="PGDocStatusStorage",
)
Environment Variables
# Required
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
POSTGRES_DATABASE=lightrag
# Optional
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer # disable|allow|prefer|require|verify-ca|verify-full
POSTGRES_SSL_CERT=/path/to/cert
POSTGRES_SSL_KEY=/path/to/key
POSTGRES_SSL_ROOT_CERT=/path/to/ca
# Vector index configuration
POSTGRES_VECTOR_INDEX_TYPE=hnsw # hnsw|ivfflat
POSTGRES_HNSW_M=16
POSTGRES_HNSW_EF=64
POSTGRES_IVFFLAT_LISTS=100
Schema Overview
-- Documents table
CREATE TABLE LIGHTRAG_DOC_FULL (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
doc_name VARCHAR(1024),
content TEXT,
meta JSONB,
createtime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (workspace, id)
);
-- Chunks table
CREATE TABLE LIGHTRAG_DOC_CHUNKS (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
full_doc_id VARCHAR(255),
chunk_order_index INT,
tokens INT,
content TEXT,
content_summary TEXT,
file_path VARCHAR(32768),
create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (workspace, id)
);
-- Entity vectors (pgvector extension required)
CREATE TABLE LIGHTRAG_VDB_ENTITY (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
entity_name VARCHAR(1024),
content TEXT,
content_vector VECTOR(1024), -- Adjust dimension to match embedding
source_id TEXT,
file_path TEXT,
create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (workspace, id)
);
-- Graph nodes
CREATE TABLE LIGHTRAG_GRAPH_NODES (
workspace VARCHAR(1024) NOT NULL,
id VARCHAR(255) NOT NULL,
entity_type VARCHAR(255),
description TEXT,
source_id TEXT,
file_path TEXT,
created_at INT,
PRIMARY KEY (workspace, id)
);
-- Graph edges
CREATE TABLE LIGHTRAG_GRAPH_EDGES (
workspace VARCHAR(1024) NOT NULL,
source_id VARCHAR(255) NOT NULL,
target_id VARCHAR(255) NOT NULL,
weight FLOAT,
description TEXT,
keywords TEXT,
source_chunk_id TEXT,
file_path TEXT,
created_at INT,
PRIMARY KEY (workspace, source_id, target_id)
);
pgvector Index Types
-- HNSW index (recommended for accuracy)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY
USING hnsw (content_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- IVFFlat index (faster but less accurate)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY
USING ivfflat (content_vector vector_cosine_ops)
WITH (lists = 100);
MongoDB Backend
Complete MongoDB Setup
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# All MongoDB backends
kv_storage="MongoKVStorage",
vector_storage="MongoVectorDBStorage",
graph_storage="MongoGraphStorage",
doc_status_storage="MongoDocStatusStorage",
)
Environment Variables
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag
# Atlas Vector Search (optional)
MONGO_ATLAS_CLUSTER=your-cluster
MONGO_ATLAS_API_KEY=your-api-key
Collection Structure
// Documents collection
db.lightrag_doc_full.insertOne({
_id: "workspace:doc_id",
workspace: "default",
doc_id: "abc123",
doc_name: "document.txt",
content: "Full document text...",
meta: { source: "upload" },
created_at: ISODate(),
updated_at: ISODate()
});
// Entities collection (with vector)
db.lightrag_entities.insertOne({
_id: "workspace:entity_id",
workspace: "default",
entity_name: "Apple Inc.",
entity_type: "organization",
description: "Technology company...",
content: "Apple Inc.\nTechnology company...",
embedding: [0.1, 0.2, ...], // Vector embedding
source_id: "chunk_001,chunk_002",
file_path: "document.txt"
});
// Graph edges collection
db.lightrag_graph_edges.insertOne({
_id: "workspace:source:target",
workspace: "default",
source: "Apple Inc.",
target: "iPhone",
weight: 3.5,
description: "Produces the iPhone",
keywords: "technology,smartphone"
});
Vector Search Index (Atlas)
// Create vector search index
db.lightrag_entities.createSearchIndex({
name: "vector_index",
definition: {
mappings: {
dynamic: true,
fields: {
embedding: {
type: "knnVector",
dimensions: 1024,
similarity: "cosine"
}
}
}
}
});
Neo4j Backend
Neo4j for Graph Storage
Neo4j provides native graph storage with Cypher queries:
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# Neo4j for graph, other backends for KV/Vector
kv_storage="PGKVStorage", # or JsonKVStorage
vector_storage="PGVectorStorage", # or other vector DB
graph_storage="Neo4JStorage", # Neo4j graph
doc_status_storage="PGDocStatusStorage",
)
Environment Variables
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password
# Optional
NEO4J_DATABASE=neo4j
NEO4J_ENCRYPTED=false
Graph Schema
// Entity nodes
CREATE (e:Entity {
entity_id: "Apple Inc.",
entity_type: "organization",
description: "Technology company...",
source_id: "chunk_001",
workspace: "default"
})
// Relationship
MATCH (a:Entity {entity_id: "Apple Inc."})
MATCH (b:Entity {entity_id: "iPhone"})
CREATE (a)-[r:RELATED_TO {
weight: 3.5,
description: "Produces",
keywords: "technology"
}]->(b)
Cypher Queries Used
-- Get node with edges
MATCH (n:Entity {entity_id: $entity_id, workspace: $workspace})
OPTIONAL MATCH (n)-[r]-(m)
RETURN n, r, m
-- Get knowledge graph (BFS)
MATCH path = (start:Entity {entity_id: $label})-[*1..3]-(connected)
WHERE start.workspace = $workspace
RETURN path
LIMIT $max_nodes
-- Search nodes
MATCH (n:Entity)
WHERE n.workspace = $workspace
AND toLower(n.entity_id) CONTAINS toLower($query)
RETURN n.entity_id
ORDER BY n.degree DESC
LIMIT $limit
Redis Backend
Redis for KV and Doc Status
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
kv_storage="RedisKVStorage",
vector_storage="NanoVectorDBStorage", # Redis doesn't have vector
graph_storage="NetworkXStorage", # Redis doesn't have graph
doc_status_storage="RedisDocStatusStorage",
)
Environment Variables
REDIS_URI=redis://localhost:6379
# or with auth
REDIS_URI=redis://user:password@localhost:6379/0
Key Structure
# Document storage
lightrag:{workspace}:full_docs:{doc_id} -> JSON document
# Chunks storage
lightrag:{workspace}:text_chunks:{chunk_id} -> JSON chunk
# LLM cache
lightrag:{workspace}:llm_cache:{cache_key} -> JSON response
# Document status
lightrag:{workspace}:doc_status:{doc_id} -> JSON status
File-Based Backends
Zero-Dependency Setup
Best for development and small-scale usage:
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
# All file-based (default)
kv_storage="JsonKVStorage",
vector_storage="NanoVectorDBStorage",
graph_storage="NetworkXStorage",
doc_status_storage="JsonDocStatusStorage",
)
File Structure
./rag_storage/
├── full_docs.json # Complete documents
├── text_chunks.json # Document chunks
├── llm_response_cache.json # LLM cache
├── full_entities.json # Entity metadata
├── full_relations.json # Relation metadata
├── vdb_entities.json # Entity vectors
├── vdb_relationships.json # Relation vectors
├── vdb_chunks.json # Chunk vectors
├── graph_chunk_entity_relation.graphml # Knowledge graph
└── doc_status.json # Processing status
NanoVectorDB Format
{
"data": {
"ent-abc123": {
"__id__": "ent-abc123",
"__vector__": [0.1, 0.2, 0.3, ...],
"entity_name": "Apple Inc.",
"content": "Apple Inc.\nTechnology company",
"source_id": "chunk_001"
}
},
"matrix": [[0.1, 0.2, ...], ...],
"index_to_id": ["ent-abc123", ...]
}
Vector Databases
Milvus
rag = LightRAG(
vector_storage="MilvusVectorDBStorage",
vector_db_storage_cls_kwargs={
"host": "localhost",
"port": 19530,
"collection_name": "lightrag_vectors"
}
)
# Environment variables
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_TOKEN=your_token # For Zilliz Cloud
Qdrant
rag = LightRAG(
vector_storage="QdrantVectorDBStorage",
vector_db_storage_cls_kwargs={
"collection_name": "lightrag"
}
)
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_api_key # Optional
FAISS
rag = LightRAG(
vector_storage="FaissVectorDBStorage",
vector_db_storage_cls_kwargs={
"index_type": "IVF_FLAT", # or HNSW
"nlist": 100
}
)
Configuration Reference
Complete Environment Variables
# Storage Selection
KV_STORAGE=PGKVStorage
VECTOR_STORAGE=PGVectorStorage
GRAPH_STORAGE=PGGraphStorage
DOC_STATUS_STORAGE=PGDocStatusStorage
# PostgreSQL
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secret
POSTGRES_DATABASE=lightrag
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer
# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag
# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
# Redis
REDIS_URI=redis://localhost:6379
# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530
# Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=
# Memgraph
MEMGRAPH_URI=bolt://localhost:7687
Programmatic Configuration
from lightrag import LightRAG
rag = LightRAG(
# Working directory
working_dir="./rag_storage",
workspace="my_project", # Multi-tenant namespace
# Storage backends
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="PGGraphStorage",
doc_status_storage="PGDocStatusStorage",
# Vector DB options
vector_db_storage_cls_kwargs={
"cosine_better_than_threshold": 0.2,
# Backend-specific options...
},
# Processing
chunk_token_size=1200,
chunk_overlap_token_size=100,
)
Multi-Tenant Data Isolation
All storage backends support multi-tenant isolation:
# Workspace creates isolated namespace
rag = LightRAG(
working_dir="./rag_storage",
workspace="tenant_a:kb_prod", # Composite namespace
)
# Or with explicit tenant context
from lightrag.tenant_rag_manager import TenantRAGManager
manager = TenantRAGManager(
base_working_dir="./rag_storage",
tenant_service=tenant_service,
template_rag=template_rag,
)
# Get tenant-specific instance
rag = await manager.get_rag_instance(
tenant_id="tenant_a",
kb_id="kb_prod"
)
Isolation Pattern
┌─────────────────────────────────────────────────────────────┐
│ Multi-Tenant Data Isolation │
├─────────────────────────────────────────────────────────────┤
│ │
│ PostgreSQL: WHERE workspace = 'tenant_a:kb_prod:default' │
│ │
│ MongoDB: { workspace: "tenant_a:kb_prod:default" } │
│ │
│ Redis: lightrag:tenant_a:kb_prod:default:{key} │
│ │
│ Neo4j: MATCH (n {workspace: $workspace}) │
│ │
│ File: ./rag_storage/tenant_a:kb_prod/ │
│ │
└─────────────────────────────────────────────────────────────┘
Migration Between Backends
Export/Import Pattern
# Export from source
source_rag = LightRAG(
kv_storage="JsonKVStorage",
vector_storage="NanoVectorDBStorage",
graph_storage="NetworkXStorage",
)
# Initialize source
await source_rag.initialize_storages()
# Get all data
docs = await source_rag.full_docs.get_all()
chunks = await source_rag.text_chunks.get_all()
# ... export other data
# Import to target
target_rag = LightRAG(
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="PGGraphStorage",
)
await target_rag.initialize_storages()
await target_rag.full_docs.upsert(docs)
await target_rag.text_chunks.upsert(chunks)
# ... import other data
await target_rag.finalize_storages()
Best Practices
Production Recommendations
- Use PostgreSQL Full Stack for simplicity and reliability
- Enable connection pooling for high concurrency
- Create indexes on frequently queried columns
- Monitor storage growth and plan capacity
- Regular backups with point-in-time recovery
- Use SSL/TLS for database connections
Performance Tuning
# PostgreSQL tuning
POSTGRES_MAX_CONNECTIONS=200
POSTGRES_VECTOR_INDEX_TYPE=hnsw
POSTGRES_HNSW_M=32
POSTGRES_HNSW_EF=128
# LightRAG tuning
MAX_PARALLEL_INSERT=4
EMBEDDING_BATCH_NUM=20
MAX_ASYNC=8
Version: 1.4.9.1 | License: MIT