LightRAG/docs/0004-storage-backends.md
Raphael MANSUY 2b292d4924
docs: Enterprise Edition & Multi-tenancy attribution (#5)
* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.
2025-12-04 18:09:15 +08:00

22 KiB

LightRAG Storage Backends

Complete guide to storage backend configuration and implementation

Version: 1.4.9.1 | Last Updated: December 2025


Table of Contents

  1. Overview
  2. Storage Types
  3. Backend Comparison
  4. PostgreSQL Backend
  5. MongoDB Backend
  6. Neo4j Backend
  7. Redis Backend
  8. File-Based Backends
  9. Vector Databases
  10. Configuration Reference

Overview

LightRAG uses four types of storage, each with multiple backend options:

┌─────────────────────────────────────────────────────────────────────────┐
│                      Storage Architecture                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    LightRAG Core                                 │   │
│  └───────────────────────────┬─────────────────────────────────────┘   │
│                              │                                          │
│          ┌───────────────────┼───────────────────┐                     │
│          │                   │                   │                     │
│          ▼                   ▼                   ▼                     │
│  ┌───────────────┐   ┌───────────────┐   ┌───────────────┐            │
│  │   KV Storage  │   │ Vector Store  │   │  Graph Store  │            │
│  │   (Documents, │   │  (Embeddings) │   │   (KG Nodes   │            │
│  │    Chunks,    │   │               │   │    & Edges)   │            │
│  │    Cache)     │   │               │   │               │            │
│  └───────┬───────┘   └───────┬───────┘   └───────┬───────┘            │
│          │                   │                   │                     │
│          ▼                   ▼                   ▼                     │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   Backend Implementations                        │   │
│  │                                                                  │   │
│  │  PostgreSQL │ MongoDB │ Redis │ Neo4j │ Milvus │ Qdrant │ FAISS │   │
│  │  JSON/File  │ NetworkX │ NanoVectorDB │ Memgraph │ ...           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Storage Types

1. Key-Value Storage (BaseKVStorage)

Stores documents, chunks, and LLM cache.

Implementation Description Use Case
JsonKVStorage File-based JSON Development, single-node
PGKVStorage PostgreSQL tables Production, multi-node
MongoKVStorage MongoDB collections Production, flexible schema
RedisKVStorage Redis hash maps High-performance caching

2. Vector Storage (BaseVectorStorage)

Stores and queries embedding vectors.

Implementation Description Use Case
NanoVectorDBStorage In-memory, file-persisted Development, small datasets
PGVectorStorage PostgreSQL + pgvector Production, unified DB
MilvusVectorDBStorage Milvus vector DB Large-scale production
QdrantVectorDBStorage Qdrant vector DB Cloud-native production
FaissVectorDBStorage FAISS index Local high-performance
MongoVectorDBStorage MongoDB Atlas Vector MongoDB ecosystem

3. Graph Storage (BaseGraphStorage)

Stores knowledge graph nodes and edges.

Implementation Description Use Case
NetworkXStorage In-memory NetworkX Development, small graphs
PGGraphStorage PostgreSQL tables Production, unified DB
Neo4JStorage Native graph DB Complex graph queries
MemgraphStorage In-memory graph DB Real-time analytics
MongoGraphStorage MongoDB documents Document-graph hybrid

4. Document Status Storage (DocStatusStorage)

Tracks document processing status.

Implementation Description Use Case
JsonDocStatusStorage File-based JSON Development
PGDocStatusStorage PostgreSQL Production
MongoDocStatusStorage MongoDB Production
RedisDocStatusStorage Redis Distributed

Backend Comparison

Feature Matrix

┌────────────────────┬─────────┬────────┬───────┬────────┬───────────┐
│     Feature        │ PG Full │ Mongo  │ Neo4j │ Mixed  │ File-Only │
├────────────────────┼─────────┼────────┼───────┼────────┼───────────┤
│ KV Storage         │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
│ Vector Storage     │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
│ Graph Storage      │    ✅   │   ✅   │   ✅  │   ✅   │    ✅     │
│ Doc Status         │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
│ Multi-tenant       │    ✅   │   ✅   │   ✅  │   ✅   │    ⚠️    │
│ Horizontal Scale   │    ✅   │   ✅   │   ✅  │   ✅   │    ❌     │
│ ACID Transactions  │    ✅   │   ⚠️   │   ✅  │   ⚠️   │    ❌     │
│ Zero Dependencies  │    ❌   │   ❌   │   ❌  │   ❌   │    ✅     │
│ Graph Queries      │    ⚠️   │   ⚠️   │   ✅  │   ✅   │    ⚠️    │
│ Vector Search      │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
└────────────────────┴─────────┴────────┴───────┴────────┴───────────┘

Legend: ✅ Full support  ⚠️ Limited  ❌ Not supported

Performance Characteristics

Backend Write Speed Read Speed Memory Usage Disk Usage
PostgreSQL Full Fast Fast Medium Compact
MongoDB Full Fast Fast Medium Medium
Neo4j + Vector Slow Fast (graph) High Medium
File-based Slow Medium Low Compact
Milvus/Qdrant Fast Very Fast High Large

PostgreSQL Backend

Complete PostgreSQL Setup

PostgreSQL can handle ALL storage types (recommended for production):

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # All PostgreSQL backends
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage", 
    graph_storage="PGGraphStorage",
    doc_status_storage="PGDocStatusStorage",
)

Environment Variables

# Required
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
POSTGRES_DATABASE=lightrag

# Optional
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer           # disable|allow|prefer|require|verify-ca|verify-full
POSTGRES_SSL_CERT=/path/to/cert
POSTGRES_SSL_KEY=/path/to/key
POSTGRES_SSL_ROOT_CERT=/path/to/ca

# Vector index configuration
POSTGRES_VECTOR_INDEX_TYPE=hnsw    # hnsw|ivfflat
POSTGRES_HNSW_M=16
POSTGRES_HNSW_EF=64
POSTGRES_IVFFLAT_LISTS=100

Schema Overview

-- Documents table
CREATE TABLE LIGHTRAG_DOC_FULL (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    doc_name VARCHAR(1024),
    content TEXT,
    meta JSONB,
    createtime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    updatetime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace, id)
);

-- Chunks table  
CREATE TABLE LIGHTRAG_DOC_CHUNKS (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    full_doc_id VARCHAR(255),
    chunk_order_index INT,
    tokens INT,
    content TEXT,
    content_summary TEXT,
    file_path VARCHAR(32768),
    create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace, id)
);

-- Entity vectors (pgvector extension required)
CREATE TABLE LIGHTRAG_VDB_ENTITY (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    entity_name VARCHAR(1024),
    content TEXT,
    content_vector VECTOR(1024),  -- Adjust dimension to match embedding
    source_id TEXT,
    file_path TEXT,
    create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace, id)
);

-- Graph nodes
CREATE TABLE LIGHTRAG_GRAPH_NODES (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    entity_type VARCHAR(255),
    description TEXT,
    source_id TEXT,
    file_path TEXT,
    created_at INT,
    PRIMARY KEY (workspace, id)
);

-- Graph edges
CREATE TABLE LIGHTRAG_GRAPH_EDGES (
    workspace VARCHAR(1024) NOT NULL,
    source_id VARCHAR(255) NOT NULL,
    target_id VARCHAR(255) NOT NULL,
    weight FLOAT,
    description TEXT,
    keywords TEXT,
    source_chunk_id TEXT,
    file_path TEXT,
    created_at INT,
    PRIMARY KEY (workspace, source_id, target_id)
);

pgvector Index Types

-- HNSW index (recommended for accuracy)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY 
USING hnsw (content_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat index (faster but less accurate)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY 
USING ivfflat (content_vector vector_cosine_ops)
WITH (lists = 100);

MongoDB Backend

Complete MongoDB Setup

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # All MongoDB backends
    kv_storage="MongoKVStorage",
    vector_storage="MongoVectorDBStorage",
    graph_storage="MongoGraphStorage", 
    doc_status_storage="MongoDocStatusStorage",
)

Environment Variables

MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag

# Atlas Vector Search (optional)
MONGO_ATLAS_CLUSTER=your-cluster
MONGO_ATLAS_API_KEY=your-api-key

Collection Structure

// Documents collection
db.lightrag_doc_full.insertOne({
    _id: "workspace:doc_id",
    workspace: "default",
    doc_id: "abc123",
    doc_name: "document.txt",
    content: "Full document text...",
    meta: { source: "upload" },
    created_at: ISODate(),
    updated_at: ISODate()
});

// Entities collection (with vector)
db.lightrag_entities.insertOne({
    _id: "workspace:entity_id",
    workspace: "default",
    entity_name: "Apple Inc.",
    entity_type: "organization",
    description: "Technology company...",
    content: "Apple Inc.\nTechnology company...",
    embedding: [0.1, 0.2, ...],  // Vector embedding
    source_id: "chunk_001,chunk_002",
    file_path: "document.txt"
});

// Graph edges collection
db.lightrag_graph_edges.insertOne({
    _id: "workspace:source:target",
    workspace: "default",
    source: "Apple Inc.",
    target: "iPhone",
    weight: 3.5,
    description: "Produces the iPhone",
    keywords: "technology,smartphone"
});

Vector Search Index (Atlas)

// Create vector search index
db.lightrag_entities.createSearchIndex({
    name: "vector_index",
    definition: {
        mappings: {
            dynamic: true,
            fields: {
                embedding: {
                    type: "knnVector",
                    dimensions: 1024,
                    similarity: "cosine"
                }
            }
        }
    }
});

Neo4j Backend

Neo4j for Graph Storage

Neo4j provides native graph storage with Cypher queries:

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # Neo4j for graph, other backends for KV/Vector
    kv_storage="PGKVStorage",           # or JsonKVStorage
    vector_storage="PGVectorStorage",    # or other vector DB
    graph_storage="Neo4JStorage",        # Neo4j graph
    doc_status_storage="PGDocStatusStorage",
)

Environment Variables

NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password

# Optional
NEO4J_DATABASE=neo4j
NEO4J_ENCRYPTED=false

Graph Schema

// Entity nodes
CREATE (e:Entity {
    entity_id: "Apple Inc.",
    entity_type: "organization",
    description: "Technology company...",
    source_id: "chunk_001",
    workspace: "default"
})

// Relationship
MATCH (a:Entity {entity_id: "Apple Inc."})
MATCH (b:Entity {entity_id: "iPhone"})
CREATE (a)-[r:RELATED_TO {
    weight: 3.5,
    description: "Produces",
    keywords: "technology"
}]->(b)

Cypher Queries Used

-- Get node with edges
MATCH (n:Entity {entity_id: $entity_id, workspace: $workspace})
OPTIONAL MATCH (n)-[r]-(m)
RETURN n, r, m

-- Get knowledge graph (BFS)
MATCH path = (start:Entity {entity_id: $label})-[*1..3]-(connected)
WHERE start.workspace = $workspace
RETURN path
LIMIT $max_nodes

-- Search nodes
MATCH (n:Entity)
WHERE n.workspace = $workspace 
  AND toLower(n.entity_id) CONTAINS toLower($query)
RETURN n.entity_id
ORDER BY n.degree DESC
LIMIT $limit

Redis Backend

Redis for KV and Doc Status

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    kv_storage="RedisKVStorage",
    vector_storage="NanoVectorDBStorage",  # Redis doesn't have vector
    graph_storage="NetworkXStorage",       # Redis doesn't have graph
    doc_status_storage="RedisDocStatusStorage",
)

Environment Variables

REDIS_URI=redis://localhost:6379
# or with auth
REDIS_URI=redis://user:password@localhost:6379/0

Key Structure

# Document storage
lightrag:{workspace}:full_docs:{doc_id} -> JSON document

# Chunks storage
lightrag:{workspace}:text_chunks:{chunk_id} -> JSON chunk

# LLM cache
lightrag:{workspace}:llm_cache:{cache_key} -> JSON response

# Document status
lightrag:{workspace}:doc_status:{doc_id} -> JSON status

File-Based Backends

Zero-Dependency Setup

Best for development and small-scale usage:

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # All file-based (default)
    kv_storage="JsonKVStorage",
    vector_storage="NanoVectorDBStorage",
    graph_storage="NetworkXStorage",
    doc_status_storage="JsonDocStatusStorage",
)

File Structure

./rag_storage/
├── full_docs.json              # Complete documents
├── text_chunks.json            # Document chunks
├── llm_response_cache.json     # LLM cache
├── full_entities.json          # Entity metadata
├── full_relations.json         # Relation metadata
├── vdb_entities.json           # Entity vectors
├── vdb_relationships.json      # Relation vectors
├── vdb_chunks.json             # Chunk vectors
├── graph_chunk_entity_relation.graphml  # Knowledge graph
└── doc_status.json             # Processing status

NanoVectorDB Format

{
  "data": {
    "ent-abc123": {
      "__id__": "ent-abc123",
      "__vector__": [0.1, 0.2, 0.3, ...],
      "entity_name": "Apple Inc.",
      "content": "Apple Inc.\nTechnology company",
      "source_id": "chunk_001"
    }
  },
  "matrix": [[0.1, 0.2, ...], ...],
  "index_to_id": ["ent-abc123", ...]
}

Vector Databases

Milvus

rag = LightRAG(
    vector_storage="MilvusVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "host": "localhost",
        "port": 19530,
        "collection_name": "lightrag_vectors"
    }
)
# Environment variables
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_TOKEN=your_token  # For Zilliz Cloud

Qdrant

rag = LightRAG(
    vector_storage="QdrantVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "collection_name": "lightrag"
    }
)
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_api_key  # Optional

FAISS

rag = LightRAG(
    vector_storage="FaissVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "index_type": "IVF_FLAT",  # or HNSW
        "nlist": 100
    }
)

Configuration Reference

Complete Environment Variables

# Storage Selection
KV_STORAGE=PGKVStorage
VECTOR_STORAGE=PGVectorStorage
GRAPH_STORAGE=PGGraphStorage
DOC_STATUS_STORAGE=PGDocStatusStorage

# PostgreSQL
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secret
POSTGRES_DATABASE=lightrag
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer

# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag

# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

# Redis
REDIS_URI=redis://localhost:6379

# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530

# Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=

# Memgraph
MEMGRAPH_URI=bolt://localhost:7687

Programmatic Configuration

from lightrag import LightRAG

rag = LightRAG(
    # Working directory
    working_dir="./rag_storage",
    workspace="my_project",  # Multi-tenant namespace
    
    # Storage backends
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage",
    graph_storage="PGGraphStorage",
    doc_status_storage="PGDocStatusStorage",
    
    # Vector DB options
    vector_db_storage_cls_kwargs={
        "cosine_better_than_threshold": 0.2,
        # Backend-specific options...
    },
    
    # Processing
    chunk_token_size=1200,
    chunk_overlap_token_size=100,
)

Multi-Tenant Data Isolation

All storage backends support multi-tenant isolation:

# Workspace creates isolated namespace
rag = LightRAG(
    working_dir="./rag_storage",
    workspace="tenant_a:kb_prod",  # Composite namespace
)

# Or with explicit tenant context
from lightrag.tenant_rag_manager import TenantRAGManager

manager = TenantRAGManager(
    base_working_dir="./rag_storage",
    tenant_service=tenant_service,
    template_rag=template_rag,
)

# Get tenant-specific instance
rag = await manager.get_rag_instance(
    tenant_id="tenant_a",
    kb_id="kb_prod"
)

Isolation Pattern

┌─────────────────────────────────────────────────────────────┐
│              Multi-Tenant Data Isolation                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PostgreSQL: WHERE workspace = 'tenant_a:kb_prod:default'  │
│                                                             │
│  MongoDB: { workspace: "tenant_a:kb_prod:default" }        │
│                                                             │
│  Redis: lightrag:tenant_a:kb_prod:default:{key}            │
│                                                             │
│  Neo4j: MATCH (n {workspace: $workspace})                  │
│                                                             │
│  File: ./rag_storage/tenant_a:kb_prod/                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Migration Between Backends

Export/Import Pattern

# Export from source
source_rag = LightRAG(
    kv_storage="JsonKVStorage",
    vector_storage="NanoVectorDBStorage",
    graph_storage="NetworkXStorage",
)

# Initialize source
await source_rag.initialize_storages()

# Get all data
docs = await source_rag.full_docs.get_all()
chunks = await source_rag.text_chunks.get_all()
# ... export other data

# Import to target
target_rag = LightRAG(
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage",
    graph_storage="PGGraphStorage",
)

await target_rag.initialize_storages()
await target_rag.full_docs.upsert(docs)
await target_rag.text_chunks.upsert(chunks)
# ... import other data
await target_rag.finalize_storages()

Best Practices

Production Recommendations

  1. Use PostgreSQL Full Stack for simplicity and reliability
  2. Enable connection pooling for high concurrency
  3. Create indexes on frequently queried columns
  4. Monitor storage growth and plan capacity
  5. Regular backups with point-in-time recovery
  6. Use SSL/TLS for database connections

Performance Tuning

# PostgreSQL tuning
POSTGRES_MAX_CONNECTIONS=200
POSTGRES_VECTOR_INDEX_TYPE=hnsw
POSTGRES_HNSW_M=32
POSTGRES_HNSW_EF=128

# LightRAG tuning
MAX_PARALLEL_INSERT=4
EMBEDDING_BATCH_NUM=20
MAX_ASYNC=8

Version: 1.4.9.1 | License: MIT