docs: Enterprise Edition & Multi-tenancy attribution (#5 )

* Remove outdated documentation files: Quick Start Guide, Apache AGE Analysis, and Scratchpad.

* Add multi-tenant testing strategy and ADR index documentation

- Introduced ADR 008 detailing the multi-tenant testing strategy for the ./starter environment, covering compatibility and multi-tenant modes, testing scenarios, and implementation details.
- Created a comprehensive ADR index (README.md) summarizing all architecture decision records related to the multi-tenant implementation, including purpose, key sections, and reading paths for different roles.

* feat(docs): Add comprehensive multi-tenancy guide and README for LightRAG Enterprise

- Introduced `0008-multi-tenancy.md` detailing multi-tenancy architecture, key concepts, roles, permissions, configuration, and API endpoints.
- Created `README.md` as the main documentation index, outlining features, quick start, system overview, and deployment options.
- Documented the LightRAG architecture, storage backends, LLM integrations, and query modes.
- Established a task log (`2025-01-21-lightrag-documentation-log.md`) summarizing documentation creation actions, decisions, and insights.

2025-12-04 18:09:15 +08:00

22 KiB

Raw Blame History

LightRAG Storage Backends

Complete guide to storage backend configuration and implementation

Version: 1.4.9.1 | Last Updated: December 2025

Overview
Storage Types
Backend Comparison
PostgreSQL Backend
MongoDB Backend
Neo4j Backend
Redis Backend
File-Based Backends
Vector Databases
Configuration Reference

Overview

LightRAG uses four types of storage, each with multiple backend options:

┌─────────────────────────────────────────────────────────────────────────┐
│                      Storage Architecture                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    LightRAG Core                                 │   │
│  └───────────────────────────┬─────────────────────────────────────┘   │
│                              │                                          │
│          ┌───────────────────┼───────────────────┐                     │
│          │                   │                   │                     │
│          ▼                   ▼                   ▼                     │
│  ┌───────────────┐   ┌───────────────┐   ┌───────────────┐            │
│  │   KV Storage  │   │ Vector Store  │   │  Graph Store  │            │
│  │   (Documents, │   │  (Embeddings) │   │   (KG Nodes   │            │
│  │    Chunks,    │   │               │   │    & Edges)   │            │
│  │    Cache)     │   │               │   │               │            │
│  └───────┬───────┘   └───────┬───────┘   └───────┬───────┘            │
│          │                   │                   │                     │
│          ▼                   ▼                   ▼                     │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   Backend Implementations                        │   │
│  │                                                                  │   │
│  │  PostgreSQL │ MongoDB │ Redis │ Neo4j │ Milvus │ Qdrant │ FAISS │   │
│  │  JSON/File  │ NetworkX │ NanoVectorDB │ Memgraph │ ...           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Storage Types

1. Key-Value Storage (`BaseKVStorage`)

Stores documents, chunks, and LLM cache.

Implementation	Description	Use Case
`JsonKVStorage`	File-based JSON	Development, single-node
`PGKVStorage`	PostgreSQL tables	Production, multi-node
`MongoKVStorage`	MongoDB collections	Production, flexible schema
`RedisKVStorage`	Redis hash maps	High-performance caching

2. Vector Storage (`BaseVectorStorage`)

Stores and queries embedding vectors.

Implementation	Description	Use Case
`NanoVectorDBStorage`	In-memory, file-persisted	Development, small datasets
`PGVectorStorage`	PostgreSQL + pgvector	Production, unified DB
`MilvusVectorDBStorage`	Milvus vector DB	Large-scale production
`QdrantVectorDBStorage`	Qdrant vector DB	Cloud-native production
`FaissVectorDBStorage`	FAISS index	Local high-performance
`MongoVectorDBStorage`	MongoDB Atlas Vector	MongoDB ecosystem

3. Graph Storage (`BaseGraphStorage`)

Stores knowledge graph nodes and edges.

Implementation	Description	Use Case
`NetworkXStorage`	In-memory NetworkX	Development, small graphs
`PGGraphStorage`	PostgreSQL tables	Production, unified DB
`Neo4JStorage`	Native graph DB	Complex graph queries
`MemgraphStorage`	In-memory graph DB	Real-time analytics
`MongoGraphStorage`	MongoDB documents	Document-graph hybrid

4. Document Status Storage (`DocStatusStorage`)

Tracks document processing status.

Implementation	Description	Use Case
`JsonDocStatusStorage`	File-based JSON	Development
`PGDocStatusStorage`	PostgreSQL	Production
`MongoDocStatusStorage`	MongoDB	Production
`RedisDocStatusStorage`	Redis	Distributed

Backend Comparison

Feature Matrix

┌────────────────────┬─────────┬────────┬───────┬────────┬───────────┐
│     Feature        │ PG Full │ Mongo  │ Neo4j │ Mixed  │ File-Only │
├────────────────────┼─────────┼────────┼───────┼────────┼───────────┤
│ KV Storage         │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
│ Vector Storage     │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
│ Graph Storage      │    ✅   │   ✅   │   ✅  │   ✅   │    ✅     │
│ Doc Status         │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
│ Multi-tenant       │    ✅   │   ✅   │   ✅  │   ✅   │    ⚠️    │
│ Horizontal Scale   │    ✅   │   ✅   │   ✅  │   ✅   │    ❌     │
│ ACID Transactions  │    ✅   │   ⚠️   │   ✅  │   ⚠️   │    ❌     │
│ Zero Dependencies  │    ❌   │   ❌   │   ❌  │   ❌   │    ✅     │
│ Graph Queries      │    ⚠️   │   ⚠️   │   ✅  │   ✅   │    ⚠️    │
│ Vector Search      │    ✅   │   ✅   │   ❌  │   ✅   │    ✅     │
└────────────────────┴─────────┴────────┴───────┴────────┴───────────┘

Legend: ✅ Full support  ⚠️ Limited  ❌ Not supported

Performance Characteristics

Backend	Write Speed	Read Speed	Memory Usage	Disk Usage
PostgreSQL Full	Fast	Fast	Medium	Compact
MongoDB Full	Fast	Fast	Medium	Medium
Neo4j + Vector	Slow	Fast (graph)	High	Medium
File-based	Slow	Medium	Low	Compact
Milvus/Qdrant	Fast	Very Fast	High	Large

PostgreSQL Backend

Complete PostgreSQL Setup

PostgreSQL can handle ALL storage types (recommended for production):

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # All PostgreSQL backends
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage", 
    graph_storage="PGGraphStorage",
    doc_status_storage="PGDocStatusStorage",
)

Environment Variables

# Required
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
POSTGRES_DATABASE=lightrag

# Optional
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer           # disable|allow|prefer|require|verify-ca|verify-full
POSTGRES_SSL_CERT=/path/to/cert
POSTGRES_SSL_KEY=/path/to/key
POSTGRES_SSL_ROOT_CERT=/path/to/ca

# Vector index configuration
POSTGRES_VECTOR_INDEX_TYPE=hnsw    # hnsw|ivfflat
POSTGRES_HNSW_M=16
POSTGRES_HNSW_EF=64
POSTGRES_IVFFLAT_LISTS=100

Schema Overview

-- Documents table
CREATE TABLE LIGHTRAG_DOC_FULL (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    doc_name VARCHAR(1024),
    content TEXT,
    meta JSONB,
    createtime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    updatetime TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace, id)
);

-- Chunks table  
CREATE TABLE LIGHTRAG_DOC_CHUNKS (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    full_doc_id VARCHAR(255),
    chunk_order_index INT,
    tokens INT,
    content TEXT,
    content_summary TEXT,
    file_path VARCHAR(32768),
    create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace, id)
);

-- Entity vectors (pgvector extension required)
CREATE TABLE LIGHTRAG_VDB_ENTITY (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    entity_name VARCHAR(1024),
    content TEXT,
    content_vector VECTOR(1024),  -- Adjust dimension to match embedding
    source_id TEXT,
    file_path TEXT,
    create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (workspace, id)
);

-- Graph nodes
CREATE TABLE LIGHTRAG_GRAPH_NODES (
    workspace VARCHAR(1024) NOT NULL,
    id VARCHAR(255) NOT NULL,
    entity_type VARCHAR(255),
    description TEXT,
    source_id TEXT,
    file_path TEXT,
    created_at INT,
    PRIMARY KEY (workspace, id)
);

-- Graph edges
CREATE TABLE LIGHTRAG_GRAPH_EDGES (
    workspace VARCHAR(1024) NOT NULL,
    source_id VARCHAR(255) NOT NULL,
    target_id VARCHAR(255) NOT NULL,
    weight FLOAT,
    description TEXT,
    keywords TEXT,
    source_chunk_id TEXT,
    file_path TEXT,
    created_at INT,
    PRIMARY KEY (workspace, source_id, target_id)
);

pgvector Index Types

-- HNSW index (recommended for accuracy)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY 
USING hnsw (content_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat index (faster but less accurate)
CREATE INDEX ON LIGHTRAG_VDB_ENTITY 
USING ivfflat (content_vector vector_cosine_ops)
WITH (lists = 100);

MongoDB Backend

Complete MongoDB Setup

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # All MongoDB backends
    kv_storage="MongoKVStorage",
    vector_storage="MongoVectorDBStorage",
    graph_storage="MongoGraphStorage", 
    doc_status_storage="MongoDocStatusStorage",
)

Environment Variables

MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag

# Atlas Vector Search (optional)
MONGO_ATLAS_CLUSTER=your-cluster
MONGO_ATLAS_API_KEY=your-api-key

Collection Structure

// Documents collection
db.lightrag_doc_full.insertOne({
    _id: "workspace:doc_id",
    workspace: "default",
    doc_id: "abc123",
    doc_name: "document.txt",
    content: "Full document text...",
    meta: { source: "upload" },
    created_at: ISODate(),
    updated_at: ISODate()
});

// Entities collection (with vector)
db.lightrag_entities.insertOne({
    _id: "workspace:entity_id",
    workspace: "default",
    entity_name: "Apple Inc.",
    entity_type: "organization",
    description: "Technology company...",
    content: "Apple Inc.\nTechnology company...",
    embedding: [0.1, 0.2, ...],  // Vector embedding
    source_id: "chunk_001,chunk_002",
    file_path: "document.txt"
});

// Graph edges collection
db.lightrag_graph_edges.insertOne({
    _id: "workspace:source:target",
    workspace: "default",
    source: "Apple Inc.",
    target: "iPhone",
    weight: 3.5,
    description: "Produces the iPhone",
    keywords: "technology,smartphone"
});

Vector Search Index (Atlas)

// Create vector search index
db.lightrag_entities.createSearchIndex({
    name: "vector_index",
    definition: {
        mappings: {
            dynamic: true,
            fields: {
                embedding: {
                    type: "knnVector",
                    dimensions: 1024,
                    similarity: "cosine"
                }
            }
        }
    }
});

Neo4j Backend

Neo4j for Graph Storage

Neo4j provides native graph storage with Cypher queries:

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # Neo4j for graph, other backends for KV/Vector
    kv_storage="PGKVStorage",           # or JsonKVStorage
    vector_storage="PGVectorStorage",    # or other vector DB
    graph_storage="Neo4JStorage",        # Neo4j graph
    doc_status_storage="PGDocStatusStorage",
)

Environment Variables

NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password

# Optional
NEO4J_DATABASE=neo4j
NEO4J_ENCRYPTED=false

Graph Schema

// Entity nodes
CREATE (e:Entity {
    entity_id: "Apple Inc.",
    entity_type: "organization",
    description: "Technology company...",
    source_id: "chunk_001",
    workspace: "default"
})

// Relationship
MATCH (a:Entity {entity_id: "Apple Inc."})
MATCH (b:Entity {entity_id: "iPhone"})
CREATE (a)-[r:RELATED_TO {
    weight: 3.5,
    description: "Produces",
    keywords: "technology"
}]->(b)

Cypher Queries Used

-- Get node with edges
MATCH (n:Entity {entity_id: $entity_id, workspace: $workspace})
OPTIONAL MATCH (n)-[r]-(m)
RETURN n, r, m

-- Get knowledge graph (BFS)
MATCH path = (start:Entity {entity_id: $label})-[*1..3]-(connected)
WHERE start.workspace = $workspace
RETURN path
LIMIT $max_nodes

-- Search nodes
MATCH (n:Entity)
WHERE n.workspace = $workspace 
  AND toLower(n.entity_id) CONTAINS toLower($query)
RETURN n.entity_id
ORDER BY n.degree DESC
LIMIT $limit

Redis Backend

Redis for KV and Doc Status

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    kv_storage="RedisKVStorage",
    vector_storage="NanoVectorDBStorage",  # Redis doesn't have vector
    graph_storage="NetworkXStorage",       # Redis doesn't have graph
    doc_status_storage="RedisDocStatusStorage",
)

Environment Variables

REDIS_URI=redis://localhost:6379
# or with auth
REDIS_URI=redis://user:password@localhost:6379/0

Key Structure

# Document storage
lightrag:{workspace}:full_docs:{doc_id} -> JSON document

# Chunks storage
lightrag:{workspace}:text_chunks:{chunk_id} -> JSON chunk

# LLM cache
lightrag:{workspace}:llm_cache:{cache_key} -> JSON response

# Document status
lightrag:{workspace}:doc_status:{doc_id} -> JSON status

File-Based Backends

Zero-Dependency Setup

Best for development and small-scale usage:

from lightrag import LightRAG

rag = LightRAG(
    working_dir="./rag_storage",
    
    # All file-based (default)
    kv_storage="JsonKVStorage",
    vector_storage="NanoVectorDBStorage",
    graph_storage="NetworkXStorage",
    doc_status_storage="JsonDocStatusStorage",
)

File Structure

./rag_storage/
├── full_docs.json              # Complete documents
├── text_chunks.json            # Document chunks
├── llm_response_cache.json     # LLM cache
├── full_entities.json          # Entity metadata
├── full_relations.json         # Relation metadata
├── vdb_entities.json           # Entity vectors
├── vdb_relationships.json      # Relation vectors
├── vdb_chunks.json             # Chunk vectors
├── graph_chunk_entity_relation.graphml  # Knowledge graph
└── doc_status.json             # Processing status

NanoVectorDB Format

{
  "data": {
    "ent-abc123": {
      "__id__": "ent-abc123",
      "__vector__": [0.1, 0.2, 0.3, ...],
      "entity_name": "Apple Inc.",
      "content": "Apple Inc.\nTechnology company",
      "source_id": "chunk_001"
    }
  },
  "matrix": [[0.1, 0.2, ...], ...],
  "index_to_id": ["ent-abc123", ...]
}

Vector Databases

Milvus

rag = LightRAG(
    vector_storage="MilvusVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "host": "localhost",
        "port": 19530,
        "collection_name": "lightrag_vectors"
    }
)

# Environment variables
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_TOKEN=your_token  # For Zilliz Cloud

Qdrant

rag = LightRAG(
    vector_storage="QdrantVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "collection_name": "lightrag"
    }
)

QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_api_key  # Optional

FAISS

rag = LightRAG(
    vector_storage="FaissVectorDBStorage",
    vector_db_storage_cls_kwargs={
        "index_type": "IVF_FLAT",  # or HNSW
        "nlist": 100
    }
)

Configuration Reference

Complete Environment Variables

# Storage Selection
KV_STORAGE=PGKVStorage
VECTOR_STORAGE=PGVectorStorage
GRAPH_STORAGE=PGGraphStorage
DOC_STATUS_STORAGE=PGDocStatusStorage

# PostgreSQL
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secret
POSTGRES_DATABASE=lightrag
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SSL_MODE=prefer

# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lightrag

# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

# Redis
REDIS_URI=redis://localhost:6379

# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530

# Qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=

# Memgraph
MEMGRAPH_URI=bolt://localhost:7687

Programmatic Configuration

from lightrag import LightRAG

rag = LightRAG(
    # Working directory
    working_dir="./rag_storage",
    workspace="my_project",  # Multi-tenant namespace
    
    # Storage backends
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage",
    graph_storage="PGGraphStorage",
    doc_status_storage="PGDocStatusStorage",
    
    # Vector DB options
    vector_db_storage_cls_kwargs={
        "cosine_better_than_threshold": 0.2,
        # Backend-specific options...
    },
    
    # Processing
    chunk_token_size=1200,
    chunk_overlap_token_size=100,
)

Multi-Tenant Data Isolation

All storage backends support multi-tenant isolation:

# Workspace creates isolated namespace
rag = LightRAG(
    working_dir="./rag_storage",
    workspace="tenant_a:kb_prod",  # Composite namespace
)

# Or with explicit tenant context
from lightrag.tenant_rag_manager import TenantRAGManager

manager = TenantRAGManager(
    base_working_dir="./rag_storage",
    tenant_service=tenant_service,
    template_rag=template_rag,
)

# Get tenant-specific instance
rag = await manager.get_rag_instance(
    tenant_id="tenant_a",
    kb_id="kb_prod"
)

Isolation Pattern

┌─────────────────────────────────────────────────────────────┐
│              Multi-Tenant Data Isolation                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PostgreSQL: WHERE workspace = 'tenant_a:kb_prod:default'  │
│                                                             │
│  MongoDB: { workspace: "tenant_a:kb_prod:default" }        │
│                                                             │
│  Redis: lightrag:tenant_a:kb_prod:default:{key}            │
│                                                             │
│  Neo4j: MATCH (n {workspace: $workspace})                  │
│                                                             │
│  File: ./rag_storage/tenant_a:kb_prod/                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Migration Between Backends

Export/Import Pattern

# Export from source
source_rag = LightRAG(
    kv_storage="JsonKVStorage",
    vector_storage="NanoVectorDBStorage",
    graph_storage="NetworkXStorage",
)

# Initialize source
await source_rag.initialize_storages()

# Get all data
docs = await source_rag.full_docs.get_all()
chunks = await source_rag.text_chunks.get_all()
# ... export other data

# Import to target
target_rag = LightRAG(
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage",
    graph_storage="PGGraphStorage",
)

await target_rag.initialize_storages()
await target_rag.full_docs.upsert(docs)
await target_rag.text_chunks.upsert(chunks)
# ... import other data
await target_rag.finalize_storages()

Best Practices

Production Recommendations

Use PostgreSQL Full Stack for simplicity and reliability
Enable connection pooling for high concurrency
Create indexes on frequently queried columns
Monitor storage growth and plan capacity
Regular backups with point-in-time recovery
Use SSL/TLS for database connections

Performance Tuning

# PostgreSQL tuning
POSTGRES_MAX_CONNECTIONS=200
POSTGRES_VECTOR_INDEX_TYPE=hnsw
POSTGRES_HNSW_M=32
POSTGRES_HNSW_EF=128

# LightRAG tuning
MAX_PARALLEL_INSERT=4
EMBEDDING_BATCH_NUM=20
MAX_ASYNC=8

Version: 1.4.9.1 | License: MIT

22 KiB Raw Blame History

LightRAG Storage Backends

Table of Contents

Overview

Storage Types

1. Key-Value Storage (BaseKVStorage)

2. Vector Storage (BaseVectorStorage)

3. Graph Storage (BaseGraphStorage)

4. Document Status Storage (DocStatusStorage)

Backend Comparison

Feature Matrix

Performance Characteristics

PostgreSQL Backend

Complete PostgreSQL Setup

Environment Variables

Schema Overview

pgvector Index Types

MongoDB Backend

Complete MongoDB Setup

Environment Variables

Collection Structure

Vector Search Index (Atlas)

Neo4j Backend

Neo4j for Graph Storage

Environment Variables

Graph Schema

Cypher Queries Used

Redis Backend

Redis for KV and Doc Status

Environment Variables

Key Structure

File-Based Backends

Zero-Dependency Setup

File Structure

NanoVectorDB Format

Vector Databases

Milvus

Qdrant

FAISS

Configuration Reference

Complete Environment Variables

Programmatic Configuration

Multi-Tenant Data Isolation

Isolation Pattern

Migration Between Backends

Export/Import Pattern

Best Practices

Production Recommendations

Performance Tuning

22 KiB

Raw Blame History

1. Key-Value Storage (`BaseKVStorage`)

2. Vector Storage (`BaseVectorStorage`)

3. Graph Storage (`BaseGraphStorage`)

4. Document Status Storage (`DocStatusStorage`)