Removes trailing whitespace and fixes minor formatting issues in Kubernetes deployment docs, storage report, and Helm chart files. Standardizes indentation and spacing in Docker Compose and deployment shell scripts to improve readability and maintainability. These edits improve documentation clarity and make deployment scripts more robust without altering functionality. Relates to MLO-469
438 lines
16 KiB
Markdown
438 lines
16 KiB
Markdown
# LightRAG Storage Stack Configurations Report
|
|
|
|
## Executive Summary
|
|
|
|
LightRAG supports a modular storage architecture with 4 distinct storage types that can be mixed and matched:
|
|
- **Graph Storage**: Knowledge graph relationships
|
|
- **Vector Storage**: Document embeddings
|
|
- **KV Storage**: Key-value pairs and metadata
|
|
- **Document Status Storage**: Document processing status
|
|
|
|
This report analyzes 25+ storage implementations across 8 database technologies to provide recommendations for different use cases.
|
|
|
|
## Storage Architecture Overview
|
|
|
|
### Storage Types & Available Implementations
|
|
|
|
| Storage Type | Implementations | Count |
|
|
|--------------|----------------|-------|
|
|
| **Graph Storage** | NetworkXStorage, Neo4JStorage, PGGraphStorage, AGEStorage¹, MongoGraphStorage¹ | 5 |
|
|
| **Vector Storage** | NanoVectorDBStorage, MilvusVectorDBStorage, ChromaVectorDBStorage, PGVectorStorage, FaissVectorDBStorage, QdrantVectorDBStorage, MongoVectorDBStorage | 7 |
|
|
| **KV Storage** | JsonKVStorage, RedisKVStorage, PGKVStorage, MongoKVStorage | 4 |
|
|
| **Doc Status Storage** | JsonDocStatusStorage, PGDocStatusStorage, MongoDocStatusStorage | 3 |
|
|
|
|
¹ *Currently commented out in production*
|
|
|
|
## Database Technology Analysis
|
|
|
|
### 1. PostgreSQL + pgvector
|
|
**Implementations**: PGVectorStorage, PGKVStorage, PGGraphStorage, PGDocStatusStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Unified Database**: Single database for all storage types
|
|
- ✅ **ACID Compliance**: Full transactional support
|
|
- ✅ **Mature Ecosystem**: Well-established, enterprise-ready
|
|
- ✅ **Minimal**: Single database to maintain
|
|
- ✅ **pgvector Extension**: Native vector operations with good performance
|
|
- ✅ **SQL Familiarity**: Easy to query and debug
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Graph Limitations**: Requires AGE extension for advanced graph operations
|
|
- ❌ **Vector Performance**: Good but not specialized vector database performance
|
|
- ❌ **Single Point of Failure**: All data in one database
|
|
|
|
**Configuration:**
|
|
```yaml
|
|
LIGHTRAG_KV_STORAGE: PGKVStorage
|
|
LIGHTRAG_VECTOR_STORAGE: PGVectorStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
|
|
LIGHTRAG_GRAPH_STORAGE: PGGraphStorage # Requires AGE extension
|
|
```
|
|
|
|
### 2. Neo4j (Graph Specialist)
|
|
**Implementations**: Neo4JStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Graph Optimization**: Purpose-built for graph operations
|
|
- ✅ **Advanced Graph Analytics**: Complex graph algorithms built-in
|
|
- ✅ **Cypher Query Language**: Powerful graph query capabilities
|
|
- ✅ **Scalability**: Excellent for large, complex graphs
|
|
- ✅ **Visualization**: Rich graph visualization tools
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Graph Only**: Requires additional databases for vectors/KV
|
|
- ❌ **Complexity**: More complex setup and maintenance
|
|
- ❌ **Cost**: Enterprise features require licensing
|
|
- ❌ **Memory Usage**: Can be memory-intensive
|
|
|
|
**Typical Configuration:**
|
|
```yaml
|
|
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
|
|
LIGHTRAG_VECTOR_STORAGE: MilvusVectorDBStorage # Or Qdrant
|
|
LIGHTRAG_KV_STORAGE: RedisKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
|
|
```
|
|
|
|
### 3. Milvus (Vector Specialist)
|
|
**Implementations**: MilvusVectorDBStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Vector Performance**: Optimized for high-performance vector search
|
|
- ✅ **Scalability**: Designed for billion-scale vector collections
|
|
- ✅ **Multiple Indexes**: Various indexing algorithms (IVF, HNSW, etc.)
|
|
- ✅ **GPU Support**: CUDA acceleration for vector operations
|
|
- ✅ **Cloud Native**: Kubernetes-ready architecture
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Complexity**: Complex distributed architecture
|
|
- ❌ **Resource Usage**: High memory and compute requirements
|
|
- ❌ **Overkill**: May be excessive for smaller datasets
|
|
- ❌ **Dependencies**: Requires etcd and MinIO for full deployment
|
|
|
|
**Typical Configuration:**
|
|
```yaml
|
|
LIGHTRAG_VECTOR_STORAGE: MilvusVectorDBStorage
|
|
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
|
|
LIGHTRAG_KV_STORAGE: RedisKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: MongoDocStatusStorage
|
|
```
|
|
|
|
### 4. Qdrant (Vector Specialist)
|
|
**Implementations**: QdrantVectorDBStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Performance**: High-performance vector search with Rust backend
|
|
- ✅ **Simplicity**: Easier deployment than Milvus
|
|
- ✅ **Filtering**: Advanced payload filtering capabilities
|
|
- ✅ **API**: Rich REST and gRPC APIs
|
|
- ✅ **Memory Efficiency**: Lower memory footprint than Milvus
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Ecosystem**: Smaller ecosystem compared to alternatives
|
|
- ❌ **Vector Only**: Requires additional databases for other storage types
|
|
|
|
### 5. MongoDB (Multi-Purpose)
|
|
**Implementations**: MongoKVStorage, MongoVectorDBStorage, MongoDocStatusStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Flexibility**: Schema-less document storage
|
|
- ✅ **Vector Search**: Native vector search capabilities (Atlas Search)
|
|
- ✅ **Multi-Purpose**: Can handle KV, vectors, and document status
|
|
- ✅ **Scalability**: Horizontal scaling with sharding
|
|
- ✅ **Developer Friendly**: Easy to work with JSON documents
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Graph Limitations**: Not optimized for graph operations
|
|
- ❌ **Vector Performance**: Vector search not as optimized as specialists
|
|
- ❌ **Memory Usage**: Can be memory-intensive for large datasets
|
|
|
|
### 6. Redis (KV Specialist)
|
|
**Implementations**: RedisKVStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Speed**: In-memory performance for KV operations
|
|
- ✅ **Simplicity**: Simple key-value operations
|
|
- ✅ **Data Structures**: Rich data structures (lists, sets, hashes)
|
|
- ✅ **Caching**: Excellent for caching and session storage
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Memory Bound**: Limited by available RAM
|
|
- ❌ **KV Only**: Only suitable for key-value storage
|
|
- ❌ **Persistence**: Data persistence requires configuration
|
|
|
|
### 7. Local File Storage
|
|
**Implementations**: NetworkXStorage, JsonKVStorage, JsonDocStatusStorage, NanoVectorDBStorage, FaissVectorDBStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Simplicity**: No external dependencies
|
|
- ✅ **Development**: Perfect for development and testing
|
|
- ✅ **Portability**: Easy to backup and move
|
|
- ✅ **Cost**: No infrastructure costs
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Scalability**: Limited by single machine resources
|
|
- ❌ **Concurrency**: No built-in concurrent access
|
|
- ❌ **Performance**: Limited performance for large datasets
|
|
- ❌ **Reliability**: Single point of failure
|
|
|
|
### 8. ChromaDB (Vector Specialist)
|
|
**Implementations**: ChromaVectorDBStorage
|
|
|
|
**Strengths:**
|
|
- ✅ **Simplicity**: Easy to deploy and use
|
|
- ✅ **Python Native**: Built for Python ML workflows
|
|
- ✅ **Metadata**: Rich metadata filtering capabilities
|
|
- ✅ **Local/Distributed**: Can run locally or distributed
|
|
|
|
**Weaknesses:**
|
|
- ❌ **Performance**: Slower than Milvus/Qdrant for large scales
|
|
- ❌ **Maturity**: Newer project with evolving feature set
|
|
|
|
## Recommended Stack Configurations
|
|
|
|
### 1. 🏆 **Production High-Performance Stack**
|
|
**Best for**: Large-scale production deployments, complex graph analytics
|
|
|
|
```yaml
|
|
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
|
|
LIGHTRAG_VECTOR_STORAGE: MilvusVectorDBStorage
|
|
LIGHTRAG_KV_STORAGE: RedisKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
|
|
```
|
|
|
|
**Services Required:**
|
|
- Neo4j (Graph operations)
|
|
- Milvus + etcd + MinIO (Vector search)
|
|
- Redis (KV cache)
|
|
- PostgreSQL (Document status)
|
|
|
|
**Pros**: Maximum performance, specialized for each data type
|
|
**Cons**: High complexity, resource intensive, expensive
|
|
|
|
```mermaid
|
|
graph LR
|
|
LightRAG_App["LightRAG Application"]
|
|
Neo4j_Service["Neo4j Service"]
|
|
Milvus_Cluster["Milvus Cluster (Milvus, etcd, MinIO)"]
|
|
Redis_Service["Redis Service"]
|
|
PostgreSQL_Service["PostgreSQL Service"]
|
|
|
|
LightRAG_App --> |Graph Storage| Neo4j_Service
|
|
LightRAG_App --> |Vector Storage| Milvus_Cluster
|
|
LightRAG_App --> |KV Storage| Redis_Service
|
|
LightRAG_App --> |Doc Status Storage| PostgreSQL_Service
|
|
```
|
|
|
|
### 2. 🎯 **Production Balanced Stack**
|
|
**Best for**: Production deployments prioritizing simplicity
|
|
|
|
```yaml
|
|
LIGHTRAG_GRAPH_STORAGE: NetworkXStorage
|
|
LIGHTRAG_VECTOR_STORAGE: QdrantVectorDBStorage
|
|
LIGHTRAG_KV_STORAGE: RedisKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
|
|
```
|
|
|
|
**Services Required:**
|
|
- Qdrant (Vector search)
|
|
- Redis (KV cache)
|
|
- PostgreSQL (Document status)
|
|
- File system (Graph storage)
|
|
|
|
**Pros**: Good performance, simpler than full specialist stack
|
|
**Cons**: Graph operations limited by file-based storage
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "LightRAG Application Environment"
|
|
LightRAG_App["LightRAG Application"]
|
|
NetworkX["NetworkX Graph Storage (Local FS)"]
|
|
LightRAG_App -.-> NetworkX
|
|
end
|
|
Qdrant_Service["Qdrant Service"]
|
|
Redis_Service["Redis Service"]
|
|
PostgreSQL_Service["PostgreSQL Service"]
|
|
|
|
LightRAG_App --> |Vector Storage| Qdrant_Service
|
|
LightRAG_App --> |KV Storage| Redis_Service
|
|
LightRAG_App --> |Doc Status Storage| PostgreSQL_Service
|
|
```
|
|
|
|
### 3. 💰 **Production Minimal Stack**
|
|
**Best for**: Budget-conscious production deployments
|
|
|
|
```yaml
|
|
LIGHTRAG_GRAPH_STORAGE: NetworkXStorage
|
|
LIGHTRAG_VECTOR_STORAGE: PGVectorStorage
|
|
LIGHTRAG_KV_STORAGE: PGKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
|
|
```
|
|
|
|
**Services Required:**
|
|
- PostgreSQL + pgvector (All storage except graph)
|
|
- File system (Graph storage)
|
|
|
|
**Pros**: Single database, low cost, good for medium scale
|
|
**Cons**: Not optimized for very large datasets or complex graphs
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "LightRAG Application Environment"
|
|
LightRAG_App["LightRAG Application"]
|
|
NetworkX["NetworkX Graph Storage (Local FS)"]
|
|
LightRAG_App -.-> NetworkX
|
|
end
|
|
PostgreSQL_Service["PostgreSQL Service (+pgvector)"]
|
|
|
|
LightRAG_App --> |Vector Storage| PostgreSQL_Service
|
|
LightRAG_App --> |KV Storage| PostgreSQL_Service
|
|
LightRAG_App --> |Doc Status Storage| PostgreSQL_Service
|
|
```
|
|
|
|
### 4. 🚀 **Development & Testing Stack**
|
|
**Best for**: Local development, testing, small deployments
|
|
|
|
```yaml
|
|
LIGHTRAG_GRAPH_STORAGE: NetworkXStorage
|
|
LIGHTRAG_VECTOR_STORAGE: NanoVectorDBStorage
|
|
LIGHTRAG_KV_STORAGE: JsonKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: JsonDocStatusStorage
|
|
```
|
|
|
|
**Services Required:**
|
|
- None (all file-based)
|
|
|
|
**Pros**: Zero infrastructure, fast setup, portable
|
|
**Cons**: Limited scalability and performance
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "LightRAG Application (Local Process)"
|
|
LightRAG_App["LightRAG App"]
|
|
NetworkX["NetworkX (File System)"]
|
|
NanoVectorDB["NanoVectorDB (File System)"]
|
|
JsonKV["JsonKVStorage (File System)"]
|
|
JsonDocStatus["JsonDocStatusStorage (File System)"]
|
|
|
|
LightRAG_App -.-> |Graph| NetworkX
|
|
LightRAG_App -.-> |Vector| NanoVectorDB
|
|
LightRAG_App -.-> |KV| JsonKV
|
|
LightRAG_App -.-> |Doc Status| JsonDocStatus
|
|
end
|
|
```
|
|
|
|
### 5. 🐳 **Docker All-in-One Stack**
|
|
**Best for**: Containerized deployments, cloud environments
|
|
|
|
```yaml
|
|
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
|
|
LIGHTRAG_VECTOR_STORAGE: QdrantVectorDBStorage
|
|
LIGHTRAG_KV_STORAGE: RedisKVStorage
|
|
LIGHTRAG_DOC_STATUS_STORAGE: MongoDocStatusStorage
|
|
```
|
|
|
|
**Services Required:**
|
|
- Neo4j (Graph)
|
|
- Qdrant (Vector)
|
|
- Redis (KV)
|
|
- MongoDB (Document status)
|
|
|
|
**Pros**: Cloud-native, each service containerized
|
|
**Cons**: More services to manage
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Docker Environment (e.g., Docker Compose)"
|
|
LightRAG_Container["LightRAG App (Container)"]
|
|
Neo4j_Container["Neo4j (Container)"]
|
|
Qdrant_Container["Qdrant (Container)"]
|
|
Redis_Container["Redis (Container)"]
|
|
MongoDB_Container["MongoDB (Container)"]
|
|
end
|
|
LightRAG_Container --> |Graph Storage| Neo4j_Container
|
|
LightRAG_Container --> |Vector Storage| Qdrant_Container
|
|
LightRAG_Container --> |KV Storage| Redis_Container
|
|
LightRAG_Container --> |Doc Status Storage| MongoDB_Container
|
|
```
|
|
|
|
## Performance Comparison
|
|
|
|
### Vector Search Performance (Approximate)
|
|
| Implementation | Small (1K docs) | Medium (100K docs) | Large (1M+ docs) | Memory Usage |
|
|
|---------------|-----------------|--------------------|-----------------|--------------|
|
|
| MilvusVectorDB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | High |
|
|
| QdrantVectorDB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Medium |
|
|
| PGVectorStorage | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | Medium |
|
|
| ChromaVectorDB | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | Medium |
|
|
| FaissVectorDB | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | Low |
|
|
| NanoVectorDB | ⭐⭐⭐ | ⭐⭐ | ⭐ | Low |
|
|
|
|
### Graph Operations Performance
|
|
| Implementation | Node Queries | Edge Traversal | Complex Analytics | Scalability |
|
|
|---------------|--------------|----------------|------------------|-------------|
|
|
| Neo4JStorage | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
| PGGraphStorage | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
|
|
| NetworkXStorage | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ |
|
|
|
|
### KV Operations Performance
|
|
| Implementation | Read Speed | Write Speed | Concurrency | Persistence |
|
|
|---------------|------------|-------------|-------------|-------------|
|
|
| RedisKVStorage | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
|
|
| PGKVStorage | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
| MongoKVStorage | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
| JsonKVStorage | ⭐⭐ | ⭐⭐ | ⭐ | ⭐⭐⭐⭐⭐ |
|
|
|
|
## Deployment Considerations
|
|
|
|
### Resource Requirements
|
|
|
|
| Configuration | CPU | Memory | Storage | Network |
|
|
|--------------|-----|--------|---------|---------|
|
|
| Development Stack | 2 cores | 4GB | 10GB | Minimal |
|
|
| Minimal Stack | 4 cores | 8GB | 50GB | Medium |
|
|
| Balanced Stack | 8 cores | 16GB | 100GB | High |
|
|
| High-Performance Stack | 16+ cores | 32GB+ | 500GB+ | Very High |
|
|
|
|
### Maintenance Complexity
|
|
|
|
| Stack Type | Setup Complexity | Operational Overhead | Monitoring | Backup Strategy |
|
|
|-----------|------------------|---------------------|------------|-----------------|
|
|
| Development | ⭐ | ⭐ | ⭐ | Simple |
|
|
| Minimal | ⭐⭐ | ⭐⭐ | ⭐⭐ | Medium |
|
|
| Balanced | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | Complex |
|
|
| High-Performance | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Very Complex |
|
|
|
|
## Migration Paths
|
|
|
|
### Development → Production
|
|
1. Start with Development Stack (all file-based)
|
|
2. Migrate to Minimal Stack (PostgreSQL-based)
|
|
3. Scale to Balanced Stack (add specialized vector DB)
|
|
4. Optimize with High-Performance Stack (full specialization)
|
|
|
|
### Data Migration Tools
|
|
- **Database-specific**: Use native tools (pg_dump, neo4j-admin, etc.)
|
|
- **LightRAG native**: Built-in export/import capabilities
|
|
- **Cross-platform**: JSON export for universal compatibility
|
|
|
|
## Recommendations by Use Case
|
|
|
|
### 📚 **Documentation/Knowledge Base**
|
|
- **Small (<10K docs)**: Development Stack
|
|
- **Medium (<100K docs)**: Minimal Stack
|
|
- **Large (>100K docs)**: Balanced Stack
|
|
|
|
### 🔬 **Research/Analytics**
|
|
- **Graph-heavy**: High-Performance Stack with Neo4j
|
|
- **Vector-heavy**: Balanced Stack with Milvus
|
|
- **Mixed workload**: Balanced Stack
|
|
|
|
### 💼 **Enterprise**
|
|
- **High Availability**: High-Performance Stack with clustering
|
|
- **Budget Conscious**: Minimal Stack with PostgreSQL
|
|
- **Regulatory**: On-premises with full control
|
|
|
|
### 🚀 **Startups/SMBs**
|
|
- **MVP**: Development Stack
|
|
- **Growing**: Minimal Stack
|
|
- **Scaling**: Balanced Stack
|
|
|
|
## Conclusion
|
|
|
|
The **Minimal Stack** (PostgreSQL + NetworkX) provides the best balance of performance, complexity, and cost for most use cases. It offers:
|
|
|
|
- ✅ Production-ready reliability
|
|
- ✅ Reasonable performance for medium-scale deployments
|
|
- ✅ Low operational overhead
|
|
- ✅ Clear upgrade path to specialized components
|
|
|
|
For specialized needs:
|
|
- **High graph complexity** → Add Neo4j
|
|
- **High vector performance** → Add Qdrant/Milvus
|
|
- **High concurrency KV** → Add Redis
|
|
|
|
The modular architecture allows gradual optimization based on actual performance bottlenecks rather than premature optimization.
|
|
|
|
---
|
|
|
|
*Report generated based on LightRAG v1.3.7 implementation analysis*
|