LightRAG/blueprints/REPORT.md
Taddeus a70ba1f75a
Phase 1: LightRAG Minimal Helm chart and documentation indexing using url references (#2)
* Partial implementation of phase-0

* Partial implementation of phase-1

* add report

* add postgress

* Revert "add postgress"

This reverts commit 27778dc6bb3906b5220dd386e47fe32ca7415332.

* remove junk

* Cleaned up annd setup docs

* update docs

* moved report

* Updated load_markdown_files function: Now returns tuples with (content, title, relative_path) instead of just (content, title)

* fixes to load docs script and more env variables for llm configuration

* update prod values

* update docs

* apolo docs support with linking

* update docs to reflect url conventions and mapping with docs

* Adds ingress and forwardAuth configurations

Adds ingress configuration to expose the application.

Adds forwardAuth configuration to enable user authentication.

Includes middleware to strip headers.

* Adds ingress and forward authentication middleware support
2025-06-23 20:04:34 +03:00

438 lines
No EOL
16 KiB
Markdown

# LightRAG Storage Stack Configurations Report
## Executive Summary
LightRAG supports a modular storage architecture with 4 distinct storage types that can be mixed and matched:
- **Graph Storage**: Knowledge graph relationships
- **Vector Storage**: Document embeddings
- **KV Storage**: Key-value pairs and metadata
- **Document Status Storage**: Document processing status
This report analyzes 25+ storage implementations across 8 database technologies to provide recommendations for different use cases.
## Storage Architecture Overview
### Storage Types & Available Implementations
| Storage Type | Implementations | Count |
|--------------|----------------|-------|
| **Graph Storage** | NetworkXStorage, Neo4JStorage, PGGraphStorage, AGEStorage¹, MongoGraphStorage¹ | 5 |
| **Vector Storage** | NanoVectorDBStorage, MilvusVectorDBStorage, ChromaVectorDBStorage, PGVectorStorage, FaissVectorDBStorage, QdrantVectorDBStorage, MongoVectorDBStorage | 7 |
| **KV Storage** | JsonKVStorage, RedisKVStorage, PGKVStorage, MongoKVStorage | 4 |
| **Doc Status Storage** | JsonDocStatusStorage, PGDocStatusStorage, MongoDocStatusStorage | 3 |
¹ *Currently commented out in production*
## Database Technology Analysis
### 1. PostgreSQL + pgvector
**Implementations**: PGVectorStorage, PGKVStorage, PGGraphStorage, PGDocStatusStorage
**Strengths:**
-**Unified Database**: Single database for all storage types
-**ACID Compliance**: Full transactional support
-**Mature Ecosystem**: Well-established, enterprise-ready
-**Minimal**: Single database to maintain
-**pgvector Extension**: Native vector operations with good performance
-**SQL Familiarity**: Easy to query and debug
**Weaknesses:**
-**Graph Limitations**: Requires AGE extension for advanced graph operations
-**Vector Performance**: Good but not specialized vector database performance
-**Single Point of Failure**: All data in one database
**Configuration:**
```yaml
LIGHTRAG_KV_STORAGE: PGKVStorage
LIGHTRAG_VECTOR_STORAGE: PGVectorStorage
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
LIGHTRAG_GRAPH_STORAGE: PGGraphStorage # Requires AGE extension
```
### 2. Neo4j (Graph Specialist)
**Implementations**: Neo4JStorage
**Strengths:**
-**Graph Optimization**: Purpose-built for graph operations
-**Advanced Graph Analytics**: Complex graph algorithms built-in
-**Cypher Query Language**: Powerful graph query capabilities
-**Scalability**: Excellent for large, complex graphs
-**Visualization**: Rich graph visualization tools
**Weaknesses:**
-**Graph Only**: Requires additional databases for vectors/KV
-**Complexity**: More complex setup and maintenance
-**Cost**: Enterprise features require licensing
-**Memory Usage**: Can be memory-intensive
**Typical Configuration:**
```yaml
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
LIGHTRAG_VECTOR_STORAGE: MilvusVectorDBStorage # Or Qdrant
LIGHTRAG_KV_STORAGE: RedisKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
```
### 3. Milvus (Vector Specialist)
**Implementations**: MilvusVectorDBStorage
**Strengths:**
-**Vector Performance**: Optimized for high-performance vector search
-**Scalability**: Designed for billion-scale vector collections
-**Multiple Indexes**: Various indexing algorithms (IVF, HNSW, etc.)
-**GPU Support**: CUDA acceleration for vector operations
-**Cloud Native**: Kubernetes-ready architecture
**Weaknesses:**
-**Complexity**: Complex distributed architecture
-**Resource Usage**: High memory and compute requirements
-**Overkill**: May be excessive for smaller datasets
-**Dependencies**: Requires etcd and MinIO for full deployment
**Typical Configuration:**
```yaml
LIGHTRAG_VECTOR_STORAGE: MilvusVectorDBStorage
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
LIGHTRAG_KV_STORAGE: RedisKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: MongoDocStatusStorage
```
### 4. Qdrant (Vector Specialist)
**Implementations**: QdrantVectorDBStorage
**Strengths:**
-**Performance**: High-performance vector search with Rust backend
-**Simplicity**: Easier deployment than Milvus
-**Filtering**: Advanced payload filtering capabilities
-**API**: Rich REST and gRPC APIs
-**Memory Efficiency**: Lower memory footprint than Milvus
**Weaknesses:**
-**Ecosystem**: Smaller ecosystem compared to alternatives
-**Vector Only**: Requires additional databases for other storage types
### 5. MongoDB (Multi-Purpose)
**Implementations**: MongoKVStorage, MongoVectorDBStorage, MongoDocStatusStorage
**Strengths:**
-**Flexibility**: Schema-less document storage
-**Vector Search**: Native vector search capabilities (Atlas Search)
-**Multi-Purpose**: Can handle KV, vectors, and document status
-**Scalability**: Horizontal scaling with sharding
-**Developer Friendly**: Easy to work with JSON documents
**Weaknesses:**
-**Graph Limitations**: Not optimized for graph operations
-**Vector Performance**: Vector search not as optimized as specialists
-**Memory Usage**: Can be memory-intensive for large datasets
### 6. Redis (KV Specialist)
**Implementations**: RedisKVStorage
**Strengths:**
-**Speed**: In-memory performance for KV operations
-**Simplicity**: Simple key-value operations
-**Data Structures**: Rich data structures (lists, sets, hashes)
-**Caching**: Excellent for caching and session storage
**Weaknesses:**
-**Memory Bound**: Limited by available RAM
-**KV Only**: Only suitable for key-value storage
-**Persistence**: Data persistence requires configuration
### 7. Local File Storage
**Implementations**: NetworkXStorage, JsonKVStorage, JsonDocStatusStorage, NanoVectorDBStorage, FaissVectorDBStorage
**Strengths:**
-**Simplicity**: No external dependencies
-**Development**: Perfect for development and testing
-**Portability**: Easy to backup and move
-**Cost**: No infrastructure costs
**Weaknesses:**
-**Scalability**: Limited by single machine resources
-**Concurrency**: No built-in concurrent access
-**Performance**: Limited performance for large datasets
-**Reliability**: Single point of failure
### 8. ChromaDB (Vector Specialist)
**Implementations**: ChromaVectorDBStorage
**Strengths:**
-**Simplicity**: Easy to deploy and use
-**Python Native**: Built for Python ML workflows
-**Metadata**: Rich metadata filtering capabilities
-**Local/Distributed**: Can run locally or distributed
**Weaknesses:**
-**Performance**: Slower than Milvus/Qdrant for large scales
-**Maturity**: Newer project with evolving feature set
## Recommended Stack Configurations
### 1. 🏆 **Production High-Performance Stack**
**Best for**: Large-scale production deployments, complex graph analytics
```yaml
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
LIGHTRAG_VECTOR_STORAGE: MilvusVectorDBStorage
LIGHTRAG_KV_STORAGE: RedisKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
```
**Services Required:**
- Neo4j (Graph operations)
- Milvus + etcd + MinIO (Vector search)
- Redis (KV cache)
- PostgreSQL (Document status)
**Pros**: Maximum performance, specialized for each data type
**Cons**: High complexity, resource intensive, expensive
```mermaid
graph LR
LightRAG_App["LightRAG Application"]
Neo4j_Service["Neo4j Service"]
Milvus_Cluster["Milvus Cluster (Milvus, etcd, MinIO)"]
Redis_Service["Redis Service"]
PostgreSQL_Service["PostgreSQL Service"]
LightRAG_App --> |Graph Storage| Neo4j_Service
LightRAG_App --> |Vector Storage| Milvus_Cluster
LightRAG_App --> |KV Storage| Redis_Service
LightRAG_App --> |Doc Status Storage| PostgreSQL_Service
```
### 2. 🎯 **Production Balanced Stack**
**Best for**: Production deployments prioritizing simplicity
```yaml
LIGHTRAG_GRAPH_STORAGE: NetworkXStorage
LIGHTRAG_VECTOR_STORAGE: QdrantVectorDBStorage
LIGHTRAG_KV_STORAGE: RedisKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
```
**Services Required:**
- Qdrant (Vector search)
- Redis (KV cache)
- PostgreSQL (Document status)
- File system (Graph storage)
**Pros**: Good performance, simpler than full specialist stack
**Cons**: Graph operations limited by file-based storage
```mermaid
graph LR
subgraph "LightRAG Application Environment"
LightRAG_App["LightRAG Application"]
NetworkX["NetworkX Graph Storage (Local FS)"]
LightRAG_App -.-> NetworkX
end
Qdrant_Service["Qdrant Service"]
Redis_Service["Redis Service"]
PostgreSQL_Service["PostgreSQL Service"]
LightRAG_App --> |Vector Storage| Qdrant_Service
LightRAG_App --> |KV Storage| Redis_Service
LightRAG_App --> |Doc Status Storage| PostgreSQL_Service
```
### 3. 💰 **Production Minimal Stack**
**Best for**: Budget-conscious production deployments
```yaml
LIGHTRAG_GRAPH_STORAGE: NetworkXStorage
LIGHTRAG_VECTOR_STORAGE: PGVectorStorage
LIGHTRAG_KV_STORAGE: PGKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
```
**Services Required:**
- PostgreSQL + pgvector (All storage except graph)
- File system (Graph storage)
**Pros**: Single database, low cost, good for medium scale
**Cons**: Not optimized for very large datasets or complex graphs
```mermaid
graph LR
subgraph "LightRAG Application Environment"
LightRAG_App["LightRAG Application"]
NetworkX["NetworkX Graph Storage (Local FS)"]
LightRAG_App -.-> NetworkX
end
PostgreSQL_Service["PostgreSQL Service (+pgvector)"]
LightRAG_App --> |Vector Storage| PostgreSQL_Service
LightRAG_App --> |KV Storage| PostgreSQL_Service
LightRAG_App --> |Doc Status Storage| PostgreSQL_Service
```
### 4. 🚀 **Development & Testing Stack**
**Best for**: Local development, testing, small deployments
```yaml
LIGHTRAG_GRAPH_STORAGE: NetworkXStorage
LIGHTRAG_VECTOR_STORAGE: NanoVectorDBStorage
LIGHTRAG_KV_STORAGE: JsonKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: JsonDocStatusStorage
```
**Services Required:**
- None (all file-based)
**Pros**: Zero infrastructure, fast setup, portable
**Cons**: Limited scalability and performance
```mermaid
graph LR
subgraph "LightRAG Application (Local Process)"
LightRAG_App["LightRAG App"]
NetworkX["NetworkX (File System)"]
NanoVectorDB["NanoVectorDB (File System)"]
JsonKV["JsonKVStorage (File System)"]
JsonDocStatus["JsonDocStatusStorage (File System)"]
LightRAG_App -.-> |Graph| NetworkX
LightRAG_App -.-> |Vector| NanoVectorDB
LightRAG_App -.-> |KV| JsonKV
LightRAG_App -.-> |Doc Status| JsonDocStatus
end
```
### 5. 🐳 **Docker All-in-One Stack**
**Best for**: Containerized deployments, cloud environments
```yaml
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
LIGHTRAG_VECTOR_STORAGE: QdrantVectorDBStorage
LIGHTRAG_KV_STORAGE: RedisKVStorage
LIGHTRAG_DOC_STATUS_STORAGE: MongoDocStatusStorage
```
**Services Required:**
- Neo4j (Graph)
- Qdrant (Vector)
- Redis (KV)
- MongoDB (Document status)
**Pros**: Cloud-native, each service containerized
**Cons**: More services to manage
```mermaid
graph LR
subgraph "Docker Environment (e.g., Docker Compose)"
LightRAG_Container["LightRAG App (Container)"]
Neo4j_Container["Neo4j (Container)"]
Qdrant_Container["Qdrant (Container)"]
Redis_Container["Redis (Container)"]
MongoDB_Container["MongoDB (Container)"]
end
LightRAG_Container --> |Graph Storage| Neo4j_Container
LightRAG_Container --> |Vector Storage| Qdrant_Container
LightRAG_Container --> |KV Storage| Redis_Container
LightRAG_Container --> |Doc Status Storage| MongoDB_Container
```
## Performance Comparison
### Vector Search Performance (Approximate)
| Implementation | Small (1K docs) | Medium (100K docs) | Large (1M+ docs) | Memory Usage |
|---------------|-----------------|--------------------|-----------------|--------------|
| MilvusVectorDB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | High |
| QdrantVectorDB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Medium |
| PGVectorStorage | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | Medium |
| ChromaVectorDB | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | Medium |
| FaissVectorDB | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | Low |
| NanoVectorDB | ⭐⭐⭐ | ⭐⭐ | ⭐ | Low |
### Graph Operations Performance
| Implementation | Node Queries | Edge Traversal | Complex Analytics | Scalability |
|---------------|--------------|----------------|------------------|-------------|
| Neo4JStorage | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| PGGraphStorage | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| NetworkXStorage | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ |
### KV Operations Performance
| Implementation | Read Speed | Write Speed | Concurrency | Persistence |
|---------------|------------|-------------|-------------|-------------|
| RedisKVStorage | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| PGKVStorage | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| MongoKVStorage | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| JsonKVStorage | ⭐⭐ | ⭐⭐ | ⭐ | ⭐⭐⭐⭐⭐ |
## Deployment Considerations
### Resource Requirements
| Configuration | CPU | Memory | Storage | Network |
|--------------|-----|--------|---------|---------|
| Development Stack | 2 cores | 4GB | 10GB | Minimal |
| Minimal Stack | 4 cores | 8GB | 50GB | Medium |
| Balanced Stack | 8 cores | 16GB | 100GB | High |
| High-Performance Stack | 16+ cores | 32GB+ | 500GB+ | Very High |
### Maintenance Complexity
| Stack Type | Setup Complexity | Operational Overhead | Monitoring | Backup Strategy |
|-----------|------------------|---------------------|------------|-----------------|
| Development | ⭐ | ⭐ | ⭐ | Simple |
| Minimal | ⭐⭐ | ⭐⭐ | ⭐⭐ | Medium |
| Balanced | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | Complex |
| High-Performance | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Very Complex |
## Migration Paths
### Development → Production
1. Start with Development Stack (all file-based)
2. Migrate to Minimal Stack (PostgreSQL-based)
3. Scale to Balanced Stack (add specialized vector DB)
4. Optimize with High-Performance Stack (full specialization)
### Data Migration Tools
- **Database-specific**: Use native tools (pg_dump, neo4j-admin, etc.)
- **LightRAG native**: Built-in export/import capabilities
- **Cross-platform**: JSON export for universal compatibility
## Recommendations by Use Case
### 📚 **Documentation/Knowledge Base**
- **Small (<10K docs)**: Development Stack
- **Medium (<100K docs)**: Minimal Stack
- **Large (>100K docs)**: Balanced Stack
### 🔬 **Research/Analytics**
- **Graph-heavy**: High-Performance Stack with Neo4j
- **Vector-heavy**: Balanced Stack with Milvus
- **Mixed workload**: Balanced Stack
### 💼 **Enterprise**
- **High Availability**: High-Performance Stack with clustering
- **Budget Conscious**: Minimal Stack with PostgreSQL
- **Regulatory**: On-premises with full control
### 🚀 **Startups/SMBs**
- **MVP**: Development Stack
- **Growing**: Minimal Stack
- **Scaling**: Balanced Stack
## Conclusion
The **Minimal Stack** (PostgreSQL + NetworkX) provides the best balance of performance, complexity, and cost for most use cases. It offers:
- ✅ Production-ready reliability
- ✅ Reasonable performance for medium-scale deployments
- ✅ Low operational overhead
- ✅ Clear upgrade path to specialized components
For specialized needs:
- **High graph complexity** → Add Neo4j
- **High vector performance** → Add Qdrant/Milvus
- **High concurrency KV** → Add Redis
The modular architecture allows gradual optimization based on actual performance bottlenecks rather than premature optimization.
---
*Report generated based on LightRAG v1.3.7 implementation analysis*