feat: support optional pre-built knowledge graphs in Docker images

This change enables shipping Docker images with pre-indexed knowledge graphs,
eliminating the need to re-index documents in production deployments.

Benefits:
- Reduces embedding API costs (no re-indexing in production)
- Enables instant query capability (zero startup delay)
- Ensures consistent embeddings across deployments
- Simplifies multi-region and offline deployments

Changes:
- Modified Dockerfile to optionally copy pre-built graph file
- Updated .dockerignore to allow specific graph files through
- Added comprehensive rag_storage/README.md documentation

Implementation Details:
- Graph file: graph_chunk_entity_relation.graphml
- Copy is optional: builds succeed with or without the file
- .dockerignore pattern allows graph while excluding other storage
- Documentation covers use cases, best practices, and troubleshooting

Usage:
1. Build knowledge graph locally
2. Docker build automatically includes it if present
3. Deploy container with instant query capability

The feature is backward compatible - existing builds work unchanged.
This commit is contained in:
anouarbm 2025-11-01 22:04:14 +01:00
parent ece0398dfc
commit 4bc1a91988
3 changed files with 230 additions and 1 deletions

View file

@ -32,7 +32,12 @@ Makefile
/dickens
/reproduce
/output_complete
/rag_storage
# Exclude rag_storage but allow pre-built knowledge graph (optional)
/rag_storage/*
!/rag_storage/graph_chunk_entity_relation.graphml
!/rag_storage/README.md
/inputs
# Python version manager file

View file

@ -87,6 +87,14 @@ RUN uv sync --frozen --no-dev --extra api --extra offline --no-editable \
# Create persistent data directories AFTER package installation
RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/tiktoken
# Copy pre-built knowledge graph if available (optional)
# This allows shipping Docker images with pre-indexed data, saving:
# - Embedding API costs (no need to re-index)
# - Startup time in production (instant query capability)
# - Consistent embeddings across deployments
# Copy will fail silently if files don't exist (handled by .dockerignore)
COPY --chown=root:root rag_storage/graph_chunk_entity_relation.graphml /app/data/rag_storage/
# Copy offline cache into the newly created directory
COPY --from=builder /app/data/tiktoken /app/data/tiktoken

216
rag_storage/README.md Normal file
View file

@ -0,0 +1,216 @@
# Pre-built Knowledge Graph for Docker Deployments
This directory can contain a pre-built knowledge graph that will be included in Docker images, enabling instant query capability without re-indexing.
## Benefits
- **💰 Cost Savings**: No embedding API costs in production
- **⚡ Fast Startup**: Instant query capability (no indexing delay)
- **🔄 Consistency**: Same embeddings across all deployments
- **📦 Portable**: Ship ready-to-query Docker images
## Usage
### 1. Build Your Knowledge Graph Locally
```bash
# Index your documents locally
python -m lightrag.examples.lightrag_api_openai_compatible_demo
# Or use the API
curl -X POST http://localhost:9621/insert \
-H "Content-Type: application/json" \
-d '{"text": "Your document content here"}'
```
This will create `graph_chunk_entity_relation.graphml` in your local `rag_storage/` directory.
### 2. Build Docker Image with Pre-built Graph
```bash
# Ensure graph file exists
ls rag_storage/graph_chunk_entity_relation.graphml
# Build Docker image (graph will be included automatically)
docker build -t lightrag:prebuilt .
```
### 3. Deploy Without Re-indexing
```bash
# Run container - queries work immediately
docker run -p 9621:9621 \
-e OPENAI_API_KEY=your_key \
lightrag:prebuilt
# Test query (no indexing needed!)
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{"query": "What is LightRAG?", "mode": "hybrid"}'
```
## How It Works
### Dockerfile Integration
The Dockerfile includes this optional step:
```dockerfile
# Copy pre-built knowledge graph if available (optional)
COPY --chown=root:root rag_storage/graph_chunk_entity_relation.graphml /app/data/rag_storage/
```
### .dockerignore Configuration
```
# Exclude rag_storage but allow pre-built knowledge graph (optional)
/rag_storage/*
!/rag_storage/graph_chunk_entity_relation.graphml
```
### Build Behavior
- **With graph file**: File is copied into image → instant queries
- **Without graph file**: Build continues normally → index at runtime
## File Format
The `graph_chunk_entity_relation.graphml` file contains:
- **Entities**: Extracted from documents
- **Relationships**: Connections between entities
- **Chunks**: Document segments with embeddings
- **Metadata**: Source information and timestamps
## Use Cases
### ✅ Good Use Cases
- **Production deployments** with stable document corpus
- **Demo/POC environments** with sample data
- **Multi-region deployments** with consistent data
- **Offline deployments** without embedding API access
- **Cost optimization** for large document sets
### ⚠️ Consider Alternatives
- **Frequently updated content**: Use volume mounts instead
- **User-specific data**: Mount per-user graph files
- **Dynamic indexing**: Let containers index at runtime
## Advanced Usage
### Multiple Graph Files
To include multiple pre-built graphs:
```dockerfile
# Custom Dockerfile
COPY rag_storage/*.graphml /app/data/rag_storage/
```
Update `.dockerignore`:
```
/rag_storage/*
!/rag_storage/*.graphml
```
### Volume Override
Even with pre-built graph, you can override at runtime:
```bash
# Use custom graph file
docker run -p 9621:9621 \
-v /path/to/custom/graph:/app/data/rag_storage \
lightrag:prebuilt
```
### Multi-stage Builds
For CI/CD pipelines:
```dockerfile
# Stage 1: Build graph
FROM lightrag:base AS indexer
COPY documents/ /documents/
RUN python index_documents.py /documents
# Stage 2: Production image with graph
FROM lightrag:base AS production
COPY --from=indexer /app/data/rag_storage/*.graphml /app/data/rag_storage/
```
## Troubleshooting
### Graph Not Loaded
**Symptoms**: Container queries return empty results
**Check**:
```bash
# Verify graph file in image
docker run lightrag:prebuilt ls -lh /app/data/rag_storage/
# Check logs
docker logs <container_id>
```
### Build Fails
**Error**: `COPY failed: file not found`
**Solution**: This means the Dockerfile expects the graph file but it doesn't exist. Either:
1. Create the graph file before building
2. Remove the COPY instruction for optional builds
### Wrong Graph Loaded
**Issue**: Old data in queries
**Solution**:
```bash
# Rebuild image with new graph
rm rag_storage/graph_chunk_entity_relation.graphml
python rebuild_index.py
docker build --no-cache -t lightrag:prebuilt .
```
## Best Practices
1. **Version your graph files**: Tag Docker images with graph versions
```bash
docker build -t lightrag:v1.0-graph-20250101 .
```
2. **Document graph contents**: Add metadata file
```bash
echo "Built: 2025-01-01, Documents: 1000, Entities: 5000" > rag_storage/graph_metadata.txt
```
3. **Test before deploying**:
```bash
# Validate graph locally
python -m lightrag.tools.validate_graph rag_storage/graph_chunk_entity_relation.graphml
```
4. **Monitor graph size**:
```bash
# Check file size
du -h rag_storage/graph_chunk_entity_relation.graphml
```
## Security Considerations
- **Sensitive Data**: Don't include confidential information in public images
- **Access Control**: Use private registries for graphs with proprietary data
- **Compliance**: Ensure graph data complies with data residency requirements
## Performance Tips
- **Graph Size**: Optimize for < 100MB for faster image pulls
- **Compression**: GraphML compresses well with gzip
- **Caching**: Use Docker layer caching for unchanged graphs
---
**Note**: This feature is optional. LightRAG works without pre-built graphs by indexing at runtime.