feat: support optional pre-built knowledge graphs in Docker images
This change enables shipping Docker images with pre-indexed knowledge graphs, eliminating the need to re-index documents in production deployments. Benefits: - Reduces embedding API costs (no re-indexing in production) - Enables instant query capability (zero startup delay) - Ensures consistent embeddings across deployments - Simplifies multi-region and offline deployments Changes: - Modified Dockerfile to optionally copy pre-built graph file - Updated .dockerignore to allow specific graph files through - Added comprehensive rag_storage/README.md documentation Implementation Details: - Graph file: graph_chunk_entity_relation.graphml - Copy is optional: builds succeed with or without the file - .dockerignore pattern allows graph while excluding other storage - Documentation covers use cases, best practices, and troubleshooting Usage: 1. Build knowledge graph locally 2. Docker build automatically includes it if present 3. Deploy container with instant query capability The feature is backward compatible - existing builds work unchanged.
This commit is contained in:
parent
ece0398dfc
commit
4bc1a91988
3 changed files with 230 additions and 1 deletions
|
|
@ -32,7 +32,12 @@ Makefile
|
|||
/dickens
|
||||
/reproduce
|
||||
/output_complete
|
||||
/rag_storage
|
||||
|
||||
# Exclude rag_storage but allow pre-built knowledge graph (optional)
|
||||
/rag_storage/*
|
||||
!/rag_storage/graph_chunk_entity_relation.graphml
|
||||
!/rag_storage/README.md
|
||||
|
||||
/inputs
|
||||
|
||||
# Python version manager file
|
||||
|
|
|
|||
|
|
@ -87,6 +87,14 @@ RUN uv sync --frozen --no-dev --extra api --extra offline --no-editable \
|
|||
# Create persistent data directories AFTER package installation
|
||||
RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/tiktoken
|
||||
|
||||
# Copy pre-built knowledge graph if available (optional)
|
||||
# This allows shipping Docker images with pre-indexed data, saving:
|
||||
# - Embedding API costs (no need to re-index)
|
||||
# - Startup time in production (instant query capability)
|
||||
# - Consistent embeddings across deployments
|
||||
# Copy will fail silently if files don't exist (handled by .dockerignore)
|
||||
COPY --chown=root:root rag_storage/graph_chunk_entity_relation.graphml /app/data/rag_storage/
|
||||
|
||||
# Copy offline cache into the newly created directory
|
||||
COPY --from=builder /app/data/tiktoken /app/data/tiktoken
|
||||
|
||||
|
|
|
|||
216
rag_storage/README.md
Normal file
216
rag_storage/README.md
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
# Pre-built Knowledge Graph for Docker Deployments
|
||||
|
||||
This directory can contain a pre-built knowledge graph that will be included in Docker images, enabling instant query capability without re-indexing.
|
||||
|
||||
## Benefits
|
||||
|
||||
- **💰 Cost Savings**: No embedding API costs in production
|
||||
- **⚡ Fast Startup**: Instant query capability (no indexing delay)
|
||||
- **🔄 Consistency**: Same embeddings across all deployments
|
||||
- **📦 Portable**: Ship ready-to-query Docker images
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Build Your Knowledge Graph Locally
|
||||
|
||||
```bash
|
||||
# Index your documents locally
|
||||
python -m lightrag.examples.lightrag_api_openai_compatible_demo
|
||||
|
||||
# Or use the API
|
||||
curl -X POST http://localhost:9621/insert \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Your document content here"}'
|
||||
```
|
||||
|
||||
This will create `graph_chunk_entity_relation.graphml` in your local `rag_storage/` directory.
|
||||
|
||||
### 2. Build Docker Image with Pre-built Graph
|
||||
|
||||
```bash
|
||||
# Ensure graph file exists
|
||||
ls rag_storage/graph_chunk_entity_relation.graphml
|
||||
|
||||
# Build Docker image (graph will be included automatically)
|
||||
docker build -t lightrag:prebuilt .
|
||||
```
|
||||
|
||||
### 3. Deploy Without Re-indexing
|
||||
|
||||
```bash
|
||||
# Run container - queries work immediately
|
||||
docker run -p 9621:9621 \
|
||||
-e OPENAI_API_KEY=your_key \
|
||||
lightrag:prebuilt
|
||||
|
||||
# Test query (no indexing needed!)
|
||||
curl -X POST http://localhost:9621/query \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "What is LightRAG?", "mode": "hybrid"}'
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Dockerfile Integration
|
||||
|
||||
The Dockerfile includes this optional step:
|
||||
|
||||
```dockerfile
|
||||
# Copy pre-built knowledge graph if available (optional)
|
||||
COPY --chown=root:root rag_storage/graph_chunk_entity_relation.graphml /app/data/rag_storage/
|
||||
```
|
||||
|
||||
### .dockerignore Configuration
|
||||
|
||||
```
|
||||
# Exclude rag_storage but allow pre-built knowledge graph (optional)
|
||||
/rag_storage/*
|
||||
!/rag_storage/graph_chunk_entity_relation.graphml
|
||||
```
|
||||
|
||||
### Build Behavior
|
||||
|
||||
- **With graph file**: File is copied into image → instant queries
|
||||
- **Without graph file**: Build continues normally → index at runtime
|
||||
|
||||
## File Format
|
||||
|
||||
The `graph_chunk_entity_relation.graphml` file contains:
|
||||
- **Entities**: Extracted from documents
|
||||
- **Relationships**: Connections between entities
|
||||
- **Chunks**: Document segments with embeddings
|
||||
- **Metadata**: Source information and timestamps
|
||||
|
||||
## Use Cases
|
||||
|
||||
### ✅ Good Use Cases
|
||||
|
||||
- **Production deployments** with stable document corpus
|
||||
- **Demo/POC environments** with sample data
|
||||
- **Multi-region deployments** with consistent data
|
||||
- **Offline deployments** without embedding API access
|
||||
- **Cost optimization** for large document sets
|
||||
|
||||
### ⚠️ Consider Alternatives
|
||||
|
||||
- **Frequently updated content**: Use volume mounts instead
|
||||
- **User-specific data**: Mount per-user graph files
|
||||
- **Dynamic indexing**: Let containers index at runtime
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Multiple Graph Files
|
||||
|
||||
To include multiple pre-built graphs:
|
||||
|
||||
```dockerfile
|
||||
# Custom Dockerfile
|
||||
COPY rag_storage/*.graphml /app/data/rag_storage/
|
||||
```
|
||||
|
||||
Update `.dockerignore`:
|
||||
```
|
||||
/rag_storage/*
|
||||
!/rag_storage/*.graphml
|
||||
```
|
||||
|
||||
### Volume Override
|
||||
|
||||
Even with pre-built graph, you can override at runtime:
|
||||
|
||||
```bash
|
||||
# Use custom graph file
|
||||
docker run -p 9621:9621 \
|
||||
-v /path/to/custom/graph:/app/data/rag_storage \
|
||||
lightrag:prebuilt
|
||||
```
|
||||
|
||||
### Multi-stage Builds
|
||||
|
||||
For CI/CD pipelines:
|
||||
|
||||
```dockerfile
|
||||
# Stage 1: Build graph
|
||||
FROM lightrag:base AS indexer
|
||||
COPY documents/ /documents/
|
||||
RUN python index_documents.py /documents
|
||||
|
||||
# Stage 2: Production image with graph
|
||||
FROM lightrag:base AS production
|
||||
COPY --from=indexer /app/data/rag_storage/*.graphml /app/data/rag_storage/
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Graph Not Loaded
|
||||
|
||||
**Symptoms**: Container queries return empty results
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
# Verify graph file in image
|
||||
docker run lightrag:prebuilt ls -lh /app/data/rag_storage/
|
||||
|
||||
# Check logs
|
||||
docker logs <container_id>
|
||||
```
|
||||
|
||||
### Build Fails
|
||||
|
||||
**Error**: `COPY failed: file not found`
|
||||
|
||||
**Solution**: This means the Dockerfile expects the graph file but it doesn't exist. Either:
|
||||
1. Create the graph file before building
|
||||
2. Remove the COPY instruction for optional builds
|
||||
|
||||
### Wrong Graph Loaded
|
||||
|
||||
**Issue**: Old data in queries
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Rebuild image with new graph
|
||||
rm rag_storage/graph_chunk_entity_relation.graphml
|
||||
python rebuild_index.py
|
||||
docker build --no-cache -t lightrag:prebuilt .
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Version your graph files**: Tag Docker images with graph versions
|
||||
```bash
|
||||
docker build -t lightrag:v1.0-graph-20250101 .
|
||||
```
|
||||
|
||||
2. **Document graph contents**: Add metadata file
|
||||
```bash
|
||||
echo "Built: 2025-01-01, Documents: 1000, Entities: 5000" > rag_storage/graph_metadata.txt
|
||||
```
|
||||
|
||||
3. **Test before deploying**:
|
||||
```bash
|
||||
# Validate graph locally
|
||||
python -m lightrag.tools.validate_graph rag_storage/graph_chunk_entity_relation.graphml
|
||||
```
|
||||
|
||||
4. **Monitor graph size**:
|
||||
```bash
|
||||
# Check file size
|
||||
du -h rag_storage/graph_chunk_entity_relation.graphml
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Sensitive Data**: Don't include confidential information in public images
|
||||
- **Access Control**: Use private registries for graphs with proprietary data
|
||||
- **Compliance**: Ensure graph data complies with data residency requirements
|
||||
|
||||
## Performance Tips
|
||||
|
||||
- **Graph Size**: Optimize for < 100MB for faster image pulls
|
||||
- **Compression**: GraphML compresses well with gzip
|
||||
- **Caching**: Use Docker layer caching for unchanged graphs
|
||||
|
||||
---
|
||||
|
||||
**Note**: This feature is optional. LightRAG works without pre-built graphs by indexing at runtime.
|
||||
Loading…
Add table
Reference in a new issue