Changes: - Move sys.exit() calls from module level to __init__() method - Raise proper exceptions (ImportError, ValueError, EnvironmentError) instead of sys.exit() - Add lazy import for RAGEvaluator in __init__.py using __getattr__ - Update README to clarify sample_dataset.json contains generic test data (not personal) - Fix README to reflect actual output format (JSON + CSV, not HTML) - Improve documentation for custom test case creation Addresses code review feedback about import-time validation and module exports. |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| eval_rag_quality.py | ||
| README.md | ||
| sample_dataset.json | ||
📊 LightRAG Evaluation Framework
RAGAS-based offline evaluation of your LightRAG system.
What is RAGAS?
RAGAS (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs.
Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art evaluation metrics:
Core Metrics
| Metric | What It Measures | Good Score |
|---|---|---|
| Faithfulness | Is the answer factually accurate based on retrieved context? | > 0.80 |
| Answer Relevance | Is the answer relevant to the user's question? | > 0.80 |
| Context Recall | Was all relevant information retrieved from documents? | > 0.80 |
| Context Precision | Is retrieved context clean without irrelevant noise? | > 0.80 |
| RAGAS Score | Overall quality metric (average of above) | > 0.80 |
📁 Structure
lightrag/evaluation/
├── eval_rag_quality.py # Main evaluation script
├── sample_dataset.json # Generic LightRAG test cases (not personal data)
├── __init__.py # Package init
├── results/ # Output directory
│ ├── results_YYYYMMDD_HHMMSS.json # Raw metrics in JSON
│ └── results_YYYYMMDD_HHMMSS.csv # Metrics in CSV format
└── README.md # This file
Note: sample_dataset.json contains generic test questions about LightRAG features (RAG systems, vector databases, deployment, etc.). This is not personal portfolio data - you can use these questions directly to test your own LightRAG installation.
🚀 Quick Start
1. Install Dependencies
pip install ragas datasets langfuse
Or use your project dependencies (already included in pyproject.toml):
pip install -e ".[offline-llm]"
2. Run Evaluation
cd /path/to/LightRAG
python -m lightrag.evaluation.eval_rag_quality
Or directly:
python lightrag/evaluation/eval_rag_quality.py
3. View Results
Results are saved automatically in lightrag/evaluation/results/:
results/
├── results_20241023_143022.json ← Raw metrics in JSON format
└── results_20241023_143022.csv ← Metrics in CSV format (for spreadsheets)
Results include:
- ✅ Overall RAGAS score
- 📊 Per-metric averages (Faithfulness, Answer Relevance, Context Recall, Context Precision)
- 📋 Individual test case results
- 📈 Performance breakdown by question
📝 Test Dataset
The included sample_dataset.json contains generic example questions about LightRAG (RAG systems, vector databases, deployment, etc.). This is NOT personal data - it's meant as a template.
Important: You should replace these with test questions based on YOUR data that you've injected into your RAG system.
Creating Your Own Test Cases
Edit sample_dataset.json with questions relevant to your indexed documents:
{
"test_cases": [
{
"question": "Question based on your documents",
"ground_truth": "Expected answer from your data",
"context": "topic_category"
}
]
}
Example (for a technical portfolio):
{
"question": "Which projects use PyTorch?",
"ground_truth": "The Neural ODE Project uses PyTorch with TorchODE library for continuous-time neural networks.",
"context": "ml_projects"
}
🔧 Integration with Your RAG System
Currently, the evaluation script uses ground truth as mock responses. To evaluate your actual LightRAG:
Step 1: Update generate_rag_response()
In eval_rag_quality.py, replace the mock implementation:
async def generate_rag_response(self, question: str, context: str = None) -> Dict[str, str]:
"""Generate RAG response using your LightRAG system"""
from lightrag import LightRAG
rag = LightRAG(
working_dir="./rag_storage",
llm_model_func=your_llm_function
)
response = await rag.aquery(question)
return {
"answer": response,
"context": "context_from_kg" # If available
}
Step 2: Run Evaluation
python lightrag/evaluation/eval_rag_quality.py
📊 Interpreting Results
Score Ranges
- 0.80-1.00: ✅ Excellent (Production-ready)
- 0.60-0.80: ⚠️ Good (Room for improvement)
- 0.40-0.60: ❌ Poor (Needs optimization)
- 0.00-0.40: 🔴 Critical (Major issues)
What Low Scores Mean
| Metric | Low Score Indicates |
|---|---|
| Faithfulness | Responses contain hallucinations or incorrect information |
| Answer Relevance | Answers don't match what users asked |
| Context Recall | Missing important information in retrieval |
| Context Precision | Retrieved documents contain irrelevant noise |
Optimization Tips
-
Low Faithfulness:
- Improve entity extraction quality
- Better document chunking
- Tune retrieval temperature
-
Low Answer Relevance:
- Improve prompt engineering
- Better query understanding
- Check semantic similarity threshold
-
Low Context Recall:
- Increase retrieval
top_kresults - Improve embedding model
- Better document preprocessing
- Increase retrieval
-
Low Context Precision:
- Smaller, focused chunks
- Better filtering
- Improve chunking strategy
📈 Usage Examples
Python API
import asyncio
from lightrag.evaluation import RAGEvaluator
async def main():
evaluator = RAGEvaluator()
results = await evaluator.run()
# Access results
for result in results:
print(f"Question: {result['question']}")
print(f"RAGAS Score: {result['ragas_score']:.2%}")
print(f"Metrics: {result['metrics']}")
asyncio.run(main())
Custom Dataset
evaluator = RAGEvaluator(test_dataset_path="custom_tests.json")
results = await evaluator.run()
Batch Evaluation
from pathlib import Path
import json
results_dir = Path("lightrag/evaluation/results")
results_dir.mkdir(exist_ok=True)
# Run multiple evaluations
for i in range(3):
evaluator = RAGEvaluator()
results = await evaluator.run()
🎯 Using Evaluation Results
What the Metrics Tell You:
- ✅ Quality Metrics: Overall RAGAS score indicates system health
- ✅ Evaluation Framework: Automated quality assessment with RAGAS
- ✅ Best Practices: Offline evaluation pipeline for continuous improvement
- ✅ Production-Ready: Metrics-driven system optimization
Example Use Cases:
- Track RAG quality over time as you update your documents
- Compare different retrieval modes (local, global, hybrid, mix)
- Measure impact of chunking strategy changes
- Validate system performance before deployment
🔗 Related Features
- LangFuse Integration: Real-time observability of production RAG calls
- LightRAG: Core RAG system with entity extraction and knowledge graphs
- Metrics: See
results/for detailed evaluation metrics
📚 Resources
🐛 Troubleshooting
"ModuleNotFoundError: No module named 'ragas'"
pip install ragas datasets
"No sample_dataset.json found"
Make sure you're running from the project root:
cd /path/to/LightRAG
python lightrag/evaluation/eval_rag_quality.py
"LLM API errors during evaluation"
The evaluation uses your configured LLM (OpenAI by default). Ensure:
- API keys are set in
.env - Have sufficient API quota
- Network connection is stable
Results showing 0 scores
Current implementation uses ground truth as mock responses. Results will show perfect scores because the "generated answer" equals the ground truth.
To use actual RAG results:
- Implement the
generate_rag_response()method - Connect to your LightRAG instance
- Run evaluation again
📝 Next Steps
- ✅ Review test dataset in
sample_dataset.json - ✅ Run
python lightrag/evaluation/eval_rag_quality.py - ✅ Open the HTML report in browser
- 🔄 Integrate with actual LightRAG system
- 📊 Monitor metrics over time
- 🎯 Use insights for optimization
Happy Evaluating! 🚀