Add 5 markdown documents that users can index to reproduce evaluation results. Changes: - Add sample_documents/ folder with 5 markdown files covering LightRAG features - Update sample_dataset.json with 3 improved, specific test questions - Shorten and correct evaluation README (removed outdated info about mock responses) - Add sample_documents reference with expected ~95% RAGAS score Test Results with sample documents: - Average RAGAS Score: 95.28% - Faithfulness: 100%, Answer Relevance: 96.67% - Context Recall: 88.89%, Context Precision: 95.56%
204 lines
5.3 KiB
Markdown
204 lines
5.3 KiB
Markdown
# 📊 LightRAG Evaluation Framework
|
|
|
|
RAGAS-based offline evaluation of your LightRAG system.
|
|
|
|
## What is RAGAS?
|
|
|
|
**RAGAS** (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs.
|
|
|
|
Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art evaluation metrics:
|
|
|
|
### Core Metrics
|
|
|
|
| Metric | What It Measures | Good Score |
|
|
|--------|-----------------|-----------|
|
|
| **Faithfulness** | Is the answer factually accurate based on retrieved context? | > 0.80 |
|
|
| **Answer Relevance** | Is the answer relevant to the user's question? | > 0.80 |
|
|
| **Context Recall** | Was all relevant information retrieved from documents? | > 0.80 |
|
|
| **Context Precision** | Is retrieved context clean without irrelevant noise? | > 0.80 |
|
|
| **RAGAS Score** | Overall quality metric (average of above) | > 0.80 |
|
|
|
|
---
|
|
|
|
## 📁 Structure
|
|
|
|
```
|
|
lightrag/evaluation/
|
|
├── eval_rag_quality.py # Main evaluation script
|
|
├── sample_dataset.json # 3 test questions about LightRAG
|
|
├── sample_documents/ # Matching markdown files for testing
|
|
│ ├── 01_lightrag_overview.md
|
|
│ ├── 02_rag_architecture.md
|
|
│ ├── 03_lightrag_improvements.md
|
|
│ ├── 04_supported_databases.md
|
|
│ ├── 05_evaluation_and_deployment.md
|
|
│ └── README.md
|
|
├── __init__.py # Package init
|
|
├── results/ # Output directory
|
|
│ ├── results_YYYYMMDD_HHMMSS.json # Raw metrics in JSON
|
|
│ └── results_YYYYMMDD_HHMMSS.csv # Metrics in CSV format
|
|
└── README.md # This file
|
|
```
|
|
|
|
**Quick Test:** Index files from `sample_documents/` into LightRAG, then run the evaluator to reproduce results (~89-100% RAGAS score per question).
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### 1. Install Dependencies
|
|
|
|
```bash
|
|
pip install ragas datasets langfuse
|
|
```
|
|
|
|
Or use your project dependencies (already included in pyproject.toml):
|
|
|
|
```bash
|
|
pip install -e ".[offline-llm]"
|
|
```
|
|
|
|
### 2. Run Evaluation
|
|
|
|
```bash
|
|
cd /path/to/LightRAG
|
|
python -m lightrag.evaluation.eval_rag_quality
|
|
```
|
|
|
|
Or directly:
|
|
|
|
```bash
|
|
python lightrag/evaluation/eval_rag_quality.py
|
|
```
|
|
|
|
### 3. View Results
|
|
|
|
Results are saved automatically in `lightrag/evaluation/results/`:
|
|
|
|
```
|
|
results/
|
|
├── results_20241023_143022.json ← Raw metrics in JSON format
|
|
└── results_20241023_143022.csv ← Metrics in CSV format (for spreadsheets)
|
|
```
|
|
|
|
**Results include:**
|
|
- ✅ Overall RAGAS score
|
|
- 📊 Per-metric averages (Faithfulness, Answer Relevance, Context Recall, Context Precision)
|
|
- 📋 Individual test case results
|
|
- 📈 Performance breakdown by question
|
|
|
|
---
|
|
|
|
## 📝 Test Dataset
|
|
|
|
`sample_dataset.json` contains 3 generic questions about LightRAG. Replace with questions matching YOUR indexed documents.
|
|
|
|
**Custom Test Cases:**
|
|
|
|
```json
|
|
{
|
|
"test_cases": [
|
|
{
|
|
"question": "Your question here",
|
|
"ground_truth": "Expected answer from your data",
|
|
"context": "topic"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Interpreting Results
|
|
|
|
### Score Ranges
|
|
|
|
- **0.80-1.00**: ✅ Excellent (Production-ready)
|
|
- **0.60-0.80**: ⚠️ Good (Room for improvement)
|
|
- **0.40-0.60**: ❌ Poor (Needs optimization)
|
|
- **0.00-0.40**: 🔴 Critical (Major issues)
|
|
|
|
### What Low Scores Mean
|
|
|
|
| Metric | Low Score Indicates |
|
|
|--------|-------------------|
|
|
| **Faithfulness** | Responses contain hallucinations or incorrect information |
|
|
| **Answer Relevance** | Answers don't match what users asked |
|
|
| **Context Recall** | Missing important information in retrieval |
|
|
| **Context Precision** | Retrieved documents contain irrelevant noise |
|
|
|
|
### Optimization Tips
|
|
|
|
1. **Low Faithfulness**:
|
|
- Improve entity extraction quality
|
|
- Better document chunking
|
|
- Tune retrieval temperature
|
|
|
|
2. **Low Answer Relevance**:
|
|
- Improve prompt engineering
|
|
- Better query understanding
|
|
- Check semantic similarity threshold
|
|
|
|
3. **Low Context Recall**:
|
|
- Increase retrieval `top_k` results
|
|
- Improve embedding model
|
|
- Better document preprocessing
|
|
|
|
4. **Low Context Precision**:
|
|
- Smaller, focused chunks
|
|
- Better filtering
|
|
- Improve chunking strategy
|
|
|
|
---
|
|
|
|
## 📚 Resources
|
|
|
|
- [RAGAS Documentation](https://docs.ragas.io/)
|
|
- [RAGAS GitHub](https://github.com/explodinggradients/ragas)
|
|
|
|
---
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### "ModuleNotFoundError: No module named 'ragas'"
|
|
|
|
```bash
|
|
pip install ragas datasets
|
|
```
|
|
|
|
### "No sample_dataset.json found"
|
|
|
|
Make sure you're running from the project root:
|
|
|
|
```bash
|
|
cd /path/to/LightRAG
|
|
python lightrag/evaluation/eval_rag_quality.py
|
|
```
|
|
|
|
### "LLM API errors during evaluation"
|
|
|
|
The evaluation uses your configured LLM (OpenAI by default). Ensure:
|
|
- API keys are set in `.env`
|
|
- Have sufficient API quota
|
|
- Network connection is stable
|
|
|
|
### Evaluation requires running LightRAG API
|
|
|
|
The evaluator queries a running LightRAG API server at `http://localhost:9621`. Make sure:
|
|
1. LightRAG API server is running (`python lightrag/api/lightrag_server.py`)
|
|
2. Documents are indexed in your LightRAG instance
|
|
3. API is accessible at the configured URL
|
|
|
|
---
|
|
|
|
## 📝 Next Steps
|
|
|
|
1. Index documents into LightRAG (WebUI or API)
|
|
2. Start LightRAG API server
|
|
3. Run `python lightrag/evaluation/eval_rag_quality.py`
|
|
4. Review results (JSON/CSV) in `results/` folder
|
|
5. Adjust entity extraction prompts or retrieval settings based on scores
|
|
|
|
---
|
|
|
|
**Happy Evaluating! 🚀**
|