anouarbm 36694eb9f2 fix(evaluation): Move import-time validation to runtime and improve documentation

Changes:
- Move sys.exit() calls from module level to __init__() method
- Raise proper exceptions (ImportError, ValueError, EnvironmentError) instead of sys.exit()
- Add lazy import for RAGEvaluator in __init__.py using __getattr__
- Update README to clarify sample_dataset.json contains generic test data (not personal)
- Fix README to reflect actual output format (JSON + CSV, not HTML)
- Improve documentation for custom test case creation

Addresses code review feedback about import-time validation and module exports.

2025-11-03 05:56:38 +01:00

8.2 KiB

Raw Blame History

📊 LightRAG Evaluation Framework

RAGAS-based offline evaluation of your LightRAG system.

What is RAGAS?

RAGAS (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs.

Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art evaluation metrics:

Core Metrics

Metric	What It Measures	Good Score
Faithfulness	Is the answer factually accurate based on retrieved context?	> 0.80
Answer Relevance	Is the answer relevant to the user's question?	> 0.80
Context Recall	Was all relevant information retrieved from documents?	> 0.80
Context Precision	Is retrieved context clean without irrelevant noise?	> 0.80
RAGAS Score	Overall quality metric (average of above)	> 0.80

📁 Structure

lightrag/evaluation/
├── eval_rag_quality.py      # Main evaluation script
├── sample_dataset.json        # Generic LightRAG test cases (not personal data)
├── __init__.py              # Package init
├── results/                 # Output directory
│   ├── results_YYYYMMDD_HHMMSS.json    # Raw metrics in JSON
│   └── results_YYYYMMDD_HHMMSS.csv     # Metrics in CSV format
└── README.md                # This file

Note: sample_dataset.json contains generic test questions about LightRAG features (RAG systems, vector databases, deployment, etc.). This is not personal portfolio data - you can use these questions directly to test your own LightRAG installation.

🚀 Quick Start

1. Install Dependencies

pip install ragas datasets langfuse

Or use your project dependencies (already included in pyproject.toml):

pip install -e ".[offline-llm]"

2. Run Evaluation

cd /path/to/LightRAG
python -m lightrag.evaluation.eval_rag_quality

Or directly:

python lightrag/evaluation/eval_rag_quality.py

3. View Results

Results are saved automatically in lightrag/evaluation/results/:

results/
├── results_20241023_143022.json     ← Raw metrics in JSON format
└── results_20241023_143022.csv      ← Metrics in CSV format (for spreadsheets)

Results include:

✅ Overall RAGAS score
📊 Per-metric averages (Faithfulness, Answer Relevance, Context Recall, Context Precision)
📋 Individual test case results
📈 Performance breakdown by question

📝 Test Dataset

The included sample_dataset.json contains generic example questions about LightRAG (RAG systems, vector databases, deployment, etc.). This is NOT personal data - it's meant as a template.

Important: You should replace these with test questions based on YOUR data that you've injected into your RAG system.

Creating Your Own Test Cases

Edit sample_dataset.json with questions relevant to your indexed documents:

{
  "test_cases": [
    {
      "question": "Question based on your documents",
      "ground_truth": "Expected answer from your data",
      "context": "topic_category"
    }
  ]
}

Example (for a technical portfolio):

{
  "question": "Which projects use PyTorch?",
  "ground_truth": "The Neural ODE Project uses PyTorch with TorchODE library for continuous-time neural networks.",
  "context": "ml_projects"
}

🔧 Integration with Your RAG System

Currently, the evaluation script uses ground truth as mock responses. To evaluate your actual LightRAG:

Step 1: Update `generate_rag_response()`

In eval_rag_quality.py, replace the mock implementation:

async def generate_rag_response(self, question: str, context: str = None) -> Dict[str, str]:
    """Generate RAG response using your LightRAG system"""
    from lightrag import LightRAG

    rag = LightRAG(
        working_dir="./rag_storage",
        llm_model_func=your_llm_function
    )

    response = await rag.aquery(question)

    return {
        "answer": response,
        "context": "context_from_kg"  # If available
    }

Step 2: Run Evaluation

python lightrag/evaluation/eval_rag_quality.py

📊 Interpreting Results

Score Ranges

0.80-1.00: ✅ Excellent (Production-ready)
0.60-0.80: ⚠️ Good (Room for improvement)
0.40-0.60: ❌ Poor (Needs optimization)
0.00-0.40: 🔴 Critical (Major issues)

What Low Scores Mean

Metric	Low Score Indicates
Faithfulness	Responses contain hallucinations or incorrect information
Answer Relevance	Answers don't match what users asked
Context Recall	Missing important information in retrieval
Context Precision	Retrieved documents contain irrelevant noise

Optimization Tips

Low Faithfulness:
- Improve entity extraction quality
- Better document chunking
- Tune retrieval temperature
Low Answer Relevance:
- Improve prompt engineering
- Better query understanding
- Check semantic similarity threshold
Low Context Recall:
- Increase retrieval top_k results
- Improve embedding model
- Better document preprocessing
Low Context Precision:
- Smaller, focused chunks
- Better filtering
- Improve chunking strategy

📈 Usage Examples

Python API

import asyncio
from lightrag.evaluation import RAGEvaluator

async def main():
    evaluator = RAGEvaluator()
    results = await evaluator.run()

    # Access results
    for result in results:
        print(f"Question: {result['question']}")
        print(f"RAGAS Score: {result['ragas_score']:.2%}")
        print(f"Metrics: {result['metrics']}")

asyncio.run(main())

Custom Dataset

evaluator = RAGEvaluator(test_dataset_path="custom_tests.json")
results = await evaluator.run()

Batch Evaluation

from pathlib import Path
import json

results_dir = Path("lightrag/evaluation/results")
results_dir.mkdir(exist_ok=True)

# Run multiple evaluations
for i in range(3):
    evaluator = RAGEvaluator()
    results = await evaluator.run()

🎯 Using Evaluation Results

What the Metrics Tell You:

✅ Quality Metrics: Overall RAGAS score indicates system health
✅ Evaluation Framework: Automated quality assessment with RAGAS
✅ Best Practices: Offline evaluation pipeline for continuous improvement
✅ Production-Ready: Metrics-driven system optimization

Example Use Cases:

Track RAG quality over time as you update your documents
Compare different retrieval modes (local, global, hybrid, mix)
Measure impact of chunking strategy changes
Validate system performance before deployment

LangFuse Integration: Real-time observability of production RAG calls
LightRAG: Core RAG system with entity extraction and knowledge graphs
Metrics: See results/ for detailed evaluation metrics

📚 Resources

🐛 Troubleshooting

"ModuleNotFoundError: No module named 'ragas'"

pip install ragas datasets

"No sample_dataset.json found"

Make sure you're running from the project root:

cd /path/to/LightRAG
python lightrag/evaluation/eval_rag_quality.py

"LLM API errors during evaluation"

The evaluation uses your configured LLM (OpenAI by default). Ensure:

API keys are set in .env
Have sufficient API quota
Network connection is stable

Results showing 0 scores

Current implementation uses ground truth as mock responses. Results will show perfect scores because the "generated answer" equals the ground truth.

To use actual RAG results:

Implement the generate_rag_response() method
Connect to your LightRAG instance
Run evaluation again

📝 Next Steps

✅ Review test dataset in sample_dataset.json
✅ Run python lightrag/evaluation/eval_rag_quality.py
✅ Open the HTML report in browser
🔄 Integrate with actual LightRAG system
📊 Monitor metrics over time
🎯 Use insights for optimization

Happy Evaluating! 🚀

8.2 KiB Raw Blame History