# 📊 RAGAS-based Evaluation Framework ## What is RAGAS? **RAGAS** (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs. RAGAS uses state-of-the-art evaluation metrics: ### Core Metrics | Metric | What It Measures | Good Score | |--------|-----------------|-----------| | **Faithfulness** | Is the answer factually accurate based on retrieved context? | > 0.80 | | **Answer Relevance** | Is the answer relevant to the user's question? | > 0.80 | | **Context Recall** | Was all relevant information retrieved from documents? | > 0.80 | | **Context Precision** | Is retrieved context clean without irrelevant noise? | > 0.80 | | **RAGAS Score** | Overall quality metric (average of above) | > 0.80 | ### 📁 LightRAG Evalua'tion Framework Directory Structure ``` lightrag/evaluation/ ├── eval_rag_quality.py # Main evaluation script ├── sample_dataset.json # 3 test questions about LightRAG ├── sample_documents/ # Matching markdown files for testing │ ├── 01_lightrag_overview.md │ ├── 02_rag_architecture.md │ ├── 03_lightrag_improvements.md │ ├── 04_supported_databases.md │ ├── 05_evaluation_and_deployment.md │ └── README.md ├── __init__.py # Package init ├── results/ # Output directory │ ├── results_YYYYMMDD_HHMMSS.json # Raw metrics in JSON │ └── results_YYYYMMDD_HHMMSS.csv # Metrics in CSV format └── README.md # This file ``` **Quick Test:** Index files from `sample_documents/` into LightRAG, then run the evaluator to reproduce results (~89-100% RAGAS score per question). ## 🚀 Quick Start ### 1. Install Dependencies ```bash pip install ragas datasets langfuse ``` Or use your project dependencies (already included in pyproject.toml): ```bash pip install -e ".[evaluation]" ``` ### 2. Run Evaluation **Basic usage (uses defaults):** ```bash cd /path/to/LightRAG python lightrag/evaluation/eval_rag_quality.py ``` **Specify custom dataset:** ```bash python lightrag/evaluation/eval_rag_quality.py --dataset my_test.json ``` **Specify custom RAG endpoint:** ```bash python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621 ``` **Specify both (short form):** ```bash python lightrag/evaluation/eval_rag_quality.py -d my_test.json -r http://localhost:9621 ``` **Get help:** ```bash python lightrag/evaluation/eval_rag_quality.py --help ``` ### 3. View Results Results are saved automatically in `lightrag/evaluation/results/`: ``` results/ ├── results_20241023_143022.json ← Raw metrics in JSON format └── results_20241023_143022.csv ← Metrics in CSV format (for spreadsheets) ``` **Results include:** - ✅ Overall RAGAS score - 📊 Per-metric averages (Faithfulness, Answer Relevance, Context Recall, Context Precision) - 📋 Individual test case results - 📈 Performance breakdown by question ## 📋 Command-Line Arguments The evaluation script supports command-line arguments for easy configuration: | Argument | Short | Default | Description | |----------|-------|---------|-------------| | `--dataset` | `-d` | `sample_dataset.json` | Path to test dataset JSON file | | `--ragendpoint` | `-r` | `http://localhost:9621` or `$LIGHTRAG_API_URL` | LightRAG API endpoint URL | ### Usage Examples **Use default dataset and endpoint:** ```bash python lightrag/evaluation/eval_rag_quality.py ``` **Custom dataset with default endpoint:** ```bash python lightrag/evaluation/eval_rag_quality.py --dataset path/to/my_dataset.json ``` **Default dataset with custom endpoint:** ```bash python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621 ``` **Custom dataset and endpoint:** ```bash python lightrag/evaluation/eval_rag_quality.py -d my_dataset.json -r http://localhost:9621 ``` **Absolute path to dataset:** ```bash python lightrag/evaluation/eval_rag_quality.py -d /path/to/custom_dataset.json ``` **Show help message:** ```bash python lightrag/evaluation/eval_rag_quality.py --help ``` ## ⚙️ Configuration ### Environment Variables The evaluation framework supports customization through environment variables: **⚠️ IMPORTANT: Both LLM and Embedding endpoints MUST be OpenAI-compatible** - The RAGAS framework requires OpenAI-compatible API interfaces - Custom endpoints must implement the OpenAI API format (e.g., vLLM, SGLang, LocalAI) - Non-compatible endpoints will cause evaluation failures | Variable | Default | Description | |----------|---------|-------------| | **LLM Configuration** | | | | `EVAL_LLM_MODEL` | `gpt-4o-mini` | LLM model used for RAGAS evaluation | | `EVAL_LLM_BINDING_API_KEY` | falls back to `OPENAI_API_KEY` | API key for LLM evaluation | | `EVAL_LLM_BINDING_HOST` | (optional) | Custom OpenAI-compatible endpoint URL for LLM | | **Embedding Configuration** | | | | `EVAL_EMBEDDING_MODEL` | `text-embedding-3-large` | Embedding model for evaluation | | `EVAL_EMBEDDING_BINDING_API_KEY` | falls back to `EVAL_LLM_BINDING_API_KEY` → `OPENAI_API_KEY` | API key for embeddings | | `EVAL_EMBEDDING_BINDING_HOST` | falls back to `EVAL_LLM_BINDING_HOST` | Custom OpenAI-compatible endpoint URL for embeddings | | **Performance Tuning** | | | | `EVAL_MAX_CONCURRENT` | 2 | Number of concurrent test case evaluations (1=serial) | | `EVAL_QUERY_TOP_K` | 10 | Number of documents to retrieve per query | | `EVAL_LLM_MAX_RETRIES` | 5 | Maximum LLM request retries | | `EVAL_LLM_TIMEOUT` | 180 | LLM request timeout in seconds | ### Usage Examples **Example 1: Default Configuration (OpenAI Official API)** ```bash export OPENAI_API_KEY=sk-xxx python lightrag/evaluation/eval_rag_quality.py ``` Both LLM and embeddings use OpenAI's official API with default models. **Example 2: Custom Models on OpenAI** ```bash export OPENAI_API_KEY=sk-xxx export EVAL_LLM_MODEL=gpt-4o-mini export EVAL_EMBEDDING_MODEL=text-embedding-3-large python lightrag/evaluation/eval_rag_quality.py ``` **Example 3: Same Custom OpenAI-Compatible Endpoint for Both** ```bash # Both LLM and embeddings use the same custom endpoint export EVAL_LLM_BINDING_API_KEY=your-custom-key export EVAL_LLM_BINDING_HOST=http://localhost:8000/v1 export EVAL_LLM_MODEL=qwen-plus export EVAL_EMBEDDING_MODEL=BAAI/bge-m3 python lightrag/evaluation/eval_rag_quality.py ``` Embeddings automatically inherit LLM endpoint configuration. **Example 4: Separate Endpoints (Cost Optimization)** ```bash # Use OpenAI for LLM (high quality) export EVAL_LLM_BINDING_API_KEY=sk-openai-key export EVAL_LLM_MODEL=gpt-4o-mini # No EVAL_LLM_BINDING_HOST means use OpenAI official API # Use local vLLM for embeddings (cost-effective) export EVAL_EMBEDDING_BINDING_API_KEY=local-key export EVAL_EMBEDDING_BINDING_HOST=http://localhost:8001/v1 export EVAL_EMBEDDING_MODEL=BAAI/bge-m3 python lightrag/evaluation/eval_rag_quality.py ``` LLM uses OpenAI official API, embeddings use local custom endpoint. **Example 5: Different Custom Endpoints for LLM and Embeddings** ```bash # LLM on one OpenAI-compatible server export EVAL_LLM_BINDING_API_KEY=key1 export EVAL_LLM_BINDING_HOST=http://llm-server:8000/v1 export EVAL_LLM_MODEL=custom-llm # Embeddings on another OpenAI-compatible server export EVAL_EMBEDDING_BINDING_API_KEY=key2 export EVAL_EMBEDDING_BINDING_HOST=http://embedding-server:8001/v1 export EVAL_EMBEDDING_MODEL=custom-embedding python lightrag/evaluation/eval_rag_quality.py ``` Both use different custom OpenAI-compatible endpoints. **Example 6: Using Environment Variables from .env File** ```bash # Create .env file in project root cat > .env << EOF EVAL_LLM_BINDING_API_KEY=your-key EVAL_LLM_BINDING_HOST=http://localhost:8000/v1 EVAL_LLM_MODEL=qwen-plus EVAL_EMBEDDING_MODEL=BAAI/bge-m3 EOF # Run evaluation (automatically loads .env) python lightrag/evaluation/eval_rag_quality.py ``` ### Concurrency Control & Rate Limiting The evaluation framework includes built-in concurrency control to prevent API rate limiting issues: **Why Concurrency Control Matters:** - RAGAS internally makes many concurrent LLM calls for each test case - Context Precision metric calls LLM once per retrieved document - Without control, this can easily exceed API rate limits **Default Configuration (Conservative):** ```bash EVAL_MAX_CONCURRENT=2 # Serial evaluation (one test at a time) EVAL_QUERY_TOP_K=10 # OP_K query parameter of LightRAG EVAL_LLM_MAX_RETRIES=5 # Retry failed requests 5 times EVAL_LLM_TIMEOUT=180 # 3-minute timeout per request ``` **Common Issues and Solutions:** | Issue | Solution | |-------|----------| | **Warning: "LM returned 1 generations instead of 3"** | Reduce `EVAL_MAX_CONCURRENT` to 1 or decrease `EVAL_QUERY_TOP_K` | | **Context Precision returns NaN** | Lower `EVAL_QUERY_TOP_K` to reduce LLM calls per test case | | **Rate limit errors (429)** | Increase `EVAL_LLM_MAX_RETRIES` and decrease `EVAL_MAX_CONCURRENT` | | **Request timeouts** | Increase `EVAL_LLM_TIMEOUT` to 180 or higher | ## 📝 Test Dataset `sample_dataset.json` contains 3 generic questions about LightRAG. Replace with questions matching YOUR indexed documents. **Custom Test Cases:** ```json { "test_cases": [ { "question": "Your question here", "ground_truth": "Expected answer from your data", "project": "evaluation_project_name" } ] } ``` --- ## 📊 Interpreting Results ### Score Ranges - **0.80-1.00**: ✅ Excellent (Production-ready) - **0.60-0.80**: ⚠️ Good (Room for improvement) - **0.40-0.60**: ❌ Poor (Needs optimization) - **0.00-0.40**: 🔴 Critical (Major issues) ### What Low Scores Mean | Metric | Low Score Indicates | |--------|-------------------| | **Faithfulness** | Responses contain hallucinations or incorrect information | | **Answer Relevance** | Answers don't match what users asked | | **Context Recall** | Missing important information in retrieval | | **Context Precision** | Retrieved documents contain irrelevant noise | ### Optimization Tips 1. **Low Faithfulness**: - Improve entity extraction quality - Better document chunking - Tune retrieval temperature 2. **Low Answer Relevance**: - Improve prompt engineering - Better query understanding - Check semantic similarity threshold 3. **Low Context Recall**: - Increase retrieval `top_k` results - Improve embedding model - Better document preprocessing 4. **Low Context Precision**: - Smaller, focused chunks - Better filtering - Improve chunking strategy --- ## 📚 Resources - [RAGAS Documentation](https://docs.ragas.io/) - [RAGAS GitHub](https://github.com/explodinggradients/ragas) --- ## 🐛 Troubleshooting ### "ModuleNotFoundError: No module named 'ragas'" ```bash pip install ragas datasets ``` ### "Warning: LM returned 1 generations instead of requested 3" or Context Precision NaN **Cause**: This warning indicates API rate limiting or concurrent request overload: - RAGAS makes multiple LLM calls per test case (faithfulness, relevancy, recall, precision) - Context Precision calls LLM once per retrieved document (with `EVAL_QUERY_TOP_K=10`, that's 10 calls) - Concurrent evaluation multiplies these calls: `EVAL_MAX_CONCURRENT × LLM calls per test` **Solutions** (in order of effectiveness): 1. **Serial Evaluation** (Default): ```bash export EVAL_MAX_CONCURRENT=1 python lightrag/evaluation/eval_rag_quality.py ``` 2. **Reduce Retrieved Documents**: ```bash export EVAL_QUERY_TOP_K=5 # Halves Context Precision LLM calls python lightrag/evaluation/eval_rag_quality.py ``` 3. **Increase Retry & Timeout**: ```bash export EVAL_LLM_MAX_RETRIES=10 export EVAL_LLM_TIMEOUT=180 python lightrag/evaluation/eval_rag_quality.py ``` 4. **Use Higher Quota API** (if available): - Upgrade to OpenAI Tier 2+ for higher RPM limits - Use self-hosted OpenAI-compatible service with no rate limits ### "AttributeError: 'InstructorLLM' object has no attribute 'agenerate_prompt'" or NaN results This error occurs with RAGAS 0.3.x when LLM and Embeddings are not explicitly configured. The evaluation framework now handles this automatically by: - Using environment variables to configure evaluation models - Creating proper LLM and Embeddings instances for RAGAS **Solution**: Ensure you have set one of the following: - `OPENAI_API_KEY` environment variable (default) - `EVAL_LLM_BINDING_API_KEY` for custom API key The framework will automatically configure the evaluation models. ### "No sample_dataset.json found" Make sure you're running from the project root: ```bash cd /path/to/LightRAG python lightrag/evaluation/eval_rag_quality.py ``` ### "LightRAG query API errors during evaluation" The evaluation uses your configured LLM (OpenAI by default). Ensure: - API keys are set in `.env` - Network connection is stable ### Evaluation requires running LightRAG API The evaluator queries a running LightRAG API server at `http://localhost:9621`. Make sure: 1. LightRAG API server is running (`python lightrag/api/lightrag_server.py`) 2. Documents are indexed in your LightRAG instance 3. API is accessible at the configured URL ## 📝 Next Steps 1. Start LightRAG API server 2. Upload sample documents into LightRAG throught WebUI 3. Run `python lightrag/evaluation/eval_rag_quality.py` 4. Review results (JSON/CSV) in `results/` folder Evaluation Result Sample: ``` INFO: ====================================================================== INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API INFO: ====================================================================== INFO: Evaluation Models: INFO: • LLM Model: gpt-4.1 INFO: • Embedding Model: text-embedding-3-large INFO: • Endpoint: OpenAI Official API INFO: Concurrency & Rate Limiting: INFO: • Query Top-K: 10 Entities/Relations INFO: • LLM Max Retries: 5 INFO: • LLM Timeout: 180 seconds INFO: Test Configuration: INFO: • Total Test Cases: 6 INFO: • Test Dataset: sample_dataset.json INFO: • LightRAG API: http://localhost:9621 INFO: • Results Directory: results INFO: ====================================================================== INFO: 🚀 Starting RAGAS Evaluation of LightRAG System INFO: 🔧 RAGAS Evaluation (Stage 2): 2 concurrent INFO: ====================================================================== INFO: INFO: =================================================================================================================== INFO: 📊 EVALUATION RESULTS SUMMARY INFO: =================================================================================================================== INFO: # | Question | Faith | AnswRel | CtxRec | CtxPrec | RAGAS | Status INFO: ------------------------------------------------------------------------------------------------------------------- INFO: 1 | How does LightRAG solve the hallucination probl... | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ✓ INFO: 2 | What are the three main components required in ... | 0.8500 | 0.5790 | 1.0000 | 1.0000 | 0.8573 | ✓ INFO: 3 | How does LightRAG's retrieval performance compa... | 0.8056 | 1.0000 | 1.0000 | 1.0000 | 0.9514 | ✓ INFO: 4 | What vector databases does LightRAG support and... | 0.8182 | 0.9807 | 1.0000 | 1.0000 | 0.9497 | ✓ INFO: 5 | What are the four key metrics for evaluating RA... | 1.0000 | 0.7452 | 1.0000 | 1.0000 | 0.9363 | ✓ INFO: 6 | What are the core benefits of LightRAG and how ... | 0.9583 | 0.8829 | 1.0000 | 1.0000 | 0.9603 | ✓ INFO: =================================================================================================================== INFO: INFO: ====================================================================== INFO: 📊 EVALUATION COMPLETE INFO: ====================================================================== INFO: Total Tests: 6 INFO: Successful: 6 INFO: Failed: 0 INFO: Success Rate: 100.00% INFO: Elapsed Time: 161.10 seconds INFO: Avg Time/Test: 26.85 seconds INFO: INFO: ====================================================================== INFO: 📈 BENCHMARK RESULTS (Average) INFO: ====================================================================== INFO: Average Faithfulness: 0.9053 INFO: Average Answer Relevance: 0.8646 INFO: Average Context Recall: 1.0000 INFO: Average Context Precision: 1.0000 INFO: Average RAGAS Score: 0.9425 INFO: ---------------------------------------------------------------------- INFO: Min RAGAS Score: 0.8573 INFO: Max RAGAS Score: 1.0000 ``` --- **Happy Evaluating! 🚀**