Refactor evaluation results display and logging format

This commit is contained in:
yangdx 2025-11-05 10:08:17 +08:00
parent 06b91d00f8
commit a73314a4ba
2 changed files with 74 additions and 22 deletions

View file

@ -156,13 +156,13 @@ The evaluation framework supports customization through environment variables:
| Variable | Default | Description | | Variable | Default | Description |
|----------|---------|-------------| |----------|---------|-------------|
| `EVAL_LLM_MODEL` | `gpt-4o-mini` | LLM model used for RAGAS evaluation | | `EVAL_LLM_MODEL` | `gpt-4o-mini` | LLM model used for RAGAS evaluation |
| `EVAL_EMBEDDING_MODEL` | `text-embedding-3-small` | Embedding model for evaluation | | `EVAL_EMBEDDING_MODEL` | `text-embedding-3-large` | Embedding model for evaluation |
| `EVAL_LLM_BINDING_API_KEY` | (falls back to `OPENAI_API_KEY`) | API key for evaluation models | | `EVAL_LLM_BINDING_API_KEY` | falls back to `OPENAI_API_KEY` | API key for evaluation models |
| `EVAL_LLM_BINDING_HOST` | (optional) | Custom endpoint URL for OpenAI-compatible services | | `EVAL_LLM_BINDING_HOST` | (optional) | Custom endpoint URL for OpenAI-compatible services |
| `EVAL_MAX_CONCURRENT` | `1` | Number of concurrent test case evaluations (1=serial) | | `EVAL_MAX_CONCURRENT` | 2 | Number of concurrent test case evaluations (1=serial) |
| `EVAL_QUERY_TOP_K` | `10` | Number of documents to retrieve per query | | `EVAL_QUERY_TOP_K` | 10 | Number of documents to retrieve per query |
| `EVAL_LLM_MAX_RETRIES` | `5` | Maximum LLM request retries | | `EVAL_LLM_MAX_RETRIES` | 5 | Maximum LLM request retries |
| `EVAL_LLM_TIMEOUT` | `120` | LLM request timeout in seconds | | `EVAL_LLM_TIMEOUT` | 180 | LLM request timeout in seconds |
### Usage Examples ### Usage Examples
@ -199,16 +199,10 @@ The evaluation framework includes built-in concurrency control to prevent API ra
**Default Configuration (Conservative):** **Default Configuration (Conservative):**
```bash ```bash
EVAL_MAX_CONCURRENT=1 # Serial evaluation (one test at a time) EVAL_MAX_CONCURRENT=2 # Serial evaluation (one test at a time)
EVAL_QUERY_TOP_K=10 # OP_K query parameter of LightRAG EVAL_QUERY_TOP_K=10 # OP_K query parameter of LightRAG
EVAL_LLM_MAX_RETRIES=5 # Retry failed requests 5 times EVAL_LLM_MAX_RETRIES=5 # Retry failed requests 5 times
EVAL_LLM_TIMEOUT=180 # 2-minute timeout per request EVAL_LLM_TIMEOUT=180 # 3-minute timeout per request
```
**If You Have Higher API Quotas:**
```bash
EVAL_MAX_CONCURRENT=2 # Evaluate 2 tests in parallel
EVAL_QUERY_TOP_K=20 # OP_K query parameter of LightRAG
``` ```
**Common Issues and Solutions:** **Common Issues and Solutions:**
@ -370,12 +364,72 @@ The evaluator queries a running LightRAG API server at `http://localhost:9621`.
## 📝 Next Steps ## 📝 Next Steps
1. Index documents into LightRAG (WebUI or API) 1. Index sample documents into LightRAG (WebUI or API)
2. Start LightRAG API server 2. Start LightRAG API server
3. Run `python lightrag/evaluation/eval_rag_quality.py` 3. Run `python lightrag/evaluation/eval_rag_quality.py`
4. Review results (JSON/CSV) in `results/` folder 4. Review results (JSON/CSV) in `results/` folder
5. Adjust entity extraction prompts or retrieval settings based on scores 5. Adjust entity extraction prompts or retrieval settings based on scores
Evaluation Result Sample:
```
INFO: ======================================================================
INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API
INFO: ======================================================================
INFO: Evaluation Models:
INFO: • LLM Model: gpt-4.1
INFO: • Embedding Model: text-embedding-3-large
INFO: • Endpoint: OpenAI Official API
INFO: Concurrency & Rate Limiting:
INFO: • Query Top-K: 10 Entities/Relations
INFO: • LLM Max Retries: 5
INFO: • LLM Timeout: 180 seconds
INFO: Test Configuration:
INFO: • Total Test Cases: 6
INFO: • Test Dataset: sample_dataset.json
INFO: • LightRAG API: http://localhost:9621
INFO: • Results Directory: results
INFO: ======================================================================
INFO: 🚀 Starting RAGAS Evaluation of LightRAG System
INFO: 🔧 RAGAS Evaluation (Stage 2): 2 concurrent
INFO: ======================================================================
INFO:
INFO: ===================================================================================================================
INFO: 📊 EVALUATION RESULTS SUMMARY
INFO: ===================================================================================================================
INFO: # | Question | Faith | AnswRel | CtxRec | CtxPrec | RAGAS | Status
INFO: -------------------------------------------------------------------------------------------------------------------
INFO: 1 | How does LightRAG solve the hallucination probl... | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ✓
INFO: 2 | What are the three main components required in ... | 0.8500 | 0.5790 | 1.0000 | 1.0000 | 0.8573 | ✓
INFO: 3 | How does LightRAG's retrieval performance compa... | 0.8056 | 1.0000 | 1.0000 | 1.0000 | 0.9514 | ✓
INFO: 4 | What vector databases does LightRAG support and... | 0.8182 | 0.9807 | 1.0000 | 1.0000 | 0.9497 | ✓
INFO: 5 | What are the four key metrics for evaluating RA... | 1.0000 | 0.7452 | 1.0000 | 1.0000 | 0.9363 | ✓
INFO: 6 | What are the core benefits of LightRAG and how ... | 0.9583 | 0.8829 | 1.0000 | 1.0000 | 0.9603 | ✓
INFO: ===================================================================================================================
INFO:
INFO: ======================================================================
INFO: 📊 EVALUATION COMPLETE
INFO: ======================================================================
INFO: Total Tests: 6
INFO: Successful: 6
INFO: Failed: 0
INFO: Success Rate: 100.00%
INFO: Elapsed Time: 161.10 seconds
INFO: Avg Time/Test: 26.85 seconds
INFO:
INFO: ======================================================================
INFO: 📈 BENCHMARK RESULTS (Average)
INFO: ======================================================================
INFO: Average Faithfulness: 0.9053
INFO: Average Answer Relevance: 0.8646
INFO: Average Context Recall: 1.0000
INFO: Average Context Precision: 1.0000
INFO: Average RAGAS Score: 0.9425
INFO: ----------------------------------------------------------------------
INFO: Min RAGAS Score: 0.8573
INFO: Max RAGAS Score: 1.0000
```
--- ---
**Happy Evaluating! 🚀** **Happy Evaluating! 🚀**

View file

@ -657,6 +657,7 @@ class RAGEvaluator:
Args: Args:
results: List of evaluation results results: List of evaluation results
""" """
logger.info("")
logger.info("%s", "=" * 115) logger.info("%s", "=" * 115)
logger.info("📊 EVALUATION RESULTS SUMMARY") logger.info("📊 EVALUATION RESULTS SUMMARY")
logger.info("%s", "=" * 115) logger.info("%s", "=" * 115)
@ -842,6 +843,9 @@ class RAGEvaluator:
"results": results, "results": results,
} }
# Display results table
self._display_results_table(results)
# Save JSON results # Save JSON results
json_path = ( json_path = (
self.results_dir self.results_dir
@ -850,14 +854,8 @@ class RAGEvaluator:
with open(json_path, "w") as f: with open(json_path, "w") as f:
json.dump(summary, f, indent=2) json.dump(summary, f, indent=2)
# Display results table
self._display_results_table(results)
logger.info("✅ JSON results saved to: %s", json_path)
# Export to CSV # Export to CSV
csv_path = self._export_to_csv(results) csv_path = self._export_to_csv(results)
logger.info("✅ CSV results saved to: %s", csv_path)
# Print summary # Print summary
logger.info("") logger.info("")
@ -882,7 +880,7 @@ class RAGEvaluator:
logger.info("Average Context Recall: %.4f", avg["context_recall"]) logger.info("Average Context Recall: %.4f", avg["context_recall"])
logger.info("Average Context Precision: %.4f", avg["context_precision"]) logger.info("Average Context Precision: %.4f", avg["context_precision"])
logger.info("Average RAGAS Score: %.4f", avg["ragas_score"]) logger.info("Average RAGAS Score: %.4f", avg["ragas_score"])
logger.info("") logger.info("%s", "-" * 70)
logger.info( logger.info(
"Min RAGAS Score: %.4f", "Min RAGAS Score: %.4f",
benchmark_stats["min_ragas_score"], benchmark_stats["min_ragas_score"],