Refactor evaluation results display and logging format

2025-11-05 10:08:17 +08:00 · 2025-11-05 10:08:17 +08:00 · a73314a4ba
commit a73314a4ba
parent 06b91d00f8
2 changed files with 74 additions and 22 deletions
--- a/lightrag/evaluation/README.md
+++ b/lightrag/evaluation/README.md
@ -156,13 +156,13 @@ The evaluation framework supports customization through environment variables:
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `EVAL_LLM_MODEL` | `gpt-4o-mini` | LLM model used for RAGAS evaluation |
-| `EVAL_EMBEDDING_MODEL` | `text-embedding-3-small` | Embedding model for evaluation |
+| `EVAL_EMBEDDING_MODEL` | `text-embedding-3-large` | Embedding model for evaluation |
-| `EVAL_LLM_BINDING_API_KEY` | (falls back to `OPENAI_API_KEY`) | API key for evaluation models |
+| `EVAL_LLM_BINDING_API_KEY` | falls back to `OPENAI_API_KEY` | API key for evaluation models |
 | `EVAL_LLM_BINDING_HOST` | (optional) | Custom endpoint URL for OpenAI-compatible services |
-| `EVAL_MAX_CONCURRENT` | `1` | Number of concurrent test case evaluations (1=serial) |
+| `EVAL_MAX_CONCURRENT` | 2 | Number of concurrent test case evaluations (1=serial) |
-| `EVAL_QUERY_TOP_K` | `10` | Number of documents to retrieve per query |
+| `EVAL_QUERY_TOP_K` | 10 | Number of documents to retrieve per query |
-| `EVAL_LLM_MAX_RETRIES` | `5` | Maximum LLM request retries |
+| `EVAL_LLM_MAX_RETRIES` | 5 | Maximum LLM request retries |
-| `EVAL_LLM_TIMEOUT` | `120` | LLM request timeout in seconds |
+| `EVAL_LLM_TIMEOUT` | 180 | LLM request timeout in seconds |
 ### Usage Examples
@ -199,16 +199,10 @@ The evaluation framework includes built-in concurrency control to prevent API ra
 **Default Configuration (Conservative):**
 ```bash
-EVAL_MAX_CONCURRENT=1    # Serial evaluation (one test at a time)
+EVAL_MAX_CONCURRENT=2    # Serial evaluation (one test at a time)
 EVAL_QUERY_TOP_K=10      # OP_K query parameter of LightRAG
 EVAL_LLM_MAX_RETRIES=5   # Retry failed requests 5 times
-EVAL_LLM_TIMEOUT=180     # 2-minute timeout per request
+EVAL_LLM_TIMEOUT=180     # 3-minute timeout per request
 ```
 **If You Have Higher API Quotas:**
 ```bash
 EVAL_MAX_CONCURRENT=2    # Evaluate 2 tests in parallel
 EVAL_QUERY_TOP_K=20      # OP_K query parameter of LightRAG
 ```
 **Common Issues and Solutions:**
@ -370,12 +364,72 @@ The evaluator queries a running LightRAG API server at `http://localhost:9621`.
 ## 📝 Next Steps
-1. Index documents into LightRAG (WebUI or API)
+1. Index sample documents into LightRAG (WebUI or API)
 2. Start LightRAG API server
 3. Run `python lightrag/evaluation/eval_rag_quality.py`
 4. Review results (JSON/CSV) in `results/` folder
 5. Adjust entity extraction prompts or retrieval settings based on scores
 Evaluation Result Sample:
 ```
 INFO: ======================================================================
 INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API
 INFO: ======================================================================
 INFO: Evaluation Models:
 INFO:   • LLM Model:            gpt-4.1
 INFO:   • Embedding Model:      text-embedding-3-large
 INFO:   • Endpoint:             OpenAI Official API
 INFO: Concurrency & Rate Limiting:
 INFO:   • Query Top-K:          10 Entities/Relations
 INFO:   • LLM Max Retries:      5
 INFO:   • LLM Timeout:          180 seconds
 INFO: Test Configuration:
 INFO:   • Total Test Cases:     6
 INFO:   • Test Dataset:         sample_dataset.json
 INFO:   • LightRAG API:         http://localhost:9621
 INFO:   • Results Directory:    results
 INFO: ======================================================================
 INFO: 🚀 Starting RAGAS Evaluation of LightRAG System
 INFO: 🔧 RAGAS Evaluation (Stage 2): 2 concurrent
 INFO: ======================================================================
 INFO:
 INFO: ===================================================================================================================
 INFO: 📊 EVALUATION RESULTS SUMMARY
 INFO: ===================================================================================================================
 INFO: #    | Question                                           |  Faith | AnswRel | CtxRec | CtxPrec |  RAGAS | Status
 INFO: -------------------------------------------------------------------------------------------------------------------
 INFO: 1    | How does LightRAG solve the hallucination probl... | 1.0000 |  1.0000 | 1.0000 |  1.0000 | 1.0000 |      ✓
 INFO: 2    | What are the three main components required in ... | 0.8500 |  0.5790 | 1.0000 |  1.0000 | 0.8573 |      ✓
 INFO: 3    | How does LightRAG's retrieval performance compa... | 0.8056 |  1.0000 | 1.0000 |  1.0000 | 0.9514 |      ✓
 INFO: 4    | What vector databases does LightRAG support and... | 0.8182 |  0.9807 | 1.0000 |  1.0000 | 0.9497 |      ✓
 INFO: 5    | What are the four key metrics for evaluating RA... | 1.0000 |  0.7452 | 1.0000 |  1.0000 | 0.9363 |      ✓
 INFO: 6    | What are the core benefits of LightRAG and how ... | 0.9583 |  0.8829 | 1.0000 |  1.0000 | 0.9603 |      ✓
 INFO: ===================================================================================================================
 INFO:
 INFO: ======================================================================
 INFO: 📊 EVALUATION COMPLETE
 INFO: ======================================================================
 INFO: Total Tests:    6
 INFO: Successful:     6
 INFO: Failed:         0
 INFO: Success Rate:   100.00%
 INFO: Elapsed Time:   161.10 seconds
 INFO: Avg Time/Test:  26.85 seconds
 INFO:
 INFO: ======================================================================
 INFO: 📈 BENCHMARK RESULTS (Average)
 INFO: ======================================================================
 INFO: Average Faithfulness:      0.9053
 INFO: Average Answer Relevance:  0.8646
 INFO: Average Context Recall:    1.0000
 INFO: Average Context Precision: 1.0000
 INFO: Average RAGAS Score:       0.9425
 INFO: ----------------------------------------------------------------------
 INFO: Min RAGAS Score:           0.8573
 INFO: Max RAGAS Score:           1.0000
 ```
 ---
 **Happy Evaluating! 🚀**
--- a/lightrag/evaluation/eval_rag_quality.py
+++ b/lightrag/evaluation/eval_rag_quality.py
@ -657,6 +657,7 @@ class RAGEvaluator:
        Args:
            results: List of evaluation results
        """
        logger.info("")
        logger.info("%s", "=" * 115)
        logger.info("📊 EVALUATION RESULTS SUMMARY")
        logger.info("%s", "=" * 115)
@ -842,6 +843,9 @@ class RAGEvaluator:
            "results": results,
        }
        # Display results table
        self._display_results_table(results)
        # Save JSON results
        json_path = (
            self.results_dir
@ -850,14 +854,8 @@ class RAGEvaluator:
        with open(json_path, "w") as f:
            json.dump(summary, f, indent=2)
        # Display results table
        self._display_results_table(results)
        logger.info("✅ JSON results saved to: %s", json_path)
        # Export to CSV
        csv_path = self._export_to_csv(results)
        logger.info("✅ CSV results saved to: %s", csv_path)
        # Print summary
        logger.info("")
@ -882,7 +880,7 @@ class RAGEvaluator:
        logger.info("Average Context Recall:    %.4f", avg["context_recall"])
        logger.info("Average Context Precision: %.4f", avg["context_precision"])
        logger.info("Average RAGAS Score:       %.4f", avg["ragas_score"])
-        logger.info("")
+        logger.info("%s", "-" * 70)
        logger.info(
            "Min RAGAS Score:           %.4f",
            benchmark_stats["min_ragas_score"],