cherry-pick 9c057060

2025-12-04 19:14:25 +08:00 · 2025-12-04 19:14:25 +08:00 · 56b8806256
commit 56b8806256
parent 89f8048df5
3 changed files with 980 additions and 242 deletions
--- a/env.example
+++ b/env.example
@ -50,6 +50,8 @@ OLLAMA_EMULATING_MODEL_TAG=latest
 # JWT_ALGORITHM=HS256
 ### API-Key to access LightRAG Server API
 ### Use this key in HTTP requests with the 'X-API-Key' header
 ### Example: curl -H "X-API-Key: your-secure-api-key-here" http://localhost:9621/query
 # LIGHTRAG_API_KEY=your-secure-api-key-here
 # WHITELIST_PATHS=/health,/api/*
@ -73,16 +75,6 @@ ENABLE_LLM_CACHE=true
 # MAX_RELATION_TOKENS=8000
 ### control the maximum tokens send to LLM (include entities, relations and chunks)
 # MAX_TOTAL_TOKENS=30000
 ### control the maximum chunk_ids stored in vector and graph db
 # MAX_SOURCE_IDS_PER_ENTITY=300
 # MAX_SOURCE_IDS_PER_RELATION=300
 ### control chunk_ids limitation method: KEEP, FIFO (KEPP: Ingore New Chunks, FIFO: New chunks replace old chunks)
 # SOURCE_IDS_LIMIT_METHOD=KEEP
 ### maximum number of related chunks per source entity or relation
 ###     The chunk picker uses this value to determine the total number of chunks selected from KG(knowledge graph)
 ###     Higher values increase re-ranking time
 # RELATED_CHUNK_NUMBER=5
 ### chunk selection strategies
 ###     VECTOR: Pick KG chunks by vector similarity, delivered chunks to the LLM aligning more closely with naive retrieval
@ -110,9 +102,6 @@ RERANK_BINDING=null
 # RERANK_MODEL=rerank-v3.5
 # RERANK_BINDING_HOST=https://api.cohere.com/v2/rerank
 # RERANK_BINDING_API_KEY=your_rerank_api_key_here
 ### Cohere rerank chunking configuration (useful for models with token limits like ColBERT)
 # RERANK_ENABLE_CHUNKING=true
 # RERANK_MAX_TOKENS_PER_DOC=480
 ### Default value for Jina AI
 # RERANK_MODEL=jina-reranker-v2-base-multilingual
@ -132,6 +121,9 @@ ENABLE_LLM_CACHE_FOR_EXTRACT=true
 ### Document processing output language: English, Chinese, French, German ...
 SUMMARY_LANGUAGE=English
 ### PDF decryption password for protected PDF files
 # PDF_DECRYPT_PASSWORD=your_pdf_password_here
 ### Entity types that the LLM will attempt to recognize
 # ENTITY_TYPES='["Person", "Creature", "Organization", "Location", "Event", "Concept", "Method", "Content", "Data", "Artifact", "NaturalObject"]'
@ -148,6 +140,22 @@ SUMMARY_LANGUAGE=English
 ### Maximum context size sent to LLM for description summary
 # SUMMARY_CONTEXT_SIZE=12000
 ### control the maximum chunk_ids stored in vector and graph db
 # MAX_SOURCE_IDS_PER_ENTITY=300
 # MAX_SOURCE_IDS_PER_RELATION=300
 ### control chunk_ids limitation method: FIFO, KEEP
 ###    FIFO: First in first out
 ###    KEEP: Keep oldest (less merge action and faster)
 # SOURCE_IDS_LIMIT_METHOD=FIFO
 # Maximum number of file paths stored in entity/relation file_path field (For displayed only, does not affect query performance)
 # MAX_FILE_PATHS=100
 ### maximum number of related chunks per source entity or relation
 ###     The chunk picker uses this value to determine the total number of chunks selected from KG(knowledge graph)
 ###     Higher values increase re-ranking time
 # RELATED_CHUNK_NUMBER=5
 ###############################
 ### Concurrency Configuration
 ###############################
@ -386,3 +394,35 @@ MEMGRAPH_USERNAME=
 MEMGRAPH_PASSWORD=
 MEMGRAPH_DATABASE=memgraph
 # MEMGRAPH_WORKSPACE=forced_workspace_name
 ############################
 ### Evaluation Configuration
 ############################
 ### RAGAS evaluation models (used for RAG quality assessment)
 ### ⚠️ IMPORTANT: Both LLM and Embedding endpoints MUST be OpenAI-compatible
 ### Default uses OpenAI models for evaluation
 ### LLM Configuration for Evaluation
 # EVAL_LLM_MODEL=gpt-4o-mini
 ### API key for LLM evaluation (fallback to OPENAI_API_KEY if not set)
 # EVAL_LLM_BINDING_API_KEY=your_api_key
 ### Custom OpenAI-compatible endpoint for LLM evaluation (optional)
 # EVAL_LLM_BINDING_HOST=https://api.openai.com/v1
 ### Embedding Configuration for Evaluation
 # EVAL_EMBEDDING_MODEL=text-embedding-3-large
 ### API key for embeddings (fallback: EVAL_LLM_BINDING_API_KEY -> OPENAI_API_KEY)
 # EVAL_EMBEDDING_BINDING_API_KEY=your_embedding_api_key
 ### Custom OpenAI-compatible endpoint for embeddings (fallback: EVAL_LLM_BINDING_HOST)
 # EVAL_EMBEDDING_BINDING_HOST=https://api.openai.com/v1
 ### Performance Tuning
 ### Number of concurrent test case evaluations
 ### Lower values reduce API rate limit issues but increase evaluation time
 # EVAL_MAX_CONCURRENT=2
 ### TOP_K query parameter of LightRAG (default: 10)
 ### Number of entities or relations retrieved from KG
 # EVAL_QUERY_TOP_K=10
 ### LLM request retry and timeout settings for evaluation
 # EVAL_LLM_MAX_RETRIES=5
 # EVAL_LLM_TIMEOUT=180
--- a/lightrag/evaluation/README.md
+++ b/lightrag/evaluation/README.md
@ -1,12 +1,8 @@
-# 📊 LightRAG Evaluation Framework
+# 📊 RAGAS-based Evaluation Framework
 RAGAS-based offline evaluation of your LightRAG system.
 ## What is RAGAS?
-**RAGAS** (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs.
+**RAGAS** (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs. RAGAS uses state-of-the-art evaluation metrics:
 Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art evaluation metrics:
 ### Core Metrics
@ -18,9 +14,7 @@ Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art e
 | **Context Precision** | Is retrieved context clean without irrelevant noise? | > 0.80 |
 | **RAGAS Score** | Overall quality metric (average of above) | > 0.80 |
---
+### 📁 LightRAG Evalua'tion Framework Directory Structure
 ## 📁 Structure
 ```
 lightrag/evaluation/
@ -42,7 +36,7 @@ lightrag/evaluation/
 **Quick Test:** Index files from `sample_documents/` into LightRAG, then run the evaluator to reproduce results (~89-100% RAGAS score per question).
---
+
 ## 🚀 Quick Start
@ -55,20 +49,35 @@ pip install ragas datasets langfuse
 Or use your project dependencies (already included in pyproject.toml):
 ```bash
-pip install -e ".[offline-llm]"
+pip install -e ".[evaluation]"
 ```
 ### 2. Run Evaluation
 **Basic usage (uses defaults):**
 ```bash
 cd /path/to/LightRAG
-python -m lightrag.evaluation.eval_rag_quality
+python lightrag/evaluation/eval_rag_quality.py
 ```
-Or directly:
+**Specify custom dataset:**
 ```bash
-python lightrag/evaluation/eval_rag_quality.py
+python lightrag/evaluation/eval_rag_quality.py --dataset my_test.json
 ```
 **Specify custom RAG endpoint:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621
 ```
 **Specify both (short form):**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py -d my_test.json -r http://localhost:9621
 ```
 **Get help:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py --help
 ```
 ### 3. View Results
@ -87,7 +96,179 @@ results/
 - 📋 Individual test case results
 - 📈 Performance breakdown by question
---
+
 ## 📋 Command-Line Arguments
 The evaluation script supports command-line arguments for easy configuration:
 | Argument | Short | Default | Description |
 |----------|-------|---------|-------------|
 | `--dataset` | `-d` | `sample_dataset.json` | Path to test dataset JSON file |
 | `--ragendpoint` | `-r` | `http://localhost:9621` or `$LIGHTRAG_API_URL` | LightRAG API endpoint URL |
 ### Usage Examples
 **Use default dataset and endpoint:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py
 ```
 **Custom dataset with default endpoint:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py --dataset path/to/my_dataset.json
 ```
 **Default dataset with custom endpoint:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621
 ```
 **Custom dataset and endpoint:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py -d my_dataset.json -r http://localhost:9621
 ```
 **Absolute path to dataset:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py -d /path/to/custom_dataset.json
 ```
 **Show help message:**
 ```bash
 python lightrag/evaluation/eval_rag_quality.py --help
 ```
 ## ⚙️ Configuration
 ### Environment Variables
 The evaluation framework supports customization through environment variables:
 **⚠️ IMPORTANT: Both LLM and Embedding endpoints MUST be OpenAI-compatible**
 - The RAGAS framework requires OpenAI-compatible API interfaces
 - Custom endpoints must implement the OpenAI API format (e.g., vLLM, SGLang, LocalAI)
 - Non-compatible endpoints will cause evaluation failures
 | Variable | Default | Description |
 |----------|---------|-------------|
 | **LLM Configuration** | | |
 | `EVAL_LLM_MODEL` | `gpt-4o-mini` | LLM model used for RAGAS evaluation |
 | `EVAL_LLM_BINDING_API_KEY` | falls back to `OPENAI_API_KEY` | API key for LLM evaluation |
 | `EVAL_LLM_BINDING_HOST` | (optional) | Custom OpenAI-compatible endpoint URL for LLM |
 | **Embedding Configuration** | | |
 | `EVAL_EMBEDDING_MODEL` | `text-embedding-3-large` | Embedding model for evaluation |
 | `EVAL_EMBEDDING_BINDING_API_KEY` | falls back to `EVAL_LLM_BINDING_API_KEY` → `OPENAI_API_KEY` | API key for embeddings |
 | `EVAL_EMBEDDING_BINDING_HOST` | falls back to `EVAL_LLM_BINDING_HOST` | Custom OpenAI-compatible endpoint URL for embeddings |
 | **Performance Tuning** | | |
 | `EVAL_MAX_CONCURRENT` | 2 | Number of concurrent test case evaluations (1=serial) |
 | `EVAL_QUERY_TOP_K` | 10 | Number of documents to retrieve per query |
 | `EVAL_LLM_MAX_RETRIES` | 5 | Maximum LLM request retries |
 | `EVAL_LLM_TIMEOUT` | 180 | LLM request timeout in seconds |
 ### Usage Examples
 **Example 1: Default Configuration (OpenAI Official API)**
 ```bash
 export OPENAI_API_KEY=sk-xxx
 python lightrag/evaluation/eval_rag_quality.py
 ```
 Both LLM and embeddings use OpenAI's official API with default models.
 **Example 2: Custom Models on OpenAI**
 ```bash
 export OPENAI_API_KEY=sk-xxx
 export EVAL_LLM_MODEL=gpt-4o-mini
 export EVAL_EMBEDDING_MODEL=text-embedding-3-large
 python lightrag/evaluation/eval_rag_quality.py
 ```
 **Example 3: Same Custom OpenAI-Compatible Endpoint for Both**
 ```bash
 # Both LLM and embeddings use the same custom endpoint
 export EVAL_LLM_BINDING_API_KEY=your-custom-key
 export EVAL_LLM_BINDING_HOST=http://localhost:8000/v1
 export EVAL_LLM_MODEL=qwen-plus
 export EVAL_EMBEDDING_MODEL=BAAI/bge-m3
 python lightrag/evaluation/eval_rag_quality.py
 ```
 Embeddings automatically inherit LLM endpoint configuration.
 **Example 4: Separate Endpoints (Cost Optimization)**
 ```bash
 # Use OpenAI for LLM (high quality)
 export EVAL_LLM_BINDING_API_KEY=sk-openai-key
 export EVAL_LLM_MODEL=gpt-4o-mini
 # No EVAL_LLM_BINDING_HOST means use OpenAI official API
 # Use local vLLM for embeddings (cost-effective)
 export EVAL_EMBEDDING_BINDING_API_KEY=local-key
 export EVAL_EMBEDDING_BINDING_HOST=http://localhost:8001/v1
 export EVAL_EMBEDDING_MODEL=BAAI/bge-m3
 python lightrag/evaluation/eval_rag_quality.py
 ```
 LLM uses OpenAI official API, embeddings use local custom endpoint.
 **Example 5: Different Custom Endpoints for LLM and Embeddings**
 ```bash
 # LLM on one OpenAI-compatible server
 export EVAL_LLM_BINDING_API_KEY=key1
 export EVAL_LLM_BINDING_HOST=http://llm-server:8000/v1
 export EVAL_LLM_MODEL=custom-llm
 # Embeddings on another OpenAI-compatible server
 export EVAL_EMBEDDING_BINDING_API_KEY=key2
 export EVAL_EMBEDDING_BINDING_HOST=http://embedding-server:8001/v1
 export EVAL_EMBEDDING_MODEL=custom-embedding
 python lightrag/evaluation/eval_rag_quality.py
 ```
 Both use different custom OpenAI-compatible endpoints.
 **Example 6: Using Environment Variables from .env File**
 ```bash
 # Create .env file in project root
 cat > .env << EOF
 EVAL_LLM_BINDING_API_KEY=your-key
 EVAL_LLM_BINDING_HOST=http://localhost:8000/v1
 EVAL_LLM_MODEL=qwen-plus
 EVAL_EMBEDDING_MODEL=BAAI/bge-m3
 EOF
 # Run evaluation (automatically loads .env)
 python lightrag/evaluation/eval_rag_quality.py
 ```
 ### Concurrency Control & Rate Limiting
 The evaluation framework includes built-in concurrency control to prevent API rate limiting issues:
 **Why Concurrency Control Matters:**
 - RAGAS internally makes many concurrent LLM calls for each test case
 - Context Precision metric calls LLM once per retrieved document
 - Without control, this can easily exceed API rate limits
 **Default Configuration (Conservative):**
 ```bash
 EVAL_MAX_CONCURRENT=2    # Serial evaluation (one test at a time)
 EVAL_QUERY_TOP_K=10      # OP_K query parameter of LightRAG
 EVAL_LLM_MAX_RETRIES=5   # Retry failed requests 5 times
 EVAL_LLM_TIMEOUT=180     # 3-minute timeout per request
 ```
 **Common Issues and Solutions:**
 | Issue | Solution |
 |-------|----------|
 | **Warning: "LM returned 1 generations instead of 3"** | Reduce `EVAL_MAX_CONCURRENT` to 1 or decrease `EVAL_QUERY_TOP_K` |
 | **Context Precision returns NaN** | Lower `EVAL_QUERY_TOP_K` to reduce LLM calls per test case |
 | **Rate limit errors (429)** | Increase `EVAL_LLM_MAX_RETRIES` and decrease `EVAL_MAX_CONCURRENT` |
 | **Request timeouts** | Increase `EVAL_LLM_TIMEOUT` to 180 or higher |
 ## 📝 Test Dataset
@ -101,7 +282,7 @@ results/
    {
      "question": "Your question here",
      "ground_truth": "Expected answer from your data",
-      "context": "topic"
+      "project": "evaluation_project_name"
    }
  ]
 }
@ -166,6 +347,50 @@ results/
 pip install ragas datasets
 ```
 ### "Warning: LM returned 1 generations instead of requested 3" or Context Precision NaN
 **Cause**: This warning indicates API rate limiting or concurrent request overload:
 - RAGAS makes multiple LLM calls per test case (faithfulness, relevancy, recall, precision)
 - Context Precision calls LLM once per retrieved document (with `EVAL_QUERY_TOP_K=10`, that's 10 calls)
 - Concurrent evaluation multiplies these calls: `EVAL_MAX_CONCURRENT × LLM calls per test`
 **Solutions** (in order of effectiveness):
 1. **Serial Evaluation** (Default):
   ```bash
   export EVAL_MAX_CONCURRENT=1
   python lightrag/evaluation/eval_rag_quality.py
   ```
 2. **Reduce Retrieved Documents**:
   ```bash
   export EVAL_QUERY_TOP_K=5  # Halves Context Precision LLM calls
   python lightrag/evaluation/eval_rag_quality.py
   ```
 3. **Increase Retry & Timeout**:
   ```bash
   export EVAL_LLM_MAX_RETRIES=10
   export EVAL_LLM_TIMEOUT=180
   python lightrag/evaluation/eval_rag_quality.py
   ```
 4. **Use Higher Quota API** (if available):
   - Upgrade to OpenAI Tier 2+ for higher RPM limits
   - Use self-hosted OpenAI-compatible service with no rate limits
 ### "AttributeError: 'InstructorLLM' object has no attribute 'agenerate_prompt'" or NaN results
 This error occurs with RAGAS 0.3.x when LLM and Embeddings are not explicitly configured. The evaluation framework now handles this automatically by:
 - Using environment variables to configure evaluation models
 - Creating proper LLM and Embeddings instances for RAGAS
 **Solution**: Ensure you have set one of the following:
 - `OPENAI_API_KEY` environment variable (default)
 - `EVAL_LLM_BINDING_API_KEY` for custom API key
 The framework will automatically configure the evaluation models.
 ### "No sample_dataset.json found"
 Make sure you're running from the project root:
@ -175,11 +400,10 @@ cd /path/to/LightRAG
 python lightrag/evaluation/eval_rag_quality.py
 ```
-### "LLM API errors during evaluation"
+### "LightRAG query API errors during evaluation"
 The evaluation uses your configured LLM (OpenAI by default). Ensure:
 - API keys are set in `.env`
 - Have sufficient API quota
 - Network connection is stable
 ### Evaluation requires running LightRAG API
@ -189,15 +413,74 @@ The evaluator queries a running LightRAG API server at `http://localhost:9621`.
 2. Documents are indexed in your LightRAG instance
 3. API is accessible at the configured URL
---
+
 ## 📝 Next Steps
-1. Index documents into LightRAG (WebUI or API)
+1. Start LightRAG API server
-2. Start LightRAG API server
+2. Upload sample documents into LightRAG  throught  WebUI
 3. Run `python lightrag/evaluation/eval_rag_quality.py`
 4. Review results (JSON/CSV) in `results/` folder
-5. Adjust entity extraction prompts or retrieval settings based on scores
+
 Evaluation Result Sample:
 ```
 INFO: ======================================================================
 INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API
 INFO: ======================================================================
 INFO: Evaluation Models:
 INFO:   • LLM Model:            gpt-4.1
 INFO:   • Embedding Model:      text-embedding-3-large
 INFO:   • Endpoint:             OpenAI Official API
 INFO: Concurrency & Rate Limiting:
 INFO:   • Query Top-K:          10 Entities/Relations
 INFO:   • LLM Max Retries:      5
 INFO:   • LLM Timeout:          180 seconds
 INFO: Test Configuration:
 INFO:   • Total Test Cases:     6
 INFO:   • Test Dataset:         sample_dataset.json
 INFO:   • LightRAG API:         http://localhost:9621
 INFO:   • Results Directory:    results
 INFO: ======================================================================
 INFO: 🚀 Starting RAGAS Evaluation of LightRAG System
 INFO: 🔧 RAGAS Evaluation (Stage 2): 2 concurrent
 INFO: ======================================================================
 INFO:
 INFO: ===================================================================================================================
 INFO: 📊 EVALUATION RESULTS SUMMARY
 INFO: ===================================================================================================================
 INFO: #    | Question                                           |  Faith | AnswRel | CtxRec | CtxPrec |  RAGAS | Status
 INFO: -------------------------------------------------------------------------------------------------------------------
 INFO: 1    | How does LightRAG solve the hallucination probl... | 1.0000 |  1.0000 | 1.0000 |  1.0000 | 1.0000 |      ✓
 INFO: 2    | What are the three main components required in ... | 0.8500 |  0.5790 | 1.0000 |  1.0000 | 0.8573 |      ✓
 INFO: 3    | How does LightRAG's retrieval performance compa... | 0.8056 |  1.0000 | 1.0000 |  1.0000 | 0.9514 |      ✓
 INFO: 4    | What vector databases does LightRAG support and... | 0.8182 |  0.9807 | 1.0000 |  1.0000 | 0.9497 |      ✓
 INFO: 5    | What are the four key metrics for evaluating RA... | 1.0000 |  0.7452 | 1.0000 |  1.0000 | 0.9363 |      ✓
 INFO: 6    | What are the core benefits of LightRAG and how ... | 0.9583 |  0.8829 | 1.0000 |  1.0000 | 0.9603 |      ✓
 INFO: ===================================================================================================================
 INFO:
 INFO: ======================================================================
 INFO: 📊 EVALUATION COMPLETE
 INFO: ======================================================================
 INFO: Total Tests:    6
 INFO: Successful:     6
 INFO: Failed:         0
 INFO: Success Rate:   100.00%
 INFO: Elapsed Time:   161.10 seconds
 INFO: Avg Time/Test:  26.85 seconds
 INFO:
 INFO: ======================================================================
 INFO: 📈 BENCHMARK RESULTS (Average)
 INFO: ======================================================================
 INFO: Average Faithfulness:      0.9053
 INFO: Average Answer Relevance:  0.8646
 INFO: Average Context Recall:    1.0000
 INFO: Average Context Precision: 1.0000
 INFO: Average RAGAS Score:       0.9425
 INFO: ----------------------------------------------------------------------
 INFO: Min RAGAS Score:           0.8573
 INFO: Max RAGAS Score:           1.0000
 ```
 ---
--- a/lightrag/evaluation/eval_rag_quality.py
+++ b/lightrag/evaluation/eval_rag_quality.py