cherry-pick 9c057060

2025-12-04 19:14:25 +08:00 · 2025-12-04 19:14:25 +08:00 · 56b8806256
commit 56b8806256
parent 89f8048df5
3 changed files with 980 additions and 242 deletions
--- a/env.example
+++ b/env.example
@ -50,6 +50,8 @@ OLLAMA_EMULATING_MODEL_TAG=latest
 # JWT_ALGORITHM=HS256

 ### API-Key to access LightRAG Server API
+### Use this key in HTTP requests with the 'X-API-Key' header
+### Example: curl -H "X-API-Key: your-secure-api-key-here" http://localhost:9621/query
 # LIGHTRAG_API_KEY=your-secure-api-key-here
 # WHITELIST_PATHS=/health,/api/*

@ -73,16 +75,6 @@ ENABLE_LLM_CACHE=true
 # MAX_RELATION_TOKENS=8000
 ### control the maximum tokens send to LLM (include entities, relations and chunks)
 # MAX_TOTAL_TOKENS=30000
-### control the maximum chunk_ids stored in vector and graph db
-# MAX_SOURCE_IDS_PER_ENTITY=300
-# MAX_SOURCE_IDS_PER_RELATION=300
-### control chunk_ids limitation method: KEEP, FIFO (KEPP: Ingore New Chunks, FIFO: New chunks replace old chunks)
-# SOURCE_IDS_LIMIT_METHOD=KEEP
-
-### maximum number of related chunks per source entity or relation
-###     The chunk picker uses this value to determine the total number of chunks selected from KG(knowledge graph)
-###     Higher values increase re-ranking time
-# RELATED_CHUNK_NUMBER=5

 ### chunk selection strategies
 ###     VECTOR: Pick KG chunks by vector similarity, delivered chunks to the LLM aligning more closely with naive retrieval
@ -110,9 +102,6 @@ RERANK_BINDING=null
 # RERANK_MODEL=rerank-v3.5
 # RERANK_BINDING_HOST=https://api.cohere.com/v2/rerank
 # RERANK_BINDING_API_KEY=your_rerank_api_key_here
-### Cohere rerank chunking configuration (useful for models with token limits like ColBERT)
-# RERANK_ENABLE_CHUNKING=true
-# RERANK_MAX_TOKENS_PER_DOC=480

 ### Default value for Jina AI
 # RERANK_MODEL=jina-reranker-v2-base-multilingual
@ -132,6 +121,9 @@ ENABLE_LLM_CACHE_FOR_EXTRACT=true
 ### Document processing output language: English, Chinese, French, German ...
 SUMMARY_LANGUAGE=English

+### PDF decryption password for protected PDF files
+# PDF_DECRYPT_PASSWORD=your_pdf_password_here
+
 ### Entity types that the LLM will attempt to recognize
 # ENTITY_TYPES='["Person", "Creature", "Organization", "Location", "Event", "Concept", "Method", "Content", "Data", "Artifact", "NaturalObject"]'

@ -148,6 +140,22 @@ SUMMARY_LANGUAGE=English
 ### Maximum context size sent to LLM for description summary
 # SUMMARY_CONTEXT_SIZE=12000

+### control the maximum chunk_ids stored in vector and graph db
+# MAX_SOURCE_IDS_PER_ENTITY=300
+# MAX_SOURCE_IDS_PER_RELATION=300
+### control chunk_ids limitation method: FIFO, KEEP
+###    FIFO: First in first out
+###    KEEP: Keep oldest (less merge action and faster)
+# SOURCE_IDS_LIMIT_METHOD=FIFO
+
+# Maximum number of file paths stored in entity/relation file_path field (For displayed only, does not affect query performance)
+# MAX_FILE_PATHS=100
+
+### maximum number of related chunks per source entity or relation
+###     The chunk picker uses this value to determine the total number of chunks selected from KG(knowledge graph)
+###     Higher values increase re-ranking time
+# RELATED_CHUNK_NUMBER=5
+
 ###############################
 ### Concurrency Configuration
 ###############################
@ -386,3 +394,35 @@ MEMGRAPH_USERNAME=
 MEMGRAPH_PASSWORD=
 MEMGRAPH_DATABASE=memgraph
 # MEMGRAPH_WORKSPACE=forced_workspace_name
+
+############################
+### Evaluation Configuration
+############################
+### RAGAS evaluation models (used for RAG quality assessment)
+### ⚠️ IMPORTANT: Both LLM and Embedding endpoints MUST be OpenAI-compatible
+### Default uses OpenAI models for evaluation
+
+### LLM Configuration for Evaluation
+# EVAL_LLM_MODEL=gpt-4o-mini
+### API key for LLM evaluation (fallback to OPENAI_API_KEY if not set)
+# EVAL_LLM_BINDING_API_KEY=your_api_key
+### Custom OpenAI-compatible endpoint for LLM evaluation (optional)
+# EVAL_LLM_BINDING_HOST=https://api.openai.com/v1
+
+### Embedding Configuration for Evaluation
+# EVAL_EMBEDDING_MODEL=text-embedding-3-large
+### API key for embeddings (fallback: EVAL_LLM_BINDING_API_KEY -> OPENAI_API_KEY)
+# EVAL_EMBEDDING_BINDING_API_KEY=your_embedding_api_key
+### Custom OpenAI-compatible endpoint for embeddings (fallback: EVAL_LLM_BINDING_HOST)
+# EVAL_EMBEDDING_BINDING_HOST=https://api.openai.com/v1
+
+### Performance Tuning
+### Number of concurrent test case evaluations
+### Lower values reduce API rate limit issues but increase evaluation time
+# EVAL_MAX_CONCURRENT=2
+### TOP_K query parameter of LightRAG (default: 10)
+### Number of entities or relations retrieved from KG
+# EVAL_QUERY_TOP_K=10
+### LLM request retry and timeout settings for evaluation
+# EVAL_LLM_MAX_RETRIES=5
+# EVAL_LLM_TIMEOUT=180
--- a/lightrag/evaluation/README.md
+++ b/lightrag/evaluation/README.md
@ -1,12 +1,8 @@
-# 📊 LightRAG Evaluation Framework
-
-RAGAS-based offline evaluation of your LightRAG system.
+# 📊 RAGAS-based Evaluation Framework

 ## What is RAGAS?

-**RAGAS** (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs.
-
-Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art evaluation metrics:
+**RAGAS** (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs. RAGAS uses state-of-the-art evaluation metrics:

 ### Core Metrics

@ -18,9 +14,7 @@ Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art e
 | **Context Precision** | Is retrieved context clean without irrelevant noise? | > 0.80 |
 | **RAGAS Score** | Overall quality metric (average of above) | > 0.80 |

---
-
-## 📁 Structure
+### 📁 LightRAG Evalua'tion Framework Directory Structure

 ```
 lightrag/evaluation/
@ -42,7 +36,7 @@ lightrag/evaluation/

 **Quick Test:** Index files from `sample_documents/` into LightRAG, then run the evaluator to reproduce results (~89-100% RAGAS score per question).

---
+

 ## 🚀 Quick Start

@ -55,20 +49,35 @@ pip install ragas datasets langfuse
 Or use your project dependencies (already included in pyproject.toml):

 ```bash
-pip install -e ".[offline-llm]"
+pip install -e ".[evaluation]"
 ```

 ### 2. Run Evaluation

+**Basic usage (uses defaults):**
 ```bash
 cd /path/to/LightRAG
-python -m lightrag.evaluation.eval_rag_quality
+python lightrag/evaluation/eval_rag_quality.py
 ```

-Or directly:
-
+**Specify custom dataset:**
 ```bash
-python lightrag/evaluation/eval_rag_quality.py
+python lightrag/evaluation/eval_rag_quality.py --dataset my_test.json
+```
+
+**Specify custom RAG endpoint:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621
+```
+
+**Specify both (short form):**
+```bash
+python lightrag/evaluation/eval_rag_quality.py -d my_test.json -r http://localhost:9621
+```
+
+**Get help:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py --help
 ```

 ### 3. View Results
@ -87,7 +96,179 @@ results/
 - 📋 Individual test case results
 - 📈 Performance breakdown by question

---
+
+
+## 📋 Command-Line Arguments
+
+The evaluation script supports command-line arguments for easy configuration:
+
+| Argument | Short | Default | Description |
+|----------|-------|---------|-------------|
+| `--dataset` | `-d` | `sample_dataset.json` | Path to test dataset JSON file |
+| `--ragendpoint` | `-r` | `http://localhost:9621` or `$LIGHTRAG_API_URL` | LightRAG API endpoint URL |
+
+### Usage Examples
+
+**Use default dataset and endpoint:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py
+```
+
+**Custom dataset with default endpoint:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py --dataset path/to/my_dataset.json
+```
+
+**Default dataset with custom endpoint:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621
+```
+
+**Custom dataset and endpoint:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py -d my_dataset.json -r http://localhost:9621
+```
+
+**Absolute path to dataset:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py -d /path/to/custom_dataset.json
+```
+
+**Show help message:**
+```bash
+python lightrag/evaluation/eval_rag_quality.py --help
+```
+
+
+
+## ⚙️ Configuration
+
+### Environment Variables
+
+The evaluation framework supports customization through environment variables:
+
+**⚠️ IMPORTANT: Both LLM and Embedding endpoints MUST be OpenAI-compatible**
+- The RAGAS framework requires OpenAI-compatible API interfaces
+- Custom endpoints must implement the OpenAI API format (e.g., vLLM, SGLang, LocalAI)
+- Non-compatible endpoints will cause evaluation failures
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| **LLM Configuration** | | |
+| `EVAL_LLM_MODEL` | `gpt-4o-mini` | LLM model used for RAGAS evaluation |
+| `EVAL_LLM_BINDING_API_KEY` | falls back to `OPENAI_API_KEY` | API key for LLM evaluation |
+| `EVAL_LLM_BINDING_HOST` | (optional) | Custom OpenAI-compatible endpoint URL for LLM |
+| **Embedding Configuration** | | |
+| `EVAL_EMBEDDING_MODEL` | `text-embedding-3-large` | Embedding model for evaluation |
+| `EVAL_EMBEDDING_BINDING_API_KEY` | falls back to `EVAL_LLM_BINDING_API_KEY` → `OPENAI_API_KEY` | API key for embeddings |
+| `EVAL_EMBEDDING_BINDING_HOST` | falls back to `EVAL_LLM_BINDING_HOST` | Custom OpenAI-compatible endpoint URL for embeddings |
+| **Performance Tuning** | | |
+| `EVAL_MAX_CONCURRENT` | 2 | Number of concurrent test case evaluations (1=serial) |
+| `EVAL_QUERY_TOP_K` | 10 | Number of documents to retrieve per query |
+| `EVAL_LLM_MAX_RETRIES` | 5 | Maximum LLM request retries |
+| `EVAL_LLM_TIMEOUT` | 180 | LLM request timeout in seconds |
+
+### Usage Examples
+
+**Example 1: Default Configuration (OpenAI Official API)**
+```bash
+export OPENAI_API_KEY=sk-xxx
+python lightrag/evaluation/eval_rag_quality.py
+```
+Both LLM and embeddings use OpenAI's official API with default models.
+
+**Example 2: Custom Models on OpenAI**
+```bash
+export OPENAI_API_KEY=sk-xxx
+export EVAL_LLM_MODEL=gpt-4o-mini
+export EVAL_EMBEDDING_MODEL=text-embedding-3-large
+python lightrag/evaluation/eval_rag_quality.py
+```
+
+**Example 3: Same Custom OpenAI-Compatible Endpoint for Both**
+```bash
+# Both LLM and embeddings use the same custom endpoint
+export EVAL_LLM_BINDING_API_KEY=your-custom-key
+export EVAL_LLM_BINDING_HOST=http://localhost:8000/v1
+export EVAL_LLM_MODEL=qwen-plus
+export EVAL_EMBEDDING_MODEL=BAAI/bge-m3
+python lightrag/evaluation/eval_rag_quality.py
+```
+Embeddings automatically inherit LLM endpoint configuration.
+
+**Example 4: Separate Endpoints (Cost Optimization)**
+```bash
+# Use OpenAI for LLM (high quality)
+export EVAL_LLM_BINDING_API_KEY=sk-openai-key
+export EVAL_LLM_MODEL=gpt-4o-mini
+# No EVAL_LLM_BINDING_HOST means use OpenAI official API
+
+# Use local vLLM for embeddings (cost-effective)
+export EVAL_EMBEDDING_BINDING_API_KEY=local-key
+export EVAL_EMBEDDING_BINDING_HOST=http://localhost:8001/v1
+export EVAL_EMBEDDING_MODEL=BAAI/bge-m3
+
+python lightrag/evaluation/eval_rag_quality.py
+```
+LLM uses OpenAI official API, embeddings use local custom endpoint.
+
+**Example 5: Different Custom Endpoints for LLM and Embeddings**
+```bash
+# LLM on one OpenAI-compatible server
+export EVAL_LLM_BINDING_API_KEY=key1
+export EVAL_LLM_BINDING_HOST=http://llm-server:8000/v1
+export EVAL_LLM_MODEL=custom-llm
+
+# Embeddings on another OpenAI-compatible server
+export EVAL_EMBEDDING_BINDING_API_KEY=key2
+export EVAL_EMBEDDING_BINDING_HOST=http://embedding-server:8001/v1
+export EVAL_EMBEDDING_MODEL=custom-embedding
+
+python lightrag/evaluation/eval_rag_quality.py
+```
+Both use different custom OpenAI-compatible endpoints.
+
+**Example 6: Using Environment Variables from .env File**
+```bash
+# Create .env file in project root
+cat > .env << EOF
+EVAL_LLM_BINDING_API_KEY=your-key
+EVAL_LLM_BINDING_HOST=http://localhost:8000/v1
+EVAL_LLM_MODEL=qwen-plus
+EVAL_EMBEDDING_MODEL=BAAI/bge-m3
+EOF
+
+# Run evaluation (automatically loads .env)
+python lightrag/evaluation/eval_rag_quality.py
+```
+
+### Concurrency Control & Rate Limiting
+
+The evaluation framework includes built-in concurrency control to prevent API rate limiting issues:
+
+**Why Concurrency Control Matters:**
+- RAGAS internally makes many concurrent LLM calls for each test case
+- Context Precision metric calls LLM once per retrieved document
+- Without control, this can easily exceed API rate limits
+
+**Default Configuration (Conservative):**
+```bash
+EVAL_MAX_CONCURRENT=2    # Serial evaluation (one test at a time)
+EVAL_QUERY_TOP_K=10      # OP_K query parameter of LightRAG
+EVAL_LLM_MAX_RETRIES=5   # Retry failed requests 5 times
+EVAL_LLM_TIMEOUT=180     # 3-minute timeout per request
+```
+
+**Common Issues and Solutions:**
+
+| Issue | Solution |
+|-------|----------|
+| **Warning: "LM returned 1 generations instead of 3"** | Reduce `EVAL_MAX_CONCURRENT` to 1 or decrease `EVAL_QUERY_TOP_K` |
+| **Context Precision returns NaN** | Lower `EVAL_QUERY_TOP_K` to reduce LLM calls per test case |
+| **Rate limit errors (429)** | Increase `EVAL_LLM_MAX_RETRIES` and decrease `EVAL_MAX_CONCURRENT` |
+| **Request timeouts** | Increase `EVAL_LLM_TIMEOUT` to 180 or higher |
+
+

 ## 📝 Test Dataset

@ -101,7 +282,7 @@ results/
    {
      "question": "Your question here",
      "ground_truth": "Expected answer from your data",
-      "context": "topic"
+      "project": "evaluation_project_name"
    }
  ]
 }
@ -166,6 +347,50 @@ results/
 pip install ragas datasets
 ```

+### "Warning: LM returned 1 generations instead of requested 3" or Context Precision NaN
+
+**Cause**: This warning indicates API rate limiting or concurrent request overload:
+- RAGAS makes multiple LLM calls per test case (faithfulness, relevancy, recall, precision)
+- Context Precision calls LLM once per retrieved document (with `EVAL_QUERY_TOP_K=10`, that's 10 calls)
+- Concurrent evaluation multiplies these calls: `EVAL_MAX_CONCURRENT × LLM calls per test`
+
+**Solutions** (in order of effectiveness):
+
+1. **Serial Evaluation** (Default):
+   ```bash
+   export EVAL_MAX_CONCURRENT=1
+   python lightrag/evaluation/eval_rag_quality.py
+   ```
+
+2. **Reduce Retrieved Documents**:
+   ```bash
+   export EVAL_QUERY_TOP_K=5  # Halves Context Precision LLM calls
+   python lightrag/evaluation/eval_rag_quality.py
+   ```
+
+3. **Increase Retry & Timeout**:
+   ```bash
+   export EVAL_LLM_MAX_RETRIES=10
+   export EVAL_LLM_TIMEOUT=180
+   python lightrag/evaluation/eval_rag_quality.py
+   ```
+
+4. **Use Higher Quota API** (if available):
+   - Upgrade to OpenAI Tier 2+ for higher RPM limits
+   - Use self-hosted OpenAI-compatible service with no rate limits
+
+### "AttributeError: 'InstructorLLM' object has no attribute 'agenerate_prompt'" or NaN results
+
+This error occurs with RAGAS 0.3.x when LLM and Embeddings are not explicitly configured. The evaluation framework now handles this automatically by:
+- Using environment variables to configure evaluation models
+- Creating proper LLM and Embeddings instances for RAGAS
+
+**Solution**: Ensure you have set one of the following:
+- `OPENAI_API_KEY` environment variable (default)
+- `EVAL_LLM_BINDING_API_KEY` for custom API key
+
+The framework will automatically configure the evaluation models.
+
 ### "No sample_dataset.json found"

 Make sure you're running from the project root:
@ -175,11 +400,10 @@ cd /path/to/LightRAG
 python lightrag/evaluation/eval_rag_quality.py
 ```

-### "LLM API errors during evaluation"
+### "LightRAG query API errors during evaluation"

 The evaluation uses your configured LLM (OpenAI by default). Ensure:
 - API keys are set in `.env`
- Have sufficient API quota
 - Network connection is stable

 ### Evaluation requires running LightRAG API
@ -189,15 +413,74 @@ The evaluator queries a running LightRAG API server at `http://localhost:9621`.
 2. Documents are indexed in your LightRAG instance
 3. API is accessible at the configured URL

---
+

 ## 📝 Next Steps

-1. Index documents into LightRAG (WebUI or API)
-2. Start LightRAG API server
+1. Start LightRAG API server
+2. Upload sample documents into LightRAG  throught  WebUI
 3. Run `python lightrag/evaluation/eval_rag_quality.py`
 4. Review results (JSON/CSV) in `results/` folder
-5. Adjust entity extraction prompts or retrieval settings based on scores
+
+Evaluation Result Sample:
+
+```
+INFO: ======================================================================
+INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API
+INFO: ======================================================================
+INFO: Evaluation Models:
+INFO:   • LLM Model:            gpt-4.1
+INFO:   • Embedding Model:      text-embedding-3-large
+INFO:   • Endpoint:             OpenAI Official API
+INFO: Concurrency & Rate Limiting:
+INFO:   • Query Top-K:          10 Entities/Relations
+INFO:   • LLM Max Retries:      5
+INFO:   • LLM Timeout:          180 seconds
+INFO: Test Configuration:
+INFO:   • Total Test Cases:     6
+INFO:   • Test Dataset:         sample_dataset.json
+INFO:   • LightRAG API:         http://localhost:9621
+INFO:   • Results Directory:    results
+INFO: ======================================================================
+INFO: 🚀 Starting RAGAS Evaluation of LightRAG System
+INFO: 🔧 RAGAS Evaluation (Stage 2): 2 concurrent
+INFO: ======================================================================
+INFO:
+INFO: ===================================================================================================================
+INFO: 📊 EVALUATION RESULTS SUMMARY
+INFO: ===================================================================================================================
+INFO: #    | Question                                           |  Faith | AnswRel | CtxRec | CtxPrec |  RAGAS | Status
+INFO: -------------------------------------------------------------------------------------------------------------------
+INFO: 1    | How does LightRAG solve the hallucination probl... | 1.0000 |  1.0000 | 1.0000 |  1.0000 | 1.0000 |      ✓
+INFO: 2    | What are the three main components required in ... | 0.8500 |  0.5790 | 1.0000 |  1.0000 | 0.8573 |      ✓
+INFO: 3    | How does LightRAG's retrieval performance compa... | 0.8056 |  1.0000 | 1.0000 |  1.0000 | 0.9514 |      ✓
+INFO: 4    | What vector databases does LightRAG support and... | 0.8182 |  0.9807 | 1.0000 |  1.0000 | 0.9497 |      ✓
+INFO: 5    | What are the four key metrics for evaluating RA... | 1.0000 |  0.7452 | 1.0000 |  1.0000 | 0.9363 |      ✓
+INFO: 6    | What are the core benefits of LightRAG and how ... | 0.9583 |  0.8829 | 1.0000 |  1.0000 | 0.9603 |      ✓
+INFO: ===================================================================================================================
+INFO:
+INFO: ======================================================================
+INFO: 📊 EVALUATION COMPLETE
+INFO: ======================================================================
+INFO: Total Tests:    6
+INFO: Successful:     6
+INFO: Failed:         0
+INFO: Success Rate:   100.00%
+INFO: Elapsed Time:   161.10 seconds
+INFO: Avg Time/Test:  26.85 seconds
+INFO:
+INFO: ======================================================================
+INFO: 📈 BENCHMARK RESULTS (Average)
+INFO: ======================================================================
+INFO: Average Faithfulness:      0.9053
+INFO: Average Answer Relevance:  0.8646
+INFO: Average Context Recall:    1.0000
+INFO: Average Context Precision: 1.0000
+INFO: Average RAGAS Score:       0.9425
+INFO: ----------------------------------------------------------------------
+INFO: Min RAGAS Score:           0.8573
+INFO: Max RAGAS Score:           1.0000
+```

 ---

--- a/lightrag/evaluation/eval_rag_quality.py
+++ b/lightrag/evaluation/eval_rag_quality.py