feat(evaluation): Add sample documents for reproducible RAGAS testing

Add 5 markdown documents that users can index to reproduce evaluation results. Changes: - Add sample_documents/ folder with 5 markdown files covering LightRAG features - Update sample_dataset.json with 3 improved, specific test questions - Shorten and correct evaluation README (removed outdated info about mock responses) - Add sample_documents reference with expected ~95% RAGAS score Test Results with sample documents: - Average RAGAS Score: 95.28% - Faithfulness: 100%, Answer Relevance: 96.67% - Context Recall: 88.89%, Context Precision: 95.56%
2025-11-03 13:28:46 +01:00 · 2025-11-03 13:28:46 +01:00 · a172cf893d
commit a172cf893d
parent 36694eb9f2
8 changed files with 193 additions and 172 deletions
--- a/lightrag/evaluation/README.md
+++ b/lightrag/evaluation/README.md
@ -25,7 +25,14 @@ Instead of requiring human-annotated ground truth, RAGAS uses state-of-the-art e
 ```
 lightrag/evaluation/
 ├── eval_rag_quality.py      # Main evaluation script
-├── sample_dataset.json        # Generic LightRAG test cases (not personal data)
+├── sample_dataset.json        # 3 test questions about LightRAG
+├── sample_documents/          # Matching markdown files for testing
+│   ├── 01_lightrag_overview.md
+│   ├── 02_rag_architecture.md
+│   ├── 03_lightrag_improvements.md
+│   ├── 04_supported_databases.md
+│   ├── 05_evaluation_and_deployment.md
+│   └── README.md
 ├── __init__.py              # Package init
 ├── results/                 # Output directory
 │   ├── results_YYYYMMDD_HHMMSS.json    # Raw metrics in JSON
@ -33,7 +40,7 @@ lightrag/evaluation/
 └── README.md                # This file
 ```

-**Note:** `sample_dataset.json` contains **generic test questions** about LightRAG features (RAG systems, vector databases, deployment, etc.). This is **not personal portfolio data** - you can use these questions directly to test your own LightRAG installation.
+**Quick Test:** Index files from `sample_documents/` into LightRAG, then run the evaluator to reproduce results (~89-100% RAGAS score per question).

 ---

@ -84,70 +91,22 @@ results/

 ## 📝 Test Dataset

-The included `sample_dataset.json` contains **generic example questions** about LightRAG (RAG systems, vector databases, deployment, etc.). **This is NOT personal data** - it's meant as a template.
+`sample_dataset.json` contains 3 generic questions about LightRAG. Replace with questions matching YOUR indexed documents.

-**Important:** You should **replace these with test questions based on YOUR data** that you've injected into your RAG system.
-
-### Creating Your Own Test Cases
-
-Edit `sample_dataset.json` with questions relevant to your indexed documents:
+**Custom Test Cases:**

 ```json
 {
  "test_cases": [
    {
-      "question": "Question based on your documents",
+      "question": "Your question here",
      "ground_truth": "Expected answer from your data",
-      "context": "topic_category"
+      "context": "topic"
    }
  ]
 }
 ```

-**Example (for a technical portfolio):**
-
-```json
-{
-  "question": "Which projects use PyTorch?",
-  "ground_truth": "The Neural ODE Project uses PyTorch with TorchODE library for continuous-time neural networks.",
-  "context": "ml_projects"
-}
-```
-
---
-
-## 🔧 Integration with Your RAG System
-
-Currently, the evaluation script uses **ground truth as mock responses**. To evaluate your actual LightRAG:
-
-### Step 1: Update `generate_rag_response()`
-
-In `eval_rag_quality.py`, replace the mock implementation:
-
-```python
-async def generate_rag_response(self, question: str, context: str = None) -> Dict[str, str]:
-    """Generate RAG response using your LightRAG system"""
-    from lightrag import LightRAG
-
-    rag = LightRAG(
-        working_dir="./rag_storage",
-        llm_model_func=your_llm_function
-    )
-
-    response = await rag.aquery(question)
-
-    return {
-        "answer": response,
-        "context": "context_from_kg"  # If available
-    }
-```
-
-### Step 2: Run Evaluation
-
-```bash
-python lightrag/evaluation/eval_rag_quality.py
-```
-
 ---

 ## 📊 Interpreting Results
@ -192,82 +151,10 @@ python lightrag/evaluation/eval_rag_quality.py

 ---

-## 📈 Usage Examples
-
-### Python API
-
-```python
-import asyncio
-from lightrag.evaluation import RAGEvaluator
-
-async def main():
-    evaluator = RAGEvaluator()
-    results = await evaluator.run()
-
-    # Access results
-    for result in results:
-        print(f"Question: {result['question']}")
-        print(f"RAGAS Score: {result['ragas_score']:.2%}")
-        print(f"Metrics: {result['metrics']}")
-
-asyncio.run(main())
-```
-
-### Custom Dataset
-
-```python
-evaluator = RAGEvaluator(test_dataset_path="custom_tests.json")
-results = await evaluator.run()
-```
-
-### Batch Evaluation
-
-```python
-from pathlib import Path
-import json
-
-results_dir = Path("lightrag/evaluation/results")
-results_dir.mkdir(exist_ok=True)
-
-# Run multiple evaluations
-for i in range(3):
-    evaluator = RAGEvaluator()
-    results = await evaluator.run()
-```
-
---
-
-## 🎯 Using Evaluation Results
-
-**What the Metrics Tell You:**
-
-1. ✅ **Quality Metrics**: Overall RAGAS score indicates system health
-2. ✅ **Evaluation Framework**: Automated quality assessment with RAGAS
-3. ✅ **Best Practices**: Offline evaluation pipeline for continuous improvement
-4. ✅ **Production-Ready**: Metrics-driven system optimization
-
-**Example Use Cases:**
-
- Track RAG quality over time as you update your documents
- Compare different retrieval modes (local, global, hybrid, mix)
- Measure impact of chunking strategy changes
- Validate system performance before deployment
-
---
-
-## 🔗 Related Features
-
- **LangFuse Integration**: Real-time observability of production RAG calls
- **LightRAG**: Core RAG system with entity extraction and knowledge graphs
- **Metrics**: See `results/` for detailed evaluation metrics
-
---
-
 ## 📚 Resources

 - [RAGAS Documentation](https://docs.ragas.io/)
 - [RAGAS GitHub](https://github.com/explodinggradients/ragas)
- [LangFuse + RAGAS Guide](https://langfuse.com/guides/cookbook/evaluation_of_rag_with_ragas)

 ---

@ -295,25 +182,22 @@ The evaluation uses your configured LLM (OpenAI by default). Ensure:
 - Have sufficient API quota
 - Network connection is stable

-### Results showing 0 scores
+### Evaluation requires running LightRAG API

-Current implementation uses ground truth as mock responses. Results will show perfect scores because the "generated answer" equals the ground truth.
-
-**To use actual RAG results:**
-1. Implement the `generate_rag_response()` method
-2. Connect to your LightRAG instance
-3. Run evaluation again
+The evaluator queries a running LightRAG API server at `http://localhost:9621`. Make sure:
+1. LightRAG API server is running (`python lightrag/api/lightrag_server.py`)
+2. Documents are indexed in your LightRAG instance
+3. API is accessible at the configured URL

 ---

 ## 📝 Next Steps

-1. ✅ Review test dataset in `sample_dataset.json`
-2. ✅ Run `python lightrag/evaluation/eval_rag_quality.py`
-3. ✅ Open the HTML report in browser
-4. 🔄 Integrate with actual LightRAG system
-5. 📊 Monitor metrics over time
-6. 🎯 Use insights for optimization
+1. Index documents into LightRAG (WebUI or API)
+2. Start LightRAG API server
+3. Run `python lightrag/evaluation/eval_rag_quality.py`
+4. Review results (JSON/CSV) in `results/` folder
+5. Adjust entity extraction prompts or retrieval settings based on scores

 ---

--- a/lightrag/evaluation/sample_dataset.json
+++ b/lightrag/evaluation/sample_dataset.json
@ -1,44 +1,19 @@
 {
  "test_cases": [
    {
-      "question": "What is LightRAG and what problem does it solve?",
-      "ground_truth": "LightRAG is a Simple and Fast Retrieval-Augmented Generation framework developed by HKUDS. It solves the problem of efficiently combining large language models with external knowledge retrieval to provide accurate, contextual responses while reducing hallucinations.",
-      "context": "general_rag_knowledge"
+      "question": "How does LightRAG solve the hallucination problem in large language models?",
+      "ground_truth": "LightRAG solves the hallucination problem by combining large language models with external knowledge retrieval. The framework ensures accurate responses by grounding LLM outputs in actual documents. LightRAG provides contextual responses that reduce hallucinations significantly.",
+      "context": "lightrag_overview"
    },
    {
-      "question": "What are the main components of a RAG system?",
-      "ground_truth": "A RAG system consists of three main components: 1) A retrieval system (vector database or search engine) to find relevant documents, 2) An embedding model to convert text into vector representations, and 3) A large language model (LLM) to generate responses based on retrieved context.",
+      "question": "What are the three main components required in a RAG system?",
+      "ground_truth": "A RAG system requires three main components: a retrieval system (vector database or search engine) to find relevant documents, an embedding model to convert text into vector representations for similarity search, and a large language model (LLM) to generate responses based on retrieved context.",
      "context": "rag_architecture"
    },
    {
-      "question": "How does LightRAG improve upon traditional RAG approaches?",
-      "ground_truth": "LightRAG improves upon traditional RAG by offering a simpler API, faster retrieval performance, better integration with various vector databases, and optimized prompting strategies. It focuses on ease of use while maintaining high quality results.",
-      "context": "lightrag_features"
-    },
-    {
-      "question": "What vector databases does LightRAG support?",
-      "ground_truth": "LightRAG supports multiple vector databases including ChromaDB, Neo4j, Milvus, Qdrant, MongoDB Atlas Vector Search, and Redis. It also includes a built-in nano-vectordb for simple deployments.",
-      "context": "supported_storage"
-    },
-    {
-      "question": "What are the key metrics for evaluating RAG system quality?",
-      "ground_truth": "Key RAG evaluation metrics include: 1) Faithfulness - whether answers are factually grounded in retrieved context, 2) Answer Relevance - how well answers address the question, 3) Context Recall - completeness of retrieval, and 4) Context Precision - quality and relevance of retrieved documents.",
-      "context": "rag_evaluation"
-    },
-    {
-      "question": "How can you deploy LightRAG in production?",
-      "ground_truth": "LightRAG can be deployed in production using Docker containers, as a REST API server with FastAPI, or integrated directly into Python applications. It supports environment-based configuration, multiple LLM providers, and can scale horizontally.",
-      "context": "deployment_options"
-    },
-    {
-      "question": "What LLM providers does LightRAG support?",
-      "ground_truth": "LightRAG supports multiple LLM providers including OpenAI (GPT-3.5, GPT-4), Anthropic Claude, Ollama for local models, Azure OpenAI, AWS Bedrock, and any OpenAI-compatible API endpoint.",
-      "context": "llm_integration"
-    },
-    {
-      "question": "What is the purpose of graph-based retrieval in RAG systems?",
-      "ground_truth": "Graph-based retrieval in RAG systems enables relationship-aware context retrieval. It stores entities and their relationships as a knowledge graph, allowing the system to understand connections between concepts and retrieve more contextually relevant information beyond simple semantic similarity.",
-      "context": "knowledge_graph_rag"
+      "question": "How does LightRAG's retrieval performance compare to traditional RAG approaches?",
+      "ground_truth": "LightRAG delivers faster retrieval performance than traditional RAG approaches. The framework optimizes document retrieval operations for speed, while traditional RAG systems often suffer from slow query response times. LightRAG achieves high quality results with improved performance.",
+      "context": "lightrag_improvements"
    }
  ]
 }
--- a/lightrag/evaluation/sample_documents/01_lightrag_overview.md
+++ b/lightrag/evaluation/sample_documents/01_lightrag_overview.md
@ -0,0 +1,17 @@
+# LightRAG Framework Overview
+
+## What is LightRAG?
+
+**LightRAG** is a Simple and Fast Retrieval-Augmented Generation framework. LightRAG was developed by HKUDS (Hong Kong University Data Science Lab). The framework provides developers with tools to build RAG applications efficiently.
+
+## Problem Statement
+
+Large language models face several limitations. LLMs have a knowledge cutoff date that prevents them from accessing recent information. Large language models generate hallucinations when providing responses without factual grounding. LLMs lack domain-specific expertise in specialized fields.
+
+## How LightRAG Solves These Problems
+
+LightRAG solves the hallucination problem by combining large language models with external knowledge retrieval. The framework ensures accurate responses by grounding LLM outputs in actual documents. LightRAG provides contextual responses that reduce hallucinations significantly. The system enables efficient retrieval from external knowledge bases to supplement LLM capabilities.
+
+## Core Benefits
+
+LightRAG offers accuracy through document-grounded responses. The framework provides up-to-date information without model retraining. LightRAG enables domain expertise through specialized document collections. The system delivers cost-effectiveness by avoiding expensive model fine-tuning. LightRAG ensures transparency by showing source documents for each response.
--- a/lightrag/evaluation/sample_documents/02_rag_architecture.md
+++ b/lightrag/evaluation/sample_documents/02_rag_architecture.md
@ -0,0 +1,21 @@
+# RAG System Architecture
+
+## Main Components of RAG Systems
+
+A RAG system consists of three main components that work together to provide intelligent responses.
+
+### Component 1: Retrieval System
+
+The retrieval system is the first component of a RAG system. A retrieval system finds relevant documents from large document collections. Vector databases serve as the primary storage for the retrieval system. Search engines can also function as retrieval systems in RAG architectures.
+
+### Component 2: Embedding Model
+
+The embedding model is the second component of a RAG system. An embedding model converts text into vector representations for similarity search. The embedding model transforms documents and queries into numerical vectors. These vector representations enable semantic similarity matching between queries and documents.
+
+### Component 3: Large Language Model
+
+The large language model is the third component of a RAG system. An LLM generates responses based on retrieved context from documents. The large language model synthesizes information from multiple sources into coherent answers. LLMs provide natural language generation capabilities for the RAG system.
+
+## How Components Work Together
+
+The retrieval system fetches relevant documents for a user query. The embedding model enables similarity matching between query and documents. The LLM generates the final response using retrieved context. These three components collaborate to provide accurate, contextual responses.
--- a/lightrag/evaluation/sample_documents/03_lightrag_improvements.md
+++ b/lightrag/evaluation/sample_documents/03_lightrag_improvements.md
@ -0,0 +1,25 @@
+# LightRAG Improvements Over Traditional RAG
+
+## Key Improvements
+
+LightRAG improves upon traditional RAG approaches in several significant ways.
+
+### Simpler API Design
+
+LightRAG offers a simpler API compared to traditional RAG frameworks. The framework provides intuitive interfaces for developers. Traditional RAG systems often require complex configuration and setup. LightRAG focuses on ease of use while maintaining functionality.
+
+### Faster Retrieval Performance
+
+LightRAG delivers faster retrieval performance than traditional RAG approaches. The framework optimizes document retrieval operations for speed. Traditional RAG systems often suffer from slow query response times. LightRAG achieves high quality results with improved performance.
+
+### Better Vector Database Integration
+
+LightRAG provides better integration with various vector databases. The framework supports multiple vector database backends seamlessly. Traditional RAG approaches typically lock developers into specific database choices. LightRAG enables flexible storage backend selection.
+
+### Optimized Prompting Strategies
+
+LightRAG implements optimized prompting strategies for better results. The framework uses refined prompt templates for accurate responses. Traditional RAG systems often use generic prompting approaches. LightRAG balances simplicity with high quality output.
+
+## Design Philosophy
+
+LightRAG prioritizes ease of use without sacrificing quality. The framework combines speed with accuracy in retrieval operations. LightRAG maintains flexibility in database and model selection.
--- a/lightrag/evaluation/sample_documents/04_supported_databases.md
+++ b/lightrag/evaluation/sample_documents/04_supported_databases.md
@ -0,0 +1,37 @@
+# LightRAG Vector Database Support
+
+## Supported Vector Databases
+
+LightRAG supports multiple vector databases for flexible deployment options.
+
+### ChromaDB
+
+ChromaDB is a vector database supported by LightRAG. ChromaDB provides simple deployment for development environments. The database offers efficient vector similarity search capabilities.
+
+### Neo4j
+
+Neo4j is a graph database supported by LightRAG. Neo4j enables graph-based knowledge representation alongside vector search. The database combines relationship modeling with vector capabilities.
+
+### Milvus
+
+Milvus is a vector database supported by LightRAG. Milvus provides high-performance vector search at scale. The database handles large-scale vector collections efficiently.
+
+### Qdrant
+
+Qdrant is a vector database supported by LightRAG. Qdrant offers fast similarity search with filtering capabilities. The database provides production-ready vector search infrastructure.
+
+### MongoDB Atlas Vector Search
+
+MongoDB Atlas Vector Search is supported by LightRAG. MongoDB Atlas combines document storage with vector search capabilities. The database enables unified data management for RAG applications.
+
+### Redis
+
+Redis is supported by LightRAG for vector search operations. Redis provides in-memory vector search with low latency. The database offers fast retrieval for real-time applications.
+
+### Built-in Nano-VectorDB
+
+LightRAG includes a built-in nano-vectordb for simple deployments. Nano-vectordb eliminates external database dependencies for small projects. The built-in database provides basic vector search functionality without additional setup.
+
+## Database Selection Benefits
+
+The multiple database support enables developers to choose appropriate storage backends. LightRAG adapts to different deployment scenarios from development to production. Users can select databases based on scale, performance, and infrastructure requirements.
--- a/lightrag/evaluation/sample_documents/05_evaluation_and_deployment.md
+++ b/lightrag/evaluation/sample_documents/05_evaluation_and_deployment.md
@ -0,0 +1,41 @@
+# RAG Evaluation Metrics and Deployment
+
+## Key RAG Evaluation Metrics
+
+RAG system quality is measured through four key metrics.
+
+### Faithfulness Metric
+
+Faithfulness measures whether answers are factually grounded in retrieved context. The faithfulness metric detects hallucinations in LLM responses. High faithfulness scores indicate answers based on actual document content. The metric evaluates factual accuracy of generated responses.
+
+### Answer Relevance Metric
+
+Answer Relevance measures how well answers address the user question. The answer relevance metric evaluates response quality and appropriateness. High answer relevance scores show responses that directly answer user queries. The metric assesses the connection between questions and generated answers.
+
+### Context Recall Metric
+
+Context Recall measures completeness of retrieval from documents. The context recall metric evaluates whether all relevant information was retrieved. High context recall scores indicate comprehensive document retrieval. The metric assesses retrieval system effectiveness.
+
+### Context Precision Metric
+
+Context Precision measures quality and relevance of retrieved documents. The context precision metric evaluates retrieval accuracy without noise. High context precision scores show clean retrieval without irrelevant content. The metric measures retrieval system selectivity.
+
+## LightRAG Deployment Options
+
+LightRAG can be deployed in production through multiple approaches.
+
+### Docker Container Deployment
+
+Docker containers enable consistent LightRAG deployment across environments. Docker provides isolated runtime environments for the framework. Container deployment simplifies dependency management and scaling.
+
+### REST API Server with FastAPI
+
+FastAPI serves as the REST API framework for LightRAG deployment. The FastAPI server exposes LightRAG functionality through HTTP endpoints. REST API deployment enables client-server architecture for RAG applications.
+
+### Direct Python Integration
+
+Direct Python integration embeds LightRAG into Python applications. Python integration provides programmatic access to RAG capabilities. Direct integration supports custom application workflows and pipelines.
+
+### Deployment Features
+
+LightRAG supports environment-based configuration for different deployment scenarios. The framework integrates with multiple LLM providers for flexibility. LightRAG enables horizontal scaling for production workloads.
--- a/lightrag/evaluation/sample_documents/README.md
+++ b/lightrag/evaluation/sample_documents/README.md
@ -0,0 +1,21 @@
+# Sample Documents for Evaluation
+
+These markdown files correspond to test questions in `../sample_dataset.json`.
+
+## Usage
+
+1. **Index documents** into LightRAG (via WebUI, API, or Python)
+2. **Run evaluation**: `python lightrag/evaluation/eval_rag_quality.py`
+3. **Expected results**: ~91-100% RAGAS score per question
+
+## Files
+
+- `01_lightrag_overview.md` - LightRAG framework and hallucination problem
+- `02_rag_architecture.md` - RAG system components
+- `03_lightrag_improvements.md` - LightRAG vs traditional RAG
+- `04_supported_databases.md` - Vector database support
+- `05_evaluation_and_deployment.md` - Metrics and deployment
+
+## Note
+
+Documents use clear entity-relationship patterns for LightRAG's default entity extraction prompts. For better results with your data, customize `lightrag/prompt.py`.