Response to @KevinHuSh's review question about mocks.
Added:
- Integration tests (10 tests) with real CheckpointService and database
- Documentation explaining unit tests vs integration tests
- Real-world resume scenario test
- Comments in unit tests explaining mock usage
Integration tests cover:
- Actual database operations
- Complete checkpoint lifecycle
- Resume from crash scenario
- Retry logic with real state
- Progress calculation with persistence
Unit tests (mocked) remain for:
- Fast CI/CD feedback (0.04s)
- Interface validation
- No database dependencies
Both test types are valuable and complement each other.
### What problem does this PR solve?
Feature: This PR implements a comprehensive RAG evaluation framework to
address issue #11656.
**Problem**: Developers using RAGFlow lack systematic ways to measure
RAG accuracy and quality. They cannot objectively answer:
1. Are RAG results truly accurate?
2. How should configurations be adjusted to improve quality?
3. How to maintain and improve RAG performance over time?
**Solution**: This PR adds a complete evaluation system with:
- **Dataset & test case management** - Create ground truth datasets with
questions and expected answers
- **Automated evaluation** - Run RAG pipeline on test cases and compute
metrics
- **Comprehensive metrics** - Precision, recall, F1 score, MRR, hit rate
for retrieval quality
- **Smart recommendations** - Analyze results and suggest specific
configuration improvements (e.g., "increase top_k", "enable reranking")
- **20+ REST API endpoints** - Full CRUD operations for datasets, test
cases, and evaluation runs
**Impact**: Enables developers to objectively measure RAG quality,
identify issues, and systematically improve their RAG systems through
data-driven configuration tuning.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)