Add comprehensive documentation covering 6 modules: - 01-API-LAYER: Authentication, routing, SSE streaming - 02-SERVICE-LAYER: Dialog, Task, LLM service analysis - 03-RAG-ENGINE: Hybrid search, embedding, reranking - 04-AGENT-SYSTEM: Canvas engine, components, tools - 05-DOCUMENT-PROCESSING: Task executor, PDF parsing - 06-ALGORITHMS: BM25, fusion, RAPTOR Total 28 documentation files with code analysis, diagrams, and formulas.
328 lines
12 KiB
Markdown
328 lines
12 KiB
Markdown
# RAGFlow Backend Architecture - Comprehensive Analysis
|
|
|
|
## Tong Quan
|
|
|
|
RAGFlow là open-source RAG (Retrieval-Augmented Generation) engine với deep document understanding. Document này tổng hợp phân tích chi tiết kiến trúc backend.
|
|
|
|
## Version
|
|
- RAGFlow v0.22.1
|
|
- Analysis Date: 2025-01
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ RAGFlow BACKEND ARCHITECTURE │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────┐
|
|
│ NGINX / Gateway │
|
|
└─────────────────┬───────────────┘
|
|
│
|
|
┌─────────────────▼───────────────┐
|
|
│ 01-API-LAYER │
|
|
│ Flask/Quart Blueprints │
|
|
│ - Authentication (JWT/OAuth) │
|
|
│ - Request Routing │
|
|
│ - SSE Streaming │
|
|
└─────────────────┬───────────────┘
|
|
│
|
|
┌─────────────────▼───────────────┐
|
|
│ 02-SERVICE-LAYER │
|
|
│ Business Logic │
|
|
│ - DialogService (Chat) │
|
|
│ - DocumentService │
|
|
│ - TaskService (Queue) │
|
|
│ - LLMBundle (Model Wrapper) │
|
|
└─────────────────┬───────────────┘
|
|
│
|
|
┌─────────────────────────────┼─────────────────────────────┐
|
|
│ │ │
|
|
┌───────▼───────┐ ┌───────────▼───────────┐ ┌───────────▼───────┐
|
|
│ 03-RAG-ENGINE │ │ 04-AGENT-SYSTEM │ │05-DOC-PROCESSING │
|
|
│ │ │ │ │ │
|
|
│ - Hybrid │ │ - Canvas Engine │ │ - PDF Parser │
|
|
│ Search │ │ - Components │ │ - OCR │
|
|
│ - Embedding │ │ - Tools │ │ - Layout │
|
|
│ - Reranking │ │ - ReAct Agent │ │ - TSR │
|
|
└───────────────┘ └───────────────────────┘ └───────────────────┘
|
|
│ │ │
|
|
└─────────────────────────────┼─────────────────────────────┘
|
|
│
|
|
┌─────────────────▼───────────────┐
|
|
│ 06-ALGORITHMS │
|
|
│ - BM25 Scoring │
|
|
│ - Vector Cosine Similarity │
|
|
│ - Hybrid Score Fusion │
|
|
│ - TF-IDF Weighting │
|
|
│ - RAPTOR │
|
|
│ - GraphRAG │
|
|
└─────────────────────────────────┘
|
|
│
|
|
┌─────────────────────────────┼─────────────────────────────┐
|
|
│ │ │
|
|
┌───────▼───────┐ ┌───────────▼───────────┐ ┌───────────▼───────┐
|
|
│ MySQL │ │ Elasticsearch/ │ │ MinIO │
|
|
│ (Metadata) │ │ Infinity (Vectors) │ │ (File Storage) │
|
|
└───────────────┘ └───────────────────────┘ └───────────────────┘
|
|
```
|
|
|
|
## Module Summary
|
|
|
|
### 01-API-LAYER
|
|
API Gateway xử lý authentication, routing, và SSE streaming.
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `document_app_analysis.md` | Document upload/processing API |
|
|
| `conversation_app_analysis.md` | Chat API với SSE |
|
|
| `canvas_app_analysis.md` | Agent workflow API |
|
|
| `authentication_flow.md` | JWT/OAuth authentication |
|
|
| `request_lifecycle.md` | Request processing pipeline |
|
|
|
|
**Key Technologies:**
|
|
- Flask/Quart (async ASGI)
|
|
- Blueprint routing
|
|
- JWT + API Token authentication
|
|
- SSE streaming
|
|
|
|
### 02-SERVICE-LAYER
|
|
Business logic layer với service pattern.
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `dialog_service_analysis.md` | RAG chat pipeline |
|
|
| `task_service_analysis.md` | Background task queue |
|
|
| `llm_service_analysis.md` | 60+ LLM provider abstraction |
|
|
|
|
**Key Technologies:**
|
|
- Peewee ORM
|
|
- Redis task queue
|
|
- LLMBundle wrapper
|
|
- Langfuse observability
|
|
|
|
### 03-RAG-ENGINE
|
|
Core RAG implementation với hybrid search.
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `hybrid_search_algorithm.md` | Vector + BM25 fusion |
|
|
| `embedding_generation.md` | 30+ embedding models |
|
|
| `rerank_algorithm.md` | Cross-encoder reranking |
|
|
| `chunking_strategies.md` | Document chunking |
|
|
| `query_processing.md` | TF-IDF query weighting |
|
|
|
|
**Key Algorithms:**
|
|
- Hybrid Score: 95% Vector + 5% BM25
|
|
- Cosine similarity
|
|
- Cross-encoder reranking
|
|
- Token-based chunking
|
|
|
|
### 04-AGENT-SYSTEM
|
|
Agentic workflows với visual canvas.
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `canvas_execution_engine.md` | Workflow orchestration |
|
|
| `component_architecture.md` | Component lifecycle |
|
|
| `tool_integration.md` | Tool framework |
|
|
|
|
**Key Features:**
|
|
- DSL-based workflows
|
|
- 15+ component types
|
|
- 10+ tool integrations
|
|
- ReAct agent pattern
|
|
|
|
### 05-DOCUMENT-PROCESSING
|
|
Document parsing pipeline.
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `task_executor_analysis.md` | Async task processing |
|
|
| `pdf_parsing.md` | PDF với OCR + layout |
|
|
|
|
**Key Technologies:**
|
|
- PaddleOCR
|
|
- Detectron2 layout detection
|
|
- TableTransformer (TSR)
|
|
- XGBoost text merging
|
|
|
|
### 06-ALGORITHMS
|
|
Core algorithms và math.
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `bm25_scoring.md` | BM25 ranking |
|
|
| `hybrid_score_fusion.md` | Score combination |
|
|
| `raptor_algorithm.md` | Hierarchical summarization |
|
|
|
|
## Tech Stack Summary
|
|
|
|
### Backend Framework
|
|
```
|
|
Python 3.10+
|
|
├── Flask/Quart - Web framework
|
|
├── Peewee - ORM
|
|
├── Trio - Async concurrency
|
|
└── Celery-like - Task queue (Redis-based)
|
|
```
|
|
|
|
### Data Stores
|
|
```
|
|
MySQL - Metadata, users, configs
|
|
Elasticsearch/Infinity - Vector search + BM25
|
|
Redis - Task queue, caching, sessions
|
|
MinIO - Object storage (documents, images)
|
|
```
|
|
|
|
### ML/AI
|
|
```
|
|
LLM Providers (60+)
|
|
├── OpenAI, Azure, Claude, Gemini
|
|
├── Qwen, DeepSeek, Groq
|
|
├── Ollama (local)
|
|
└── LiteLLM (unified interface)
|
|
|
|
Embedding Models (30+)
|
|
├── OpenAI text-embedding-3
|
|
├── BGE, Jina, Cohere
|
|
└── HuggingFace TEI
|
|
|
|
Vision Models
|
|
├── PaddleOCR
|
|
├── Detectron2
|
|
└── TableTransformer
|
|
```
|
|
|
|
### Search & Retrieval
|
|
```
|
|
Hybrid Search
|
|
├── BM25 (Elasticsearch native)
|
|
├── Vector (cosine similarity)
|
|
└── Fusion (weighted sum)
|
|
|
|
Reranking
|
|
├── Jina Reranker
|
|
├── Cohere Rerank
|
|
└── BGE Reranker
|
|
```
|
|
|
|
## Key Flows
|
|
|
|
### 1. Document Upload Flow
|
|
```
|
|
Upload → MinIO → Task Queue → Parser → Chunking → Embedding → Elasticsearch
|
|
```
|
|
|
|
### 2. Chat/Query Flow
|
|
```
|
|
Query → TF-IDF Weight → Hybrid Search → Rerank → Context Building → LLM → SSE Stream
|
|
```
|
|
|
|
### 3. Agent Workflow Flow
|
|
```
|
|
User Input → Canvas Engine → Component Execution → Tool Calls → LLM → Output
|
|
```
|
|
|
|
## Performance Metrics
|
|
|
|
| Operation | Typical Latency |
|
|
|-----------|-----------------|
|
|
| Vector Search | < 100ms |
|
|
| BM25 Search | < 50ms |
|
|
| Reranking | 200-500ms |
|
|
| Total Retrieval | < 1s |
|
|
| Embedding (batch 16) | 1-5s |
|
|
| PDF Parsing (10 pages) | 30-60s |
|
|
|
|
## Configuration Highlights
|
|
|
|
### Search Config
|
|
```python
|
|
{
|
|
"vector_similarity_weight": 0.95, # 95% vector
|
|
"similarity_threshold": 0.2, # Min similarity
|
|
"top_k": 1024, # Initial candidates
|
|
"top_n": 6, # Final results
|
|
}
|
|
```
|
|
|
|
### Chunking Config
|
|
```python
|
|
{
|
|
"chunk_token_num": 512, # Tokens per chunk
|
|
"delimiter": "\n。;!?", # Split chars
|
|
"overlapped_percent": 0, # Overlap %
|
|
}
|
|
```
|
|
|
|
### Agent Config
|
|
```python
|
|
{
|
|
"max_rounds": 5, # Max tool rounds
|
|
"temperature": 0.7, # LLM temperature
|
|
"max_tokens": 2048, # Response limit
|
|
}
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
personal_analyze/
|
|
├── 00-OVERVIEW.md # This file
|
|
├── 01-API-LAYER/
|
|
│ ├── README.md
|
|
│ ├── document_app_analysis.md
|
|
│ ├── conversation_app_analysis.md
|
|
│ ├── canvas_app_analysis.md
|
|
│ ├── authentication_flow.md
|
|
│ └── request_lifecycle.md
|
|
├── 02-SERVICE-LAYER/
|
|
│ ├── README.md
|
|
│ ├── dialog_service_analysis.md
|
|
│ ├── task_service_analysis.md
|
|
│ └── llm_service_analysis.md
|
|
├── 03-RAG-ENGINE/
|
|
│ ├── README.md
|
|
│ ├── hybrid_search_algorithm.md
|
|
│ ├── embedding_generation.md
|
|
│ ├── rerank_algorithm.md
|
|
│ ├── chunking_strategies.md
|
|
│ └── query_processing.md
|
|
├── 04-AGENT-SYSTEM/
|
|
│ ├── README.md
|
|
│ ├── canvas_execution_engine.md
|
|
│ ├── component_architecture.md
|
|
│ └── tool_integration.md
|
|
├── 05-DOCUMENT-PROCESSING/
|
|
│ ├── README.md
|
|
│ ├── task_executor_analysis.md
|
|
│ └── pdf_parsing.md
|
|
└── 06-ALGORITHMS/
|
|
├── README.md
|
|
├── bm25_scoring.md
|
|
├── hybrid_score_fusion.md
|
|
└── raptor_algorithm.md
|
|
```
|
|
|
|
## Key Source Files
|
|
|
|
| Module | Key File | Description |
|
|
|--------|----------|-------------|
|
|
| API | `/api/apps/dialog_app.py` | Chat API endpoints |
|
|
| Service | `/api/db/services/dialog_service.py` | RAG chat logic |
|
|
| RAG | `/rag/nlp/search.py` | Hybrid search |
|
|
| Agent | `/agent/canvas.py` | Workflow engine |
|
|
| Parser | `/deepdoc/parser/pdf_parser.py` | PDF parsing |
|
|
| Algorithms | `/rag/raptor.py` | RAPTOR algorithm |
|
|
|
|
## Conclusion
|
|
|
|
RAGFlow là một comprehensive RAG system với:
|
|
- **Multi-provider LLM support** (60+ providers)
|
|
- **Advanced document understanding** (OCR, layout, tables)
|
|
- **Hybrid search** (Vector + BM25)
|
|
- **Agentic workflows** (visual canvas)
|
|
- **Production-ready** (multi-tenant, scalable)
|
|
|
|
Tham khảo các file chi tiết trong từng module để hiểu sâu hơn về implementation.
|