ragflow/personal_analyze/00-OVERVIEW.md
Claude a6ee18476d
docs: Add detailed backend module analysis documentation
Add comprehensive documentation covering 6 modules:
- 01-API-LAYER: Authentication, routing, SSE streaming
- 02-SERVICE-LAYER: Dialog, Task, LLM service analysis
- 03-RAG-ENGINE: Hybrid search, embedding, reranking
- 04-AGENT-SYSTEM: Canvas engine, components, tools
- 05-DOCUMENT-PROCESSING: Task executor, PDF parsing
- 06-ALGORITHMS: BM25, fusion, RAPTOR

Total 28 documentation files with code analysis, diagrams, and formulas.
2025-11-26 11:10:54 +00:00

328 lines
12 KiB
Markdown

# RAGFlow Backend Architecture - Comprehensive Analysis
## Tong Quan
RAGFlow là open-source RAG (Retrieval-Augmented Generation) engine với deep document understanding. Document này tổng hợp phân tích chi tiết kiến trúc backend.
## Version
- RAGFlow v0.22.1
- Analysis Date: 2025-01
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAGFlow BACKEND ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────┐
│ NGINX / Gateway │
└─────────────────┬───────────────┘
┌─────────────────▼───────────────┐
│ 01-API-LAYER │
│ Flask/Quart Blueprints │
│ - Authentication (JWT/OAuth) │
│ - Request Routing │
│ - SSE Streaming │
└─────────────────┬───────────────┘
┌─────────────────▼───────────────┐
│ 02-SERVICE-LAYER │
│ Business Logic │
│ - DialogService (Chat) │
│ - DocumentService │
│ - TaskService (Queue) │
│ - LLMBundle (Model Wrapper) │
└─────────────────┬───────────────┘
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────────▼───────────┐ ┌───────────▼───────┐
│ 03-RAG-ENGINE │ │ 04-AGENT-SYSTEM │ │05-DOC-PROCESSING │
│ │ │ │ │ │
│ - Hybrid │ │ - Canvas Engine │ │ - PDF Parser │
│ Search │ │ - Components │ │ - OCR │
│ - Embedding │ │ - Tools │ │ - Layout │
│ - Reranking │ │ - ReAct Agent │ │ - TSR │
└───────────────┘ └───────────────────────┘ └───────────────────┘
│ │ │
└─────────────────────────────┼─────────────────────────────┘
┌─────────────────▼───────────────┐
│ 06-ALGORITHMS │
│ - BM25 Scoring │
│ - Vector Cosine Similarity │
│ - Hybrid Score Fusion │
│ - TF-IDF Weighting │
│ - RAPTOR │
│ - GraphRAG │
└─────────────────────────────────┘
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────────▼───────────┐ ┌───────────▼───────┐
│ MySQL │ │ Elasticsearch/ │ │ MinIO │
│ (Metadata) │ │ Infinity (Vectors) │ │ (File Storage) │
└───────────────┘ └───────────────────────┘ └───────────────────┘
```
## Module Summary
### 01-API-LAYER
API Gateway xử lý authentication, routing, và SSE streaming.
| File | Purpose |
|------|---------|
| `document_app_analysis.md` | Document upload/processing API |
| `conversation_app_analysis.md` | Chat API với SSE |
| `canvas_app_analysis.md` | Agent workflow API |
| `authentication_flow.md` | JWT/OAuth authentication |
| `request_lifecycle.md` | Request processing pipeline |
**Key Technologies:**
- Flask/Quart (async ASGI)
- Blueprint routing
- JWT + API Token authentication
- SSE streaming
### 02-SERVICE-LAYER
Business logic layer với service pattern.
| File | Purpose |
|------|---------|
| `dialog_service_analysis.md` | RAG chat pipeline |
| `task_service_analysis.md` | Background task queue |
| `llm_service_analysis.md` | 60+ LLM provider abstraction |
**Key Technologies:**
- Peewee ORM
- Redis task queue
- LLMBundle wrapper
- Langfuse observability
### 03-RAG-ENGINE
Core RAG implementation với hybrid search.
| File | Purpose |
|------|---------|
| `hybrid_search_algorithm.md` | Vector + BM25 fusion |
| `embedding_generation.md` | 30+ embedding models |
| `rerank_algorithm.md` | Cross-encoder reranking |
| `chunking_strategies.md` | Document chunking |
| `query_processing.md` | TF-IDF query weighting |
**Key Algorithms:**
- Hybrid Score: 95% Vector + 5% BM25
- Cosine similarity
- Cross-encoder reranking
- Token-based chunking
### 04-AGENT-SYSTEM
Agentic workflows với visual canvas.
| File | Purpose |
|------|---------|
| `canvas_execution_engine.md` | Workflow orchestration |
| `component_architecture.md` | Component lifecycle |
| `tool_integration.md` | Tool framework |
**Key Features:**
- DSL-based workflows
- 15+ component types
- 10+ tool integrations
- ReAct agent pattern
### 05-DOCUMENT-PROCESSING
Document parsing pipeline.
| File | Purpose |
|------|---------|
| `task_executor_analysis.md` | Async task processing |
| `pdf_parsing.md` | PDF với OCR + layout |
**Key Technologies:**
- PaddleOCR
- Detectron2 layout detection
- TableTransformer (TSR)
- XGBoost text merging
### 06-ALGORITHMS
Core algorithms và math.
| File | Purpose |
|------|---------|
| `bm25_scoring.md` | BM25 ranking |
| `hybrid_score_fusion.md` | Score combination |
| `raptor_algorithm.md` | Hierarchical summarization |
## Tech Stack Summary
### Backend Framework
```
Python 3.10+
├── Flask/Quart - Web framework
├── Peewee - ORM
├── Trio - Async concurrency
└── Celery-like - Task queue (Redis-based)
```
### Data Stores
```
MySQL - Metadata, users, configs
Elasticsearch/Infinity - Vector search + BM25
Redis - Task queue, caching, sessions
MinIO - Object storage (documents, images)
```
### ML/AI
```
LLM Providers (60+)
├── OpenAI, Azure, Claude, Gemini
├── Qwen, DeepSeek, Groq
├── Ollama (local)
└── LiteLLM (unified interface)
Embedding Models (30+)
├── OpenAI text-embedding-3
├── BGE, Jina, Cohere
└── HuggingFace TEI
Vision Models
├── PaddleOCR
├── Detectron2
└── TableTransformer
```
### Search & Retrieval
```
Hybrid Search
├── BM25 (Elasticsearch native)
├── Vector (cosine similarity)
└── Fusion (weighted sum)
Reranking
├── Jina Reranker
├── Cohere Rerank
└── BGE Reranker
```
## Key Flows
### 1. Document Upload Flow
```
Upload → MinIO → Task Queue → Parser → Chunking → Embedding → Elasticsearch
```
### 2. Chat/Query Flow
```
Query → TF-IDF Weight → Hybrid Search → Rerank → Context Building → LLM → SSE Stream
```
### 3. Agent Workflow Flow
```
User Input → Canvas Engine → Component Execution → Tool Calls → LLM → Output
```
## Performance Metrics
| Operation | Typical Latency |
|-----------|-----------------|
| Vector Search | < 100ms |
| BM25 Search | < 50ms |
| Reranking | 200-500ms |
| Total Retrieval | < 1s |
| Embedding (batch 16) | 1-5s |
| PDF Parsing (10 pages) | 30-60s |
## Configuration Highlights
### Search Config
```python
{
"vector_similarity_weight": 0.95, # 95% vector
"similarity_threshold": 0.2, # Min similarity
"top_k": 1024, # Initial candidates
"top_n": 6, # Final results
}
```
### Chunking Config
```python
{
"chunk_token_num": 512, # Tokens per chunk
"delimiter": "\n。;!?", # Split chars
"overlapped_percent": 0, # Overlap %
}
```
### Agent Config
```python
{
"max_rounds": 5, # Max tool rounds
"temperature": 0.7, # LLM temperature
"max_tokens": 2048, # Response limit
}
```
## Directory Structure
```
personal_analyze/
├── 00-OVERVIEW.md # This file
├── 01-API-LAYER/
│ ├── README.md
│ ├── document_app_analysis.md
│ ├── conversation_app_analysis.md
│ ├── canvas_app_analysis.md
│ ├── authentication_flow.md
│ └── request_lifecycle.md
├── 02-SERVICE-LAYER/
│ ├── README.md
│ ├── dialog_service_analysis.md
│ ├── task_service_analysis.md
│ └── llm_service_analysis.md
├── 03-RAG-ENGINE/
│ ├── README.md
│ ├── hybrid_search_algorithm.md
│ ├── embedding_generation.md
│ ├── rerank_algorithm.md
│ ├── chunking_strategies.md
│ └── query_processing.md
├── 04-AGENT-SYSTEM/
│ ├── README.md
│ ├── canvas_execution_engine.md
│ ├── component_architecture.md
│ └── tool_integration.md
├── 05-DOCUMENT-PROCESSING/
│ ├── README.md
│ ├── task_executor_analysis.md
│ └── pdf_parsing.md
└── 06-ALGORITHMS/
├── README.md
├── bm25_scoring.md
├── hybrid_score_fusion.md
└── raptor_algorithm.md
```
## Key Source Files
| Module | Key File | Description |
|--------|----------|-------------|
| API | `/api/apps/dialog_app.py` | Chat API endpoints |
| Service | `/api/db/services/dialog_service.py` | RAG chat logic |
| RAG | `/rag/nlp/search.py` | Hybrid search |
| Agent | `/agent/canvas.py` | Workflow engine |
| Parser | `/deepdoc/parser/pdf_parser.py` | PDF parsing |
| Algorithms | `/rag/raptor.py` | RAPTOR algorithm |
## Conclusion
RAGFlow một comprehensive RAG system với:
- **Multi-provider LLM support** (60+ providers)
- **Advanced document understanding** (OCR, layout, tables)
- **Hybrid search** (Vector + BM25)
- **Agentic workflows** (visual canvas)
- **Production-ready** (multi-tenant, scalable)
Tham khảo các file chi tiết trong từng module để hiểu sâu hơn về implementation.