Add comprehensive documentation covering 6 modules: - 01-API-LAYER: Authentication, routing, SSE streaming - 02-SERVICE-LAYER: Dialog, Task, LLM service analysis - 03-RAG-ENGINE: Hybrid search, embedding, reranking - 04-AGENT-SYSTEM: Canvas engine, components, tools - 05-DOCUMENT-PROCESSING: Task executor, PDF parsing - 06-ALGORITHMS: BM25, fusion, RAPTOR Total 28 documentation files with code analysis, diagrams, and formulas.
12 KiB
12 KiB
RAGFlow Backend Architecture - Comprehensive Analysis
Tong Quan
RAGFlow là open-source RAG (Retrieval-Augmented Generation) engine với deep document understanding. Document này tổng hợp phân tích chi tiết kiến trúc backend.
Version
- RAGFlow v0.22.1
- Analysis Date: 2025-01
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAGFlow BACKEND ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────┐
│ NGINX / Gateway │
└─────────────────┬───────────────┘
│
┌─────────────────▼───────────────┐
│ 01-API-LAYER │
│ Flask/Quart Blueprints │
│ - Authentication (JWT/OAuth) │
│ - Request Routing │
│ - SSE Streaming │
└─────────────────┬───────────────┘
│
┌─────────────────▼───────────────┐
│ 02-SERVICE-LAYER │
│ Business Logic │
│ - DialogService (Chat) │
│ - DocumentService │
│ - TaskService (Queue) │
│ - LLMBundle (Model Wrapper) │
└─────────────────┬───────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────────▼───────────┐ ┌───────────▼───────┐
│ 03-RAG-ENGINE │ │ 04-AGENT-SYSTEM │ │05-DOC-PROCESSING │
│ │ │ │ │ │
│ - Hybrid │ │ - Canvas Engine │ │ - PDF Parser │
│ Search │ │ - Components │ │ - OCR │
│ - Embedding │ │ - Tools │ │ - Layout │
│ - Reranking │ │ - ReAct Agent │ │ - TSR │
└───────────────┘ └───────────────────────┘ └───────────────────┘
│ │ │
└─────────────────────────────┼─────────────────────────────┘
│
┌─────────────────▼───────────────┐
│ 06-ALGORITHMS │
│ - BM25 Scoring │
│ - Vector Cosine Similarity │
│ - Hybrid Score Fusion │
│ - TF-IDF Weighting │
│ - RAPTOR │
│ - GraphRAG │
└─────────────────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────────▼───────────┐ ┌───────────▼───────┐
│ MySQL │ │ Elasticsearch/ │ │ MinIO │
│ (Metadata) │ │ Infinity (Vectors) │ │ (File Storage) │
└───────────────┘ └───────────────────────┘ └───────────────────┘
Module Summary
01-API-LAYER
API Gateway xử lý authentication, routing, và SSE streaming.
| File | Purpose |
|---|---|
document_app_analysis.md |
Document upload/processing API |
conversation_app_analysis.md |
Chat API với SSE |
canvas_app_analysis.md |
Agent workflow API |
authentication_flow.md |
JWT/OAuth authentication |
request_lifecycle.md |
Request processing pipeline |
Key Technologies:
- Flask/Quart (async ASGI)
- Blueprint routing
- JWT + API Token authentication
- SSE streaming
02-SERVICE-LAYER
Business logic layer với service pattern.
| File | Purpose |
|---|---|
dialog_service_analysis.md |
RAG chat pipeline |
task_service_analysis.md |
Background task queue |
llm_service_analysis.md |
60+ LLM provider abstraction |
Key Technologies:
- Peewee ORM
- Redis task queue
- LLMBundle wrapper
- Langfuse observability
03-RAG-ENGINE
Core RAG implementation với hybrid search.
| File | Purpose |
|---|---|
hybrid_search_algorithm.md |
Vector + BM25 fusion |
embedding_generation.md |
30+ embedding models |
rerank_algorithm.md |
Cross-encoder reranking |
chunking_strategies.md |
Document chunking |
query_processing.md |
TF-IDF query weighting |
Key Algorithms:
- Hybrid Score: 95% Vector + 5% BM25
- Cosine similarity
- Cross-encoder reranking
- Token-based chunking
04-AGENT-SYSTEM
Agentic workflows với visual canvas.
| File | Purpose |
|---|---|
canvas_execution_engine.md |
Workflow orchestration |
component_architecture.md |
Component lifecycle |
tool_integration.md |
Tool framework |
Key Features:
- DSL-based workflows
- 15+ component types
- 10+ tool integrations
- ReAct agent pattern
05-DOCUMENT-PROCESSING
Document parsing pipeline.
| File | Purpose |
|---|---|
task_executor_analysis.md |
Async task processing |
pdf_parsing.md |
PDF với OCR + layout |
Key Technologies:
- PaddleOCR
- Detectron2 layout detection
- TableTransformer (TSR)
- XGBoost text merging
06-ALGORITHMS
Core algorithms và math.
| File | Purpose |
|---|---|
bm25_scoring.md |
BM25 ranking |
hybrid_score_fusion.md |
Score combination |
raptor_algorithm.md |
Hierarchical summarization |
Tech Stack Summary
Backend Framework
Python 3.10+
├── Flask/Quart - Web framework
├── Peewee - ORM
├── Trio - Async concurrency
└── Celery-like - Task queue (Redis-based)
Data Stores
MySQL - Metadata, users, configs
Elasticsearch/Infinity - Vector search + BM25
Redis - Task queue, caching, sessions
MinIO - Object storage (documents, images)
ML/AI
LLM Providers (60+)
├── OpenAI, Azure, Claude, Gemini
├── Qwen, DeepSeek, Groq
├── Ollama (local)
└── LiteLLM (unified interface)
Embedding Models (30+)
├── OpenAI text-embedding-3
├── BGE, Jina, Cohere
└── HuggingFace TEI
Vision Models
├── PaddleOCR
├── Detectron2
└── TableTransformer
Search & Retrieval
Hybrid Search
├── BM25 (Elasticsearch native)
├── Vector (cosine similarity)
└── Fusion (weighted sum)
Reranking
├── Jina Reranker
├── Cohere Rerank
└── BGE Reranker
Key Flows
1. Document Upload Flow
Upload → MinIO → Task Queue → Parser → Chunking → Embedding → Elasticsearch
2. Chat/Query Flow
Query → TF-IDF Weight → Hybrid Search → Rerank → Context Building → LLM → SSE Stream
3. Agent Workflow Flow
User Input → Canvas Engine → Component Execution → Tool Calls → LLM → Output
Performance Metrics
| Operation | Typical Latency |
|---|---|
| Vector Search | < 100ms |
| BM25 Search | < 50ms |
| Reranking | 200-500ms |
| Total Retrieval | < 1s |
| Embedding (batch 16) | 1-5s |
| PDF Parsing (10 pages) | 30-60s |
Configuration Highlights
Search Config
{
"vector_similarity_weight": 0.95, # 95% vector
"similarity_threshold": 0.2, # Min similarity
"top_k": 1024, # Initial candidates
"top_n": 6, # Final results
}
Chunking Config
{
"chunk_token_num": 512, # Tokens per chunk
"delimiter": "\n。;!?", # Split chars
"overlapped_percent": 0, # Overlap %
}
Agent Config
{
"max_rounds": 5, # Max tool rounds
"temperature": 0.7, # LLM temperature
"max_tokens": 2048, # Response limit
}
Directory Structure
personal_analyze/
├── 00-OVERVIEW.md # This file
├── 01-API-LAYER/
│ ├── README.md
│ ├── document_app_analysis.md
│ ├── conversation_app_analysis.md
│ ├── canvas_app_analysis.md
│ ├── authentication_flow.md
│ └── request_lifecycle.md
├── 02-SERVICE-LAYER/
│ ├── README.md
│ ├── dialog_service_analysis.md
│ ├── task_service_analysis.md
│ └── llm_service_analysis.md
├── 03-RAG-ENGINE/
│ ├── README.md
│ ├── hybrid_search_algorithm.md
│ ├── embedding_generation.md
│ ├── rerank_algorithm.md
│ ├── chunking_strategies.md
│ └── query_processing.md
├── 04-AGENT-SYSTEM/
│ ├── README.md
│ ├── canvas_execution_engine.md
│ ├── component_architecture.md
│ └── tool_integration.md
├── 05-DOCUMENT-PROCESSING/
│ ├── README.md
│ ├── task_executor_analysis.md
│ └── pdf_parsing.md
└── 06-ALGORITHMS/
├── README.md
├── bm25_scoring.md
├── hybrid_score_fusion.md
└── raptor_algorithm.md
Key Source Files
| Module | Key File | Description |
|---|---|---|
| API | /api/apps/dialog_app.py |
Chat API endpoints |
| Service | /api/db/services/dialog_service.py |
RAG chat logic |
| RAG | /rag/nlp/search.py |
Hybrid search |
| Agent | /agent/canvas.py |
Workflow engine |
| Parser | /deepdoc/parser/pdf_parser.py |
PDF parsing |
| Algorithms | /rag/raptor.py |
RAPTOR algorithm |
Conclusion
RAGFlow là một comprehensive RAG system với:
- Multi-provider LLM support (60+ providers)
- Advanced document understanding (OCR, layout, tables)
- Hybrid search (Vector + BM25)
- Agentic workflows (visual canvas)
- Production-ready (multi-tenant, scalable)
Tham khảo các file chi tiết trong từng module để hiểu sâu hơn về implementation.