ragflow/personal_analyze/00-OVERVIEW.md

# RAGFlow Backend Architecture - Comprehensive Analysis

## Tong Quan

RAGFlow là open-source RAG (Retrieval-Augmented Generation) engine với deep document understanding. Document này tổng hợp phân tích chi tiết kiến trúc backend.

## Version
- RAGFlow v0.22.1
- Analysis Date: 2025-01

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RAGFlow BACKEND ARCHITECTURE                        │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────────────────┐
                    │         NGINX / Gateway          │
                    └─────────────────┬───────────────┘
                                      │
                    ┌─────────────────▼───────────────┐
                    │      01-API-LAYER               │
                    │  Flask/Quart Blueprints         │
                    │  - Authentication (JWT/OAuth)   │
                    │  - Request Routing              │
                    │  - SSE Streaming                │
                    └─────────────────┬───────────────┘
                                      │
                    ┌─────────────────▼───────────────┐
                    │      02-SERVICE-LAYER           │
                    │  Business Logic                 │
                    │  - DialogService (Chat)         │
                    │  - DocumentService              │
                    │  - TaskService (Queue)          │
                    │  - LLMBundle (Model Wrapper)    │
                    └─────────────────┬───────────────┘
                                      │
        ┌─────────────────────────────┼─────────────────────────────┐
        │                             │                             │
┌───────▼───────┐         ┌───────────▼───────────┐     ┌───────────▼───────┐
│ 03-RAG-ENGINE │         │  04-AGENT-SYSTEM      │     │05-DOC-PROCESSING │
│               │         │                       │     │                   │
│ - Hybrid      │         │ - Canvas Engine       │     │ - PDF Parser      │
│   Search      │         │ - Components          │     │ - OCR             │
│ - Embedding   │         │ - Tools               │     │ - Layout          │
│ - Reranking   │         │ - ReAct Agent         │     │ - TSR             │
└───────────────┘         └───────────────────────┘     └───────────────────┘
        │                             │                             │
        └─────────────────────────────┼─────────────────────────────┘
                                      │
                    ┌─────────────────▼───────────────┐
                    │      06-ALGORITHMS              │
                    │  - BM25 Scoring                 │
                    │  - Vector Cosine Similarity     │
                    │  - Hybrid Score Fusion          │
                    │  - TF-IDF Weighting             │
                    │  - RAPTOR                       │
                    │  - GraphRAG                     │
                    └─────────────────────────────────┘
                                      │
        ┌─────────────────────────────┼─────────────────────────────┐
        │                             │                             │
┌───────▼───────┐         ┌───────────▼───────────┐     ┌───────────▼───────┐
│    MySQL      │         │   Elasticsearch/      │     │      MinIO        │
│  (Metadata)   │         │   Infinity (Vectors)  │     │   (File Storage)  │
└───────────────┘         └───────────────────────┘     └───────────────────┘
```

## Module Summary

### 01-API-LAYER
API Gateway xử lý authentication, routing, và SSE streaming.

| File | Purpose |
|------|---------|
| `document_app_analysis.md` | Document upload/processing API |
| `conversation_app_analysis.md` | Chat API với SSE |
| `canvas_app_analysis.md` | Agent workflow API |
| `authentication_flow.md` | JWT/OAuth authentication |
| `request_lifecycle.md` | Request processing pipeline |

**Key Technologies:**
- Flask/Quart (async ASGI)
- Blueprint routing
- JWT + API Token authentication
- SSE streaming

### 02-SERVICE-LAYER
Business logic layer với service pattern.

| File | Purpose |
|------|---------|
| `dialog_service_analysis.md` | RAG chat pipeline |
| `task_service_analysis.md` | Background task queue |
| `llm_service_analysis.md` | 60+ LLM provider abstraction |

**Key Technologies:**
- Peewee ORM
- Redis task queue
- LLMBundle wrapper
- Langfuse observability

### 03-RAG-ENGINE
Core RAG implementation với hybrid search.

| File | Purpose |
|------|---------|
| `hybrid_search_algorithm.md` | Vector + BM25 fusion |
| `embedding_generation.md` | 30+ embedding models |
| `rerank_algorithm.md` | Cross-encoder reranking |
| `chunking_strategies.md` | Document chunking |
| `query_processing.md` | TF-IDF query weighting |

**Key Algorithms:**
- Hybrid Score: 95% Vector + 5% BM25
- Cosine similarity
- Cross-encoder reranking
- Token-based chunking

### 04-AGENT-SYSTEM
Agentic workflows với visual canvas.

| File | Purpose |
|------|---------|
| `canvas_execution_engine.md` | Workflow orchestration |
| `component_architecture.md` | Component lifecycle |
| `tool_integration.md` | Tool framework |

**Key Features:**
- DSL-based workflows
- 15+ component types
- 10+ tool integrations
- ReAct agent pattern

### 05-DOCUMENT-PROCESSING
Document parsing pipeline.

| File | Purpose |
|------|---------|
| `task_executor_analysis.md` | Async task processing |
| `pdf_parsing.md` | PDF với OCR + layout |

**Key Technologies:**
- PaddleOCR
- Detectron2 layout detection
- TableTransformer (TSR)
- XGBoost text merging

### 06-ALGORITHMS
Core algorithms và math.

| File | Purpose |
|------|---------|
| `bm25_scoring.md` | BM25 ranking |
| `hybrid_score_fusion.md` | Score combination |
| `raptor_algorithm.md` | Hierarchical summarization |

## Tech Stack Summary

### Backend Framework
```
Python 3.10+
├── Flask/Quart - Web framework
├── Peewee - ORM
├── Trio - Async concurrency
└── Celery-like - Task queue (Redis-based)
```

### Data Stores
```
MySQL - Metadata, users, configs
Elasticsearch/Infinity - Vector search + BM25
Redis - Task queue, caching, sessions
MinIO - Object storage (documents, images)
```

### ML/AI
```
LLM Providers (60+)
├── OpenAI, Azure, Claude, Gemini
├── Qwen, DeepSeek, Groq
├── Ollama (local)
└── LiteLLM (unified interface)

Embedding Models (30+)
├── OpenAI text-embedding-3
├── BGE, Jina, Cohere
└── HuggingFace TEI

Vision Models
├── PaddleOCR
├── Detectron2
└── TableTransformer
```

### Search & Retrieval
```
Hybrid Search
├── BM25 (Elasticsearch native)
├── Vector (cosine similarity)
└── Fusion (weighted sum)

Reranking
├── Jina Reranker
├── Cohere Rerank
└── BGE Reranker
```

## Key Flows

### 1. Document Upload Flow
```
Upload → MinIO → Task Queue → Parser → Chunking → Embedding → Elasticsearch
```

### 2. Chat/Query Flow
```
Query → TF-IDF Weight → Hybrid Search → Rerank → Context Building → LLM → SSE Stream
```

### 3. Agent Workflow Flow
```
User Input → Canvas Engine → Component Execution → Tool Calls → LLM → Output
```

## Performance Metrics

| Operation | Typical Latency |
|-----------|-----------------|
| Vector Search | < 100ms |
| BM25 Search | < 50ms |
| Reranking | 200-500ms |
| Total Retrieval | < 1s |
| Embedding (batch 16) | 1-5s |
| PDF Parsing (10 pages) | 30-60s |

## Configuration Highlights

### Search Config
```python
{
    "vector_similarity_weight": 0.95,   # 95% vector
    "similarity_threshold": 0.2,        # Min similarity
    "top_k": 1024,                       # Initial candidates
    "top_n": 6,                          # Final results
}
```

### Chunking Config
```python
{
    "chunk_token_num": 512,             # Tokens per chunk
    "delimiter": "\n。；！？",          # Split chars
    "overlapped_percent": 0,            # Overlap %
}
```

### Agent Config
```python
{
    "max_rounds": 5,                    # Max tool rounds
    "temperature": 0.7,                 # LLM temperature
    "max_tokens": 2048,                 # Response limit
}
```

## Directory Structure

```
personal_analyze/
├── 00-OVERVIEW.md                  # This file
├── 01-API-LAYER/
│   ├── README.md
│   ├── document_app_analysis.md
│   ├── conversation_app_analysis.md
│   ├── canvas_app_analysis.md
│   ├── authentication_flow.md
│   └── request_lifecycle.md
├── 02-SERVICE-LAYER/
│   ├── README.md
│   ├── dialog_service_analysis.md
│   ├── task_service_analysis.md
│   └── llm_service_analysis.md
├── 03-RAG-ENGINE/
│   ├── README.md
│   ├── hybrid_search_algorithm.md
│   ├── embedding_generation.md
│   ├── rerank_algorithm.md
│   ├── chunking_strategies.md
│   └── query_processing.md
├── 04-AGENT-SYSTEM/
│   ├── README.md
│   ├── canvas_execution_engine.md
│   ├── component_architecture.md
│   └── tool_integration.md
├── 05-DOCUMENT-PROCESSING/
│   ├── README.md
│   ├── task_executor_analysis.md
│   └── pdf_parsing.md
└── 06-ALGORITHMS/
    ├── README.md
    ├── bm25_scoring.md
    ├── hybrid_score_fusion.md
    └── raptor_algorithm.md
```

## Key Source Files

| Module | Key File | Description |
|--------|----------|-------------|
| API | `/api/apps/dialog_app.py` | Chat API endpoints |
| Service | `/api/db/services/dialog_service.py` | RAG chat logic |
| RAG | `/rag/nlp/search.py` | Hybrid search |
| Agent | `/agent/canvas.py` | Workflow engine |
| Parser | `/deepdoc/parser/pdf_parser.py` | PDF parsing |
| Algorithms | `/rag/raptor.py` | RAPTOR algorithm |

## Conclusion

RAGFlow là một comprehensive RAG system với:
- **Multi-provider LLM support** (60+ providers)
- **Advanced document understanding** (OCR, layout, tables)
- **Hybrid search** (Vector + BM25)
- **Agentic workflows** (visual canvas)
- **Production-ready** (multi-tenant, scalable)

Tham khảo các file chi tiết trong từng module để hiểu sâu hơn về implementation.