ragflow/personal_analyze/00-OVERVIEW.md
Claude a6ee18476d
docs: Add detailed backend module analysis documentation
Add comprehensive documentation covering 6 modules:
- 01-API-LAYER: Authentication, routing, SSE streaming
- 02-SERVICE-LAYER: Dialog, Task, LLM service analysis
- 03-RAG-ENGINE: Hybrid search, embedding, reranking
- 04-AGENT-SYSTEM: Canvas engine, components, tools
- 05-DOCUMENT-PROCESSING: Task executor, PDF parsing
- 06-ALGORITHMS: BM25, fusion, RAPTOR

Total 28 documentation files with code analysis, diagrams, and formulas.
2025-11-26 11:10:54 +00:00

12 KiB

RAGFlow Backend Architecture - Comprehensive Analysis

Tong Quan

RAGFlow là open-source RAG (Retrieval-Augmented Generation) engine với deep document understanding. Document này tổng hợp phân tích chi tiết kiến trúc backend.

Version

  • RAGFlow v0.22.1
  • Analysis Date: 2025-01

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                          RAGFlow BACKEND ARCHITECTURE                        │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────────────────┐
                    │         NGINX / Gateway          │
                    └─────────────────┬───────────────┘
                                      │
                    ┌─────────────────▼───────────────┐
                    │      01-API-LAYER               │
                    │  Flask/Quart Blueprints         │
                    │  - Authentication (JWT/OAuth)   │
                    │  - Request Routing              │
                    │  - SSE Streaming                │
                    └─────────────────┬───────────────┘
                                      │
                    ┌─────────────────▼───────────────┐
                    │      02-SERVICE-LAYER           │
                    │  Business Logic                 │
                    │  - DialogService (Chat)         │
                    │  - DocumentService              │
                    │  - TaskService (Queue)          │
                    │  - LLMBundle (Model Wrapper)    │
                    └─────────────────┬───────────────┘
                                      │
        ┌─────────────────────────────┼─────────────────────────────┐
        │                             │                             │
┌───────▼───────┐         ┌───────────▼───────────┐     ┌───────────▼───────┐
│ 03-RAG-ENGINE │         │  04-AGENT-SYSTEM      │     │05-DOC-PROCESSING │
│               │         │                       │     │                   │
│ - Hybrid      │         │ - Canvas Engine       │     │ - PDF Parser      │
│   Search      │         │ - Components          │     │ - OCR             │
│ - Embedding   │         │ - Tools               │     │ - Layout          │
│ - Reranking   │         │ - ReAct Agent         │     │ - TSR             │
└───────────────┘         └───────────────────────┘     └───────────────────┘
        │                             │                             │
        └─────────────────────────────┼─────────────────────────────┘
                                      │
                    ┌─────────────────▼───────────────┐
                    │      06-ALGORITHMS              │
                    │  - BM25 Scoring                 │
                    │  - Vector Cosine Similarity     │
                    │  - Hybrid Score Fusion          │
                    │  - TF-IDF Weighting             │
                    │  - RAPTOR                       │
                    │  - GraphRAG                     │
                    └─────────────────────────────────┘
                                      │
        ┌─────────────────────────────┼─────────────────────────────┐
        │                             │                             │
┌───────▼───────┐         ┌───────────▼───────────┐     ┌───────────▼───────┐
│    MySQL      │         │   Elasticsearch/      │     │      MinIO        │
│  (Metadata)   │         │   Infinity (Vectors)  │     │   (File Storage)  │
└───────────────┘         └───────────────────────┘     └───────────────────┘

Module Summary

01-API-LAYER

API Gateway xử lý authentication, routing, và SSE streaming.

File Purpose
document_app_analysis.md Document upload/processing API
conversation_app_analysis.md Chat API với SSE
canvas_app_analysis.md Agent workflow API
authentication_flow.md JWT/OAuth authentication
request_lifecycle.md Request processing pipeline

Key Technologies:

  • Flask/Quart (async ASGI)
  • Blueprint routing
  • JWT + API Token authentication
  • SSE streaming

02-SERVICE-LAYER

Business logic layer với service pattern.

File Purpose
dialog_service_analysis.md RAG chat pipeline
task_service_analysis.md Background task queue
llm_service_analysis.md 60+ LLM provider abstraction

Key Technologies:

  • Peewee ORM
  • Redis task queue
  • LLMBundle wrapper
  • Langfuse observability

03-RAG-ENGINE

Core RAG implementation với hybrid search.

File Purpose
hybrid_search_algorithm.md Vector + BM25 fusion
embedding_generation.md 30+ embedding models
rerank_algorithm.md Cross-encoder reranking
chunking_strategies.md Document chunking
query_processing.md TF-IDF query weighting

Key Algorithms:

  • Hybrid Score: 95% Vector + 5% BM25
  • Cosine similarity
  • Cross-encoder reranking
  • Token-based chunking

04-AGENT-SYSTEM

Agentic workflows với visual canvas.

File Purpose
canvas_execution_engine.md Workflow orchestration
component_architecture.md Component lifecycle
tool_integration.md Tool framework

Key Features:

  • DSL-based workflows
  • 15+ component types
  • 10+ tool integrations
  • ReAct agent pattern

05-DOCUMENT-PROCESSING

Document parsing pipeline.

File Purpose
task_executor_analysis.md Async task processing
pdf_parsing.md PDF với OCR + layout

Key Technologies:

  • PaddleOCR
  • Detectron2 layout detection
  • TableTransformer (TSR)
  • XGBoost text merging

06-ALGORITHMS

Core algorithms và math.

File Purpose
bm25_scoring.md BM25 ranking
hybrid_score_fusion.md Score combination
raptor_algorithm.md Hierarchical summarization

Tech Stack Summary

Backend Framework

Python 3.10+
├── Flask/Quart - Web framework
├── Peewee - ORM
├── Trio - Async concurrency
└── Celery-like - Task queue (Redis-based)

Data Stores

MySQL - Metadata, users, configs
Elasticsearch/Infinity - Vector search + BM25
Redis - Task queue, caching, sessions
MinIO - Object storage (documents, images)

ML/AI

LLM Providers (60+)
├── OpenAI, Azure, Claude, Gemini
├── Qwen, DeepSeek, Groq
├── Ollama (local)
└── LiteLLM (unified interface)

Embedding Models (30+)
├── OpenAI text-embedding-3
├── BGE, Jina, Cohere
└── HuggingFace TEI

Vision Models
├── PaddleOCR
├── Detectron2
└── TableTransformer

Search & Retrieval

Hybrid Search
├── BM25 (Elasticsearch native)
├── Vector (cosine similarity)
└── Fusion (weighted sum)

Reranking
├── Jina Reranker
├── Cohere Rerank
└── BGE Reranker

Key Flows

1. Document Upload Flow

Upload → MinIO → Task Queue → Parser → Chunking → Embedding → Elasticsearch

2. Chat/Query Flow

Query → TF-IDF Weight → Hybrid Search → Rerank → Context Building → LLM → SSE Stream

3. Agent Workflow Flow

User Input → Canvas Engine → Component Execution → Tool Calls → LLM → Output

Performance Metrics

Operation Typical Latency
Vector Search < 100ms
BM25 Search < 50ms
Reranking 200-500ms
Total Retrieval < 1s
Embedding (batch 16) 1-5s
PDF Parsing (10 pages) 30-60s

Configuration Highlights

Search Config

{
    "vector_similarity_weight": 0.95,   # 95% vector
    "similarity_threshold": 0.2,        # Min similarity
    "top_k": 1024,                       # Initial candidates
    "top_n": 6,                          # Final results
}

Chunking Config

{
    "chunk_token_num": 512,             # Tokens per chunk
    "delimiter": "\n。;!?",          # Split chars
    "overlapped_percent": 0,            # Overlap %
}

Agent Config

{
    "max_rounds": 5,                    # Max tool rounds
    "temperature": 0.7,                 # LLM temperature
    "max_tokens": 2048,                 # Response limit
}

Directory Structure

personal_analyze/
├── 00-OVERVIEW.md                  # This file
├── 01-API-LAYER/
│   ├── README.md
│   ├── document_app_analysis.md
│   ├── conversation_app_analysis.md
│   ├── canvas_app_analysis.md
│   ├── authentication_flow.md
│   └── request_lifecycle.md
├── 02-SERVICE-LAYER/
│   ├── README.md
│   ├── dialog_service_analysis.md
│   ├── task_service_analysis.md
│   └── llm_service_analysis.md
├── 03-RAG-ENGINE/
│   ├── README.md
│   ├── hybrid_search_algorithm.md
│   ├── embedding_generation.md
│   ├── rerank_algorithm.md
│   ├── chunking_strategies.md
│   └── query_processing.md
├── 04-AGENT-SYSTEM/
│   ├── README.md
│   ├── canvas_execution_engine.md
│   ├── component_architecture.md
│   └── tool_integration.md
├── 05-DOCUMENT-PROCESSING/
│   ├── README.md
│   ├── task_executor_analysis.md
│   └── pdf_parsing.md
└── 06-ALGORITHMS/
    ├── README.md
    ├── bm25_scoring.md
    ├── hybrid_score_fusion.md
    └── raptor_algorithm.md

Key Source Files

Module Key File Description
API /api/apps/dialog_app.py Chat API endpoints
Service /api/db/services/dialog_service.py RAG chat logic
RAG /rag/nlp/search.py Hybrid search
Agent /agent/canvas.py Workflow engine
Parser /deepdoc/parser/pdf_parser.py PDF parsing
Algorithms /rag/raptor.py RAPTOR algorithm

Conclusion

RAGFlow là một comprehensive RAG system với:

  • Multi-provider LLM support (60+ providers)
  • Advanced document understanding (OCR, layout, tables)
  • Hybrid search (Vector + BM25)
  • Agentic workflows (visual canvas)
  • Production-ready (multi-tenant, scalable)

Tham khảo các file chi tiết trong từng module để hiểu sâu hơn về implementation.