docs: Add comprehensive RAGFlow analysis documentation

- Add directory structure analysis (01_directory_structure.md)
- Add system architecture with diagrams (02_system_architecture.md)
- Add sequence diagrams for main flows (03_sequence_diagrams.md)
- Add detailed modules analysis (04_modules_analysis.md)
- Add tech stack documentation (05_tech_stack.md)
- Add source code analysis (06_source_code_analysis.md)
- Add README summary for personal_analyze folder

This documentation provides:
- Complete codebase structure overview
- System architecture diagrams (ASCII art)
- Sequence diagrams for authentication, RAG, chat, agent flows
- Detailed analysis of API, RAG, DeepDoc, Agent, GraphRAG modules
- Full tech stack with 150+ dependencies analyzed
- Source code patterns and best practices analysis
This commit is contained in:
Claude 2025-11-26 10:20:05 +00:00
parent 2fd5ac1031
commit c7cecf9a1f
No known key found for this signature in database
7 changed files with 4841 additions and 0 deletions

View file

@ -0,0 +1,348 @@
# RAGFlow - Cấu Trúc Thư Mục
## Tổng Quan
RAGFlow (v0.22.1) là một RAG (Retrieval-Augmented Generation) engine mã nguồn mở dựa trên deep document understanding. Dự án được xây dựng với kiến trúc full-stack bao gồm Python backend và React/TypeScript frontend.
## Cấu Trúc Thư Mục Chi Tiết
```
ragflow/
├── api/ # [BACKEND] Flask API Server
│ ├── ragflow_server.py # Entry point chính
│ ├── settings.py # Cấu hình server
│ ├── constants.py # Hằng số API (API_VERSION = "v1")
│ ├── validation.py # Request validation
│ │
│ ├── apps/ # Flask Blueprints - API endpoints
│ │ ├── kb_app.py # Knowledge Base management
│ │ ├── document_app.py # Document processing
│ │ ├── dialog_app.py # Chat/Dialog handling
│ │ ├── canvas_app.py # Agent workflow canvas
│ │ ├── file_app.py # File upload/management
│ │ ├── chunk_app.py # Document chunking
│ │ ├── conversation_app.py # Conversation management
│ │ ├── search_app.py # Search functionality
│ │ ├── system_app.py # System configuration
│ │ ├── llm_app.py # LLM model management
│ │ ├── connector_app.py # Data source connectors
│ │ ├── mcp_server_app.py # MCP server integration
│ │ ├── langfuse_app.py # Langfuse observability
│ │ ├── api_app.py # API key management
│ │ ├── plugin_app.py # Plugin management
│ │ ├── tenant_app.py # Multi-tenancy
│ │ ├── user_app.py # User management
│ │ │
│ │ ├── auth/ # Authentication modules
│ │ │ ├── oauth.py # OAuth base
│ │ │ ├── github.py # GitHub OAuth
│ │ │ └── oidc.py # OpenID Connect
│ │ │
│ │ └── sdk/ # SDK REST API endpoints
│ │ ├── dataset.py # Dataset API
│ │ ├── doc.py # Document API
│ │ ├── chat.py # Chat API
│ │ ├── session.py # Session API
│ │ ├── files.py # File API
│ │ ├── agents.py # Agent API
│ │ └── dify_retrieval.py # Dify integration
│ │
│ ├── db/ # Database layer
│ │ ├── db_models.py # SQLAlchemy/Peewee models (54KB)
│ │ ├── db_utils.py # Database utilities
│ │ ├── init_data.py # Initial data seeding
│ │ ├── runtime_config.py # Runtime configuration
│ │ │
│ │ ├── services/ # Business logic services
│ │ │ ├── user_service.py # User operations
│ │ │ ├── dialog_service.py # Dialog logic (37KB)
│ │ │ ├── document_service.py # Document processing (39KB)
│ │ │ ├── file_service.py # File handling (22KB)
│ │ │ ├── knowledgebase_service.py # KB management (21KB)
│ │ │ ├── task_service.py # Task queue (20KB)
│ │ │ ├── canvas_service.py # Canvas logic (12KB)
│ │ │ ├── conversation_service.py # Conversation handling
│ │ │ ├── connector_service.py # Connector management
│ │ │ ├── llm_service.py # LLM operations
│ │ │ ├── search_service.py # Search operations
│ │ │ └── api_service.py # API token service
│ │ │
│ │ └── joint_services/ # Cross-service operations
│ │
│ └── utils/ # API utilities
│ ├── api_utils.py # API helpers
│ ├── file_utils.py # File utilities
│ ├── crypt.py # Encryption
│ └── log_utils.py # Logging
├── rag/ # [CORE] RAG Processing Engine
│ ├── settings.py # RAG configuration
│ ├── raptor.py # RAPTOR algorithm
│ ├── benchmark.py # Performance testing
│ │
│ ├── llm/ # LLM Model Abstractions
│ │ ├── chat_model.py # Chat LLM interface
│ │ ├── embedding_model.py # Embedding models
│ │ ├── rerank_model.py # Reranking models
│ │ ├── cv_model.py # Computer vision
│ │ ├── tts_model.py # Text-to-speech
│ │ └── sequence2txt_model.py # Sequence to text
│ │
│ ├── flow/ # RAG Pipeline
│ │ ├── pipeline.py # Main pipeline
│ │ ├── file.py # File processing
│ │ │
│ │ ├── parser/ # Document parsing
│ │ │ ├── parser.py
│ │ │ └── schema.py
│ │ │
│ │ ├── extractor/ # Information extraction
│ │ │ ├── extractor.py
│ │ │ └── schema.py
│ │ │
│ │ ├── tokenizer/ # Text tokenization
│ │ │ ├── tokenizer.py
│ │ │ └── schema.py
│ │ │
│ │ ├── splitter/ # Document chunking
│ │ │ ├── splitter.py
│ │ │ └── schema.py
│ │ │
│ │ └── hierarchical_merger/ # Hierarchical merging
│ │ ├── hierarchical_merger.py
│ │ └── schema.py
│ │
│ ├── app/ # RAG application logic
│ ├── nlp/ # NLP utilities
│ ├── utils/ # RAG utilities
│ └── prompts/ # LLM prompt templates
├── deepdoc/ # [DOCUMENT] Deep Document Understanding
│ ├── parser/ # Multi-format parsers
│ │ ├── pdf_parser.py # PDF with layout analysis
│ │ ├── docx_parser.py # Word documents
│ │ ├── ppt_parser.py # PowerPoint
│ │ ├── excel_parser.py # Excel spreadsheets
│ │ ├── html_parser.py # HTML pages
│ │ ├── markdown_parser.py # Markdown files
│ │ ├── json_parser.py # JSON data
│ │ ├── txt_parser.py # Plain text
│ │ ├── figure_parser.py # Image/figure extraction
│ │ │
│ │ └── resume/ # Resume parsing
│ │ ├── step_one.py
│ │ └── step_two.py
│ │
│ └── vision/ # Computer vision modules
├── agent/ # [AGENT] Agentic Workflow System
│ ├── canvas.py # Canvas orchestration (25KB)
│ ├── settings.py # Agent configuration
│ │
│ ├── component/ # Workflow components
│ │ ├── begin.py # Workflow start
│ │ ├── llm.py # LLM invocation
│ │ ├── agent_with_tools.py # Agent with tools
│ │ ├── retrieval.py # Document retrieval
│ │ ├── categorize.py # Message categorization
│ │ ├── message.py # Message handling
│ │ ├── webhook.py # Webhook triggers
│ │ ├── iteration.py # Loop iteration
│ │ └── variable_assigner.py # Variable assignment
│ │
│ ├── tools/ # External tool integrations
│ │ ├── tavily.py # Web search
│ │ ├── arxiv.py # Academic papers
│ │ ├── github.py # GitHub API
│ │ ├── google.py # Google Search
│ │ ├── wikipedia.py # Wikipedia
│ │ ├── email.py # Email sending
│ │ ├── code_exec.py # Code execution
│ │ └── yahoofinance.py # Financial data
│ │
│ └── templates/ # Pre-built workflows
├── graphrag/ # [GRAPH] Knowledge Graph RAG
│ ├── entity_resolution.py # Entity linking (12KB)
│ ├── search.py # Graph search (14KB)
│ ├── utils.py # Graph utilities (23KB)
│ ├── general/ # General graph operations
│ └── light/ # Lightweight implementations
├── web/ # [FRONTEND] React/TypeScript
│ ├── package.json # NPM dependencies (172 packages)
│ ├── .umirc.ts # UmiJS configuration
│ ├── tailwind.config.js # Tailwind CSS config
│ │
│ └── src/
│ ├── pages/ # UmiJS page routes
│ │ ├── admin/ # Admin dashboard
│ │ ├── dataset/ # Knowledge base management
│ │ ├── datasets/ # Datasets list
│ │ ├── knowledge/ # Knowledge management
│ │ ├── next-chats/ # Chat interface
│ │ ├── next-searches/ # Search interface
│ │ ├── document-viewer/ # Document preview
│ │ ├── login/ # Authentication
│ │ └── register/ # User registration
│ │
│ ├── components/ # React components
│ │ ├── file-upload-modal/
│ │ ├── pdf-drawer/
│ │ ├── prompt-editor/
│ │ ├── document-preview/
│ │ └── ui/ # Shadcn/UI components
│ │
│ ├── services/ # API client services
│ ├── hooks/ # React hooks
│ ├── interfaces/ # TypeScript interfaces
│ ├── utils/ # Utility functions
│ ├── constants/ # Constants
│ └── locales/ # i18n translations
├── common/ # [SHARED] Common Utilities
│ ├── settings.py # Main configuration (11KB)
│ ├── config_utils.py # Config utilities
│ ├── connection_utils.py # Database connections
│ ├── constants.py # Global constants
│ ├── exceptions.py # Exception definitions
│ │
│ ├── Utilities:
│ │ ├── log_utils.py # Logging setup
│ │ ├── file_utils.py # File operations
│ │ ├── string_utils.py # String utilities
│ │ ├── token_utils.py # Token operations
│ │ └── time_utils.py # Time utilities
│ │
│ └── data_source/ # Data source connectors
│ ├── confluence_connector.py (81KB)
│ ├── notion_connector.py (25KB)
│ ├── slack_connector.py (22KB)
│ ├── gmail_connector.py
│ ├── discord_connector.py
│ ├── sharepoint_connector.py
│ ├── dropbox_connector.py
│ └── google_drive/
├── sdk/ # [SDK] Python Client Library
│ └── python/
│ └── ragflow_sdk/ # SDK implementation
├── mcp/ # [MCP] Model Context Protocol
│ ├── server/ # MCP server
│ │ └── server.py
│ └── client/ # MCP client
│ └── client.py
├── admin/ # [ADMIN] Admin Interface
│ ├── server/ # Admin backend
│ └── client/ # Admin frontend
├── plugin/ # [PLUGIN] Plugin System
│ ├── plugin_manager.py # Plugin management
│ ├── llm_tool_plugin.py # LLM tool plugins
│ └── embedded_plugins/ # Built-in plugins
├── docker/ # [DEPLOYMENT] Docker Configuration
│ ├── docker-compose.yml # Main compose file
│ ├── docker-compose-base.yml # Base services
│ ├── .env # Environment variables
│ ├── entrypoint.sh # Container entry
│ ├── service_conf.yaml.template # Service config
│ ├── nginx/ # Nginx configuration
│ │ └── nginx.conf
│ └── init.sql # Database init
├── conf/ # [CONFIG] Configuration Files
│ ├── llm_factories.json # LLM providers
│ ├── mapping.json # Field mappings
│ ├── service_conf.yaml # Service configuration
│ ├── private.pem # RSA private key
│ └── public.pem # RSA public key
├── test/ # [TEST] Testing Suite
│ ├── unit_test/ # Unit tests
│ │ └── common/ # Common utilities tests
│ │
│ └── testcases/ # Integration tests
│ ├── test_http_api/ # HTTP API tests
│ ├── test_sdk_api/ # SDK tests
│ └── test_web_api/ # Web API tests
├── example/ # [EXAMPLES] Usage Examples
│ ├── http/ # HTTP API examples
│ └── sdk/ # SDK examples
├── intergrations/ # [INTEGRATIONS] Third-party
│ ├── chatgpt-on-wechat/ # WeChat integration
│ ├── extension_chrome/ # Chrome extension
│ └── firecrawl/ # Web scraping
├── agentic_reasoning/ # [REASONING] Advanced reasoning
├── sandbox/ # [SANDBOX] Code execution
├── helm/ # [K8S] Kubernetes Helm charts
├── docs/ # [DOCS] Documentation
├── pyproject.toml # Python project config
├── CLAUDE.md # Development guidelines
└── README.md # Project overview
```
## Mô Tả Chi Tiết Các Thư Mục Chính
### 1. `/api/` - Backend API Server
- **Vai trò**: Xử lý tất cả HTTP requests, authentication, và business logic
- **Framework**: Flask/Quart (async ASGI)
- **Port mặc định**: 9380
- **Entry point**: `ragflow_server.py`
### 2. `/rag/` - RAG Processing Engine
- **Vai trò**: Xử lý pipeline RAG từ document parsing đến retrieval
- **Chức năng chính**:
- Document parsing và extraction
- Text tokenization
- Semantic chunking
- Embedding generation
- Reranking
### 3. `/deepdoc/` - Document Understanding
- **Vai trò**: Deep document parsing với layout analysis
- **Hỗ trợ formats**: PDF, Word, PPT, Excel, HTML, Markdown, JSON, TXT
- **Đặc biệt**: OCR và layout analysis cho PDF
### 4. `/agent/` - Agentic Workflow
- **Vai trò**: Hệ thống workflow agent với visual canvas
- **Components**: LLM, Retrieval, Categorize, Webhook, Iteration...
- **Tools**: Tavily, Google, Wikipedia, GitHub, Email...
### 5. `/graphrag/` - Knowledge Graph
- **Vai trò**: Xây dựng và query knowledge graph
- **Chức năng**: Entity resolution, graph search, relationship extraction
### 6. `/web/` - Frontend
- **Framework**: React + TypeScript + UmiJS
- **UI**: Ant Design + Shadcn/UI + Tailwind CSS
- **State**: Zustand
- **Port**: 80/443 (qua Nginx)
### 7. `/common/` - Shared Utilities
- **Vai trò**: Utilities và connectors dùng chung
- **Data sources**: Confluence, Notion, Slack, Gmail, SharePoint...
### 8. `/docker/` - Deployment
- **Services**: MySQL, Elasticsearch/Infinity, Redis, MinIO, Nginx
- **Modes**: CPU/GPU, single/cluster
## Tóm Tắt Thống Kê
| Thư mục | Số files | Mô tả |
|---------|----------|-------|
| api/ | ~100+ | Backend API |
| rag/ | ~50+ | RAG engine |
| deepdoc/ | ~30+ | Document parsers |
| agent/ | ~40+ | Agent system |
| graphrag/ | ~20+ | Knowledge graph |
| web/src/ | ~200+ | Frontend |
| common/ | ~50+ | Shared utilities |
| test/ | ~80+ | Test suite |

View file

@ -0,0 +1,567 @@
# RAGFlow - Kiến Trúc Hệ Thống
## 1. Tổng Quan Kiến Trúc
RAGFlow sử dụng kiến trúc **Microservices** với các thành phần được container hóa bằng Docker. Hệ thống được thiết kế theo mô hình **3-tier architecture** kết hợp với **event-driven architecture** cho xử lý bất đồng bộ.
## 2. Sơ Đồ Kiến Trúc Tổng Quan
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web App │ │ Mobile App │ │ Python SDK │ │ REST API │ │
│ │ (React/TS) │ │ (Future) │ │ Client │ │ Client │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │ │
└─────────┼─────────────────┼─────────────────┼─────────────────┼──────────────────┘
│ │ │ │
└─────────────────┴────────┬────────┴─────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ GATEWAY LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ NGINX Reverse Proxy │ │
│ │ (Load Balancing, SSL Termination) │ │
│ │ Port: 80/443 │ │
│ └─────────────────────────────────────┬───────────────────────────────────┘ │
│ │ │
└────────────────────────────────────────┼─────────────────────────────────────────┘
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────┐ │
│ │ RAGFlow Server │ │ Admin Server │ │ MCP Server │ │
│ │ (Flask/Quart) │ │ (Flask) │ │ (MCP Protocol) │ │
│ │ Port: 9380 │ │ Port: 9381 │ │ Port: 9382 │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ API Blueprints │ │ │ │ Admin APIs │ │ │ │ MCP Handler │ │ │
│ │ │ - kb_app │ │ │ │ - User Mgmt │ │ │ │ - Tools │ │ │
│ │ │ - document_app │ │ │ │ - System Cfg │ │ │ │ - Resources │ │ │
│ │ │ - dialog_app │ │ │ │ - Monitoring │ │ │ └─────────────┘ │ │
│ │ │ - canvas_app │ │ │ └─────────────────┘ │ │ │ │
│ │ │ - search_app │ │ │ │ │ │ │
│ │ │ - file_app │ │ │ │ │ │ │
│ │ └─────────────────┘ │ │ │ │ │ │
│ └───────────┬───────────┘ └───────────────────────┘ └───────────────────┘ │
│ │ │
└──────────────┼───────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SERVICE LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Business Logic │ │ RAG Pipeline │ │ Agent System │ │
│ │ Services │ │ Engine │ │ Engine │ │
│ │ │ │ │ │ │ │
│ │ - UserService │ │ - Parser │ │ - Canvas │ │
│ │ - DialogService │ │ - Tokenizer │ │ - Components │ │
│ │ - DocService │ │ - Splitter │ │ - Tools │ │
│ │ - KBService │ │ - Embedder │ │ - Workflows │ │
│ │ - TaskService │ │ - Reranker │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ┌─────────────────────────────┼─────────────────────────────────────────────┐ │
│ │ DeepDoc Processing Engine │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ PDF │ │ DOCX │ │ PPT │ │ Excel │ │ HTML │ │ │
│ │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Vision/OCR Processing (Layout Analysis) │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ MySQL │ │ Redis/Valkey │ │ MinIO │ │
│ │ (Primary DB) │ │ (Cache) │ │ (Object Store) │ │
│ │ Port: 5455 │ │ Port: 6379 │ │ Port: 9000/9001 │ │
│ │ │ │ │ │ │ │
│ │ - Users │ │ - Sessions │ │ - Documents │ │
│ │ - Tenants │ │ - Cache │ │ - Files │ │
│ │ - Knowledgebase │ │ - Rate Limit │ │ - Chunks │ │
│ │ - Documents │ │ - Task Queue │ │ - Images │ │
│ │ - Dialogs │ │ │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ Vector Database Layer │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Elasticsearch │ │ Infinity │ │ OpenSearch │ │ │
│ │ │ (Default) │ │ (Alternative) │ │ (Alternative) │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - Vector Search │ │ - Hybrid Search │ │ - Vector Search │ │ │
│ │ │ - Full-text │ │ - Full-text │ │ - Full-text │ │ │
│ │ │ - BM25 │ │ - BM25 │ │ - BM25 │ │ │
│ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL SERVICES │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ LLM Providers │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ OpenAI │ │ Claude │ │ Gemini │ │ Qwen │ │ Groq │ │ Ollama │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Data Source Connectors │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │Confluence│ │ Notion │ │ Slack │ │ Gmail │ │SharePoint│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Agent Tools & APIs │ │
│ │ ┌────────┐ ┌─────────┐ ┌────────┐ ┌────────┐ ┌─────────┐ ┌────────┐ │ │
│ │ │ Tavily │ │ Google │ │ ArXiv │ │ GitHub │ │Wikipedia│ │ Weather│ │ │
│ │ └────────┘ └─────────┘ └────────┘ └────────┘ └─────────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
```
## 3. Kiến Trúc Chi Tiết Các Thành Phần
### 3.1 API Server Architecture
```
┌──────────────────────────────────────────────────────────────┐
│ RAGFlow API Server │
│ (ragflow_server.py) │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Flask/Quart Application │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ CORS │ │ Session │ │ JWT Auth │ │ │
│ │ │ Middleware │ │ Middleware │ │ Middleware │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────────┐
│ │ API Blueprints │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ │ kb_app │ │ doc_app │ │dialog_app│ │canvas_app│ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ │file_app │ │search_app│ │ llm_app │ │ user_app │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ └───────┼────────────┼────────────┼────────────┼───────────┘
│ │ │ │ │ │
│ ┌───────┴────────────┴────────────┴────────────┴───────────┐
│ │ Service Layer │
│ │ ┌────────────────┐ ┌────────────────┐ │
│ │ │ UserService │ │ DialogService │ │
│ │ │ - register() │ │ - chat() │ │
│ │ │ - login() │ │ - stream() │ │
│ │ │ - get_user() │ │ - completion() │ │
│ │ └────────────────┘ └────────────────┘ │
│ │ │
│ │ ┌────────────────┐ ┌────────────────┐ │
│ │ │ DocumentService│ │ KBService │ │
│ │ │ - upload() │ │ - create() │ │
│ │ │ - parse() │ │ - list() │ │
│ │ │ - chunk() │ │ - delete() │ │
│ │ └────────────────┘ └────────────────┘ │
│ └──────────────────────────────────────────────────────────┘
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐
│ │ Database Layer (Peewee ORM) │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ │ User │ │ Tenant │ │Document │ │ Dialog │ │
│ │ │ Model │ │ Model │ │ Model │ │ Model │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ └──────────────────────────────────────────────────────────┘
│ │
└──────────────────────────────────────────────────────────────┘
```
### 3.2 RAG Pipeline Architecture
```
┌──────────────────────────────────────────────────────────────────────────┐
│ RAG Processing Pipeline │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ INGESTION PIPELINE │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ File │───▶│ Parser │───▶│Tokenizer │───▶│ Splitter │ │ │
│ │ │ Upload │ │ │ │ │ │ (Chunker)│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Embedding│───▶│ Index │───▶│ Store │ │ │
│ │ │ Model │ │ Creation │ │ (ES/Inf) │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RETRIEVAL PIPELINE │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Query │───▶│ Query │───▶│ Embedding│───▶│ Hybrid │ │ │
│ │ │ Input │ │ Analysis │ │ Query │ │ Search │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Candidate│───▶│ Reranker │───▶│ Context │───▶│ LLM │ │ │
│ │ │ Chunks │ │ │ │ Building │ │ Response │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### 3.3 Agent Workflow Architecture
```
┌──────────────────────────────────────────────────────────────────────────┐
│ Agent Canvas Architecture │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Canvas Orchestrator │ │
│ │ (canvas.py) │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ BEGIN │─────────────▶│ LLM │─────────────▶│RETRIEVAL│ │
│ │Component│ │Component│ │Component│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ │ ┌───────────────────┼───────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────┐ ┌─────────────┐ │
│ │ CATEGORIZE │ │ MESSAGE │ │ WEBHOOK │ │
│ │ Component │ │Component│ │ Component │ │
│ └─────────────┘ └─────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────────────┼────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TOOLS INTEGRATION │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ Tavily │ │ ArXiv │ │ GitHub │ │ Email │ │Code │ │ │
│ │ │ Search │ │ Search │ │ API │ │ Send │ │Executor│ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────┘
```
## 4. Data Flow Architecture
### 4.1 Document Ingestion Flow
```
┌────────────────────────────────────────────────────────────────────────────┐
│ Document Ingestion Flow │
├────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Upload │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ API │────▶│ File │────▶│ MinIO │────▶│ Task │ │
│ │ Endpoint │ │ Service │ │ Storage │ │ Queue │ │
│ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Background Task Processor │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Parser │───▶│Extractor │───▶│ Chunker │───▶│ Embedder │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ - PDF │ │ - Text │ │ - Token │ │ - OpenAI │ │ │
│ │ │ - DOCX │ │ - Table │ │ - Sent │ │ - BGE │ │ │
│ │ │ - HTML │ │ - Image │ │ - Page │ │ - Cohere │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │ │
│ │ │ │ │
│ └────────────────────────────────────────────────────────┼────────┘ │
│ │ │
│ ┌─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ MySQL │ │ Elasticsearch│ │ MinIO │ │ │
│ │ │ (Metadata) │ │ (Vectors) │ │ (Files) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────┘
```
### 4.2 Query Processing Flow
```
┌────────────────────────────────────────────────────────────────────────────┐
│ Query Processing Flow │
├────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Query: "What is the revenue for Q3 2024?" │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 1. QUERY UNDERSTANDING │ │
│ │ ┌──────────────┐ │ │
│ │ │ Query Parser │──▶ Extract: entities, intent, keywords │ │
│ │ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 2. RETRIEVAL │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Embedding │───▶│ Hybrid │───▶│ Candidate │ │ │
│ │ │ Query │ │ Search │ │ Chunks │ │ │
│ │ └────────────┘ │ │ │ (Top 100) │ │ │
│ │ │ Vector+BM25│ └────────────┘ │ │
│ │ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 3. RERANKING │ │
│ │ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Reranker │───▶│ Top-K │ │ │
│ │ │ Model │ │ Chunks │ │ │
│ │ │ │ │ (Top 5) │ │ │
│ │ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 4. GENERATION │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Prompt │───▶│ LLM │───▶│ Response │ │ │
│ │ │ Builder │ │ (GPT-4) │ │ + Sources │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Response: "The revenue for Q3 2024 was $X million... [source: doc.pdf]" │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## 5. Deployment Architecture
### 5.1 Docker Compose Deployment
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ Docker Compose Deployment │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Docker Network │ │
│ │ (ragflow-network) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────┼───────────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ nginx │ ◀──────────────▶│ ragflow- │◀──────────────────▶│ ragflow- │ │
│ │ :80/443 │ │ server │ │ admin │ │
│ └──────────┘ │ :9380 │ │ :9381 │ │
│ │ └────┬─────┘ └──────────┘ │
│ │ │ │
│ │ ┌──────────────────┼──────────────────────┐ │
│ │ │ │ │ │
│ │ ▼ ▼ ▼ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ │ mysql │ │ redis │ │elasticsearch│ │
│ │ │ :5455 │ │ :6379 │ │ :9200 │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ │ minio │ │ sandbox │ │ tei │ │
│ │ │:9000/9001│ │ :9385 │ │ :6380 │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ┌────┴─────────────────────────────────────────────────────────────────┐ │
│ │ Volumes │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │mysql_data │ │ es_data │ │minio_data │ │ redis_data │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────┘
```
## 6. Security Architecture
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ Security Architecture │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Authentication Layer │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ JWT │ │ OAuth │ │ API │ │ │
│ │ │ Tokens │ │ (GitHub, │ │ Tokens │ │ │
│ │ │ │ │ OIDC) │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Authorization Layer │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Multi-Tenancy Model │ │ │
│ │ │ │ │ │
│ │ │ Tenant A Tenant B Tenant C │ │ │
│ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │
│ │ │ │Users │ │Users │ │Users │ │ │ │
│ │ │ │KBs │ │KBs │ │KBs │ │ │ │
│ │ │ │Docs │ │Docs │ │Docs │ │ │ │
│ │ │ └──────┘ └──────┘ └──────┘ │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Role-Based │ │ Team │ │ Resource │ │ │
│ │ │ Access │ │ Permissions│ │ Ownership │ │ │
│ │ │ Control │ │ │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Encryption Layer │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ RSA │ │ HTTPS │ │ Password │ │ │
│ │ │ Key Pair │ │ (TLS) │ │ Bcrypt │ │ │
│ │ │ (conf/*.pem)│ │ │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────┘
```
## 7. Scalability Architecture
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ Scalability Architecture │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Horizontal Scaling │ │
│ │ │ │
│ │ Load Balancer (Nginx) │ │
│ │ │ │ │
│ │ ┌──────────────────┼──────────────────┐ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Server #1 │ │ Server #2 │ │ Server #N │ │ │
│ │ │ (Instance) │ │ (Instance) │ │ (Instance) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Database Scaling │ │
│ │ │ │
│ │ MySQL: Elasticsearch: Redis: │ │
│ │ - Read Replicas - Cluster Mode - Sentinel │ │
│ │ - Connection Pool - Sharding - Cluster Mode │ │
│ │ - Index Partitioning │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Async Processing │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Task │───▶│ Redis │───▶│ Worker │ │ │
│ │ │ Producer │ │ Queue │ │ Consumer │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Tasks: Document parsing, Embedding, Indexing │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────┘
```
## 8. Tóm Tắt Kiến Trúc
| Layer | Components | Technology |
|-------|------------|------------|
| Client | Web App, SDK, API | React, Python SDK, REST |
| Gateway | Reverse Proxy | Nginx |
| Application | API Server, Admin | Flask/Quart |
| Service | Business Logic | Python Services |
| Processing | RAG, DeepDoc, Agent | Python, ML Models |
| Data | Storage, Cache, Vector | MySQL, Redis, ES, MinIO |
| External | LLM, Connectors, Tools | OpenAI, Claude, APIs |
### Đặc Điểm Nổi Bật
1. **Microservices**: Các service độc lập, dễ scale
2. **Event-Driven**: Xử lý async cho document processing
3. **Multi-Tenant**: Hỗ trợ nhiều tenants với data isolation
4. **Hybrid Search**: Kết hợp vector search và full-text search
5. **Pluggable**: Hỗ trợ multiple LLM providers và vector stores
6. **Containerized**: Full Docker deployment với orchestration

View file

@ -0,0 +1,700 @@
# RAGFlow - Sequence Diagrams
Tài liệu này mô tả các luồng xử lý chính trong hệ thống RAGFlow thông qua sequence diagrams.
## 1. User Authentication Flow
### 1.1 User Registration
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant DB as MySQL
participant R as Redis
U->>W: Click Register
W->>W: Show registration form
U->>W: Enter email, password, nickname
W->>A: POST /api/v1/user/register
A->>A: Validate input data
A->>DB: Check if email exists
alt Email exists
DB-->>A: User found
A-->>W: 400 - Email already registered
W-->>U: Show error message
else Email not exists
DB-->>A: No user found
A->>A: Hash password (bcrypt)
A->>A: Generate user ID
A->>DB: INSERT User
A->>DB: CREATE Tenant for user
A->>DB: CREATE UserTenant association
DB-->>A: Success
A->>A: Generate JWT token
A->>R: Store session
A-->>W: 200 - Registration success + token
W->>W: Store token in localStorage
W-->>U: Redirect to dashboard
end
```
### 1.2 User Login
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant DB as MySQL
participant R as Redis
U->>W: Enter email/password
W->>A: POST /api/v1/user/login
A->>DB: SELECT User WHERE email
alt User not found
DB-->>A: No user
A-->>W: 401 - Invalid credentials
W-->>U: Show error
else User found
DB-->>A: User record
A->>A: Verify password (bcrypt)
alt Password invalid
A-->>W: 401 - Invalid credentials
W-->>U: Show error
else Password valid
A->>A: Generate JWT (access_token)
A->>A: Generate refresh_token
A->>R: Store session data
A->>DB: Update last_login_time
A-->>W: 200 - Login success
Note over A,W: Response: {access_token, refresh_token, user_info}
W->>W: Store tokens
W-->>U: Redirect to dashboard
end
end
```
## 2. Knowledge Base Management
### 2.1 Create Knowledge Base
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant DB as MySQL
participant ES as Elasticsearch
U->>W: Click "Create Knowledge Base"
W->>W: Show KB creation modal
U->>W: Enter name, description, settings
W->>A: POST /api/v1/kb/create
Note over W,A: Headers: Authorization: Bearer {token}
A->>A: Validate JWT token
A->>A: Extract tenant_id from token
A->>DB: Check KB name uniqueness in tenant
alt Name exists
A-->>W: 400 - Name already exists
W-->>U: Show error
else Name unique
A->>A: Generate KB ID
A->>DB: INSERT Knowledgebase
Note over A,DB: {id, name, tenant_id, embd_id, parser_id, ...}
A->>ES: CREATE Index for KB
Note over A,ES: Index: ragflow_{kb_id}
ES-->>A: Index created
DB-->>A: KB record saved
A-->>W: 200 - KB created
Note over A,W: {kb_id, name, created_at}
W-->>U: Show success, refresh KB list
end
```
### 2.2 List Knowledge Bases
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant DB as MySQL
U->>W: Open Knowledge Base page
W->>A: GET /api/v1/kb/list?page=1&size=10
A->>A: Validate JWT, extract tenant_id
A->>DB: SELECT * FROM knowledgebase WHERE tenant_id
A->>DB: COUNT total KBs
DB-->>A: KB list + count
loop For each KB
A->>DB: COUNT documents in KB
A->>DB: SUM chunk_num for KB
end
A->>A: Build response with stats
A-->>W: 200 - KB list with pagination
Note over A,W: {data: [...], total, page, size}
W->>W: Render KB cards
W-->>U: Display knowledge bases
```
## 3. Document Upload & Processing
### 3.1 Document Upload Flow
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant M as MinIO
participant DB as MySQL
participant Q as Task Queue (Redis)
U->>W: Select files to upload
W->>W: Validate file types/sizes
loop For each file
W->>A: POST /api/v1/document/upload
Note over W,A: multipart/form-data: file, kb_id
A->>A: Validate file type
A->>A: Generate file_id, doc_id
A->>M: Upload file to bucket
Note over A,M: Bucket: ragflow, Key: {tenant_id}/{kb_id}/{file_id}
M-->>A: Upload success, file_key
A->>DB: INSERT File record
Note over A,DB: {id, name, size, location, tenant_id}
A->>DB: INSERT Document record
Note over A,DB: {id, kb_id, name, status: 'UNSTART'}
A->>Q: PUSH parsing task
Note over A,Q: {doc_id, file_location, parser_config}
A-->>W: 200 - Upload success
Note over A,W: {doc_id, file_id, status}
end
W-->>U: Show upload progress/success
```
### 3.2 Document Parsing Flow (Background Task)
```mermaid
sequenceDiagram
participant Q as Task Queue
participant W as Worker
participant M as MinIO
participant P as Parser (DeepDoc)
participant E as Embedding Model
participant ES as Elasticsearch
participant DB as MySQL
Q->>W: POP task from queue
W->>DB: UPDATE doc status = 'RUNNING'
W->>M: Download file
M-->>W: File content
W->>P: Parse document
Note over W,P: Based on file type (PDF, DOCX, etc.)
P->>P: Extract text content
P->>P: Extract tables
P->>P: Extract images (if any)
P->>P: Layout analysis (for PDF)
P-->>W: Parsed content
W->>W: Apply chunking strategy
Note over W: Token-based, sentence-based, or page-based
W->>W: Generate chunks
loop For each chunk batch
W->>E: Generate embeddings
Note over W,E: batch_size typically 32
E-->>W: Vector embeddings [1536 dim]
W->>ES: Bulk index chunks
Note over W,ES: {chunk_id, content, embedding, doc_id, kb_id}
ES-->>W: Index success
W->>DB: INSERT Chunk records
end
W->>DB: UPDATE Document
Note over W,DB: status='FINISHED', chunk_num, token_num
W->>DB: UPDATE Task status = 'SUCCESS'
```
## 4. Chat/Dialog Flow
### 4.1 Create Chat Session
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant DB as MySQL
U->>W: Click "New Chat"
W->>A: POST /api/v1/dialog/create
Note over W,A: {name, kb_ids[], llm_id, prompt_config}
A->>A: Validate KB access
A->>DB: INSERT Dialog record
Note over A,DB: {id, name, tenant_id, kb_ids, llm_id, ...}
DB-->>A: Dialog created
A-->>W: 200 - Dialog created
Note over A,W: {dialog_id, name, created_at}
W-->>U: Open chat interface
```
### 4.2 Chat Message Flow (RAG)
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant ES as Elasticsearch
participant RR as Reranker
participant LLM as LLM Provider
participant DB as MySQL
U->>W: Type question
W->>A: POST /api/v1/dialog/chat (SSE)
Note over W,A: {dialog_id, conversation_id, question}
A->>DB: Load dialog config
Note over A,DB: Get kb_ids, llm_config, prompt
A->>DB: Load conversation history
rect rgb(200, 220, 240)
Note over A,ES: RETRIEVAL PHASE
A->>A: Query understanding
A->>A: Generate query embedding
A->>ES: Hybrid search
Note over A,ES: Vector similarity + BM25 full-text
ES-->>A: Top 100 candidates
A->>RR: Rerank candidates
Note over A,RR: Cross-encoder scoring
RR-->>A: Top K chunks (typically 5-10)
end
rect rgb(220, 240, 200)
Note over A,LLM: GENERATION PHASE
A->>A: Build prompt with context
Note over A: System prompt + Retrieved chunks + Question
A->>LLM: Stream completion request
loop Streaming response
LLM-->>A: Token chunk
A-->>W: SSE: data chunk
W-->>U: Display token
end
LLM-->>A: [DONE]
end
A->>DB: Save conversation message
Note over A,DB: {role, content, doc_ids[], conversation_id}
A-->>W: SSE: [DONE] + sources
W-->>U: Show sources/citations
```
### 4.3 Streaming Response Detail
```mermaid
sequenceDiagram
participant W as Web Frontend
participant A as API Server
participant LLM as LLM Provider
W->>A: POST /api/v1/dialog/chat
Note over W,A: Accept: text/event-stream
A->>A: Process retrieval...
A->>LLM: POST /v1/chat/completions
Note over A,LLM: stream: true
loop Until complete
LLM-->>A: data: {"choices":[{"delta":{"content":"..."}}]}
A->>A: Extract content
A-->>W: data: {"answer": "...", "reference": {...}}
W->>W: Append to display
end
LLM-->>A: data: [DONE]
A-->>W: data: {"answer": "", "reference": {...}, "done": true}
W->>W: Show final state
```
## 5. Agent Workflow Execution
### 5.1 Canvas Workflow Execution
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant C as Canvas Engine
participant Comp as Components
participant LLM as LLM Provider
participant Tools as External Tools
U->>W: Run workflow
W->>A: POST /api/v1/canvas/run
Note over W,A: {canvas_id, input_data}
A->>C: Initialize canvas execution
C->>C: Parse workflow DSL
C->>C: Build execution graph
rect rgb(240, 220, 200)
Note over C,Comp: BEGIN Component
C->>Comp: Execute BEGIN
Comp->>Comp: Initialize variables
Comp-->>C: {user_input: "..."}
end
rect rgb(200, 220, 240)
Note over C,Comp: RETRIEVAL Component
C->>Comp: Execute RETRIEVAL
Comp->>A: Search knowledge bases
A-->>Comp: Retrieved chunks
Comp-->>C: {context: [...]}
end
rect rgb(220, 240, 200)
Note over C,LLM: LLM Component
C->>Comp: Execute LLM
Comp->>Comp: Build prompt with variables
Comp->>LLM: Chat completion
LLM-->>Comp: Response
Comp-->>C: {llm_output: "..."}
end
rect rgb(240, 240, 200)
Note over C,Tools: TOOL Component (optional)
C->>Comp: Execute TOOL (e.g., Tavily)
Comp->>Tools: API call
Tools-->>Comp: Tool result
Comp-->>C: {tool_output: {...}}
end
rect rgb(220, 220, 240)
Note over C,Comp: CATEGORIZE Component
C->>Comp: Execute CATEGORIZE
Comp->>Comp: Evaluate conditions
Comp-->>C: {next_node: "node_id"}
end
C->>C: Continue to next component...
C-->>A: Workflow complete
A-->>W: SSE: Final output
W-->>U: Display result
```
### 5.2 Agent with Tools Flow
```mermaid
sequenceDiagram
participant U as User
participant A as Agent Engine
participant LLM as LLM Provider
participant T1 as Tavily Search
participant T2 as Wikipedia
participant T3 as Code Executor
U->>A: Question requiring tools
A->>LLM: Initial prompt + available tools
Note over A,LLM: Tools: [tavily_search, wikipedia, code_exec]
loop ReAct Loop
LLM-->>A: Thought + Action
Note over LLM,A: Action: {"tool": "tavily_search", "input": "..."}
alt Tool: tavily_search
A->>T1: Search query
T1-->>A: Search results
else Tool: wikipedia
A->>T2: Page lookup
T2-->>A: Wikipedia content
else Tool: code_exec
A->>T3: Execute code
T3-->>A: Execution result
end
A->>LLM: Observation from tool
alt LLM decides more tools needed
LLM-->>A: Another Action
else LLM ready to answer
LLM-->>A: Final Answer
end
end
A-->>U: Final response with sources
```
## 6. GraphRAG Flow
### 6.1 Knowledge Graph Construction
```mermaid
sequenceDiagram
participant D as Document
participant E as Entity Extractor
participant LLM as LLM Provider
participant ER as Entity Resolution
participant G as Graph Store
D->>E: Document chunks
loop For each chunk
E->>LLM: Extract entities prompt
Note over E,LLM: "Extract entities and relationships..."
LLM-->>E: Entities + Relations
Note over LLM,E: [{entity, type, properties}, {src, rel, dst}]
end
E->>ER: All extracted entities
ER->>ER: Cluster similar entities
ER->>LLM: Entity resolution prompt
Note over ER,LLM: "Are these the same entity?"
LLM-->>ER: Resolution decisions
ER->>ER: Merge duplicate entities
ER-->>G: Resolved entities + relations
G->>G: Build graph structure
G->>G: Create entity embeddings
G->>G: Index for search
```
### 6.2 GraphRAG Query Flow
```mermaid
sequenceDiagram
participant U as User
participant Q as Query Analyzer
participant G as Graph Store
participant V as Vector Search
participant LLM as LLM Provider
U->>Q: Natural language query
Q->>LLM: Analyze query
Note over Q,LLM: Extract entities, intent, constraints
LLM-->>Q: Query analysis
par Graph Search
Q->>G: Find related entities
G->>G: Traverse relationships
G-->>Q: Subgraph context
and Vector Search
Q->>V: Semantic search
V-->>Q: Relevant chunks
end
Q->>Q: Merge graph + vector results
Q->>Q: Build unified context
Q->>LLM: Generate with context
Note over Q,LLM: Context includes entity relations
LLM-->>Q: Response with graph insights
Q-->>U: Answer + entity graph visualization
```
## 7. File Operations
### 7.1 File Download Flow
```mermaid
sequenceDiagram
participant U as User
participant W as Web Frontend
participant A as API Server
participant M as MinIO
participant DB as MySQL
U->>W: Click download
W->>A: GET /api/v1/file/download/{file_id}
A->>A: Validate JWT
A->>DB: Get file record
A->>A: Check user permission
alt No permission
A-->>W: 403 Forbidden
else Has permission
A->>M: Get file from storage
M-->>A: File stream
A-->>W: File stream with headers
Note over A,W: Content-Disposition: attachment
W-->>U: Download starts
end
```
## 8. Search Operations
### 8.1 Hybrid Search Flow
```mermaid
sequenceDiagram
participant U as User
participant A as API Server
participant E as Embedding Model
participant ES as Elasticsearch
U->>A: Search query
A->>E: Embed query text
E-->>A: Query vector [1536]
A->>ES: Hybrid query
Note over A,ES: script_score (vector) + bool (BM25)
ES->>ES: Vector similarity search
Note over ES: cosine_similarity on dense_vector
ES->>ES: BM25 full-text search
Note over ES: match on content field
ES->>ES: Combine scores
Note over ES: final = vector_score * weight + bm25_score * weight
ES-->>A: Ranked results
A->>A: Post-process results
A->>A: Add highlights
A->>A: Group by document
A-->>U: Search results with snippets
```
## 9. Multi-Tenancy Flow
### 9.1 Tenant Data Isolation
```mermaid
sequenceDiagram
participant U1 as User (Tenant A)
participant U2 as User (Tenant B)
participant A as API Server
participant DB as MySQL
U1->>A: GET /api/v1/kb/list
A->>A: Extract tenant_id from JWT
Note over A: tenant_id = "tenant_a"
A->>DB: SELECT * FROM kb WHERE tenant_id = 'tenant_a'
DB-->>A: Tenant A's KBs only
A-->>U1: KBs for Tenant A
U2->>A: GET /api/v1/kb/list
A->>A: Extract tenant_id from JWT
Note over A: tenant_id = "tenant_b"
A->>DB: SELECT * FROM kb WHERE tenant_id = 'tenant_b'
DB-->>A: Tenant B's KBs only
A-->>U2: KBs for Tenant B
Note over U1,U2: Data is completely isolated
```
## 10. Connector Integration Flow
### 10.1 Confluence Connector Sync
```mermaid
sequenceDiagram
participant U as User
participant A as API Server
participant C as Confluence Connector
participant CF as Confluence API
participant DB as MySQL
participant Q as Task Queue
U->>A: Setup Confluence connector
Note over U,A: {url, username, api_token, space_key}
A->>C: Initialize connector
C->>CF: Authenticate
CF-->>C: Auth success
A->>DB: Save connector config
A-->>U: Connector created
U->>A: Start sync
A->>Q: Queue sync task
Q->>C: Execute sync
C->>CF: GET /wiki/rest/api/content
CF-->>C: Content list
loop For each page
C->>CF: GET page content
CF-->>C: Page HTML
C->>C: Convert to markdown
C->>A: Create document
A->>Q: Queue parsing task
end
C->>DB: Update sync status
C-->>A: Sync complete
A-->>U: Show sync results
```
## Tóm Tắt
| Flow | Thành phần chính | Mô tả |
|------|-----------------|-------|
| Authentication | User, API, DB, Redis | Đăng ký, đăng nhập với JWT |
| Knowledge Base | API, MySQL, ES | CRUD knowledge bases |
| Document Upload | API, MinIO, Queue, ES | Upload và index documents |
| Chat/Dialog | API, ES, Reranker, LLM | RAG-based chat với streaming |
| Agent Workflow | Canvas Engine, Components, LLM, Tools | Visual workflow execution |
| GraphRAG | Entity Extractor, Graph Store, LLM | Knowledge graph queries |
| Search | Embedding, ES | Hybrid vector + BM25 search |
| Connectors | Connector, External API | Sync external data sources |
### Các Pattern Thiết Kế Sử Dụng
1. **Event-Driven**: Task queue cho background processing
2. **Streaming**: SSE cho real-time chat responses
3. **Hybrid Search**: Kết hợp vector và text search
4. **ReAct Pattern**: Agent reasoning với tool use
5. **Multi-Tenancy**: Data isolation per tenant

View file

@ -0,0 +1,949 @@
# RAGFlow - Phân Tích Chi Tiết Các Module
## 1. Module API (`/api/`)
### 1.1 Tổng Quan
Module API là trung tâm xử lý tất cả HTTP requests của hệ thống. Được xây dựng trên Flask/Quart framework với kiến trúc Blueprint.
### 1.2 Cấu Trúc
```
api/
├── ragflow_server.py # Entry point - Khởi tạo Flask app
├── settings.py # Cấu hình server
├── constants.py # API_VERSION = "v1"
├── validation.py # Request validation
├── apps/ # API Blueprints
├── db/ # Database layer
└── utils/ # Utilities
```
### 1.3 Chi Tiết Các Blueprint (API Apps)
#### 1.3.1 `kb_app.py` - Knowledge Base Management
**Chức năng**: Quản lý Knowledge Base (tạo, xóa, sửa, liệt kê)
**Endpoints chính**:
| Method | Endpoint | Mô tả |
|--------|----------|-------|
| POST | `/api/v1/kb/create` | Tạo KB mới |
| GET | `/api/v1/kb/list` | Liệt kê KBs |
| PUT | `/api/v1/kb/update` | Cập nhật KB |
| DELETE | `/api/v1/kb/delete` | Xóa KB |
| GET | `/api/v1/kb/{id}` | Chi tiết KB |
**Logic chính**:
- Validation tenant permissions
- Tạo Elasticsearch index cho mỗi KB
- Quản lý embedding model settings
- Quản lý parser configurations
#### 1.3.2 `document_app.py` - Document Management
**Chức năng**: Upload, parsing, và quản lý documents
**Endpoints chính**:
| Method | Endpoint | Mô tả |
|--------|----------|-------|
| POST | `/api/v1/document/upload` | Upload file |
| POST | `/api/v1/document/run` | Trigger parsing |
| GET | `/api/v1/document/list` | Liệt kê docs |
| DELETE | `/api/v1/document/delete` | Xóa document |
| GET | `/api/v1/document/{id}/chunks` | Lấy chunks |
**Logic chính**:
- File type validation
- MinIO storage integration
- Background task queuing
- Parsing status tracking
#### 1.3.3 `dialog_app.py` - Chat/Dialog Management
**Chức năng**: Xử lý chat conversations với RAG
**Endpoints chính**:
| Method | Endpoint | Mô tả |
|--------|----------|-------|
| POST | `/api/v1/dialog/create` | Tạo dialog |
| POST | `/api/v1/dialog/chat` | Chat (SSE streaming) |
| POST | `/api/v1/dialog/completion` | Non-streaming chat |
| GET | `/api/v1/dialog/list` | Liệt kê dialogs |
**Logic chính**:
- RAG pipeline orchestration
- Streaming response (SSE)
- Conversation history management
- Multi-KB retrieval
#### 1.3.4 `canvas_app.py` - Agent Workflow
**Chức năng**: Visual workflow builder cho AI agents
**Endpoints chính**:
| Method | Endpoint | Mô tả |
|--------|----------|-------|
| POST | `/api/v1/canvas/create` | Tạo workflow |
| POST | `/api/v1/canvas/run` | Execute workflow |
| PUT | `/api/v1/canvas/update` | Cập nhật |
| GET | `/api/v1/canvas/list` | Liệt kê |
**Logic chính**:
- DSL parsing và validation
- Component orchestration
- Tool integration
- Variable passing between nodes
#### 1.3.5 `file_app.py` - File Management
**Chức năng**: Upload, download, quản lý files
**Endpoints chính**:
| Method | Endpoint | Mô tả |
|--------|----------|-------|
| POST | `/api/v1/file/upload` | Upload file |
| GET | `/api/v1/file/download/{id}` | Download |
| GET | `/api/v1/file/list` | Liệt kê files |
| DELETE | `/api/v1/file/delete` | Xóa file |
#### 1.3.6 `search_app.py` - Search Operations
**Chức năng**: Full-text và semantic search
**Endpoints chính**:
| Method | Endpoint | Mô tả |
|--------|----------|-------|
| POST | `/api/v1/search` | Hybrid search |
| GET | `/api/v1/search/history` | Search history |
### 1.4 Database Services (`/api/db/services/`)
#### `dialog_service.py` (37KB - Service phức tạp nhất)
```python
class DialogService:
def chat(dialog_id, question, stream=True):
"""
Main RAG chat function
1. Load dialog configuration
2. Get relevant documents (retrieval)
3. Rerank results
4. Build prompt with context
5. Call LLM (streaming)
6. Save conversation
"""
def retrieval(dialog, question):
"""
Hybrid retrieval from Elasticsearch
- Vector similarity search
- BM25 full-text search
- Score combination
"""
def rerank(chunks, question):
"""
Cross-encoder reranking
- Score each chunk against question
- Return top-k
"""
```
#### `document_service.py` (39KB)
```python
class DocumentService:
def upload(file, kb_id):
"""Upload file to MinIO, create DB record"""
def parse(doc_id):
"""Queue document for background parsing"""
def chunk(doc_id, chunks):
"""Save parsed chunks to ES and DB"""
def delete(doc_id):
"""Remove doc, chunks, and file"""
```
#### `knowledgebase_service.py` (21KB)
```python
class KnowledgebaseService:
def create(name, embedding_model, parser_id):
"""Create KB with ES index"""
def update_parser_config(kb_id, config):
"""Update chunking/parsing settings"""
def get_statistics(kb_id):
"""Get doc count, chunk count, etc."""
```
### 1.5 Database Models (`/api/db/db_models.py`)
**25+ Models quan trọng**:
```python
# User & Tenant
class User(BaseModel):
id, email, password, nickname, avatar, status, login_channel
class Tenant(BaseModel):
id, name, public_key, llm_id, embd_id, parser_id, credit
class UserTenant(BaseModel):
user_id, tenant_id, role # owner, admin, member
# Knowledge Management
class Knowledgebase(BaseModel):
id, tenant_id, name, description, embd_id, parser_id,
similarity_threshold, vector_similarity_weight, ...
class Document(BaseModel):
id, kb_id, name, location, size, type, parser_id,
status, progress, chunk_num, token_num, process_duation
class File(BaseModel):
id, tenant_id, name, size, location, type, source_type
# Chat & Dialog
class Dialog(BaseModel):
id, tenant_id, name, description, kb_ids, llm_id,
prompt_config, similarity_threshold, top_n, top_k
class Conversation(BaseModel):
id, dialog_id, name, message # JSON array of messages
# Workflow
class UserCanvas(BaseModel):
id, tenant_id, name, dsl, avatar # DSL is workflow definition
class CanvasTemplate(BaseModel):
id, name, dsl, avatar # Pre-built templates
# Integration
class APIToken(BaseModel):
id, tenant_id, token, dialog_id # For external API access
class MCPServer(BaseModel):
id, tenant_id, name, host, tools # MCP server config
```
---
## 2. Module RAG (`/rag/`)
### 2.1 Tổng Quan
Core RAG processing engine - xử lý từ document parsing đến retrieval.
### 2.2 LLM Abstractions (`/rag/llm/`)
#### `chat_model.py` - Chat LLM Interface
```python
class Base:
"""Abstract base for all chat models"""
def chat(messages, stream=True, **kwargs):
"""Generate chat completion"""
class OpenAIChat(Base):
"""OpenAI GPT models"""
class ClaudeChat(Base):
"""Anthropic Claude models"""
class QwenChat(Base):
"""Alibaba Qwen models"""
class OllamaChat(Base):
"""Local Ollama models"""
# Factory function
def get_chat_model(model_name, api_key, base_url):
"""Return appropriate chat model instance"""
```
**Supported Providers** (20+):
- OpenAI (GPT-3.5, GPT-4, GPT-4V)
- Anthropic (Claude 3)
- Google (Gemini)
- Alibaba (Qwen, Qwen-VL)
- Groq
- Mistral
- Cohere
- DeepSeek
- Zhipu (GLM)
- Moonshot
- Ollama (local)
- NVIDIA
- Bedrock (AWS)
- Azure OpenAI
- Hugging Face
- ...
#### `embedding_model.py` - Embedding Interface
```python
class Base:
"""Abstract base for embeddings"""
def encode(texts: List[str]) -> List[List[float]]:
"""Generate embeddings for texts"""
class OpenAIEmbed(Base):
"""text-embedding-ada-002, text-embedding-3-*"""
class BGEEmbed(Base):
"""BAAI BGE models"""
class JinaEmbed(Base):
"""Jina AI embeddings"""
# Supported embedding models:
# - OpenAI: ada-002, embedding-3-small, embedding-3-large
# - BGE: bge-base, bge-large, bge-m3
# - Jina: jina-embeddings-v2
# - Cohere: embed-english-v3
# - HuggingFace: sentence-transformers
# - Local: Ollama embeddings
```
#### `rerank_model.py` - Reranking Interface
```python
class Base:
"""Abstract base for rerankers"""
def rerank(query: str, documents: List[str]) -> List[float]:
"""Score documents against query"""
class CohereRerank(Base):
"""Cohere rerank models"""
class JinaRerank(Base):
"""Jina AI reranker"""
class BGERerank(Base):
"""BAAI BGE reranker"""
```
### 2.3 RAG Pipeline (`/rag/flow/`)
#### Pipeline Architecture
```
Document → Parser → Tokenizer → Splitter → Embedder → Index
```
#### `parser/parser.py`
```python
def parse(file_path, parser_config):
"""
Parse document based on file type
Returns: List of text segments with metadata
"""
# Supported parsers:
# - naive: Simple text extraction
# - paper: Academic paper structure
# - book: Book chapter detection
# - laws: Legal document parsing
# - presentation: PPT parsing
# - qa: Q&A format extraction
# - table: Table extraction
# - picture: Image description
# - one: Single chunk per doc
# - audio: Audio transcription
# - email: Email thread parsing
```
#### `splitter/splitter.py`
```python
class Splitter:
"""Document chunking strategies"""
def split_by_tokens(text, chunk_size=512, overlap=128):
"""Token-based splitting"""
def split_by_sentences(text, max_sentences=10):
"""Sentence-based splitting"""
def split_by_delimiter(text, delimiter='\n\n'):
"""Delimiter-based splitting"""
def split_semantic(text, threshold=0.5):
"""Semantic similarity based splitting"""
```
#### `tokenizer/tokenizer.py`
```python
class Tokenizer:
"""Text tokenization"""
def tokenize(text):
"""Convert text to tokens"""
def count_tokens(text):
"""Count tokens in text"""
# Uses tiktoken for OpenAI models
# Uses model-specific tokenizers for others
```
### 2.4 RAPTOR (`/rag/raptor.py`)
**RAPTOR** = Recursive Abstractive Processing for Tree-Organized Retrieval
```python
class RAPTOR:
"""
Hierarchical document representation
- Clusters similar chunks
- Creates summaries of clusters
- Builds tree structure for retrieval
"""
def build_tree(chunks):
"""Build RAPTOR tree from chunks"""
def retrieve(query, tree):
"""Retrieve from tree structure"""
```
---
## 3. Module DeepDoc (`/deepdoc/`)
### 3.1 Tổng Quan
Deep document understanding với layout analysis và OCR.
### 3.2 Document Parsers (`/deepdoc/parser/`)
#### `pdf_parser.py` - PDF Processing
```python
class PdfParser:
"""
Advanced PDF parsing with:
- OCR for scanned pages
- Layout analysis (tables, figures, headers)
- Multi-column detection
- Image extraction
"""
def __call__(file_path):
"""Parse PDF file"""
# 1. Extract text with PyMuPDF
# 2. Apply OCR if needed (Tesseract)
# 3. Analyze layout (detectron2/layoutlm)
# 4. Extract tables (camelot/tabula)
# 5. Extract images
# Return structured content
```
#### `docx_parser.py` - Word Documents
```python
class DocxParser:
"""
Parse .docx files
- Text extraction
- Table extraction
- Image extraction
- Style preservation
"""
```
#### `excel_parser.py` - Spreadsheets
```python
class ExcelParser:
"""
Parse .xlsx/.xls files
- Sheet-by-sheet processing
- Table structure preservation
- Formula evaluation
"""
```
#### `html_parser.py` - Web Pages
```python
class HtmlParser:
"""
Parse HTML content
- Clean HTML
- Extract main content
- Handle tables
- Remove scripts/styles
"""
```
### 3.3 Vision Module (`/deepdoc/vision/`)
```python
class LayoutAnalyzer:
"""
Document layout analysis using ML
- Detectron2 for object detection
- LayoutLM for document understanding
"""
def analyze(image):
"""
Detect document regions:
- Title
- Paragraph
- Table
- Figure
- Header/Footer
- List
"""
```
---
## 4. Module Agent (`/agent/`)
### 4.1 Tổng Quan
Agentic workflow system với visual canvas builder.
### 4.2 Canvas Engine (`/agent/canvas.py`)
```python
class Canvas:
"""
Main workflow orchestrator
- Parse DSL definition
- Execute components in order
- Handle branching logic
- Manage variables
"""
def __init__(self, dsl):
"""Initialize from DSL"""
self.components = self._parse_dsl(dsl)
self.graph = self._build_graph()
def run(self, input_data):
"""Execute workflow"""
context = {"input": input_data}
for component in self._topological_sort():
result = component.execute(context)
context.update(result)
return context["output"]
```
### 4.3 Components (`/agent/component/`)
#### `begin.py` - Workflow Start
```python
class BeginComponent:
"""
Entry point of workflow
- Initialize variables
- Receive user input
"""
def execute(self, context):
return {"user_input": context["input"]}
```
#### `llm.py` - LLM Component
```python
class LLMComponent:
"""
Call LLM with configured prompt
- Template variable substitution
- Streaming support
- Output parsing
"""
def execute(self, context):
prompt = self.template.format(**context)
response = self.llm.chat(prompt)
return {"llm_output": response}
```
#### `retrieval.py` - Retrieval Component
```python
class RetrievalComponent:
"""
Search knowledge bases
- Multi-KB search
- Configurable top_k
- Score threshold
"""
def execute(self, context):
query = context["user_input"]
results = self.search(query, self.kb_ids)
return {"retrieved_docs": results}
```
#### `categorize.py` - Conditional Branching
```python
class CategorizeComponent:
"""
Route to different paths based on conditions
- LLM-based classification
- Rule-based matching
"""
def execute(self, context):
category = self._classify(context)
return {"next_node": self.routes[category]}
```
#### `agent_with_tools.py` - Tool-Using Agent
```python
class AgentWithToolsComponent:
"""
ReAct pattern agent
- Tool selection
- Iterative reasoning
- Observation handling
"""
def execute(self, context):
while not done:
action = self.llm.decide_action(context)
if action.type == "tool":
result = self.tools[action.tool].run(action.input)
context["observation"] = result
else:
return {"output": action.response}
```
### 4.4 Tools (`/agent/tools/`)
#### External Tool Integrations
| Tool | File | Chức năng |
|------|------|-----------|
| Tavily | `tavily.py` | Web search API |
| ArXiv | `arxiv.py` | Academic paper search |
| Google | `google.py` | Google search |
| Wikipedia | `wikipedia.py` | Wikipedia lookup |
| GitHub | `github.py` | GitHub API |
| Email | `email.py` | Send emails |
| Code Exec | `code_exec.py` | Execute Python code |
| DeepL | `deepl.py` | Translation |
| Jin10 | `jin10.py` | Financial news |
| TuShare | `tushare.py` | Chinese stock data |
| Yahoo Finance | `yahoofinance.py` | Stock data |
| QWeather | `qweather.py` | Weather data |
```python
class BaseTool:
"""Base class for all tools"""
name: str
description: str
def run(self, input: str) -> str:
"""Execute tool and return result"""
class TavilySearch(BaseTool):
name = "tavily_search"
description = "Search the web for current information"
def run(self, query):
response = tavily.search(query)
return format_results(response)
```
---
## 5. Module GraphRAG (`/graphrag/`)
### 5.1 Tổng Quan
Knowledge graph construction và querying.
### 5.2 Entity Resolution (`/graphrag/entity_resolution.py`)
```python
class EntityResolution:
"""
Entity extraction và linking
- Extract entities from text
- Cluster similar entities
- Resolve duplicates
"""
def extract_entities(text):
"""Extract named entities using LLM"""
prompt = f"Extract entities from: {text}"
return llm.chat(prompt)
def resolve_entities(entities):
"""Merge duplicate entities"""
clusters = self._cluster_similar(entities)
return self._merge_clusters(clusters)
```
### 5.3 Graph Search (`/graphrag/search.py`)
```python
class GraphSearch:
"""
Query knowledge graph
- Entity-based search
- Relationship traversal
- Subgraph extraction
"""
def search(query):
"""Find relevant subgraph for query"""
# 1. Extract query entities
# 2. Find matching graph entities
# 3. Traverse relationships
# 4. Return context subgraph
```
---
## 6. Module Frontend (`/web/`)
### 6.1 Tổng Quan
React/TypeScript SPA với UmiJS framework.
### 6.2 Pages (`/web/src/pages/`)
| Page | Chức năng |
|------|-----------|
| `/dataset` | Knowledge base management |
| `/datasets` | Dataset list view |
| `/next-chats` | Chat interface |
| `/next-searches` | Search interface |
| `/document-viewer` | Document preview |
| `/admin` | Admin dashboard |
| `/login` | Authentication |
| `/register` | User registration |
### 6.3 Components (`/web/src/components/`)
**Core Components**:
- `file-upload-modal/` - File upload UI
- `pdf-drawer/` - PDF preview drawer
- `prompt-editor/` - Prompt template editor
- `document-preview/` - Document viewer
- `llm-setting-items/` - LLM configuration UI
- `ui/` - Shadcn/UI base components
### 6.4 State Management
```typescript
// Using Zustand for state
import { create } from 'zustand';
interface KnowledgebaseStore {
knowledgebases: Knowledgebase[];
currentKb: Knowledgebase | null;
fetchKnowledgebases: () => Promise<void>;
createKnowledgebase: (data: CreateKbRequest) => Promise<void>;
}
export const useKnowledgebaseStore = create<KnowledgebaseStore>((set) => ({
knowledgebases: [],
currentKb: null,
fetchKnowledgebases: async () => {
const data = await api.get('/kb/list');
set({ knowledgebases: data });
},
// ...
}));
```
### 6.5 API Services (`/web/src/services/`)
```typescript
// API client using Axios
import { request } from 'umi';
export async function createKnowledgebase(data: CreateKbRequest) {
return request('/api/v1/kb/create', {
method: 'POST',
data,
});
}
export async function chat(dialogId: string, question: string) {
return request('/api/v1/dialog/chat', {
method: 'POST',
data: { dialog_id: dialogId, question },
responseType: 'stream',
});
}
```
---
## 7. Module Common (`/common/`)
### 7.1 Configuration (`/common/settings.py`)
```python
# Main configuration file
class Settings:
# Database
MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost')
MYSQL_PORT = int(os.getenv('MYSQL_PORT', 5455))
MYSQL_USER = os.getenv('MYSQL_USER', 'root')
MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD', 'infini_rag_flow')
MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'ragflow')
# Elasticsearch
ES_HOSTS = os.getenv('ES_HOSTS', 'http://localhost:9200').split(',')
# Redis
REDIS_HOST = os.getenv('REDIS_HOST', 'localhost')
REDIS_PORT = int(os.getenv('REDIS_PORT', 6379))
# MinIO
MINIO_HOST = os.getenv('MINIO_HOST', 'localhost:9000')
MINIO_ACCESS_KEY = os.getenv('MINIO_USER', 'rag_flow')
MINIO_SECRET_KEY = os.getenv('MINIO_PASSWORD', 'infini_rag_flow')
# Document Engine
DOC_ENGINE = os.getenv('DOC_ENGINE', 'elasticsearch') # or 'infinity'
```
### 7.2 Data Source Connectors (`/common/data_source/`)
**Supported Connectors**:
| Connector | File | Chức năng |
|-----------|------|-----------|
| Confluence | `confluence_connector.py` (81KB) | Atlassian Confluence wiki |
| Notion | `notion_connector.py` (25KB) | Notion databases |
| Slack | `slack_connector.py` (22KB) | Slack messages |
| Gmail | `gmail_connector.py` | Gmail emails |
| Discord | `discord_connector.py` | Discord channels |
| SharePoint | `sharepoint_connector.py` | Microsoft SharePoint |
| Teams | `teams_connector.py` | Microsoft Teams |
| Dropbox | `dropbox_connector.py` | Dropbox files |
| Google Drive | `google_drive/` | Google Drive |
| WebDAV | `webdav_connector.py` | WebDAV servers |
| Moodle | `moodle_connector.py` | Moodle LMS |
```python
class BaseConnector:
"""Abstract base for connectors"""
def authenticate(credentials):
"""Authenticate with external service"""
def list_items():
"""List available items"""
def sync():
"""Sync data to RAGFlow"""
class ConfluenceConnector(BaseConnector):
"""Confluence integration"""
def __init__(self, url, username, api_token):
self.client = Confluence(url, username, api_token)
def sync_space(space_key):
"""Sync all pages from a space"""
pages = self.client.get_all_pages(space_key)
for page in pages:
content = self._convert_to_markdown(page.body)
yield Document(content=content, metadata=page.metadata)
```
---
## 8. Module SDK (`/sdk/python/`)
### 8.1 Python SDK
```python
from ragflow import RAGFlow
# Initialize client
client = RAGFlow(
api_key="your-api-key",
base_url="http://localhost:9380"
)
# Create knowledge base
kb = client.create_knowledgebase(
name="My KB",
embedding_model="text-embedding-3-small"
)
# Upload document
doc = kb.upload_document("path/to/document.pdf")
# Wait for parsing
doc.wait_for_ready()
# Create chat
chat = client.create_chat(
name="My Chat",
knowledgebase_ids=[kb.id]
)
# Send message
response = chat.send_message("What is this document about?")
print(response.answer)
```
---
## 9. Tóm Tắt Module Dependencies
```
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (web/) │
└─────────────────────────────┬───────────────────────────────────┘
│ HTTP/SSE
┌─────────────────────────────────────────────────────────────────┐
│ API (api/) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ kb_app │ │doc_app │ │dialog_ │ │canvas_ │ │
│ │ │ │ │ │app │ │app │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └────────────┴───────────┴────────────┘ │
│ │ │
│ ┌──────────────────────────┴──────────────────────────┐ │
│ │ Services Layer │ │
│ │ DialogService │ DocumentService │ KBService │ │
│ └───────────────────────────┬─────────────────────────┘ │
└───────────────────────────────┼─────────────────────────────────┘
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ RAG (rag/) │ │ Agent (agent/) │ │GraphRAG(graphrag)│
│ │ │ │ │ │
│ - LLM Models │ │ - Canvas Engine │ │ - Entity Res. │
│ - Pipeline │ │ - Components │ │ - Graph Search │
│ - Embeddings │ │ - Tools │ │ │
└───────┬───────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└─────────────────────┼───────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DeepDoc (deepdoc/) │
│ │
│ PDF Parser │ DOCX Parser │ HTML Parser │ Vision/OCR │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Common (common/) │
│ │
│ Settings │ Utilities │ Data Source Connectors │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Data Stores │
│ │
│ MySQL │ Elasticsearch/Infinity │ Redis │ MinIO │
└─────────────────────────────────────────────────────────────────┘
```
## 10. Kích Thước Code Ước Tính
| Module | Lines of Code | Complexity |
|--------|--------------|------------|
| api/ | ~15,000 | High |
| rag/ | ~8,000 | High |
| deepdoc/ | ~5,000 | Medium |
| agent/ | ~6,000 | High |
| graphrag/ | ~3,000 | Medium |
| web/src/ | ~20,000 | High |
| common/ | ~5,000 | Medium |
| **Total** | **~62,000** | - |

View file

@ -0,0 +1,634 @@
# RAGFlow - Tech Stack Analysis
## 1. Tổng Quan Tech Stack
RAGFlow sử dụng một tech stack hiện đại, được thiết kế để xử lý các workload AI/ML nặng với khả năng scale tốt.
```
┌─────────────────────────────────────────────────────────────────────────┐
│ TECH STACK OVERVIEW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ FRONTEND │ │
│ │ React 18 │ TypeScript │ UmiJS │ Ant Design │ Tailwind CSS │ │
│ │ Zustand │ TanStack Query │ XYFlow │ Monaco Editor │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ BACKEND │ │
│ │ Python 3.10-3.12 │ Flask/Quart │ Peewee ORM │ Celery │ │
│ │ AsyncIO │ JWT │ SSE Streaming │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ AI/ML │ │
│ │ LangChain │ OpenAI │ Sentence Transformers │ Hugging Face │ │
│ │ PyTorch │ Detectron2 │ Tesseract OCR │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ MySQL 8 │ Elasticsearch 8 │ Redis │ MinIO │ Infinity │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ INFRASTRUCTURE │ │
│ │ Docker │ Docker Compose │ Kubernetes │ Nginx │ Helm │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## 2. Frontend Technologies
### 2.1 Core Framework
| Technology | Version | Mục đích |
|------------|---------|----------|
| **React** | 18.x | UI library chính |
| **TypeScript** | 5.x | Type-safe JavaScript |
| **UmiJS** | 4.x | React framework (Ant Design ecosystem) |
| **Vite** | 5.x | Build tool (nhanh hơn Webpack) |
### 2.2 UI Libraries
| Library | Version | Mục đích |
|---------|---------|----------|
| **Ant Design** | 5.x | Primary UI component library |
| **Shadcn/UI** | Latest | Modern, customizable components |
| **Radix UI** | Latest | Headless UI primitives |
| **Tailwind CSS** | 3.x | Utility-first CSS framework |
| **LESS** | 4.x | CSS preprocessor (legacy) |
### 2.3 State Management & Data Fetching
| Library | Mục đích |
|---------|----------|
| **Zustand** | Lightweight state management |
| **TanStack React Query** | Server state & caching |
| **Axios** | HTTP client |
### 2.4 Specialized Libraries
| Library | Mục đích |
|---------|----------|
| **XYFlow (React Flow)** | Workflow/canvas visualization |
| **Monaco Editor** | Code editor (VS Code core) |
| **AntV G2/G6** | Data visualization & graphs |
| **Recharts** | Charts and analytics |
| **Lexical** | Rich text editor (Facebook) |
| **React Markdown** | Markdown rendering |
| **i18next** | Internationalization |
| **React Hook Form** | Form handling |
| **Zod** | Schema validation |
### 2.5 Package.json Dependencies (172 packages)
```json
{
"dependencies": {
"react": "^18.2.0",
"react-dom": "^18.2.0",
"umi": "^4.0.0",
"antd": "^5.0.0",
"@tanstack/react-query": "^5.0.0",
"zustand": "^4.0.0",
"axios": "^1.0.0",
"tailwindcss": "^3.0.0",
"@xyflow/react": "^12.0.0",
"@monaco-editor/react": "^4.0.0",
"lexical": "^0.12.0",
"react-markdown": "^9.0.0",
"i18next": "^23.0.0",
"react-hook-form": "^7.0.0",
"zod": "^3.0.0",
"@radix-ui/react-*": "latest",
"@ant-design/icons": "^5.0.0",
"@antv/g2": "^5.0.0",
"@antv/g6": "^5.0.0"
}
}
```
---
## 3. Backend Technologies
### 3.1 Core Framework
| Technology | Version | Mục đích |
|------------|---------|----------|
| **Python** | 3.10-3.12 | Programming language |
| **Flask** | 3.x | Web framework |
| **Quart** | 0.19.x | Async Flask (ASGI) |
| **Hypercorn** | Latest | ASGI server |
### 3.2 Database & ORM
| Technology | Mục đích |
|------------|----------|
| **Peewee** | Lightweight ORM (primary) |
| **SQLAlchemy** | Advanced ORM operations |
| **PyMySQL** | MySQL driver |
### 3.3 Authentication & Security
| Library | Mục đích |
|---------|----------|
| **PyJWT** | JWT token handling |
| **bcrypt** | Password hashing |
| **python-jose** | JOSE implementation |
| **Authlib** | OAuth integration |
### 3.4 Async & Background Tasks
| Library | Mục đích |
|---------|----------|
| **asyncio** | Async I/O |
| **aiohttp** | Async HTTP client |
| **Redis/Valkey** | Task queue & caching |
| **APScheduler** | Job scheduling |
### 3.5 API & Documentation
| Library | Mục đích |
|---------|----------|
| **Flasgger** | Swagger/OpenAPI docs |
| **Flask-CORS** | CORS handling |
| **Werkzeug** | WSGI utilities |
### 3.6 pyproject.toml Dependencies (150+ packages)
```toml
[project]
name = "ragflow"
version = "0.22.1"
requires-python = ">=3.10,<3.13"
dependencies = [
# Web Framework
"flask>=3.0.0",
"quart>=0.19.0",
"hypercorn>=0.17.0",
"flask-cors>=4.0.0",
"flasgger>=0.9.0",
# Database
"peewee>=3.17.0",
"pymysql>=1.1.0",
# Authentication
"pyjwt>=2.8.0",
"bcrypt>=4.1.0",
# Async
"aiohttp>=3.9.0",
"httpx>=0.27.0",
# Data Processing
"pandas>=2.0.0",
"numpy>=1.26.0",
# AI/ML (see section 4)
...
]
```
---
## 4. AI/ML Technologies
### 4.1 LLM Integration
| Provider | Library | Models Supported |
|----------|---------|-----------------|
| **OpenAI** | `openai>=1.0` | GPT-3.5, GPT-4, GPT-4V |
| **Anthropic** | `anthropic>=0.20` | Claude 3 family |
| **Google** | `google-generativeai` | Gemini Pro |
| **Cohere** | `cohere>=5.0` | Command, Embed, Rerank |
| **Groq** | `groq>=0.4` | LLaMA, Mixtral |
| **Mistral** | `mistralai>=0.1` | Mistral 7B, Mixtral |
| **Ollama** | `ollama>=0.1` | Local models |
| **HuggingFace** | `huggingface_hub` | Open source models |
### 4.2 Embedding Models
| Library | Models |
|---------|--------|
| **Sentence Transformers** | all-MiniLM, all-mpnet, etc. |
| **OpenAI Embeddings** | text-embedding-3-small/large |
| **BGE** | bge-base, bge-large, bge-m3 |
| **Jina** | jina-embeddings-v2 |
| **Cohere** | embed-english-v3 |
```python
# Embedding configuration
EMBEDDING_MODELS = {
"openai": {
"text-embedding-3-small": {"dim": 1536, "max_tokens": 8191},
"text-embedding-3-large": {"dim": 3072, "max_tokens": 8191},
},
"bge": {
"bge-base-en-v1.5": {"dim": 768, "max_tokens": 512},
"bge-large-en-v1.5": {"dim": 1024, "max_tokens": 512},
"bge-m3": {"dim": 1024, "max_tokens": 8192},
},
"sentence-transformers": {
"all-MiniLM-L6-v2": {"dim": 384, "max_tokens": 256},
"all-mpnet-base-v2": {"dim": 768, "max_tokens": 384},
}
}
```
### 4.3 Document Processing
| Library | Mục đích |
|---------|----------|
| **PyMuPDF (fitz)** | PDF text extraction |
| **pdf2image** | PDF to image conversion |
| **Tesseract (pytesseract)** | OCR |
| **python-docx** | Word document parsing |
| **openpyxl** | Excel parsing |
| **python-pptx** | PowerPoint parsing |
| **BeautifulSoup4** | HTML parsing |
| **markdown** | Markdown processing |
| **camelot-py** | Table extraction from PDF |
| **tabula-py** | Alternative table extraction |
### 4.4 Computer Vision
| Library | Mục đích |
|---------|----------|
| **Detectron2** | Layout analysis |
| **LayoutLM** | Document understanding |
| **OpenCV** | Image processing |
| **Pillow** | Image manipulation |
| **YOLO** | Object detection |
### 4.5 NLP & Text Processing
| Library | Mục đích |
|---------|----------|
| **tiktoken** | OpenAI tokenization |
| **nltk** | Natural language toolkit |
| **spaCy** | NLP pipeline |
| **regex** | Advanced regex |
| **chardet** | Character encoding detection |
### 4.6 Vector Operations
| Library | Mục đích |
|---------|----------|
| **NumPy** | Numerical operations |
| **SciPy** | Scientific computing |
| **scikit-learn** | ML utilities, clustering |
| **faiss-cpu/gpu** | Vector similarity search |
---
## 5. Data Storage Technologies
### 5.1 Relational Database
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **MySQL 8.0** | Primary database | Port 5455 |
| **PostgreSQL** | Alternative (supported) | - |
**MySQL Schema Design**:
- InnoDB engine
- UTF8MB4 character set
- JSON columns for flexible data
- Foreign keys for integrity
### 5.2 Vector/Search Database
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **Elasticsearch 8.12** | Default vector store | Port 9200 |
| **Infinity** | Alternative (in-house) | Port 23817 |
| **OpenSearch** | Alternative | Port 9200 |
| **OceanBase** | Alternative (distributed) | - |
**Elasticsearch Configuration**:
```json
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"ik_smart": { "type": "ik_smart" },
"ik_max_word": { "type": "ik_max_word" }
}
}
},
"mappings": {
"properties": {
"content": { "type": "text", "analyzer": "ik_smart" },
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": true,
"similarity": "cosine"
}
}
}
}
```
### 5.3 Cache & Message Queue
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **Redis 7.x** | Cache, sessions, queue | Port 6379 |
| **Valkey** | Redis alternative | Port 6379 |
**Redis Usage**:
- Session storage
- Rate limiting
- Task queue (custom implementation)
- Cache layer
### 5.4 Object Storage
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **MinIO** | S3-compatible storage | Port 9000/9001 |
| **AWS S3** | Cloud storage option | - |
| **Azure Blob** | Cloud storage option | - |
**MinIO Structure**:
```
ragflow/ # Bucket
├── {tenant_id}/
│ ├── {kb_id}/
│ │ ├── {file_id} # Original files
│ │ └── chunks/ # Processed chunks
│ └── temp/ # Temporary files
└── system/ # System files
```
---
## 6. Infrastructure Technologies
### 6.1 Containerization
| Technology | Mục đích |
|------------|----------|
| **Docker** | Container runtime |
| **Docker Compose** | Multi-container orchestration |
| **BuildKit** | Efficient image building |
**Docker Images**:
```yaml
services:
ragflow-server:
image: infiniflow/ragflow:latest
# or: ragflow:nightly for development
mysql:
image: mysql:8.0
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
redis:
image: redis:7-alpine
minio:
image: minio/minio:latest
```
### 6.2 Web Server & Proxy
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **Nginx** | Reverse proxy, static files | Port 80/443 |
| **Hypercorn** | ASGI server | Port 9380 |
**Nginx Configuration**:
```nginx
upstream ragflow {
server ragflow-server:9380;
}
server {
listen 80;
location /api/ {
proxy_pass http://ragflow;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
location / {
root /usr/share/nginx/html;
try_files $uri $uri/ /index.html;
}
}
```
### 6.3 Kubernetes Deployment
| Technology | Mục đích |
|------------|----------|
| **Kubernetes** | Container orchestration |
| **Helm** | K8s package manager |
**Helm Chart Structure**:
```
helm/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ └── ingress.yaml
```
---
## 7. Development Tools
### 7.1 Python Development
| Tool | Mục đích |
|------|----------|
| **uv** | Package manager (fast) |
| **pip** | Traditional package manager |
| **pre-commit** | Git hooks |
| **ruff** | Linter & formatter |
| **pytest** | Testing framework |
| **mypy** | Type checking |
### 7.2 Frontend Development
| Tool | Mục đích |
|------|----------|
| **npm/pnpm** | Package manager |
| **ESLint** | Linting |
| **Prettier** | Code formatting |
| **Jest** | Testing |
| **Storybook** | Component development |
| **Husky** | Git hooks |
### 7.3 Version Control & CI/CD
| Tool | Mục đích |
|------|----------|
| **Git** | Version control |
| **GitHub Actions** | CI/CD |
| **Docker Hub** | Image registry |
---
## 8. Monitoring & Observability
### 8.1 Logging
| Library | Mục đích |
|---------|----------|
| **Python logging** | Standard logging |
| **structlog** | Structured logging |
### 8.2 Tracing
| Integration | Mục đích |
|-------------|----------|
| **Langfuse** | LLM observability |
| **OpenTelemetry** | Distributed tracing |
### 8.3 Metrics
| Tool | Mục đích |
|------|----------|
| **Prometheus** | Metrics collection |
| **Grafana** | Visualization |
---
## 9. Third-party Integrations
### 9.1 LLM Providers
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Provider Support │
├─────────────────────────────────────────────────────────────┤
│ │
│ Commercial APIs: │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │OpenAI │ │Claude │ │Gemini │ │Cohere │ │ Groq │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
│ China Providers: │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Qwen │ │Zhipu │ │Baichuan│ │Spark │ │ERNIE │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
│ Self-hosted: │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Ollama │ │ vLLM │ │LocalAI│ │
│ └───────┘ └───────┘ └───────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
### 9.2 Data Source Connectors
| Category | Services |
|----------|----------|
| **Enterprise Wiki** | Confluence, Notion, SharePoint |
| **Communication** | Slack, Discord, Gmail, Teams |
| **Cloud Storage** | Google Drive, Dropbox, S3, WebDAV |
| **Development** | GitHub, Jira |
| **Education** | Moodle |
| **Finance** | TuShare, AkShare, Yahoo Finance |
### 9.3 Search APIs
| Service | Mục đích |
|---------|----------|
| **Tavily** | AI-optimized web search |
| **Google Search** | Web search |
| **Google Scholar** | Academic search |
| **SearXNG** | Meta search |
| **ArXiv** | Academic papers |
| **Wikipedia** | Knowledge lookup |
---
## 10. System Requirements
### 10.1 Minimum Requirements
| Resource | Minimum | Recommended |
|----------|---------|-------------|
| **CPU** | 4 cores | 8+ cores |
| **RAM** | 16 GB | 32+ GB |
| **Disk** | 50 GB | 200+ GB SSD |
| **GPU** | - | NVIDIA 8GB+ VRAM |
### 10.2 Software Requirements
| Software | Version |
|----------|---------|
| **Docker** | 20.10+ |
| **Docker Compose** | 2.0+ |
| **Python** | 3.10-3.12 |
| **Node.js** | 18.20.4+ |
### 10.3 Port Requirements
| Port | Service |
|------|---------|
| 80/443 | Nginx (HTTP/HTTPS) |
| 9380 | RAGFlow API |
| 9381 | Admin Server |
| 9200 | Elasticsearch |
| 5455 | MySQL |
| 6379 | Redis |
| 9000/9001 | MinIO |
---
## 11. Tóm Tắt Tech Stack
### Production Stack
```
Frontend: React 18 + TypeScript + UmiJS + Ant Design + Tailwind
Backend: Python 3.11 + Flask/Quart + Peewee
AI/ML: OpenAI + Sentence Transformers + Detectron2
Database: MySQL 8 + Elasticsearch 8
Cache: Redis 7
Storage: MinIO
Proxy: Nginx
Container: Docker + Docker Compose
Orchestration: Kubernetes + Helm
```
### Development Stack
```
Package Mgmt: uv (Python), npm (Node.js)
Linting: ruff (Python), ESLint (JS/TS)
Testing: pytest (Python), Jest (JS/TS)
CI/CD: GitHub Actions
Version Ctrl: Git
```
### Key Architectural Choices
1. **Async-first**: Quart ASGI cho high concurrency
2. **Hybrid Search**: Vector + BM25 trong Elasticsearch
3. **Multi-tenant**: Data isolation per tenant
4. **Pluggable LLMs**: Abstract interface cho nhiều providers
5. **Containerized**: Full Docker deployment
6. **Event-driven**: Background processing với Redis queue

File diff suppressed because it is too large Load diff

134
personal_analyze/README.md Normal file
View file

@ -0,0 +1,134 @@
# RAGFlow Analysis Documentation
Tài liệu phân tích chi tiết về RAGFlow - Open-source RAG Engine.
## Tổng Quan RAGFlow
**RAGFlow** (v0.22.1) là một **Retrieval-Augmented Generation (RAG) engine** mã nguồn mở, được xây dựng dựa trên **deep document understanding**. Đây là một ứng dụng full-stack với:
- **Backend**: Python (Flask/Quart)
- **Frontend**: React/TypeScript (UmiJS)
- **Kiến trúc**: Microservices với Docker
- **Data Stores**: MySQL, Elasticsearch/Infinity, Redis, MinIO
## Danh Sách Tài Liệu
| File | Nội dung |
|------|----------|
| [01_directory_structure.md](./01_directory_structure.md) | Cấu trúc cây thư mục chi tiết |
| [02_system_architecture.md](./02_system_architecture.md) | Kiến trúc hệ thống với diagrams |
| [03_sequence_diagrams.md](./03_sequence_diagrams.md) | Sequence diagrams cho các flows chính |
| [04_modules_analysis.md](./04_modules_analysis.md) | Phân tích chi tiết từng module |
| [05_tech_stack.md](./05_tech_stack.md) | Tech stack và dependencies |
| [06_source_code_analysis.md](./06_source_code_analysis.md) | Phân tích source code chi tiết |
## Tóm Tắt Chức Năng Chính
### 1. Document Processing
- Upload và parse nhiều định dạng (PDF, Word, Excel, PPT, HTML...)
- OCR và layout analysis cho PDF
- Intelligent chunking strategies
### 2. RAG Pipeline
- Hybrid search (Vector + BM25)
- Multiple embedding models support
- Reranking với cross-encoder
### 3. Chat/Dialog
- Streaming responses (SSE)
- Multi-knowledge base retrieval
- Conversation history
### 4. Agent Workflows
- Visual canvas builder
- 15+ built-in components
- 20+ external tool integrations
### 5. Knowledge Graph (GraphRAG)
- Entity extraction và resolution
- Graph-based retrieval
- Relationship visualization
## Kiến Trúc High-Level
```
┌─────────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ Web App │ Mobile │ Python SDK │ REST API │
└────────────────────────────┬────────────────────────────────────┘
┌────────────────────────────┼────────────────────────────────────┐
│ NGINX (Gateway) │
└────────────────────────────┬────────────────────────────────────┘
┌────────────────────────────┼────────────────────────────────────┐
│ APPLICATION LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │RAGFlow Server│ │ Admin Server │ │ MCP Server │ │
│ │ (Port 9380) │ │ (Port 9381) │ │ (Port 9382) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
┌────────────────────────────┼────────────────────────────────────┐
│ SERVICE LAYER │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ RAG │ │DeepDoc │ │ Agent │ │GraphRAG│ │Services│ │
│ │Pipeline│ │Parsers │ │ Canvas │ │ Engine │ │ Layer │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │
└────────────────────────────┬────────────────────────────────────┘
┌────────────────────────────┼────────────────────────────────────┐
│ DATA LAYER │
│ ┌────────┐ ┌────────────┐ ┌────────┐ ┌────────┐ │
│ │ MySQL │ │Elasticsearch│ │ Redis │ │ MinIO │ │
│ │(5455) │ │ (9200) │ │ (6379) │ │ (9000) │ │
│ └────────┘ └────────────┘ └────────┘ └────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Tech Stack Summary
| Layer | Technologies |
|-------|-------------|
| **Frontend** | React 18, TypeScript, UmiJS, Ant Design, Tailwind CSS |
| **Backend** | Python 3.10-3.12, Flask/Quart, Peewee ORM |
| **AI/ML** | OpenAI, Sentence Transformers, Detectron2, PyTorch |
| **Database** | MySQL 8, Elasticsearch 8, Redis 7 |
| **Storage** | MinIO (S3-compatible) |
| **Infrastructure** | Docker, Nginx, Kubernetes/Helm |
## LLM Providers Supported
- OpenAI (GPT-3.5, GPT-4, GPT-4V)
- Anthropic (Claude 3)
- Google (Gemini)
- Alibaba (Qwen)
- Groq, Mistral, Cohere
- Ollama (local models)
- 20+ more providers
## Data Connectors
- Enterprise: Confluence, Notion, SharePoint, Jira
- Communication: Slack, Discord, Gmail, Teams
- Storage: Google Drive, Dropbox, S3, WebDAV
## Quick Stats
| Metric | Value |
|--------|-------|
| Total LOC | ~62,000+ |
| Python Files | ~300+ |
| TS/JS Files | ~400+ |
| Database Models | 25+ |
| API Endpoints | ~50+ |
| LLM Providers | 20+ |
| Data Connectors | 15+ |
## License
RAGFlow is open-source under Apache 2.0 license.
---
*Documentation generated: 2025-11-26*