ragflow/personal_analyze/05_tech_stack.md
Claude c7cecf9a1f
docs: Add comprehensive RAGFlow analysis documentation
- Add directory structure analysis (01_directory_structure.md)
- Add system architecture with diagrams (02_system_architecture.md)
- Add sequence diagrams for main flows (03_sequence_diagrams.md)
- Add detailed modules analysis (04_modules_analysis.md)
- Add tech stack documentation (05_tech_stack.md)
- Add source code analysis (06_source_code_analysis.md)
- Add README summary for personal_analyze folder

This documentation provides:
- Complete codebase structure overview
- System architecture diagrams (ASCII art)
- Sequence diagrams for authentication, RAG, chat, agent flows
- Detailed analysis of API, RAG, DeepDoc, Agent, GraphRAG modules
- Full tech stack with 150+ dependencies analyzed
- Source code patterns and best practices analysis
2025-11-26 10:20:05 +00:00

634 lines
20 KiB
Markdown

# RAGFlow - Tech Stack Analysis
## 1. Tổng Quan Tech Stack
RAGFlow sử dụng một tech stack hiện đại, được thiết kế để xử lý các workload AI/ML nặng với khả năng scale tốt.
```
┌─────────────────────────────────────────────────────────────────────────┐
│ TECH STACK OVERVIEW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ FRONTEND │ │
│ │ React 18 │ TypeScript │ UmiJS │ Ant Design │ Tailwind CSS │ │
│ │ Zustand │ TanStack Query │ XYFlow │ Monaco Editor │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ BACKEND │ │
│ │ Python 3.10-3.12 │ Flask/Quart │ Peewee ORM │ Celery │ │
│ │ AsyncIO │ JWT │ SSE Streaming │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ AI/ML │ │
│ │ LangChain │ OpenAI │ Sentence Transformers │ Hugging Face │ │
│ │ PyTorch │ Detectron2 │ Tesseract OCR │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ MySQL 8 │ Elasticsearch 8 │ Redis │ MinIO │ Infinity │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ INFRASTRUCTURE │ │
│ │ Docker │ Docker Compose │ Kubernetes │ Nginx │ Helm │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## 2. Frontend Technologies
### 2.1 Core Framework
| Technology | Version | Mục đích |
|------------|---------|----------|
| **React** | 18.x | UI library chính |
| **TypeScript** | 5.x | Type-safe JavaScript |
| **UmiJS** | 4.x | React framework (Ant Design ecosystem) |
| **Vite** | 5.x | Build tool (nhanh hơn Webpack) |
### 2.2 UI Libraries
| Library | Version | Mục đích |
|---------|---------|----------|
| **Ant Design** | 5.x | Primary UI component library |
| **Shadcn/UI** | Latest | Modern, customizable components |
| **Radix UI** | Latest | Headless UI primitives |
| **Tailwind CSS** | 3.x | Utility-first CSS framework |
| **LESS** | 4.x | CSS preprocessor (legacy) |
### 2.3 State Management & Data Fetching
| Library | Mục đích |
|---------|----------|
| **Zustand** | Lightweight state management |
| **TanStack React Query** | Server state & caching |
| **Axios** | HTTP client |
### 2.4 Specialized Libraries
| Library | Mục đích |
|---------|----------|
| **XYFlow (React Flow)** | Workflow/canvas visualization |
| **Monaco Editor** | Code editor (VS Code core) |
| **AntV G2/G6** | Data visualization & graphs |
| **Recharts** | Charts and analytics |
| **Lexical** | Rich text editor (Facebook) |
| **React Markdown** | Markdown rendering |
| **i18next** | Internationalization |
| **React Hook Form** | Form handling |
| **Zod** | Schema validation |
### 2.5 Package.json Dependencies (172 packages)
```json
{
"dependencies": {
"react": "^18.2.0",
"react-dom": "^18.2.0",
"umi": "^4.0.0",
"antd": "^5.0.0",
"@tanstack/react-query": "^5.0.0",
"zustand": "^4.0.0",
"axios": "^1.0.0",
"tailwindcss": "^3.0.0",
"@xyflow/react": "^12.0.0",
"@monaco-editor/react": "^4.0.0",
"lexical": "^0.12.0",
"react-markdown": "^9.0.0",
"i18next": "^23.0.0",
"react-hook-form": "^7.0.0",
"zod": "^3.0.0",
"@radix-ui/react-*": "latest",
"@ant-design/icons": "^5.0.0",
"@antv/g2": "^5.0.0",
"@antv/g6": "^5.0.0"
}
}
```
---
## 3. Backend Technologies
### 3.1 Core Framework
| Technology | Version | Mục đích |
|------------|---------|----------|
| **Python** | 3.10-3.12 | Programming language |
| **Flask** | 3.x | Web framework |
| **Quart** | 0.19.x | Async Flask (ASGI) |
| **Hypercorn** | Latest | ASGI server |
### 3.2 Database & ORM
| Technology | Mục đích |
|------------|----------|
| **Peewee** | Lightweight ORM (primary) |
| **SQLAlchemy** | Advanced ORM operations |
| **PyMySQL** | MySQL driver |
### 3.3 Authentication & Security
| Library | Mục đích |
|---------|----------|
| **PyJWT** | JWT token handling |
| **bcrypt** | Password hashing |
| **python-jose** | JOSE implementation |
| **Authlib** | OAuth integration |
### 3.4 Async & Background Tasks
| Library | Mục đích |
|---------|----------|
| **asyncio** | Async I/O |
| **aiohttp** | Async HTTP client |
| **Redis/Valkey** | Task queue & caching |
| **APScheduler** | Job scheduling |
### 3.5 API & Documentation
| Library | Mục đích |
|---------|----------|
| **Flasgger** | Swagger/OpenAPI docs |
| **Flask-CORS** | CORS handling |
| **Werkzeug** | WSGI utilities |
### 3.6 pyproject.toml Dependencies (150+ packages)
```toml
[project]
name = "ragflow"
version = "0.22.1"
requires-python = ">=3.10,<3.13"
dependencies = [
# Web Framework
"flask>=3.0.0",
"quart>=0.19.0",
"hypercorn>=0.17.0",
"flask-cors>=4.0.0",
"flasgger>=0.9.0",
# Database
"peewee>=3.17.0",
"pymysql>=1.1.0",
# Authentication
"pyjwt>=2.8.0",
"bcrypt>=4.1.0",
# Async
"aiohttp>=3.9.0",
"httpx>=0.27.0",
# Data Processing
"pandas>=2.0.0",
"numpy>=1.26.0",
# AI/ML (see section 4)
...
]
```
---
## 4. AI/ML Technologies
### 4.1 LLM Integration
| Provider | Library | Models Supported |
|----------|---------|-----------------|
| **OpenAI** | `openai>=1.0` | GPT-3.5, GPT-4, GPT-4V |
| **Anthropic** | `anthropic>=0.20` | Claude 3 family |
| **Google** | `google-generativeai` | Gemini Pro |
| **Cohere** | `cohere>=5.0` | Command, Embed, Rerank |
| **Groq** | `groq>=0.4` | LLaMA, Mixtral |
| **Mistral** | `mistralai>=0.1` | Mistral 7B, Mixtral |
| **Ollama** | `ollama>=0.1` | Local models |
| **HuggingFace** | `huggingface_hub` | Open source models |
### 4.2 Embedding Models
| Library | Models |
|---------|--------|
| **Sentence Transformers** | all-MiniLM, all-mpnet, etc. |
| **OpenAI Embeddings** | text-embedding-3-small/large |
| **BGE** | bge-base, bge-large, bge-m3 |
| **Jina** | jina-embeddings-v2 |
| **Cohere** | embed-english-v3 |
```python
# Embedding configuration
EMBEDDING_MODELS = {
"openai": {
"text-embedding-3-small": {"dim": 1536, "max_tokens": 8191},
"text-embedding-3-large": {"dim": 3072, "max_tokens": 8191},
},
"bge": {
"bge-base-en-v1.5": {"dim": 768, "max_tokens": 512},
"bge-large-en-v1.5": {"dim": 1024, "max_tokens": 512},
"bge-m3": {"dim": 1024, "max_tokens": 8192},
},
"sentence-transformers": {
"all-MiniLM-L6-v2": {"dim": 384, "max_tokens": 256},
"all-mpnet-base-v2": {"dim": 768, "max_tokens": 384},
}
}
```
### 4.3 Document Processing
| Library | Mục đích |
|---------|----------|
| **PyMuPDF (fitz)** | PDF text extraction |
| **pdf2image** | PDF to image conversion |
| **Tesseract (pytesseract)** | OCR |
| **python-docx** | Word document parsing |
| **openpyxl** | Excel parsing |
| **python-pptx** | PowerPoint parsing |
| **BeautifulSoup4** | HTML parsing |
| **markdown** | Markdown processing |
| **camelot-py** | Table extraction from PDF |
| **tabula-py** | Alternative table extraction |
### 4.4 Computer Vision
| Library | Mục đích |
|---------|----------|
| **Detectron2** | Layout analysis |
| **LayoutLM** | Document understanding |
| **OpenCV** | Image processing |
| **Pillow** | Image manipulation |
| **YOLO** | Object detection |
### 4.5 NLP & Text Processing
| Library | Mục đích |
|---------|----------|
| **tiktoken** | OpenAI tokenization |
| **nltk** | Natural language toolkit |
| **spaCy** | NLP pipeline |
| **regex** | Advanced regex |
| **chardet** | Character encoding detection |
### 4.6 Vector Operations
| Library | Mục đích |
|---------|----------|
| **NumPy** | Numerical operations |
| **SciPy** | Scientific computing |
| **scikit-learn** | ML utilities, clustering |
| **faiss-cpu/gpu** | Vector similarity search |
---
## 5. Data Storage Technologies
### 5.1 Relational Database
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **MySQL 8.0** | Primary database | Port 5455 |
| **PostgreSQL** | Alternative (supported) | - |
**MySQL Schema Design**:
- InnoDB engine
- UTF8MB4 character set
- JSON columns for flexible data
- Foreign keys for integrity
### 5.2 Vector/Search Database
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **Elasticsearch 8.12** | Default vector store | Port 9200 |
| **Infinity** | Alternative (in-house) | Port 23817 |
| **OpenSearch** | Alternative | Port 9200 |
| **OceanBase** | Alternative (distributed) | - |
**Elasticsearch Configuration**:
```json
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"ik_smart": { "type": "ik_smart" },
"ik_max_word": { "type": "ik_max_word" }
}
}
},
"mappings": {
"properties": {
"content": { "type": "text", "analyzer": "ik_smart" },
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": true,
"similarity": "cosine"
}
}
}
}
```
### 5.3 Cache & Message Queue
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **Redis 7.x** | Cache, sessions, queue | Port 6379 |
| **Valkey** | Redis alternative | Port 6379 |
**Redis Usage**:
- Session storage
- Rate limiting
- Task queue (custom implementation)
- Cache layer
### 5.4 Object Storage
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **MinIO** | S3-compatible storage | Port 9000/9001 |
| **AWS S3** | Cloud storage option | - |
| **Azure Blob** | Cloud storage option | - |
**MinIO Structure**:
```
ragflow/ # Bucket
├── {tenant_id}/
│ ├── {kb_id}/
│ │ ├── {file_id} # Original files
│ │ └── chunks/ # Processed chunks
│ └── temp/ # Temporary files
└── system/ # System files
```
---
## 6. Infrastructure Technologies
### 6.1 Containerization
| Technology | Mục đích |
|------------|----------|
| **Docker** | Container runtime |
| **Docker Compose** | Multi-container orchestration |
| **BuildKit** | Efficient image building |
**Docker Images**:
```yaml
services:
ragflow-server:
image: infiniflow/ragflow:latest
# or: ragflow:nightly for development
mysql:
image: mysql:8.0
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
redis:
image: redis:7-alpine
minio:
image: minio/minio:latest
```
### 6.2 Web Server & Proxy
| Technology | Mục đích | Configuration |
|------------|----------|---------------|
| **Nginx** | Reverse proxy, static files | Port 80/443 |
| **Hypercorn** | ASGI server | Port 9380 |
**Nginx Configuration**:
```nginx
upstream ragflow {
server ragflow-server:9380;
}
server {
listen 80;
location /api/ {
proxy_pass http://ragflow;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
location / {
root /usr/share/nginx/html;
try_files $uri $uri/ /index.html;
}
}
```
### 6.3 Kubernetes Deployment
| Technology | Mục đích |
|------------|----------|
| **Kubernetes** | Container orchestration |
| **Helm** | K8s package manager |
**Helm Chart Structure**:
```
helm/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ └── ingress.yaml
```
---
## 7. Development Tools
### 7.1 Python Development
| Tool | Mục đích |
|------|----------|
| **uv** | Package manager (fast) |
| **pip** | Traditional package manager |
| **pre-commit** | Git hooks |
| **ruff** | Linter & formatter |
| **pytest** | Testing framework |
| **mypy** | Type checking |
### 7.2 Frontend Development
| Tool | Mục đích |
|------|----------|
| **npm/pnpm** | Package manager |
| **ESLint** | Linting |
| **Prettier** | Code formatting |
| **Jest** | Testing |
| **Storybook** | Component development |
| **Husky** | Git hooks |
### 7.3 Version Control & CI/CD
| Tool | Mục đích |
|------|----------|
| **Git** | Version control |
| **GitHub Actions** | CI/CD |
| **Docker Hub** | Image registry |
---
## 8. Monitoring & Observability
### 8.1 Logging
| Library | Mục đích |
|---------|----------|
| **Python logging** | Standard logging |
| **structlog** | Structured logging |
### 8.2 Tracing
| Integration | Mục đích |
|-------------|----------|
| **Langfuse** | LLM observability |
| **OpenTelemetry** | Distributed tracing |
### 8.3 Metrics
| Tool | Mục đích |
|------|----------|
| **Prometheus** | Metrics collection |
| **Grafana** | Visualization |
---
## 9. Third-party Integrations
### 9.1 LLM Providers
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Provider Support │
├─────────────────────────────────────────────────────────────┤
│ │
│ Commercial APIs: │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │OpenAI │ │Claude │ │Gemini │ │Cohere │ │ Groq │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
│ China Providers: │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Qwen │ │Zhipu │ │Baichuan│ │Spark │ │ERNIE │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
│ Self-hosted: │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Ollama │ │ vLLM │ │LocalAI│ │
│ └───────┘ └───────┘ └───────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
### 9.2 Data Source Connectors
| Category | Services |
|----------|----------|
| **Enterprise Wiki** | Confluence, Notion, SharePoint |
| **Communication** | Slack, Discord, Gmail, Teams |
| **Cloud Storage** | Google Drive, Dropbox, S3, WebDAV |
| **Development** | GitHub, Jira |
| **Education** | Moodle |
| **Finance** | TuShare, AkShare, Yahoo Finance |
### 9.3 Search APIs
| Service | Mục đích |
|---------|----------|
| **Tavily** | AI-optimized web search |
| **Google Search** | Web search |
| **Google Scholar** | Academic search |
| **SearXNG** | Meta search |
| **ArXiv** | Academic papers |
| **Wikipedia** | Knowledge lookup |
---
## 10. System Requirements
### 10.1 Minimum Requirements
| Resource | Minimum | Recommended |
|----------|---------|-------------|
| **CPU** | 4 cores | 8+ cores |
| **RAM** | 16 GB | 32+ GB |
| **Disk** | 50 GB | 200+ GB SSD |
| **GPU** | - | NVIDIA 8GB+ VRAM |
### 10.2 Software Requirements
| Software | Version |
|----------|---------|
| **Docker** | 20.10+ |
| **Docker Compose** | 2.0+ |
| **Python** | 3.10-3.12 |
| **Node.js** | 18.20.4+ |
### 10.3 Port Requirements
| Port | Service |
|------|---------|
| 80/443 | Nginx (HTTP/HTTPS) |
| 9380 | RAGFlow API |
| 9381 | Admin Server |
| 9200 | Elasticsearch |
| 5455 | MySQL |
| 6379 | Redis |
| 9000/9001 | MinIO |
---
## 11. Tóm Tắt Tech Stack
### Production Stack
```
Frontend: React 18 + TypeScript + UmiJS + Ant Design + Tailwind
Backend: Python 3.11 + Flask/Quart + Peewee
AI/ML: OpenAI + Sentence Transformers + Detectron2
Database: MySQL 8 + Elasticsearch 8
Cache: Redis 7
Storage: MinIO
Proxy: Nginx
Container: Docker + Docker Compose
Orchestration: Kubernetes + Helm
```
### Development Stack
```
Package Mgmt: uv (Python), npm (Node.js)
Linting: ruff (Python), ESLint (JS/TS)
Testing: pytest (Python), Jest (JS/TS)
CI/CD: GitHub Actions
Version Ctrl: Git
```
### Key Architectural Choices
1. **Async-first**: Quart ASGI cho high concurrency
2. **Hybrid Search**: Vector + BM25 trong Elasticsearch
3. **Multi-tenant**: Data isolation per tenant
4. **Pluggable LLMs**: Abstract interface cho nhiều providers
5. **Containerized**: Full Docker deployment
6. **Event-driven**: Background processing với Redis queue