ragflow/personal_analyze/05_tech_stack.md
Claude c7cecf9a1f
docs: Add comprehensive RAGFlow analysis documentation
- Add directory structure analysis (01_directory_structure.md)
- Add system architecture with diagrams (02_system_architecture.md)
- Add sequence diagrams for main flows (03_sequence_diagrams.md)
- Add detailed modules analysis (04_modules_analysis.md)
- Add tech stack documentation (05_tech_stack.md)
- Add source code analysis (06_source_code_analysis.md)
- Add README summary for personal_analyze folder

This documentation provides:
- Complete codebase structure overview
- System architecture diagrams (ASCII art)
- Sequence diagrams for authentication, RAG, chat, agent flows
- Detailed analysis of API, RAG, DeepDoc, Agent, GraphRAG modules
- Full tech stack with 150+ dependencies analyzed
- Source code patterns and best practices analysis
2025-11-26 10:20:05 +00:00

20 KiB

RAGFlow - Tech Stack Analysis

1. Tổng Quan Tech Stack

RAGFlow sử dụng một tech stack hiện đại, được thiết kế để xử lý các workload AI/ML nặng với khả năng scale tốt.

┌─────────────────────────────────────────────────────────────────────────┐
│                           TECH STACK OVERVIEW                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                        FRONTEND                                     │ │
│  │  React 18 │ TypeScript │ UmiJS │ Ant Design │ Tailwind CSS         │ │
│  │  Zustand │ TanStack Query │ XYFlow │ Monaco Editor                 │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                        BACKEND                                      │ │
│  │  Python 3.10-3.12 │ Flask/Quart │ Peewee ORM │ Celery              │ │
│  │  AsyncIO │ JWT │ SSE Streaming                                      │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                        AI/ML                                        │ │
│  │  LangChain │ OpenAI │ Sentence Transformers │ Hugging Face         │ │
│  │  PyTorch │ Detectron2 │ Tesseract OCR                              │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                        DATA LAYER                                   │ │
│  │  MySQL 8 │ Elasticsearch 8 │ Redis │ MinIO │ Infinity              │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │                        INFRASTRUCTURE                               │ │
│  │  Docker │ Docker Compose │ Kubernetes │ Nginx │ Helm               │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

2. Frontend Technologies

2.1 Core Framework

Technology Version Mục đích
React 18.x UI library chính
TypeScript 5.x Type-safe JavaScript
UmiJS 4.x React framework (Ant Design ecosystem)
Vite 5.x Build tool (nhanh hơn Webpack)

2.2 UI Libraries

Library Version Mục đích
Ant Design 5.x Primary UI component library
Shadcn/UI Latest Modern, customizable components
Radix UI Latest Headless UI primitives
Tailwind CSS 3.x Utility-first CSS framework
LESS 4.x CSS preprocessor (legacy)

2.3 State Management & Data Fetching

Library Mục đích
Zustand Lightweight state management
TanStack React Query Server state & caching
Axios HTTP client

2.4 Specialized Libraries

Library Mục đích
XYFlow (React Flow) Workflow/canvas visualization
Monaco Editor Code editor (VS Code core)
AntV G2/G6 Data visualization & graphs
Recharts Charts and analytics
Lexical Rich text editor (Facebook)
React Markdown Markdown rendering
i18next Internationalization
React Hook Form Form handling
Zod Schema validation

2.5 Package.json Dependencies (172 packages)

{
  "dependencies": {
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "umi": "^4.0.0",
    "antd": "^5.0.0",
    "@tanstack/react-query": "^5.0.0",
    "zustand": "^4.0.0",
    "axios": "^1.0.0",
    "tailwindcss": "^3.0.0",
    "@xyflow/react": "^12.0.0",
    "@monaco-editor/react": "^4.0.0",
    "lexical": "^0.12.0",
    "react-markdown": "^9.0.0",
    "i18next": "^23.0.0",
    "react-hook-form": "^7.0.0",
    "zod": "^3.0.0",
    "@radix-ui/react-*": "latest",
    "@ant-design/icons": "^5.0.0",
    "@antv/g2": "^5.0.0",
    "@antv/g6": "^5.0.0"
  }
}

3. Backend Technologies

3.1 Core Framework

Technology Version Mục đích
Python 3.10-3.12 Programming language
Flask 3.x Web framework
Quart 0.19.x Async Flask (ASGI)
Hypercorn Latest ASGI server

3.2 Database & ORM

Technology Mục đích
Peewee Lightweight ORM (primary)
SQLAlchemy Advanced ORM operations
PyMySQL MySQL driver

3.3 Authentication & Security

Library Mục đích
PyJWT JWT token handling
bcrypt Password hashing
python-jose JOSE implementation
Authlib OAuth integration

3.4 Async & Background Tasks

Library Mục đích
asyncio Async I/O
aiohttp Async HTTP client
Redis/Valkey Task queue & caching
APScheduler Job scheduling

3.5 API & Documentation

Library Mục đích
Flasgger Swagger/OpenAPI docs
Flask-CORS CORS handling
Werkzeug WSGI utilities

3.6 pyproject.toml Dependencies (150+ packages)

[project]
name = "ragflow"
version = "0.22.1"
requires-python = ">=3.10,<3.13"

dependencies = [
    # Web Framework
    "flask>=3.0.0",
    "quart>=0.19.0",
    "hypercorn>=0.17.0",
    "flask-cors>=4.0.0",
    "flasgger>=0.9.0",

    # Database
    "peewee>=3.17.0",
    "pymysql>=1.1.0",

    # Authentication
    "pyjwt>=2.8.0",
    "bcrypt>=4.1.0",

    # Async
    "aiohttp>=3.9.0",
    "httpx>=0.27.0",

    # Data Processing
    "pandas>=2.0.0",
    "numpy>=1.26.0",

    # AI/ML (see section 4)
    ...
]

4. AI/ML Technologies

4.1 LLM Integration

Provider Library Models Supported
OpenAI openai>=1.0 GPT-3.5, GPT-4, GPT-4V
Anthropic anthropic>=0.20 Claude 3 family
Google google-generativeai Gemini Pro
Cohere cohere>=5.0 Command, Embed, Rerank
Groq groq>=0.4 LLaMA, Mixtral
Mistral mistralai>=0.1 Mistral 7B, Mixtral
Ollama ollama>=0.1 Local models
HuggingFace huggingface_hub Open source models

4.2 Embedding Models

Library Models
Sentence Transformers all-MiniLM, all-mpnet, etc.
OpenAI Embeddings text-embedding-3-small/large
BGE bge-base, bge-large, bge-m3
Jina jina-embeddings-v2
Cohere embed-english-v3
# Embedding configuration
EMBEDDING_MODELS = {
    "openai": {
        "text-embedding-3-small": {"dim": 1536, "max_tokens": 8191},
        "text-embedding-3-large": {"dim": 3072, "max_tokens": 8191},
    },
    "bge": {
        "bge-base-en-v1.5": {"dim": 768, "max_tokens": 512},
        "bge-large-en-v1.5": {"dim": 1024, "max_tokens": 512},
        "bge-m3": {"dim": 1024, "max_tokens": 8192},
    },
    "sentence-transformers": {
        "all-MiniLM-L6-v2": {"dim": 384, "max_tokens": 256},
        "all-mpnet-base-v2": {"dim": 768, "max_tokens": 384},
    }
}

4.3 Document Processing

Library Mục đích
PyMuPDF (fitz) PDF text extraction
pdf2image PDF to image conversion
Tesseract (pytesseract) OCR
python-docx Word document parsing
openpyxl Excel parsing
python-pptx PowerPoint parsing
BeautifulSoup4 HTML parsing
markdown Markdown processing
camelot-py Table extraction from PDF
tabula-py Alternative table extraction

4.4 Computer Vision

Library Mục đích
Detectron2 Layout analysis
LayoutLM Document understanding
OpenCV Image processing
Pillow Image manipulation
YOLO Object detection

4.5 NLP & Text Processing

Library Mục đích
tiktoken OpenAI tokenization
nltk Natural language toolkit
spaCy NLP pipeline
regex Advanced regex
chardet Character encoding detection

4.6 Vector Operations

Library Mục đích
NumPy Numerical operations
SciPy Scientific computing
scikit-learn ML utilities, clustering
faiss-cpu/gpu Vector similarity search

5. Data Storage Technologies

5.1 Relational Database

Technology Mục đích Configuration
MySQL 8.0 Primary database Port 5455
PostgreSQL Alternative (supported) -

MySQL Schema Design:

  • InnoDB engine
  • UTF8MB4 character set
  • JSON columns for flexible data
  • Foreign keys for integrity

5.2 Vector/Search Database

Technology Mục đích Configuration
Elasticsearch 8.12 Default vector store Port 9200
Infinity Alternative (in-house) Port 23817
OpenSearch Alternative Port 9200
OceanBase Alternative (distributed) -

Elasticsearch Configuration:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "ik_smart": { "type": "ik_smart" },
        "ik_max_word": { "type": "ik_max_word" }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "ik_smart" },
      "embedding": {
        "type": "dense_vector",
        "dims": 1536,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

5.3 Cache & Message Queue

Technology Mục đích Configuration
Redis 7.x Cache, sessions, queue Port 6379
Valkey Redis alternative Port 6379

Redis Usage:

  • Session storage
  • Rate limiting
  • Task queue (custom implementation)
  • Cache layer

5.4 Object Storage

Technology Mục đích Configuration
MinIO S3-compatible storage Port 9000/9001
AWS S3 Cloud storage option -
Azure Blob Cloud storage option -

MinIO Structure:

ragflow/                    # Bucket
├── {tenant_id}/
│   ├── {kb_id}/
│   │   ├── {file_id}      # Original files
│   │   └── chunks/        # Processed chunks
│   └── temp/              # Temporary files
└── system/                # System files

6. Infrastructure Technologies

6.1 Containerization

Technology Mục đích
Docker Container runtime
Docker Compose Multi-container orchestration
BuildKit Efficient image building

Docker Images:

services:
  ragflow-server:
    image: infiniflow/ragflow:latest
    # or: ragflow:nightly for development

  mysql:
    image: mysql:8.0

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0

  redis:
    image: redis:7-alpine

  minio:
    image: minio/minio:latest

6.2 Web Server & Proxy

Technology Mục đích Configuration
Nginx Reverse proxy, static files Port 80/443
Hypercorn ASGI server Port 9380

Nginx Configuration:

upstream ragflow {
    server ragflow-server:9380;
}

server {
    listen 80;

    location /api/ {
        proxy_pass http://ragflow;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    location / {
        root /usr/share/nginx/html;
        try_files $uri $uri/ /index.html;
    }
}

6.3 Kubernetes Deployment

Technology Mục đích
Kubernetes Container orchestration
Helm K8s package manager

Helm Chart Structure:

helm/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   └── ingress.yaml

7. Development Tools

7.1 Python Development

Tool Mục đích
uv Package manager (fast)
pip Traditional package manager
pre-commit Git hooks
ruff Linter & formatter
pytest Testing framework
mypy Type checking

7.2 Frontend Development

Tool Mục đích
npm/pnpm Package manager
ESLint Linting
Prettier Code formatting
Jest Testing
Storybook Component development
Husky Git hooks

7.3 Version Control & CI/CD

Tool Mục đích
Git Version control
GitHub Actions CI/CD
Docker Hub Image registry

8. Monitoring & Observability

8.1 Logging

Library Mục đích
Python logging Standard logging
structlog Structured logging

8.2 Tracing

Integration Mục đích
Langfuse LLM observability
OpenTelemetry Distributed tracing

8.3 Metrics

Tool Mục đích
Prometheus Metrics collection
Grafana Visualization

9. Third-party Integrations

9.1 LLM Providers

┌─────────────────────────────────────────────────────────────┐
│                    LLM Provider Support                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Commercial APIs:                                            │
│  ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐        │
│  │OpenAI │ │Claude │ │Gemini │ │Cohere │ │ Groq  │        │
│  └───────┘ └───────┘ └───────┘ └───────┘ └───────┘        │
│                                                              │
│  China Providers:                                            │
│  ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐        │
│  │ Qwen  │ │Zhipu  │ │Baichuan│ │Spark  │ │ERNIE  │        │
│  └───────┘ └───────┘ └───────┘ └───────┘ └───────┘        │
│                                                              │
│  Self-hosted:                                                │
│  ┌───────┐ ┌───────┐ ┌───────┐                             │
│  │Ollama │ │ vLLM  │ │LocalAI│                             │
│  └───────┘ └───────┘ └───────┘                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

9.2 Data Source Connectors

Category Services
Enterprise Wiki Confluence, Notion, SharePoint
Communication Slack, Discord, Gmail, Teams
Cloud Storage Google Drive, Dropbox, S3, WebDAV
Development GitHub, Jira
Education Moodle
Finance TuShare, AkShare, Yahoo Finance

9.3 Search APIs

Service Mục đích
Tavily AI-optimized web search
Google Search Web search
Google Scholar Academic search
SearXNG Meta search
ArXiv Academic papers
Wikipedia Knowledge lookup

10. System Requirements

10.1 Minimum Requirements

Resource Minimum Recommended
CPU 4 cores 8+ cores
RAM 16 GB 32+ GB
Disk 50 GB 200+ GB SSD
GPU - NVIDIA 8GB+ VRAM

10.2 Software Requirements

Software Version
Docker 20.10+
Docker Compose 2.0+
Python 3.10-3.12
Node.js 18.20.4+

10.3 Port Requirements

Port Service
80/443 Nginx (HTTP/HTTPS)
9380 RAGFlow API
9381 Admin Server
9200 Elasticsearch
5455 MySQL
6379 Redis
9000/9001 MinIO

11. Tóm Tắt Tech Stack

Production Stack

Frontend:     React 18 + TypeScript + UmiJS + Ant Design + Tailwind
Backend:      Python 3.11 + Flask/Quart + Peewee
AI/ML:        OpenAI + Sentence Transformers + Detectron2
Database:     MySQL 8 + Elasticsearch 8
Cache:        Redis 7
Storage:      MinIO
Proxy:        Nginx
Container:    Docker + Docker Compose
Orchestration: Kubernetes + Helm

Development Stack

Package Mgmt: uv (Python), npm (Node.js)
Linting:      ruff (Python), ESLint (JS/TS)
Testing:      pytest (Python), Jest (JS/TS)
CI/CD:        GitHub Actions
Version Ctrl: Git

Key Architectural Choices

  1. Async-first: Quart ASGI cho high concurrency
  2. Hybrid Search: Vector + BM25 trong Elasticsearch
  3. Multi-tenant: Data isolation per tenant
  4. Pluggable LLMs: Abstract interface cho nhiều providers
  5. Containerized: Full Docker deployment
  6. Event-driven: Background processing với Redis queue