ragflow/personal_analyze/01_directory_structure.md
Claude c7cecf9a1f
docs: Add comprehensive RAGFlow analysis documentation
- Add directory structure analysis (01_directory_structure.md)
- Add system architecture with diagrams (02_system_architecture.md)
- Add sequence diagrams for main flows (03_sequence_diagrams.md)
- Add detailed modules analysis (04_modules_analysis.md)
- Add tech stack documentation (05_tech_stack.md)
- Add source code analysis (06_source_code_analysis.md)
- Add README summary for personal_analyze folder

This documentation provides:
- Complete codebase structure overview
- System architecture diagrams (ASCII art)
- Sequence diagrams for authentication, RAG, chat, agent flows
- Detailed analysis of API, RAG, DeepDoc, Agent, GraphRAG modules
- Full tech stack with 150+ dependencies analyzed
- Source code patterns and best practices analysis
2025-11-26 10:20:05 +00:00

17 KiB

RAGFlow - Cấu Trúc Thư Mục

Tổng Quan

RAGFlow (v0.22.1) là một RAG (Retrieval-Augmented Generation) engine mã nguồn mở dựa trên deep document understanding. Dự án được xây dựng với kiến trúc full-stack bao gồm Python backend và React/TypeScript frontend.

Cấu Trúc Thư Mục Chi Tiết

ragflow/
│
├── api/                          # [BACKEND] Flask API Server
│   ├── ragflow_server.py         # Entry point chính
│   ├── settings.py               # Cấu hình server
│   ├── constants.py              # Hằng số API (API_VERSION = "v1")
│   ├── validation.py             # Request validation
│   │
│   ├── apps/                     # Flask Blueprints - API endpoints
│   │   ├── kb_app.py             # Knowledge Base management
│   │   ├── document_app.py       # Document processing
│   │   ├── dialog_app.py         # Chat/Dialog handling
│   │   ├── canvas_app.py         # Agent workflow canvas
│   │   ├── file_app.py           # File upload/management
│   │   ├── chunk_app.py          # Document chunking
│   │   ├── conversation_app.py   # Conversation management
│   │   ├── search_app.py         # Search functionality
│   │   ├── system_app.py         # System configuration
│   │   ├── llm_app.py            # LLM model management
│   │   ├── connector_app.py      # Data source connectors
│   │   ├── mcp_server_app.py     # MCP server integration
│   │   ├── langfuse_app.py       # Langfuse observability
│   │   ├── api_app.py            # API key management
│   │   ├── plugin_app.py         # Plugin management
│   │   ├── tenant_app.py         # Multi-tenancy
│   │   ├── user_app.py           # User management
│   │   │
│   │   ├── auth/                 # Authentication modules
│   │   │   ├── oauth.py          # OAuth base
│   │   │   ├── github.py         # GitHub OAuth
│   │   │   └── oidc.py           # OpenID Connect
│   │   │
│   │   └── sdk/                  # SDK REST API endpoints
│   │       ├── dataset.py        # Dataset API
│   │       ├── doc.py            # Document API
│   │       ├── chat.py           # Chat API
│   │       ├── session.py        # Session API
│   │       ├── files.py          # File API
│   │       ├── agents.py         # Agent API
│   │       └── dify_retrieval.py # Dify integration
│   │
│   ├── db/                       # Database layer
│   │   ├── db_models.py          # SQLAlchemy/Peewee models (54KB)
│   │   ├── db_utils.py           # Database utilities
│   │   ├── init_data.py          # Initial data seeding
│   │   ├── runtime_config.py     # Runtime configuration
│   │   │
│   │   ├── services/             # Business logic services
│   │   │   ├── user_service.py           # User operations
│   │   │   ├── dialog_service.py         # Dialog logic (37KB)
│   │   │   ├── document_service.py       # Document processing (39KB)
│   │   │   ├── file_service.py           # File handling (22KB)
│   │   │   ├── knowledgebase_service.py  # KB management (21KB)
│   │   │   ├── task_service.py           # Task queue (20KB)
│   │   │   ├── canvas_service.py         # Canvas logic (12KB)
│   │   │   ├── conversation_service.py   # Conversation handling
│   │   │   ├── connector_service.py      # Connector management
│   │   │   ├── llm_service.py            # LLM operations
│   │   │   ├── search_service.py         # Search operations
│   │   │   └── api_service.py            # API token service
│   │   │
│   │   └── joint_services/       # Cross-service operations
│   │
│   └── utils/                    # API utilities
│       ├── api_utils.py          # API helpers
│       ├── file_utils.py         # File utilities
│       ├── crypt.py              # Encryption
│       └── log_utils.py          # Logging
│
├── rag/                          # [CORE] RAG Processing Engine
│   ├── settings.py               # RAG configuration
│   ├── raptor.py                 # RAPTOR algorithm
│   ├── benchmark.py              # Performance testing
│   │
│   ├── llm/                      # LLM Model Abstractions
│   │   ├── chat_model.py         # Chat LLM interface
│   │   ├── embedding_model.py    # Embedding models
│   │   ├── rerank_model.py       # Reranking models
│   │   ├── cv_model.py           # Computer vision
│   │   ├── tts_model.py          # Text-to-speech
│   │   └── sequence2txt_model.py # Sequence to text
│   │
│   ├── flow/                     # RAG Pipeline
│   │   ├── pipeline.py           # Main pipeline
│   │   ├── file.py               # File processing
│   │   │
│   │   ├── parser/               # Document parsing
│   │   │   ├── parser.py
│   │   │   └── schema.py
│   │   │
│   │   ├── extractor/            # Information extraction
│   │   │   ├── extractor.py
│   │   │   └── schema.py
│   │   │
│   │   ├── tokenizer/            # Text tokenization
│   │   │   ├── tokenizer.py
│   │   │   └── schema.py
│   │   │
│   │   ├── splitter/             # Document chunking
│   │   │   ├── splitter.py
│   │   │   └── schema.py
│   │   │
│   │   └── hierarchical_merger/  # Hierarchical merging
│   │       ├── hierarchical_merger.py
│   │       └── schema.py
│   │
│   ├── app/                      # RAG application logic
│   ├── nlp/                      # NLP utilities
│   ├── utils/                    # RAG utilities
│   └── prompts/                  # LLM prompt templates
│
├── deepdoc/                      # [DOCUMENT] Deep Document Understanding
│   ├── parser/                   # Multi-format parsers
│   │   ├── pdf_parser.py         # PDF with layout analysis
│   │   ├── docx_parser.py        # Word documents
│   │   ├── ppt_parser.py         # PowerPoint
│   │   ├── excel_parser.py       # Excel spreadsheets
│   │   ├── html_parser.py        # HTML pages
│   │   ├── markdown_parser.py    # Markdown files
│   │   ├── json_parser.py        # JSON data
│   │   ├── txt_parser.py         # Plain text
│   │   ├── figure_parser.py      # Image/figure extraction
│   │   │
│   │   └── resume/               # Resume parsing
│   │       ├── step_one.py
│   │       └── step_two.py
│   │
│   └── vision/                   # Computer vision modules
│
├── agent/                        # [AGENT] Agentic Workflow System
│   ├── canvas.py                 # Canvas orchestration (25KB)
│   ├── settings.py               # Agent configuration
│   │
│   ├── component/                # Workflow components
│   │   ├── begin.py              # Workflow start
│   │   ├── llm.py                # LLM invocation
│   │   ├── agent_with_tools.py   # Agent with tools
│   │   ├── retrieval.py          # Document retrieval
│   │   ├── categorize.py         # Message categorization
│   │   ├── message.py            # Message handling
│   │   ├── webhook.py            # Webhook triggers
│   │   ├── iteration.py          # Loop iteration
│   │   └── variable_assigner.py  # Variable assignment
│   │
│   ├── tools/                    # External tool integrations
│   │   ├── tavily.py             # Web search
│   │   ├── arxiv.py              # Academic papers
│   │   ├── github.py             # GitHub API
│   │   ├── google.py             # Google Search
│   │   ├── wikipedia.py          # Wikipedia
│   │   ├── email.py              # Email sending
│   │   ├── code_exec.py          # Code execution
│   │   └── yahoofinance.py       # Financial data
│   │
│   └── templates/                # Pre-built workflows
│
├── graphrag/                     # [GRAPH] Knowledge Graph RAG
│   ├── entity_resolution.py      # Entity linking (12KB)
│   ├── search.py                 # Graph search (14KB)
│   ├── utils.py                  # Graph utilities (23KB)
│   ├── general/                  # General graph operations
│   └── light/                    # Lightweight implementations
│
├── web/                          # [FRONTEND] React/TypeScript
│   ├── package.json              # NPM dependencies (172 packages)
│   ├── .umirc.ts                 # UmiJS configuration
│   ├── tailwind.config.js        # Tailwind CSS config
│   │
│   └── src/
│       ├── pages/                # UmiJS page routes
│       │   ├── admin/            # Admin dashboard
│       │   ├── dataset/          # Knowledge base management
│       │   ├── datasets/         # Datasets list
│       │   ├── knowledge/        # Knowledge management
│       │   ├── next-chats/       # Chat interface
│       │   ├── next-searches/    # Search interface
│       │   ├── document-viewer/  # Document preview
│       │   ├── login/            # Authentication
│       │   └── register/         # User registration
│       │
│       ├── components/           # React components
│       │   ├── file-upload-modal/
│       │   ├── pdf-drawer/
│       │   ├── prompt-editor/
│       │   ├── document-preview/
│       │   └── ui/               # Shadcn/UI components
│       │
│       ├── services/             # API client services
│       ├── hooks/                # React hooks
│       ├── interfaces/           # TypeScript interfaces
│       ├── utils/                # Utility functions
│       ├── constants/            # Constants
│       └── locales/              # i18n translations
│
├── common/                       # [SHARED] Common Utilities
│   ├── settings.py               # Main configuration (11KB)
│   ├── config_utils.py           # Config utilities
│   ├── connection_utils.py       # Database connections
│   ├── constants.py              # Global constants
│   ├── exceptions.py             # Exception definitions
│   │
│   ├── Utilities:
│   │   ├── log_utils.py          # Logging setup
│   │   ├── file_utils.py         # File operations
│   │   ├── string_utils.py       # String utilities
│   │   ├── token_utils.py        # Token operations
│   │   └── time_utils.py         # Time utilities
│   │
│   └── data_source/              # Data source connectors
│       ├── confluence_connector.py (81KB)
│       ├── notion_connector.py (25KB)
│       ├── slack_connector.py (22KB)
│       ├── gmail_connector.py
│       ├── discord_connector.py
│       ├── sharepoint_connector.py
│       ├── dropbox_connector.py
│       └── google_drive/
│
├── sdk/                          # [SDK] Python Client Library
│   └── python/
│       └── ragflow_sdk/          # SDK implementation
│
├── mcp/                          # [MCP] Model Context Protocol
│   ├── server/                   # MCP server
│   │   └── server.py
│   └── client/                   # MCP client
│       └── client.py
│
├── admin/                        # [ADMIN] Admin Interface
│   ├── server/                   # Admin backend
│   └── client/                   # Admin frontend
│
├── plugin/                       # [PLUGIN] Plugin System
│   ├── plugin_manager.py         # Plugin management
│   ├── llm_tool_plugin.py        # LLM tool plugins
│   └── embedded_plugins/         # Built-in plugins
│
├── docker/                       # [DEPLOYMENT] Docker Configuration
│   ├── docker-compose.yml        # Main compose file
│   ├── docker-compose-base.yml   # Base services
│   ├── .env                      # Environment variables
│   ├── entrypoint.sh             # Container entry
│   ├── service_conf.yaml.template # Service config
│   ├── nginx/                    # Nginx configuration
│   │   └── nginx.conf
│   └── init.sql                  # Database init
│
├── conf/                         # [CONFIG] Configuration Files
│   ├── llm_factories.json        # LLM providers
│   ├── mapping.json              # Field mappings
│   ├── service_conf.yaml         # Service configuration
│   ├── private.pem               # RSA private key
│   └── public.pem                # RSA public key
│
├── test/                         # [TEST] Testing Suite
│   ├── unit_test/                # Unit tests
│   │   └── common/               # Common utilities tests
│   │
│   └── testcases/                # Integration tests
│       ├── test_http_api/        # HTTP API tests
│       ├── test_sdk_api/         # SDK tests
│       └── test_web_api/         # Web API tests
│
├── example/                      # [EXAMPLES] Usage Examples
│   ├── http/                     # HTTP API examples
│   └── sdk/                      # SDK examples
│
├── intergrations/                # [INTEGRATIONS] Third-party
│   ├── chatgpt-on-wechat/        # WeChat integration
│   ├── extension_chrome/         # Chrome extension
│   └── firecrawl/                # Web scraping
│
├── agentic_reasoning/            # [REASONING] Advanced reasoning
├── sandbox/                      # [SANDBOX] Code execution
├── helm/                         # [K8S] Kubernetes Helm charts
├── docs/                         # [DOCS] Documentation
│
├── pyproject.toml                # Python project config
├── CLAUDE.md                     # Development guidelines
└── README.md                     # Project overview

Mô Tả Chi Tiết Các Thư Mục Chính

1. /api/ - Backend API Server

  • Vai trò: Xử lý tất cả HTTP requests, authentication, và business logic
  • Framework: Flask/Quart (async ASGI)
  • Port mặc định: 9380
  • Entry point: ragflow_server.py

2. /rag/ - RAG Processing Engine

  • Vai trò: Xử lý pipeline RAG từ document parsing đến retrieval
  • Chức năng chính:
    • Document parsing và extraction
    • Text tokenization
    • Semantic chunking
    • Embedding generation
    • Reranking

3. /deepdoc/ - Document Understanding

  • Vai trò: Deep document parsing với layout analysis
  • Hỗ trợ formats: PDF, Word, PPT, Excel, HTML, Markdown, JSON, TXT
  • Đặc biệt: OCR và layout analysis cho PDF

4. /agent/ - Agentic Workflow

  • Vai trò: Hệ thống workflow agent với visual canvas
  • Components: LLM, Retrieval, Categorize, Webhook, Iteration...
  • Tools: Tavily, Google, Wikipedia, GitHub, Email...

5. /graphrag/ - Knowledge Graph

  • Vai trò: Xây dựng và query knowledge graph
  • Chức năng: Entity resolution, graph search, relationship extraction

6. /web/ - Frontend

  • Framework: React + TypeScript + UmiJS
  • UI: Ant Design + Shadcn/UI + Tailwind CSS
  • State: Zustand
  • Port: 80/443 (qua Nginx)

7. /common/ - Shared Utilities

  • Vai trò: Utilities và connectors dùng chung
  • Data sources: Confluence, Notion, Slack, Gmail, SharePoint...

8. /docker/ - Deployment

  • Services: MySQL, Elasticsearch/Infinity, Redis, MinIO, Nginx
  • Modes: CPU/GPU, single/cluster

Tóm Tắt Thống Kê

Thư mục Số files Mô tả
api/ ~100+ Backend API
rag/ ~50+ RAG engine
deepdoc/ ~30+ Document parsers
agent/ ~40+ Agent system
graphrag/ ~20+ Knowledge graph
web/src/ ~200+ Frontend
common/ ~50+ Shared utilities
test/ ~80+ Test suite