- Add directory structure analysis (01_directory_structure.md) - Add system architecture with diagrams (02_system_architecture.md) - Add sequence diagrams for main flows (03_sequence_diagrams.md) - Add detailed modules analysis (04_modules_analysis.md) - Add tech stack documentation (05_tech_stack.md) - Add source code analysis (06_source_code_analysis.md) - Add README summary for personal_analyze folder This documentation provides: - Complete codebase structure overview - System architecture diagrams (ASCII art) - Sequence diagrams for authentication, RAG, chat, agent flows - Detailed analysis of API, RAG, DeepDoc, Agent, GraphRAG modules - Full tech stack with 150+ dependencies analyzed - Source code patterns and best practices analysis
27 KiB
RAGFlow - Phân Tích Chi Tiết Các Module
1. Module API (/api/)
1.1 Tổng Quan
Module API là trung tâm xử lý tất cả HTTP requests của hệ thống. Được xây dựng trên Flask/Quart framework với kiến trúc Blueprint.
1.2 Cấu Trúc
api/
├── ragflow_server.py # Entry point - Khởi tạo Flask app
├── settings.py # Cấu hình server
├── constants.py # API_VERSION = "v1"
├── validation.py # Request validation
│
├── apps/ # API Blueprints
├── db/ # Database layer
└── utils/ # Utilities
1.3 Chi Tiết Các Blueprint (API Apps)
1.3.1 kb_app.py - Knowledge Base Management
Chức năng: Quản lý Knowledge Base (tạo, xóa, sửa, liệt kê)
Endpoints chính:
| Method | Endpoint | Mô tả |
|---|---|---|
| POST | /api/v1/kb/create |
Tạo KB mới |
| GET | /api/v1/kb/list |
Liệt kê KBs |
| PUT | /api/v1/kb/update |
Cập nhật KB |
| DELETE | /api/v1/kb/delete |
Xóa KB |
| GET | /api/v1/kb/{id} |
Chi tiết KB |
Logic chính:
- Validation tenant permissions
- Tạo Elasticsearch index cho mỗi KB
- Quản lý embedding model settings
- Quản lý parser configurations
1.3.2 document_app.py - Document Management
Chức năng: Upload, parsing, và quản lý documents
Endpoints chính:
| Method | Endpoint | Mô tả |
|---|---|---|
| POST | /api/v1/document/upload |
Upload file |
| POST | /api/v1/document/run |
Trigger parsing |
| GET | /api/v1/document/list |
Liệt kê docs |
| DELETE | /api/v1/document/delete |
Xóa document |
| GET | /api/v1/document/{id}/chunks |
Lấy chunks |
Logic chính:
- File type validation
- MinIO storage integration
- Background task queuing
- Parsing status tracking
1.3.3 dialog_app.py - Chat/Dialog Management
Chức năng: Xử lý chat conversations với RAG
Endpoints chính:
| Method | Endpoint | Mô tả |
|---|---|---|
| POST | /api/v1/dialog/create |
Tạo dialog |
| POST | /api/v1/dialog/chat |
Chat (SSE streaming) |
| POST | /api/v1/dialog/completion |
Non-streaming chat |
| GET | /api/v1/dialog/list |
Liệt kê dialogs |
Logic chính:
- RAG pipeline orchestration
- Streaming response (SSE)
- Conversation history management
- Multi-KB retrieval
1.3.4 canvas_app.py - Agent Workflow
Chức năng: Visual workflow builder cho AI agents
Endpoints chính:
| Method | Endpoint | Mô tả |
|---|---|---|
| POST | /api/v1/canvas/create |
Tạo workflow |
| POST | /api/v1/canvas/run |
Execute workflow |
| PUT | /api/v1/canvas/update |
Cập nhật |
| GET | /api/v1/canvas/list |
Liệt kê |
Logic chính:
- DSL parsing và validation
- Component orchestration
- Tool integration
- Variable passing between nodes
1.3.5 file_app.py - File Management
Chức năng: Upload, download, quản lý files
Endpoints chính:
| Method | Endpoint | Mô tả |
|---|---|---|
| POST | /api/v1/file/upload |
Upload file |
| GET | /api/v1/file/download/{id} |
Download |
| GET | /api/v1/file/list |
Liệt kê files |
| DELETE | /api/v1/file/delete |
Xóa file |
1.3.6 search_app.py - Search Operations
Chức năng: Full-text và semantic search
Endpoints chính:
| Method | Endpoint | Mô tả |
|---|---|---|
| POST | /api/v1/search |
Hybrid search |
| GET | /api/v1/search/history |
Search history |
1.4 Database Services (/api/db/services/)
dialog_service.py (37KB - Service phức tạp nhất)
class DialogService:
def chat(dialog_id, question, stream=True):
"""
Main RAG chat function
1. Load dialog configuration
2. Get relevant documents (retrieval)
3. Rerank results
4. Build prompt with context
5. Call LLM (streaming)
6. Save conversation
"""
def retrieval(dialog, question):
"""
Hybrid retrieval from Elasticsearch
- Vector similarity search
- BM25 full-text search
- Score combination
"""
def rerank(chunks, question):
"""
Cross-encoder reranking
- Score each chunk against question
- Return top-k
"""
document_service.py (39KB)
class DocumentService:
def upload(file, kb_id):
"""Upload file to MinIO, create DB record"""
def parse(doc_id):
"""Queue document for background parsing"""
def chunk(doc_id, chunks):
"""Save parsed chunks to ES and DB"""
def delete(doc_id):
"""Remove doc, chunks, and file"""
knowledgebase_service.py (21KB)
class KnowledgebaseService:
def create(name, embedding_model, parser_id):
"""Create KB with ES index"""
def update_parser_config(kb_id, config):
"""Update chunking/parsing settings"""
def get_statistics(kb_id):
"""Get doc count, chunk count, etc."""
1.5 Database Models (/api/db/db_models.py)
25+ Models quan trọng:
# User & Tenant
class User(BaseModel):
id, email, password, nickname, avatar, status, login_channel
class Tenant(BaseModel):
id, name, public_key, llm_id, embd_id, parser_id, credit
class UserTenant(BaseModel):
user_id, tenant_id, role # owner, admin, member
# Knowledge Management
class Knowledgebase(BaseModel):
id, tenant_id, name, description, embd_id, parser_id,
similarity_threshold, vector_similarity_weight, ...
class Document(BaseModel):
id, kb_id, name, location, size, type, parser_id,
status, progress, chunk_num, token_num, process_duation
class File(BaseModel):
id, tenant_id, name, size, location, type, source_type
# Chat & Dialog
class Dialog(BaseModel):
id, tenant_id, name, description, kb_ids, llm_id,
prompt_config, similarity_threshold, top_n, top_k
class Conversation(BaseModel):
id, dialog_id, name, message # JSON array of messages
# Workflow
class UserCanvas(BaseModel):
id, tenant_id, name, dsl, avatar # DSL is workflow definition
class CanvasTemplate(BaseModel):
id, name, dsl, avatar # Pre-built templates
# Integration
class APIToken(BaseModel):
id, tenant_id, token, dialog_id # For external API access
class MCPServer(BaseModel):
id, tenant_id, name, host, tools # MCP server config
2. Module RAG (/rag/)
2.1 Tổng Quan
Core RAG processing engine - xử lý từ document parsing đến retrieval.
2.2 LLM Abstractions (/rag/llm/)
chat_model.py - Chat LLM Interface
class Base:
"""Abstract base for all chat models"""
def chat(messages, stream=True, **kwargs):
"""Generate chat completion"""
class OpenAIChat(Base):
"""OpenAI GPT models"""
class ClaudeChat(Base):
"""Anthropic Claude models"""
class QwenChat(Base):
"""Alibaba Qwen models"""
class OllamaChat(Base):
"""Local Ollama models"""
# Factory function
def get_chat_model(model_name, api_key, base_url):
"""Return appropriate chat model instance"""
Supported Providers (20+):
- OpenAI (GPT-3.5, GPT-4, GPT-4V)
- Anthropic (Claude 3)
- Google (Gemini)
- Alibaba (Qwen, Qwen-VL)
- Groq
- Mistral
- Cohere
- DeepSeek
- Zhipu (GLM)
- Moonshot
- Ollama (local)
- NVIDIA
- Bedrock (AWS)
- Azure OpenAI
- Hugging Face
- ...
embedding_model.py - Embedding Interface
class Base:
"""Abstract base for embeddings"""
def encode(texts: List[str]) -> List[List[float]]:
"""Generate embeddings for texts"""
class OpenAIEmbed(Base):
"""text-embedding-ada-002, text-embedding-3-*"""
class BGEEmbed(Base):
"""BAAI BGE models"""
class JinaEmbed(Base):
"""Jina AI embeddings"""
# Supported embedding models:
# - OpenAI: ada-002, embedding-3-small, embedding-3-large
# - BGE: bge-base, bge-large, bge-m3
# - Jina: jina-embeddings-v2
# - Cohere: embed-english-v3
# - HuggingFace: sentence-transformers
# - Local: Ollama embeddings
rerank_model.py - Reranking Interface
class Base:
"""Abstract base for rerankers"""
def rerank(query: str, documents: List[str]) -> List[float]:
"""Score documents against query"""
class CohereRerank(Base):
"""Cohere rerank models"""
class JinaRerank(Base):
"""Jina AI reranker"""
class BGERerank(Base):
"""BAAI BGE reranker"""
2.3 RAG Pipeline (/rag/flow/)
Pipeline Architecture
Document → Parser → Tokenizer → Splitter → Embedder → Index
parser/parser.py
def parse(file_path, parser_config):
"""
Parse document based on file type
Returns: List of text segments with metadata
"""
# Supported parsers:
# - naive: Simple text extraction
# - paper: Academic paper structure
# - book: Book chapter detection
# - laws: Legal document parsing
# - presentation: PPT parsing
# - qa: Q&A format extraction
# - table: Table extraction
# - picture: Image description
# - one: Single chunk per doc
# - audio: Audio transcription
# - email: Email thread parsing
splitter/splitter.py
class Splitter:
"""Document chunking strategies"""
def split_by_tokens(text, chunk_size=512, overlap=128):
"""Token-based splitting"""
def split_by_sentences(text, max_sentences=10):
"""Sentence-based splitting"""
def split_by_delimiter(text, delimiter='\n\n'):
"""Delimiter-based splitting"""
def split_semantic(text, threshold=0.5):
"""Semantic similarity based splitting"""
tokenizer/tokenizer.py
class Tokenizer:
"""Text tokenization"""
def tokenize(text):
"""Convert text to tokens"""
def count_tokens(text):
"""Count tokens in text"""
# Uses tiktoken for OpenAI models
# Uses model-specific tokenizers for others
2.4 RAPTOR (/rag/raptor.py)
RAPTOR = Recursive Abstractive Processing for Tree-Organized Retrieval
class RAPTOR:
"""
Hierarchical document representation
- Clusters similar chunks
- Creates summaries of clusters
- Builds tree structure for retrieval
"""
def build_tree(chunks):
"""Build RAPTOR tree from chunks"""
def retrieve(query, tree):
"""Retrieve from tree structure"""
3. Module DeepDoc (/deepdoc/)
3.1 Tổng Quan
Deep document understanding với layout analysis và OCR.
3.2 Document Parsers (/deepdoc/parser/)
pdf_parser.py - PDF Processing
class PdfParser:
"""
Advanced PDF parsing with:
- OCR for scanned pages
- Layout analysis (tables, figures, headers)
- Multi-column detection
- Image extraction
"""
def __call__(file_path):
"""Parse PDF file"""
# 1. Extract text with PyMuPDF
# 2. Apply OCR if needed (Tesseract)
# 3. Analyze layout (detectron2/layoutlm)
# 4. Extract tables (camelot/tabula)
# 5. Extract images
# Return structured content
docx_parser.py - Word Documents
class DocxParser:
"""
Parse .docx files
- Text extraction
- Table extraction
- Image extraction
- Style preservation
"""
excel_parser.py - Spreadsheets
class ExcelParser:
"""
Parse .xlsx/.xls files
- Sheet-by-sheet processing
- Table structure preservation
- Formula evaluation
"""
html_parser.py - Web Pages
class HtmlParser:
"""
Parse HTML content
- Clean HTML
- Extract main content
- Handle tables
- Remove scripts/styles
"""
3.3 Vision Module (/deepdoc/vision/)
class LayoutAnalyzer:
"""
Document layout analysis using ML
- Detectron2 for object detection
- LayoutLM for document understanding
"""
def analyze(image):
"""
Detect document regions:
- Title
- Paragraph
- Table
- Figure
- Header/Footer
- List
"""
4. Module Agent (/agent/)
4.1 Tổng Quan
Agentic workflow system với visual canvas builder.
4.2 Canvas Engine (/agent/canvas.py)
class Canvas:
"""
Main workflow orchestrator
- Parse DSL definition
- Execute components in order
- Handle branching logic
- Manage variables
"""
def __init__(self, dsl):
"""Initialize from DSL"""
self.components = self._parse_dsl(dsl)
self.graph = self._build_graph()
def run(self, input_data):
"""Execute workflow"""
context = {"input": input_data}
for component in self._topological_sort():
result = component.execute(context)
context.update(result)
return context["output"]
4.3 Components (/agent/component/)
begin.py - Workflow Start
class BeginComponent:
"""
Entry point of workflow
- Initialize variables
- Receive user input
"""
def execute(self, context):
return {"user_input": context["input"]}
llm.py - LLM Component
class LLMComponent:
"""
Call LLM with configured prompt
- Template variable substitution
- Streaming support
- Output parsing
"""
def execute(self, context):
prompt = self.template.format(**context)
response = self.llm.chat(prompt)
return {"llm_output": response}
retrieval.py - Retrieval Component
class RetrievalComponent:
"""
Search knowledge bases
- Multi-KB search
- Configurable top_k
- Score threshold
"""
def execute(self, context):
query = context["user_input"]
results = self.search(query, self.kb_ids)
return {"retrieved_docs": results}
categorize.py - Conditional Branching
class CategorizeComponent:
"""
Route to different paths based on conditions
- LLM-based classification
- Rule-based matching
"""
def execute(self, context):
category = self._classify(context)
return {"next_node": self.routes[category]}
agent_with_tools.py - Tool-Using Agent
class AgentWithToolsComponent:
"""
ReAct pattern agent
- Tool selection
- Iterative reasoning
- Observation handling
"""
def execute(self, context):
while not done:
action = self.llm.decide_action(context)
if action.type == "tool":
result = self.tools[action.tool].run(action.input)
context["observation"] = result
else:
return {"output": action.response}
4.4 Tools (/agent/tools/)
External Tool Integrations
| Tool | File | Chức năng |
|---|---|---|
| Tavily | tavily.py |
Web search API |
| ArXiv | arxiv.py |
Academic paper search |
google.py |
Google search | |
| Wikipedia | wikipedia.py |
Wikipedia lookup |
| GitHub | github.py |
GitHub API |
email.py |
Send emails | |
| Code Exec | code_exec.py |
Execute Python code |
| DeepL | deepl.py |
Translation |
| Jin10 | jin10.py |
Financial news |
| TuShare | tushare.py |
Chinese stock data |
| Yahoo Finance | yahoofinance.py |
Stock data |
| QWeather | qweather.py |
Weather data |
class BaseTool:
"""Base class for all tools"""
name: str
description: str
def run(self, input: str) -> str:
"""Execute tool and return result"""
class TavilySearch(BaseTool):
name = "tavily_search"
description = "Search the web for current information"
def run(self, query):
response = tavily.search(query)
return format_results(response)
5. Module GraphRAG (/graphrag/)
5.1 Tổng Quan
Knowledge graph construction và querying.
5.2 Entity Resolution (/graphrag/entity_resolution.py)
class EntityResolution:
"""
Entity extraction và linking
- Extract entities from text
- Cluster similar entities
- Resolve duplicates
"""
def extract_entities(text):
"""Extract named entities using LLM"""
prompt = f"Extract entities from: {text}"
return llm.chat(prompt)
def resolve_entities(entities):
"""Merge duplicate entities"""
clusters = self._cluster_similar(entities)
return self._merge_clusters(clusters)
5.3 Graph Search (/graphrag/search.py)
class GraphSearch:
"""
Query knowledge graph
- Entity-based search
- Relationship traversal
- Subgraph extraction
"""
def search(query):
"""Find relevant subgraph for query"""
# 1. Extract query entities
# 2. Find matching graph entities
# 3. Traverse relationships
# 4. Return context subgraph
6. Module Frontend (/web/)
6.1 Tổng Quan
React/TypeScript SPA với UmiJS framework.
6.2 Pages (/web/src/pages/)
| Page | Chức năng |
|---|---|
/dataset |
Knowledge base management |
/datasets |
Dataset list view |
/next-chats |
Chat interface |
/next-searches |
Search interface |
/document-viewer |
Document preview |
/admin |
Admin dashboard |
/login |
Authentication |
/register |
User registration |
6.3 Components (/web/src/components/)
Core Components:
file-upload-modal/- File upload UIpdf-drawer/- PDF preview drawerprompt-editor/- Prompt template editordocument-preview/- Document viewerllm-setting-items/- LLM configuration UIui/- Shadcn/UI base components
6.4 State Management
// Using Zustand for state
import { create } from 'zustand';
interface KnowledgebaseStore {
knowledgebases: Knowledgebase[];
currentKb: Knowledgebase | null;
fetchKnowledgebases: () => Promise<void>;
createKnowledgebase: (data: CreateKbRequest) => Promise<void>;
}
export const useKnowledgebaseStore = create<KnowledgebaseStore>((set) => ({
knowledgebases: [],
currentKb: null,
fetchKnowledgebases: async () => {
const data = await api.get('/kb/list');
set({ knowledgebases: data });
},
// ...
}));
6.5 API Services (/web/src/services/)
// API client using Axios
import { request } from 'umi';
export async function createKnowledgebase(data: CreateKbRequest) {
return request('/api/v1/kb/create', {
method: 'POST',
data,
});
}
export async function chat(dialogId: string, question: string) {
return request('/api/v1/dialog/chat', {
method: 'POST',
data: { dialog_id: dialogId, question },
responseType: 'stream',
});
}
7. Module Common (/common/)
7.1 Configuration (/common/settings.py)
# Main configuration file
class Settings:
# Database
MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost')
MYSQL_PORT = int(os.getenv('MYSQL_PORT', 5455))
MYSQL_USER = os.getenv('MYSQL_USER', 'root')
MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD', 'infini_rag_flow')
MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'ragflow')
# Elasticsearch
ES_HOSTS = os.getenv('ES_HOSTS', 'http://localhost:9200').split(',')
# Redis
REDIS_HOST = os.getenv('REDIS_HOST', 'localhost')
REDIS_PORT = int(os.getenv('REDIS_PORT', 6379))
# MinIO
MINIO_HOST = os.getenv('MINIO_HOST', 'localhost:9000')
MINIO_ACCESS_KEY = os.getenv('MINIO_USER', 'rag_flow')
MINIO_SECRET_KEY = os.getenv('MINIO_PASSWORD', 'infini_rag_flow')
# Document Engine
DOC_ENGINE = os.getenv('DOC_ENGINE', 'elasticsearch') # or 'infinity'
7.2 Data Source Connectors (/common/data_source/)
Supported Connectors:
| Connector | File | Chức năng |
|---|---|---|
| Confluence | confluence_connector.py (81KB) |
Atlassian Confluence wiki |
| Notion | notion_connector.py (25KB) |
Notion databases |
| Slack | slack_connector.py (22KB) |
Slack messages |
| Gmail | gmail_connector.py |
Gmail emails |
| Discord | discord_connector.py |
Discord channels |
| SharePoint | sharepoint_connector.py |
Microsoft SharePoint |
| Teams | teams_connector.py |
Microsoft Teams |
| Dropbox | dropbox_connector.py |
Dropbox files |
| Google Drive | google_drive/ |
Google Drive |
| WebDAV | webdav_connector.py |
WebDAV servers |
| Moodle | moodle_connector.py |
Moodle LMS |
class BaseConnector:
"""Abstract base for connectors"""
def authenticate(credentials):
"""Authenticate with external service"""
def list_items():
"""List available items"""
def sync():
"""Sync data to RAGFlow"""
class ConfluenceConnector(BaseConnector):
"""Confluence integration"""
def __init__(self, url, username, api_token):
self.client = Confluence(url, username, api_token)
def sync_space(space_key):
"""Sync all pages from a space"""
pages = self.client.get_all_pages(space_key)
for page in pages:
content = self._convert_to_markdown(page.body)
yield Document(content=content, metadata=page.metadata)
8. Module SDK (/sdk/python/)
8.1 Python SDK
from ragflow import RAGFlow
# Initialize client
client = RAGFlow(
api_key="your-api-key",
base_url="http://localhost:9380"
)
# Create knowledge base
kb = client.create_knowledgebase(
name="My KB",
embedding_model="text-embedding-3-small"
)
# Upload document
doc = kb.upload_document("path/to/document.pdf")
# Wait for parsing
doc.wait_for_ready()
# Create chat
chat = client.create_chat(
name="My Chat",
knowledgebase_ids=[kb.id]
)
# Send message
response = chat.send_message("What is this document about?")
print(response.answer)
9. Tóm Tắt Module Dependencies
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (web/) │
└─────────────────────────────┬───────────────────────────────────┘
│ HTTP/SSE
▼
┌─────────────────────────────────────────────────────────────────┐
│ API (api/) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ kb_app │ │doc_app │ │dialog_ │ │canvas_ │ │
│ │ │ │ │ │app │ │app │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └────────────┴───────────┴────────────┘ │
│ │ │
│ ┌──────────────────────────┴──────────────────────────┐ │
│ │ Services Layer │ │
│ │ DialogService │ DocumentService │ KBService │ │
│ └───────────────────────────┬─────────────────────────┘ │
└───────────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ RAG (rag/) │ │ Agent (agent/) │ │GraphRAG(graphrag)│
│ │ │ │ │ │
│ - LLM Models │ │ - Canvas Engine │ │ - Entity Res. │
│ - Pipeline │ │ - Components │ │ - Graph Search │
│ - Embeddings │ │ - Tools │ │ │
└───────┬───────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└─────────────────────┼───────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DeepDoc (deepdoc/) │
│ │
│ PDF Parser │ DOCX Parser │ HTML Parser │ Vision/OCR │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Common (common/) │
│ │
│ Settings │ Utilities │ Data Source Connectors │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Stores │
│ │
│ MySQL │ Elasticsearch/Infinity │ Redis │ MinIO │
└─────────────────────────────────────────────────────────────────┘
10. Kích Thước Code Ước Tính
| Module | Lines of Code | Complexity |
|---|---|---|
| api/ | ~15,000 | High |
| rag/ | ~8,000 | High |
| deepdoc/ | ~5,000 | Medium |
| agent/ | ~6,000 | High |
| graphrag/ | ~3,000 | Medium |
| web/src/ | ~20,000 | High |
| common/ | ~5,000 | Medium |
| Total | ~62,000 | - |