Merge pull request #1 from Learnheart/claude/add-data-fabric-layer-0183ybbNvwdPu6Psi3N1hGNu
docs: Add database architecture analysis for RAGFlow
This commit is contained in:
commit
3b7123f176
1 changed files with 273 additions and 0 deletions
273
personal_analyze/database_saved_types.md
Normal file
273
personal_analyze/database_saved_types.md
Normal file
|
|
@ -0,0 +1,273 @@
|
|||
# RAGFlow Database Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
RAGFlow sử dụng 4 loại database chính để lưu trữ và xử lý dữ liệu:
|
||||
|
||||
| Database | Loại | Mục đích chính |
|
||||
|----------|------|----------------|
|
||||
| MySQL | Relational | Metadata, user data, configs |
|
||||
| Elasticsearch/Infinity | Vector + Search | Chunks, embeddings, full-text search |
|
||||
| Redis | In-memory | Task queue, caching, distributed locks |
|
||||
| MinIO | Object Storage | Raw files (PDF, DOCX, images) |
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Diagram
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ USER UPLOAD FILE │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ STEP 1: MinIO (Object Storage) │
|
||||
│ ─────────────────────────────────────────────────────────────────────── │
|
||||
│ Action: Lưu raw file │
|
||||
│ Path: bucket={kb_id}, location={filename} │
|
||||
│ Data: Binary content của file gốc │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ STEP 2: MySQL (Metadata) │
|
||||
│ ─────────────────────────────────────────────────────────────────────── │
|
||||
│ Tables affected: │
|
||||
│ • File: {id, parent_id, tenant_id, name, location, size, type} │
|
||||
│ • Document: {id, kb_id, name, location, size, parser_id, progress=0} │
|
||||
│ • File2Document: {file_id, document_id} │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ STEP 3: Redis (Task Queue) │
|
||||
│ ─────────────────────────────────────────────────────────────────────── │
|
||||
│ Action: Push task message to stream │
|
||||
│ Queue: "rag_flow_svr_queue" │
|
||||
│ Message: { │
|
||||
│ "id": "task_xxx", │
|
||||
│ "doc_id": "doc_xxx", │
|
||||
│ "kb_id": "kb_xxx", │
|
||||
│ "tenant_id": "tenant_xxx", │
|
||||
│ "parser_id": "naive|paper|book|...", │
|
||||
│ "task_type": "parse" │
|
||||
│ } │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ STEP 4: Task Executor (Worker Process) │
|
||||
│ ─────────────────────────────────────────────────────────────────────── │
|
||||
│ Actions: │
|
||||
│ 1. Consume task from Redis queue │
|
||||
│ 2. Fetch raw file from MinIO │
|
||||
│ 3. Parse & chunk document │
|
||||
│ 4. Generate embeddings │
|
||||
│ 5. Update progress in MySQL │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ STEP 5: Elasticsearch/Infinity (Vector Store) │
|
||||
│ ─────────────────────────────────────────────────────────────────────── │
|
||||
│ Action: Insert chunks with embeddings │
|
||||
│ Index: {tenant_id}_ragflow │
|
||||
│ Document: { │
|
||||
│ "id": "xxhash(content + doc_id)", │
|
||||
│ "doc_id": "doc_xxx", │
|
||||
│ "kb_id": ["kb_xxx"], │
|
||||
│ "content_with_weight": "chunk text...", │
|
||||
│ "q_1024_vec": [0.1, 0.2, ...], │
|
||||
│ "important_kwd": ["keyword1", "keyword2"], │
|
||||
│ "question_kwd": ["What is...?"], │
|
||||
│ "page_num_int": [1, 2], │
|
||||
│ "create_time": "2024-01-01 12:00:00" │
|
||||
│ } │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ STEP 6: MySQL (Update Status) │
|
||||
│ ─────────────────────────────────────────────────────────────────────── │
|
||||
│ Table: Document │
|
||||
│ Update: { │
|
||||
│ "chunk_num": 42, │
|
||||
│ "token_num": 15000, │
|
||||
│ "progress": 1.0, │
|
||||
│ "process_duration": 12.5 │
|
||||
│ } │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Storage Details
|
||||
|
||||
### 1. MySQL Tables
|
||||
|
||||
#### User & Tenant Management
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `user` | id, email, password, nickname, avatar, language, timezone, last_login_time, is_superuser | User accounts |
|
||||
| `tenant` | id, name, llm_id, embd_id, rerank_id, asr_id, img2txt_id, tts_id, parser_ids, credit | Tenant configuration |
|
||||
| `user_tenant` | id, user_id, tenant_id, role, invited_by | User-Tenant mapping |
|
||||
| `invitation_code` | id, code, user_id, tenant_id, visit_time | Invitation codes |
|
||||
|
||||
#### LLM & Model Configuration
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `llm_factories` | name, logo, tags, rank | LLM provider registry |
|
||||
| `llm` | llm_name, model_type, fid, max_tokens, tags, is_tools | Model definitions |
|
||||
| `tenant_llm` | tenant_id, llm_factory, llm_name, api_key, api_base, max_tokens, used_tokens | Tenant API keys |
|
||||
| `tenant_langfuse` | tenant_id, secret_key, public_key, host | Observability config |
|
||||
|
||||
#### Knowledge Base & Documents
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `knowledgebase` | id, tenant_id, name, embd_id, parser_id, doc_num, chunk_num, token_num, similarity_threshold, vector_similarity_weight | KB metadata |
|
||||
| `document` | id, kb_id, name, location, size, parser_id, parser_config, progress, chunk_num, token_num, meta_fields | Document metadata |
|
||||
| `file` | id, parent_id, tenant_id, name, location, size, type, source_type | File system structure |
|
||||
| `file2document` | id, file_id, document_id | File to Document mapping |
|
||||
| `task` | id, doc_id, from_page, to_page, task_type, priority, progress, retry_count, chunk_ids | Processing tasks |
|
||||
|
||||
#### Chat & Conversation
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `dialog` | id, tenant_id, name, kb_ids, llm_id, llm_setting, prompt_config, similarity_threshold, top_n, rerank_id | Chat app config |
|
||||
| `conversation` | id, dialog_id, user_id, name, message (JSON), reference | Internal chat history |
|
||||
| `api_4_conversation` | id, dialog_id, user_id, message, reference, tokens, duration, thumb_up, errors | API chat history |
|
||||
| `api_token` | tenant_id, token, dialog_id, source | API authentication |
|
||||
|
||||
#### Agent & Canvas
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `user_canvas` | id, user_id, title, description, canvas_type, canvas_category, dsl (JSON) | Agent workflows |
|
||||
| `canvas_template` | id, title, description, canvas_type, dsl | Workflow templates |
|
||||
| `user_canvas_version` | id, user_canvas_id, title, dsl | Version history |
|
||||
| `mcp_server` | id, tenant_id, name, url, server_type, variables, headers | MCP integrations |
|
||||
|
||||
#### Data Connectors
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `connector` | id, tenant_id, name, source, input_type, config, refresh_freq | External data sources |
|
||||
| `connector2kb` | id, connector_id, kb_id, auto_parse | Connector-KB mapping |
|
||||
| `sync_logs` | id, connector_id, kb_id, status, new_docs_indexed, error_msg | Sync history |
|
||||
|
||||
#### Operations
|
||||
|
||||
| Table | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `pipeline_operation_log` | id, document_id, pipeline_id, parser_id, progress, dsl | Pipeline logs |
|
||||
| `search` | id, tenant_id, name, search_config | Search configurations |
|
||||
|
||||
---
|
||||
|
||||
### 2. Elasticsearch/Infinity Chunk Schema
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `id` | string | Chunk ID = xxhash(content + doc_id) |
|
||||
| `doc_id` | string | Reference to source document |
|
||||
| `kb_id` | string[] | Knowledge base IDs (list format) |
|
||||
| `content_with_weight` | text | Chunk content |
|
||||
| `content_ltks` | text | Tokenized content (for search) |
|
||||
| `content_sm_ltks` | text | Fine-grained tokenized content |
|
||||
| `q_{size}_vec` | float[] | Dense vector embeddings |
|
||||
| `docnm_kwd` | keyword | Document filename |
|
||||
| `title_tks` | text | Tokenized title |
|
||||
| `important_kwd` | keyword[] | Extracted keywords |
|
||||
| `question_kwd` | keyword[] | Generated questions |
|
||||
| `tag_fea_kwd` | keyword[] | Content tags |
|
||||
| `page_num_int` | int[] | Page numbers |
|
||||
| `top_int` | int[] | Vertical position in page |
|
||||
| `position_int` | int[] | Position coordinates |
|
||||
| `image_id` | string | Reference to extracted image |
|
||||
| `create_time` | string | Creation timestamp |
|
||||
| `create_timestamp_flt` | float | Unix timestamp |
|
||||
|
||||
---
|
||||
|
||||
### 3. Redis Data Structures
|
||||
|
||||
| Key Pattern | Type | Description |
|
||||
|-------------|------|-------------|
|
||||
| `rag_flow_svr_queue` | Stream | Main task queue for document processing |
|
||||
| `{lock_name}` | String | Distributed locks |
|
||||
| `{cache_key}` | String/Hash | LLM response cache |
|
||||
| `{session_id}` | String | User session data |
|
||||
|
||||
#### Task Message Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "task_xxx",
|
||||
"doc_id": "document_id",
|
||||
"kb_id": "knowledgebase_id",
|
||||
"tenant_id": "tenant_id",
|
||||
"parser_id": "naive|paper|book|qa|table|...",
|
||||
"parser_config": {},
|
||||
"from_page": 0,
|
||||
"to_page": 100000,
|
||||
"name": "filename.pdf",
|
||||
"location": "storage_path",
|
||||
"language": "English|Chinese",
|
||||
"task_type": "parse|raptor|graphrag"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. MinIO Object Storage
|
||||
|
||||
| Bucket | Object Path | Content |
|
||||
|--------|-------------|---------|
|
||||
| `{kb_id}` | `{filename}` | Raw document files |
|
||||
| `{kb_id}` | `{filename}_` | Duplicate files (auto-renamed) |
|
||||
| `{tenant_id}` | `{chunk_id}` | Extracted images from chunks |
|
||||
|
||||
---
|
||||
|
||||
## Data Lineage
|
||||
|
||||
```
|
||||
Raw File (MinIO)
|
||||
│
|
||||
├── location: "{kb_id}/{filename}"
|
||||
│
|
||||
▼
|
||||
Document (MySQL)
|
||||
│
|
||||
├── id: "doc_xxx"
|
||||
├── kb_id: "kb_xxx"
|
||||
├── location: "{filename}"
|
||||
│
|
||||
▼
|
||||
Chunks (Elasticsearch/Infinity)
|
||||
│
|
||||
├── doc_id: "doc_xxx" ← Link back to Document
|
||||
├── kb_id: ["kb_xxx"] ← Link to Knowledge Base
|
||||
└── id: xxhash(content + doc_id)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Observations
|
||||
|
||||
### Current Limitations
|
||||
|
||||
1. **No Data Fabric Layer**: Document (`doc_id`) is hard-coded to one Knowledge Base (`kb_id`)
|
||||
2. **Duplicate Required**: Same file in multiple KBs requires re-upload and re-processing
|
||||
3. **No Cross-KB Sharing**: Chunks cannot be shared across Knowledge Bases
|
||||
|
||||
### Potential Improvements
|
||||
|
||||
1. Separate `RawDocument` table from `Document`
|
||||
2. Allow `Document.kb_id` to be a list or use junction table
|
||||
3. Enable chunk sharing with multi-KB tagging
|
||||
Loading…
Add table
Reference in a new issue