- Add document_service_analysis.md: comprehensive analysis of document lifecycle management including insert, remove, parse, progress tracking - Add knowledgebase_service_analysis.md: dataset management and access control analysis with permission model, parser configuration
24 KiB
Knowledgebase Service Analysis - Dataset Management & Access Control
Tổng Quan
knowledgebase_service.py (566 lines) quản lý Dataset (Knowledgebase) - đơn vị tổ chức tài liệu trong RAGFlow, bao gồm CRUD operations, access control, parser configuration, và document association tracking.
File Location
/api/db/services/knowledgebase_service.py
Class Definition
class KnowledgebaseService(CommonService):
model = Knowledgebase # Line 49
Kế thừa CommonService với các method cơ bản: query(), get_by_id(), save(), update_by_id(), delete_by_id()
Knowledgebase Model Structure
# From db_models.py (Lines 734-753)
class Knowledgebase(DataBaseModel):
id = CharField(max_length=32, primary_key=True)
avatar = TextField(null=True) # KB avatar (base64)
tenant_id = CharField(max_length=32, index=True) # Owner tenant
name = CharField(max_length=128, index=True) # KB name
language = CharField(max_length=32) # "English"|"Chinese"
description = TextField(null=True) # KB description
embd_id = CharField(max_length=128) # Embedding model ID
permission = CharField(max_length=16) # "me"|"team"
created_by = CharField(max_length=32) # Creator user ID
# Statistics
doc_num = IntegerField(default=0) # Document count
token_num = IntegerField(default=0) # Total tokens
chunk_num = IntegerField(default=0) # Total chunks
# Search config
similarity_threshold = FloatField(default=0.2)
vector_similarity_weight = FloatField(default=0.3)
# Parser config
parser_id = CharField(default="naive") # Default parser
pipeline_id = CharField(null=True) # Pipeline workflow ID
parser_config = JSONField(default={"pages": [[1, 1000000]]})
pagerank = IntegerField(default=0)
Permission Model
Dual-Level Access Control
┌─────────────────────────────────────────────────────────────────────────────┐
│ PERMISSION MODEL │
└─────────────────────────────────────────────────────────────────────────────┘
Level 1: Knowledgebase.permission
│
├─── "me" ───► Only owner (created_by) can access
│
└─── "team" ──► All users in owner's tenant can access
Level 2: UserTenant relationship
│
└─── User must belong to KB's tenant to access
Combined Check (get_by_tenant_ids):
┌──────────────────────────────────────────────────────────────────┐
│ ((tenant_id IN joined_tenants) AND (permission == "team")) │
│ OR │
│ (tenant_id == user_id) │
└──────────────────────────────────────────────────────────────────┘
Permission Methods
| Method | Lines | Purpose |
|---|---|---|
accessible |
471-486 | Check if user can VIEW KB |
accessible4deletion |
53-83 | Check if user can DELETE KB |
get_by_tenant_ids |
134-197 | Get KBs with permission filter |
get_kb_by_id |
488-500 | Get KB by ID + user permission |
Core Methods
1. Create Knowledgebase
Lines: 374-429
@classmethod
@DB.connection_context()
def create_with_name(
cls,
*,
name: str,
tenant_id: str,
parser_id: str | None = None,
**kwargs
):
"""
Create a dataset with validation and defaults.
Validation Steps:
1. Name must be string
2. Name cannot be empty
3. Name cannot exceed DATASET_NAME_LIMIT bytes (UTF-8)
4. Deduplicate name within tenant (append _1, _2, etc.)
5. Verify tenant exists
Returns:
(True, payload_dict) on success
(False, error_result) on failure
"""
Creation Flow:
create_with_name(name, tenant_id, ...)
│
├──► Validate name type
│ └── Must be string
│
├──► Validate name content
│ ├── Strip whitespace
│ ├── Check not empty
│ └── Check UTF-8 byte length
│
├──► duplicate_name(query, name, tenant_id, status)
│ └── Returns unique name: "name", "name_1", "name_2"...
│
├──► TenantService.get_by_id(tenant_id)
│ └── Verify tenant exists
│
└──► Build payload dict
├── id: get_uuid()
├── name: deduplicated_name
├── tenant_id: tenant_id
├── created_by: tenant_id
├── parser_id: parser_id or "naive"
└── parser_config: get_parser_config(parser_id, config)
2. Get Knowledgebases by Tenant
Lines: 134-197
@classmethod
@DB.connection_context()
def get_by_tenant_ids(cls, joined_tenant_ids, user_id,
page_number, items_per_page,
orderby, desc, keywords,
parser_id=None):
"""
Get knowledge bases accessible to user with pagination.
Permission Logic:
- Include team KBs from joined tenants
- Include private KBs owned by user
Filters:
- keywords: Case-insensitive name search
- parser_id: Filter by parser type
Joins:
- User: Get owner nickname and avatar
Returns:
(list[dict], total_count)
"""
Query Structure:
SELECT
kb.id, kb.avatar, kb.name, kb.language, kb.description,
kb.tenant_id, kb.permission, kb.doc_num, kb.token_num,
kb.chunk_num, kb.parser_id, kb.embd_id,
user.nickname, user.avatar AS tenant_avatar,
kb.update_time
FROM knowledgebase kb
JOIN user ON kb.tenant_id = user.id
WHERE
((kb.tenant_id IN (?, ?, ...) AND kb.permission = 'team')
OR kb.tenant_id = ?)
AND kb.status = '1'
AND LOWER(kb.name) LIKE '%keyword%' -- if keywords
AND kb.parser_id = ? -- if parser_id
ORDER BY kb.{orderby} DESC/ASC
LIMIT ? OFFSET ?
3. Get Knowledgebase Detail
Lines: 250-292
@classmethod
@DB.connection_context()
def get_detail(cls, kb_id):
"""
Get comprehensive KB information including pipeline details.
Joins:
- UserCanvas (LEFT): Get pipeline name and avatar
Fields included:
- Basic: id, avatar, name, language, description
- Config: parser_id, parser_config, embd_id
- Stats: doc_num, token_num, chunk_num
- Pipeline: pipeline_id, pipeline_name, pipeline_avatar
- GraphRAG: graphrag_task_id, graphrag_task_finish_at
- RAPTOR: raptor_task_id, raptor_task_finish_at
- MindMap: mindmap_task_id, mindmap_task_finish_at
- Timestamps: create_time, update_time
Returns:
dict or None if not found
"""
4. Check Parsing Status
Lines: 85-117
@classmethod
@DB.connection_context()
def is_parsed_done(cls, kb_id):
"""
Verify all documents in KB are ready for chat.
Validation Rules:
1. KB must exist
2. No documents in RUNNING/CANCEL/FAIL state
3. No documents UNSTART with zero chunks
Returns:
(True, None) - All parsed
(False, error_message) - Not ready
Used by:
Chat creation validation
"""
Status Check Flow:
is_parsed_done(kb_id)
│
├──► cls.query(id=kb_id)
│ └── Get KB info
│
├──► DocumentService.get_by_kb_id(kb_id, ...)
│ └── Get all documents (up to 1000)
│
└──► For each document:
│
├─── run == RUNNING ───► Return (False, "still being parsed")
├─── run == CANCEL ───► Return (False, "still being parsed")
├─── run == FAIL ───► Return (False, "still being parsed")
└─── run == UNSTART
└── chunk_num == 0 ──► Return (False, "has not been parsed")
└──► Return (True, None)
5. Parser Configuration Management
Lines: 294-345
@classmethod
@DB.connection_context()
def update_parser_config(cls, id, config):
"""
Deep merge parser configuration.
Algorithm (dfs_update):
- For dict values: recursively merge
- For list values: union (set merge)
- For scalar values: replace
Example:
old = {"pages": [[1, 100]], "ocr": True}
new = {"pages": [[101, 200]], "language": "en"}
result = {"pages": [[1, 100], [101, 200]], "ocr": True, "language": "en"}
"""
@classmethod
@DB.connection_context()
def delete_field_map(cls, id):
"""Remove field_map key from parser_config."""
@classmethod
@DB.connection_context()
def get_field_map(cls, ids):
"""
Aggregate field mappings from multiple KBs.
Used by: TABLE parser for column mapping
"""
Deep Merge Algorithm:
def dfs_update(old, new):
for k, v in new.items():
if k not in old:
old[k] = v # Add new key
elif isinstance(v, dict):
dfs_update(old[k], v) # Recursive merge
elif isinstance(v, list):
old[k] = list(set(old[k] + v)) # Union lists
else:
old[k] = v # Replace value
6. Document Statistics Management
Lines: 516-565
@classmethod
@DB.connection_context()
def atomic_increase_doc_num_by_id(cls, kb_id):
"""
Atomically increment doc_num by 1.
Called when: DocumentService.insert()
SQL: UPDATE knowledgebase SET doc_num = doc_num + 1 WHERE id = ?
"""
@classmethod
@DB.connection_context()
def decrease_document_num_in_delete(cls, kb_id, doc_num_info: dict):
"""
Decrease statistics when documents are deleted.
doc_num_info = {
'doc_num': number of docs deleted,
'chunk_num': total chunks deleted,
'token_num': total tokens deleted
}
SQL:
UPDATE knowledgebase SET
doc_num = doc_num - ?,
chunk_num = chunk_num - ?,
token_num = token_num - ?,
update_time = ?
WHERE id = ?
"""
Statistics Flow:
┌─────────────────────────────────────────────────────────────────────────────┐
│ STATISTICS TRACKING │
└─────────────────────────────────────────────────────────────────────────────┘
[Document Insert]
│
└──► KnowledgebaseService.atomic_increase_doc_num_by_id(kb_id)
└── kb.doc_num += 1
[Chunk Processing]
│
└──► DocumentService.increment_chunk_num(doc_id, kb_id, tokens, chunks, ...)
├── doc.chunk_num += chunks
├── doc.token_num += tokens
├── kb.chunk_num += chunks
└── kb.token_num += tokens
[Document Delete]
│
└──► KnowledgebaseService.decrease_document_num_in_delete(kb_id, info)
├── kb.doc_num -= info['doc_num']
├── kb.chunk_num -= info['chunk_num']
└── kb.token_num -= info['token_num']
7. Access Control Methods
Lines: 471-514
@classmethod
@DB.connection_context()
def accessible(cls, kb_id, user_id):
"""
Check if user can access (view) KB.
Logic: User must belong to KB's tenant via UserTenant table.
SQL:
SELECT kb.id
FROM knowledgebase kb
JOIN user_tenant ON user_tenant.tenant_id = kb.tenant_id
WHERE kb.id = ? AND user_tenant.user_id = ?
"""
@classmethod
@DB.connection_context()
def accessible4deletion(cls, kb_id, user_id):
"""
Check if user can delete KB.
Logic: User must be the CREATOR of the KB.
SQL:
SELECT kb.id
FROM knowledgebase kb
WHERE kb.id = ? AND kb.created_by = ?
"""
Access Control Diagram:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ACCESS CONTROL CHECKS │
└─────────────────────────────────────────────────────────────────────────────┘
VIEW Access (accessible):
┌─────────────────────────────────────────────────────────────────┐
│ User │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ ▼ ▼ │
│ UserTenant Knowledgebase │
│ (user_id) (tenant_id) │
│ │ │ │
│ └──────────┬──────────────────┘ │
│ ▼ │
│ tenant_id MATCH? │
│ │ │
│ ┌────────┴────────┐ │
│ Yes No │
│ │ │ │
│ ALLOWED DENIED │
└─────────────────────────────────────────────────────────────────┘
DELETE Access (accessible4deletion):
┌─────────────────────────────────────────────────────────────────┐
│ User │
│ (user_id) │
│ │ │
│ ▼ │
│ Knowledgebase │
│ (created_by) │
│ │ │
│ user_id == created_by? │
│ │ │
│ ┌────────────┴────────────┐ │
│ Yes No │
│ │ │ │
│ ALLOWED DENIED │
└─────────────────────────────────────────────────────────────────┘
Service Interactions
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVICE INTERACTION DIAGRAM │
└─────────────────────────────────────────────────────────────────────────────┘
KnowledgebaseService
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Document │ │ Tenant │ │ User │
│ Service │ │ Service │ │ Service │
│ │ │ │ │ │
│ • get_by_kb_id│ │ • get_by_id │ │ • get profile │
│ • insert │ │ (validate │ │ info for │
│ (→ atomic │ │ tenant) │ │ joins │
│ _increase) │ │ │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
API Layer Callers:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ kb_app.py │ │ dialog_app.py │ │ RESTful API │
│ │ │ │ │ │
│ • create │ │ • is_parsed │ │ • create_ │
│ • update │ │ _done check │ │ with_name │
│ • delete │ │ before chat │ │ │
│ • list │ │ │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
Key Method Reference Table
| Category | Method | Lines | Purpose |
|---|---|---|---|
| Query | get_by_tenant_ids |
134-197 | Paginated list with permissions |
| Query | get_all_kb_by_tenant_ids |
199-233 | Get all KBs (batch pagination) |
| Query | get_kb_ids |
235-248 | Get KB IDs for tenant |
| Query | get_detail |
250-292 | Comprehensive KB info |
| Query | get_by_name |
347-363 | Get by name + tenant |
| Query | get_list |
432-469 | Filtered paginated list |
| Query | get_all_ids |
365-371 | Get all KB IDs |
| Create | create_with_name |
374-429 | Validated creation |
| Access | accessible |
471-486 | View permission check |
| Access | accessible4deletion |
53-83 | Delete permission check |
| Access | get_kb_by_id |
488-500 | Get with user permission |
| Access | get_kb_by_name |
502-514 | Get by name with permission |
| Config | update_parser_config |
294-321 | Deep merge config |
| Config | delete_field_map |
323-331 | Remove field map |
| Config | get_field_map |
333-345 | Get field mappings |
| Status | is_parsed_done |
85-117 | Check parsing complete |
| Stats | atomic_increase_doc_num_by_id |
516-524 | Increment doc count |
| Stats | decrease_document_num_in_delete |
552-565 | Decrease on delete |
| Stats | update_document_number_in_init |
526-550 | Init doc count |
| Docs | list_documents_by_ids |
119-132 | Get doc IDs for KBs |
API Endpoints Mapping
| HTTP Method | Endpoint | Service Method |
|---|---|---|
POST |
/v1/kb/create |
create_with_name() |
GET |
/v1/kb/list |
get_by_tenant_ids() |
GET |
/v1/kb/detail |
get_detail() |
PUT |
/v1/kb/{kb_id} |
update_by_id() (inherited) |
DELETE |
/v1/kb/{kb_id} |
delete_by_id() + cleanup |
PUT |
/v1/kb/{kb_id}/config |
update_parser_config() |
Parser Configuration Schema
parser_config = {
# Page range for PDF parsing
"pages": [[1, 1000000]], # Default: all pages
# OCR settings
"ocr": True,
"ocr_model": "tesseract", # or "paddleocr"
# Layout settings
"layout_recognize": True,
# Chunking settings
"chunk_token_num": 128,
"delimiter": "\n!?。;!?",
# For TABLE parser
"field_map": {
"column_name": "mapped_field_name"
},
# For specific parsers
"raptor": {"enabled": False},
"graphrag": {"enabled": False}
}
Error Handling
| Location | Error Type | Handling |
|---|---|---|
create_with_name() |
Invalid name | Return (False, error_result) |
create_with_name() |
Tenant not found | Return (False, error_result) |
update_parser_config() |
KB not found | Raise LookupError |
delete_field_map() |
KB not found | Raise LookupError |
decrease_document_num_in_delete() |
KB not found | Raise RuntimeError |
update_document_number_in_init() |
ValueError "no data to save" | Pass (ignore) |
Database Patterns
Atomic Updates
# Atomic increment using SQL expression (Line 522-523)
data["doc_num"] = cls.model.doc_num + 1 # Peewee generates: doc_num + 1
cls.model.update(data).where(cls.model.id == kb_id).execute()
Batch Pagination Pattern
# Avoid deep pagination performance issues (Lines 224-232)
offset, limit = 0, 50
res = []
while True:
kb_batch = kbs.offset(offset).limit(limit)
_temp = list(kb_batch.dicts())
if not _temp:
break
res.extend(_temp)
offset += limit
Selective Field Save
# Save only dirty fields without updating timestamps (Lines 537-545)
dirty_fields = kb.dirty_fields
if cls.model._meta.combined.get("update_time") in dirty_fields:
dirty_fields.remove(cls.model._meta.combined["update_time"])
kb.save(only=dirty_fields)
Key Constants & Imports
# Permission types (from api/db/__init__.py)
class TenantPermission(Enum):
ME = "me" # Private to creator
TEAM = "team" # Shared with tenant
# Status (from common/constants.py)
class StatusEnum(Enum):
WASTED = "0" # Soft deleted
VALID = "1" # Active
# Dataset name limit (from api/constants.py)
DATASET_NAME_LIMIT = 128 # bytes (UTF-8)
# Default parser
ParserType.NAIVE = "naive"
Performance Considerations
-
Batch Pagination:
get_all_kb_by_tenant_ids()uses offset-limit pagination to avoid memory issues -
Selective Joins: Queries only join necessary tables (User, UserTenant, UserCanvas)
-
Index Usage: All filter/sort fields are indexed (
tenant_id,name,permission,status,parser_id) -
Atomic Operations: Statistics updates use SQL expressions for atomicity without explicit transactions
-
Lazy Loading: Document details fetched separately from KB list queries