docs: Add document and knowledgebase service analysis documentation

- Add document_service_analysis.md: comprehensive analysis of document
  lifecycle management including insert, remove, parse, progress tracking
- Add knowledgebase_service_analysis.md: dataset management and access
  control analysis with permission model, parser configuration

2025-11-27 09:54:39 +00:00

24 KiB

Raw Blame History

Knowledgebase Service Analysis - Dataset Management & Access Control

Tổng Quan

knowledgebase_service.py (566 lines) quản lý Dataset (Knowledgebase) - đơn vị tổ chức tài liệu trong RAGFlow, bao gồm CRUD operations, access control, parser configuration, và document association tracking.

File Location

/api/db/services/knowledgebase_service.py

Class Definition

class KnowledgebaseService(CommonService):
    model = Knowledgebase  # Line 49

Kế thừa CommonService với các method cơ bản: query(), get_by_id(), save(), update_by_id(), delete_by_id()

Knowledgebase Model Structure

# From db_models.py (Lines 734-753)

class Knowledgebase(DataBaseModel):
    id = CharField(max_length=32, primary_key=True)
    avatar = TextField(null=True)                    # KB avatar (base64)
    tenant_id = CharField(max_length=32, index=True) # Owner tenant
    name = CharField(max_length=128, index=True)     # KB name
    language = CharField(max_length=32)              # "English"|"Chinese"
    description = TextField(null=True)               # KB description
    embd_id = CharField(max_length=128)              # Embedding model ID
    permission = CharField(max_length=16)            # "me"|"team"
    created_by = CharField(max_length=32)            # Creator user ID

    # Statistics
    doc_num = IntegerField(default=0)                # Document count
    token_num = IntegerField(default=0)              # Total tokens
    chunk_num = IntegerField(default=0)              # Total chunks

    # Search config
    similarity_threshold = FloatField(default=0.2)
    vector_similarity_weight = FloatField(default=0.3)

    # Parser config
    parser_id = CharField(default="naive")           # Default parser
    pipeline_id = CharField(null=True)               # Pipeline workflow ID
    parser_config = JSONField(default={"pages": [[1, 1000000]]})
    pagerank = IntegerField(default=0)

Permission Model

Dual-Level Access Control

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PERMISSION MODEL                                     │
└─────────────────────────────────────────────────────────────────────────────┘

Level 1: Knowledgebase.permission
    │
    ├─── "me"  ───► Only owner (created_by) can access
    │
    └─── "team" ──► All users in owner's tenant can access

Level 2: UserTenant relationship
    │
    └─── User must belong to KB's tenant to access

Combined Check (get_by_tenant_ids):
┌──────────────────────────────────────────────────────────────────┐
│  ((tenant_id IN joined_tenants) AND (permission == "team"))      │
│                          OR                                       │
│  (tenant_id == user_id)                                          │
└──────────────────────────────────────────────────────────────────┘

Permission Methods

Method	Lines	Purpose
`accessible`	471-486	Check if user can VIEW KB
`accessible4deletion`	53-83	Check if user can DELETE KB
`get_by_tenant_ids`	134-197	Get KBs with permission filter
`get_kb_by_id`	488-500	Get KB by ID + user permission

Core Methods

1. Create Knowledgebase

Lines: 374-429

@classmethod
@DB.connection_context()
def create_with_name(
    cls,
    *,
    name: str,
    tenant_id: str,
    parser_id: str | None = None,
    **kwargs
):
    """
    Create a dataset with validation and defaults.

    Validation Steps:
    1. Name must be string
    2. Name cannot be empty
    3. Name cannot exceed DATASET_NAME_LIMIT bytes (UTF-8)
    4. Deduplicate name within tenant (append _1, _2, etc.)
    5. Verify tenant exists

    Returns:
        (True, payload_dict) on success
        (False, error_result) on failure
    """

Creation Flow:

create_with_name(name, tenant_id, ...)
         │
         ├──► Validate name type
         │       └── Must be string
         │
         ├──► Validate name content
         │       ├── Strip whitespace
         │       ├── Check not empty
         │       └── Check UTF-8 byte length
         │
         ├──► duplicate_name(query, name, tenant_id, status)
         │       └── Returns unique name: "name", "name_1", "name_2"...
         │
         ├──► TenantService.get_by_id(tenant_id)
         │       └── Verify tenant exists
         │
         └──► Build payload dict
                 ├── id: get_uuid()
                 ├── name: deduplicated_name
                 ├── tenant_id: tenant_id
                 ├── created_by: tenant_id
                 ├── parser_id: parser_id or "naive"
                 └── parser_config: get_parser_config(parser_id, config)

2. Get Knowledgebases by Tenant

Lines: 134-197

@classmethod
@DB.connection_context()
def get_by_tenant_ids(cls, joined_tenant_ids, user_id,
                      page_number, items_per_page,
                      orderby, desc, keywords,
                      parser_id=None):
    """
    Get knowledge bases accessible to user with pagination.

    Permission Logic:
    - Include team KBs from joined tenants
    - Include private KBs owned by user

    Filters:
    - keywords: Case-insensitive name search
    - parser_id: Filter by parser type

    Joins:
    - User: Get owner nickname and avatar

    Returns:
        (list[dict], total_count)
    """

Query Structure:

SELECT
    kb.id, kb.avatar, kb.name, kb.language, kb.description,
    kb.tenant_id, kb.permission, kb.doc_num, kb.token_num,
    kb.chunk_num, kb.parser_id, kb.embd_id,
    user.nickname, user.avatar AS tenant_avatar,
    kb.update_time
FROM knowledgebase kb
JOIN user ON kb.tenant_id = user.id
WHERE
    ((kb.tenant_id IN (?, ?, ...) AND kb.permission = 'team')
     OR kb.tenant_id = ?)
    AND kb.status = '1'
    AND LOWER(kb.name) LIKE '%keyword%'  -- if keywords
    AND kb.parser_id = ?                  -- if parser_id
ORDER BY kb.{orderby} DESC/ASC
LIMIT ? OFFSET ?

3. Get Knowledgebase Detail

Lines: 250-292

@classmethod
@DB.connection_context()
def get_detail(cls, kb_id):
    """
    Get comprehensive KB information including pipeline details.

    Joins:
    - UserCanvas (LEFT): Get pipeline name and avatar

    Fields included:
    - Basic: id, avatar, name, language, description
    - Config: parser_id, parser_config, embd_id
    - Stats: doc_num, token_num, chunk_num
    - Pipeline: pipeline_id, pipeline_name, pipeline_avatar
    - GraphRAG: graphrag_task_id, graphrag_task_finish_at
    - RAPTOR: raptor_task_id, raptor_task_finish_at
    - MindMap: mindmap_task_id, mindmap_task_finish_at
    - Timestamps: create_time, update_time

    Returns:
        dict or None if not found
    """

4. Check Parsing Status

Lines: 85-117

@classmethod
@DB.connection_context()
def is_parsed_done(cls, kb_id):
    """
    Verify all documents in KB are ready for chat.

    Validation Rules:
    1. KB must exist
    2. No documents in RUNNING/CANCEL/FAIL state
    3. No documents UNSTART with zero chunks

    Returns:
        (True, None) - All parsed
        (False, error_message) - Not ready

    Used by:
        Chat creation validation
    """

Status Check Flow:

is_parsed_done(kb_id)
         │
         ├──► cls.query(id=kb_id)
         │       └── Get KB info
         │
         ├──► DocumentService.get_by_kb_id(kb_id, ...)
         │       └── Get all documents (up to 1000)
         │
         └──► For each document:
                 │
                 ├─── run == RUNNING ───► Return (False, "still being parsed")
                 ├─── run == CANCEL  ───► Return (False, "still being parsed")
                 ├─── run == FAIL    ───► Return (False, "still being parsed")
                 └─── run == UNSTART
                        └── chunk_num == 0 ──► Return (False, "has not been parsed")

         └──► Return (True, None)

5. Parser Configuration Management

Lines: 294-345

@classmethod
@DB.connection_context()
def update_parser_config(cls, id, config):
    """
    Deep merge parser configuration.

    Algorithm (dfs_update):
    - For dict values: recursively merge
    - For list values: union (set merge)
    - For scalar values: replace

    Example:
        old = {"pages": [[1, 100]], "ocr": True}
        new = {"pages": [[101, 200]], "language": "en"}
        result = {"pages": [[1, 100], [101, 200]], "ocr": True, "language": "en"}
    """

@classmethod
@DB.connection_context()
def delete_field_map(cls, id):
    """Remove field_map key from parser_config."""

@classmethod
@DB.connection_context()
def get_field_map(cls, ids):
    """
    Aggregate field mappings from multiple KBs.

    Used by: TABLE parser for column mapping
    """

Deep Merge Algorithm:

def dfs_update(old, new):
    for k, v in new.items():
        if k not in old:
            old[k] = v           # Add new key
        elif isinstance(v, dict):
            dfs_update(old[k], v) # Recursive merge
        elif isinstance(v, list):
            old[k] = list(set(old[k] + v))  # Union lists
        else:
            old[k] = v           # Replace value

6. Document Statistics Management

Lines: 516-565

@classmethod
@DB.connection_context()
def atomic_increase_doc_num_by_id(cls, kb_id):
    """
    Atomically increment doc_num by 1.
    Called when: DocumentService.insert()

    SQL: UPDATE knowledgebase SET doc_num = doc_num + 1 WHERE id = ?
    """

@classmethod
@DB.connection_context()
def decrease_document_num_in_delete(cls, kb_id, doc_num_info: dict):
    """
    Decrease statistics when documents are deleted.

    doc_num_info = {
        'doc_num': number of docs deleted,
        'chunk_num': total chunks deleted,
        'token_num': total tokens deleted
    }

    SQL:
        UPDATE knowledgebase SET
            doc_num = doc_num - ?,
            chunk_num = chunk_num - ?,
            token_num = token_num - ?,
            update_time = ?
        WHERE id = ?
    """

Statistics Flow:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STATISTICS TRACKING                                       │
└─────────────────────────────────────────────────────────────────────────────┘

[Document Insert]
    │
    └──► KnowledgebaseService.atomic_increase_doc_num_by_id(kb_id)
            └── kb.doc_num += 1

[Chunk Processing]
    │
    └──► DocumentService.increment_chunk_num(doc_id, kb_id, tokens, chunks, ...)
            ├── doc.chunk_num += chunks
            ├── doc.token_num += tokens
            ├── kb.chunk_num += chunks
            └── kb.token_num += tokens

[Document Delete]
    │
    └──► KnowledgebaseService.decrease_document_num_in_delete(kb_id, info)
            ├── kb.doc_num -= info['doc_num']
            ├── kb.chunk_num -= info['chunk_num']
            └── kb.token_num -= info['token_num']

7. Access Control Methods

Lines: 471-514

@classmethod
@DB.connection_context()
def accessible(cls, kb_id, user_id):
    """
    Check if user can access (view) KB.

    Logic: User must belong to KB's tenant via UserTenant table.

    SQL:
        SELECT kb.id
        FROM knowledgebase kb
        JOIN user_tenant ON user_tenant.tenant_id = kb.tenant_id
        WHERE kb.id = ? AND user_tenant.user_id = ?
    """

@classmethod
@DB.connection_context()
def accessible4deletion(cls, kb_id, user_id):
    """
    Check if user can delete KB.

    Logic: User must be the CREATOR of the KB.

    SQL:
        SELECT kb.id
        FROM knowledgebase kb
        WHERE kb.id = ? AND kb.created_by = ?
    """

Access Control Diagram:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ACCESS CONTROL CHECKS                                     │
└─────────────────────────────────────────────────────────────────────────────┘

VIEW Access (accessible):
┌─────────────────────────────────────────────────────────────────┐
│                        User                                      │
│                          │                                       │
│           ┌──────────────┴──────────────┐                       │
│           ▼                             ▼                       │
│     UserTenant                    Knowledgebase                 │
│     (user_id)                     (tenant_id)                   │
│           │                             │                       │
│           └──────────┬──────────────────┘                       │
│                      ▼                                          │
│              tenant_id MATCH?                                   │
│                      │                                          │
│            ┌────────┴────────┐                                  │
│            Yes               No                                 │
│            │                 │                                  │
│         ALLOWED          DENIED                                 │
└─────────────────────────────────────────────────────────────────┘

DELETE Access (accessible4deletion):
┌─────────────────────────────────────────────────────────────────┐
│                        User                                      │
│                      (user_id)                                   │
│                          │                                       │
│                          ▼                                       │
│                   Knowledgebase                                  │
│                   (created_by)                                   │
│                          │                                       │
│              user_id == created_by?                              │
│                          │                                       │
│            ┌────────────┴────────────┐                          │
│            Yes                        No                         │
│            │                          │                         │
│         ALLOWED                    DENIED                       │
└─────────────────────────────────────────────────────────────────┘

Service Interactions

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SERVICE INTERACTION DIAGRAM                               │
└─────────────────────────────────────────────────────────────────────────────┘

                        KnowledgebaseService
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
        ▼                       ▼                       ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   Document    │      │    Tenant     │      │     User      │
│   Service     │      │   Service     │      │   Service     │
│               │      │               │      │               │
│ • get_by_kb_id│      │ • get_by_id   │      │ • get profile │
│ • insert      │      │   (validate   │      │   info for    │
│   (→ atomic   │      │    tenant)    │      │   joins       │
│   _increase)  │      │               │      │               │
└───────────────┘      └───────────────┘      └───────────────┘

API Layer Callers:
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   kb_app.py   │      │ dialog_app.py │      │  RESTful API  │
│               │      │               │      │               │
│ • create      │      │ • is_parsed   │      │ • create_     │
│ • update      │      │   _done check │      │   with_name   │
│ • delete      │      │   before chat │      │               │
│ • list        │      │               │      │               │
└───────────────┘      └───────────────┘      └───────────────┘

Key Method Reference Table

Category	Method	Lines	Purpose
Query	`get_by_tenant_ids`	134-197	Paginated list with permissions
Query	`get_all_kb_by_tenant_ids`	199-233	Get all KBs (batch pagination)
Query	`get_kb_ids`	235-248	Get KB IDs for tenant
Query	`get_detail`	250-292	Comprehensive KB info
Query	`get_by_name`	347-363	Get by name + tenant
Query	`get_list`	432-469	Filtered paginated list
Query	`get_all_ids`	365-371	Get all KB IDs
Create	`create_with_name`	374-429	Validated creation
Access	`accessible`	471-486	View permission check
Access	`accessible4deletion`	53-83	Delete permission check
Access	`get_kb_by_id`	488-500	Get with user permission
Access	`get_kb_by_name`	502-514	Get by name with permission
Config	`update_parser_config`	294-321	Deep merge config
Config	`delete_field_map`	323-331	Remove field map
Config	`get_field_map`	333-345	Get field mappings
Status	`is_parsed_done`	85-117	Check parsing complete
Stats	`atomic_increase_doc_num_by_id`	516-524	Increment doc count
Stats	`decrease_document_num_in_delete`	552-565	Decrease on delete
Stats	`update_document_number_in_init`	526-550	Init doc count
Docs	`list_documents_by_ids`	119-132	Get doc IDs for KBs

API Endpoints Mapping

HTTP Method	Endpoint	Service Method
`POST`	`/v1/kb/create`	`create_with_name()`
`GET`	`/v1/kb/list`	`get_by_tenant_ids()`
`GET`	`/v1/kb/detail`	`get_detail()`
`PUT`	`/v1/kb/{kb_id}`	`update_by_id()` (inherited)
`DELETE`	`/v1/kb/{kb_id}`	`delete_by_id()` + cleanup
`PUT`	`/v1/kb/{kb_id}/config`	`update_parser_config()`

Parser Configuration Schema

parser_config = {
    # Page range for PDF parsing
    "pages": [[1, 1000000]],  # Default: all pages

    # OCR settings
    "ocr": True,
    "ocr_model": "tesseract",  # or "paddleocr"

    # Layout settings
    "layout_recognize": True,

    # Chunking settings
    "chunk_token_num": 128,
    "delimiter": "\n!?。；！？",

    # For TABLE parser
    "field_map": {
        "column_name": "mapped_field_name"
    },

    # For specific parsers
    "raptor": {"enabled": False},
    "graphrag": {"enabled": False}
}

Error Handling

Location	Error Type	Handling
`create_with_name()`	Invalid name	Return `(False, error_result)`
`create_with_name()`	Tenant not found	Return `(False, error_result)`
`update_parser_config()`	KB not found	Raise `LookupError`
`delete_field_map()`	KB not found	Raise `LookupError`
`decrease_document_num_in_delete()`	KB not found	Raise `RuntimeError`
`update_document_number_in_init()`	ValueError "no data to save"	Pass (ignore)

Database Patterns

Atomic Updates

# Atomic increment using SQL expression (Line 522-523)
data["doc_num"] = cls.model.doc_num + 1  # Peewee generates: doc_num + 1
cls.model.update(data).where(cls.model.id == kb_id).execute()

Batch Pagination Pattern

# Avoid deep pagination performance issues (Lines 224-232)
offset, limit = 0, 50
res = []
while True:
    kb_batch = kbs.offset(offset).limit(limit)
    _temp = list(kb_batch.dicts())
    if not _temp:
        break
    res.extend(_temp)
    offset += limit

Selective Field Save

# Save only dirty fields without updating timestamps (Lines 537-545)
dirty_fields = kb.dirty_fields
if cls.model._meta.combined.get("update_time") in dirty_fields:
    dirty_fields.remove(cls.model._meta.combined["update_time"])
kb.save(only=dirty_fields)

Key Constants & Imports

# Permission types (from api/db/__init__.py)
class TenantPermission(Enum):
    ME = "me"      # Private to creator
    TEAM = "team"  # Shared with tenant

# Status (from common/constants.py)
class StatusEnum(Enum):
    WASTED = "0"  # Soft deleted
    VALID = "1"   # Active

# Dataset name limit (from api/constants.py)
DATASET_NAME_LIMIT = 128  # bytes (UTF-8)

# Default parser
ParserType.NAIVE = "naive"

Performance Considerations

Batch Pagination: get_all_kb_by_tenant_ids() uses offset-limit pagination to avoid memory issues
Selective Joins: Queries only join necessary tables (User, UserTenant, UserCanvas)
Index Usage: All filter/sort fields are indexed (tenant_id, name, permission, status, parser_id)
Atomic Operations: Statistics updates use SQL expressions for atomicity without explicit transactions
Lazy Loading: Document details fetched separately from KB list queries

24 KiB Raw Blame History