Add comprehensive documentation covering 6 modules: - 01-API-LAYER: Authentication, routing, SSE streaming - 02-SERVICE-LAYER: Dialog, Task, LLM service analysis - 03-RAG-ENGINE: Hybrid search, embedding, reranking - 04-AGENT-SYSTEM: Canvas engine, components, tools - 05-DOCUMENT-PROCESSING: Task executor, PDF parsing - 06-ALGORITHMS: BM25, fusion, RAPTOR Total 28 documentation files with code analysis, diagrams, and formulas.
13 KiB
13 KiB
Document App Analysis
Tổng Quan
document_app.py (708 lines) là blueprint xử lý tất cả operations liên quan đến document management: upload, parsing, status tracking, và deletion.
File Location
/api/apps/document_app.py
API Endpoints
| Endpoint | Method | Auth | Mô Tả |
|---|---|---|---|
/upload |
POST | Required | Upload files to knowledge base |
/web_crawl |
POST | Required | Crawl and add web content |
/create |
POST | Required | Create virtual document |
/list |
POST | Required | List documents with filters |
/filter |
POST | Required | Get filterable metadata |
/run |
POST | Required | Execute document processing |
/rm |
POST | Required | Delete documents |
/change_status |
POST | Required | Enable/disable documents |
/change_parser |
POST | Required | Switch parsing engine |
/rename |
POST | Required | Rename documents |
/set_meta |
POST | Required | Set metadata (JSON) |
/get/<doc_id> |
GET | Optional | Retrieve raw document |
/image/<image_id> |
GET | Optional | Get document thumbnail |
Core Flow: Document Upload
┌─────────────────────────────────────────────────────────────────────────┐
│ DOCUMENT UPLOAD FLOW │
└─────────────────────────────────────────────────────────────────────────┘
Client Request API Layer Storage/DB
│ │ │
│ POST /upload │ │
│ multipart/form-data │ │
│ - kb_id │ │
│ - file(s) │ │
├──────────────────────────────►│ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ @login_required │ │
│ │ @validate_request │ │
│ │ ("kb_id") │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ KnowledgebaseService│ │
│ │ .get_by_id(kb_id) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ check_kb_team_ │ │
│ │ permission(kb, │ │
│ │ current_user.id) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ FileService. │ │
│ │ upload_document() │ │
│ └──────────┬──────────┘ │
│ │ │
│ │ Store binary │
│ ├─────────────────────────────►│ MinIO
│ │ │
│ │ Create File record │
│ ├─────────────────────────────►│ MySQL
│ │ │
│ │ Create Document record │
│ ├─────────────────────────────►│ MySQL
│ │ │
│ │ Create File2Document │
│ ├─────────────────────────────►│ MySQL
│ │ │
│ 200 OK │ │
│ {code: 0, data: [...]} │ │
│◄──────────────────────────────┤ │
│ │ │
Code Analysis
Upload Endpoint
@manager.route("/upload", methods=["POST"])
@login_required
@validate_request("kb_id")
async def upload():
"""
Upload documents to a knowledge base.
Request:
- Content-Type: multipart/form-data
- kb_id: Knowledge base ID
- file: Binary file data (multiple allowed)
Response:
- code: 0 (success) or error code
- data: List of uploaded document info
"""
form = await request.form
kb_id = form.get("kb_id")
# 1. Get knowledge base
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
raise LookupError("Can't find this knowledgebase!")
# 2. Authorization check
if not check_kb_team_permission(kb, current_user.id):
return get_json_result(
data=False,
message="No authorization.",
code=RetCode.AUTHENTICATION_ERROR
)
# 3. Get uploaded files
file_objs = await request.files
# 4. Process upload through FileService
err, files = FileService.upload_document(kb, file_objs, current_user.id)
if err:
return get_json_result(
data=files,
message="\n".join(err),
code=RetCode.SERVER_ERROR
)
return get_json_result(data=files)
Document Processing (Run)
@manager.route("/run", methods=["POST"])
@login_required
@validate_request("doc_ids", "run")
async def run():
"""
Trigger document processing pipeline.
Request:
- doc_ids: List of document IDs
- run: 1 (start) or 0 (cancel)
Flow:
1. Validate documents exist
2. Check authorization
3. Queue tasks for processing
4. Return immediately (async processing)
"""
req = await request.json
doc_ids = req["doc_ids"]
run_flag = req["run"]
for doc_id in doc_ids:
# Get document info
info = {"run": str(run_flag), "progress": 0}
info["progress_msg"] = "" if run_flag == 1 else "Task is cancelled."
info["chunk_num"] = 0
info["token_num"] = 0
# Update document status
DocumentService.update_by_id(doc_id, info)
if run_flag == 1:
# Queue for processing
e, doc = DocumentService.get_by_id(doc_id)
tenant_id = DocumentService.get_tenant_id(doc_id)
# Reset chunks if re-running
if doc.progress == 0:
DocumentService.clear_chunk_num(doc_id)
# Queue task
TaskService.queue_tasks(doc, tenant_id)
return get_json_result(data=True)
List Documents with Filtering
@manager.route("/list", methods=["POST"])
@login_required
async def list_docs():
"""
List documents with pagination and filtering.
Request:
- kb_id: Knowledge base ID
- keywords: Search keywords (optional)
- page: Page number (default 1)
- page_size: Items per page (default 15)
- orderby: Sort field
- desc: Sort descending (default True)
- status: Filter by status
Response:
- docs: List of document objects
- total: Total count
"""
req = await request.json
kb_id = req.get("kb_id")
# Build query conditions
conditions = {
"kb_id": kb_id,
"status": StatusEnum.VALID.value
}
if req.get("status"):
conditions["run"] = req["status"]
# Execute paginated query
docs, total = DocumentService.get_list(
conditions,
page=req.get("page", 1),
page_size=req.get("page_size", 15),
orderby=req.get("orderby", "create_time"),
desc=req.get("desc", True),
keywords=req.get("keywords", "")
)
return get_json_result(data={"docs": docs, "total": total})
Authorization Pattern
# Pattern 1: Team Permission Check
if not check_kb_team_permission(kb, current_user.id):
return get_json_result(
data=False,
message="No authorization.",
code=RetCode.AUTHENTICATION_ERROR
)
# Pattern 2: KB Accessibility Check
if not KnowledgebaseService.accessible(kb_id, current_user.id):
return get_json_result(
data=False,
message='No authorization.',
code=RetCode.AUTHENTICATION_ERROR
)
# Pattern 3: Ownership Check
if not KnowledgebaseService.query(created_by=current_user.id, id=kb_id):
return get_json_result(
data=False,
message='Only owner authorized for this operation.',
code=RetCode.OPERATING_ERROR
)
Error Handling
# Global exception handler in __init__.py
def server_error_response(e):
logging.error("Unhandled exception", exc_info=(type(e), e, e.__traceback__))
msg = repr(e).lower()
# Authorization errors
if getattr(e, "code", None) == 401 or "unauthorized" in msg:
return get_json_result(code=RetCode.UNAUTHORIZED, message=repr(e))
# Document store errors
if "index_not_found_exception" in repr(e):
return get_json_result(
code=RetCode.EXCEPTION_ERROR,
message="No chunk found, please upload file and parse it."
)
return get_json_result(code=RetCode.EXCEPTION_ERROR, message=repr(e))
Sequence Diagram: Complete Upload & Parse Flow
sequenceDiagram
participant C as Client
participant A as API (document_app)
participant FS as FileService
participant DS as DocumentService
participant TS as TaskService
participant M as MinIO
participant DB as MySQL
participant Q as Redis Queue
participant W as Worker
C->>A: POST /upload (multipart)
A->>A: Validate JWT token
A->>A: Validate kb_id parameter
A->>DB: Get knowledge base
A->>A: Check team permission
A->>FS: upload_document(kb, files, user_id)
FS->>M: Store file binary
M-->>FS: File location
FS->>DB: INSERT File record
FS->>DB: INSERT Document record
FS->>DB: INSERT File2Document
FS-->>A: (errors, documents)
A-->>C: 200 OK {documents}
Note over C,W: Later: User triggers parsing
C->>A: POST /run {doc_ids, run: 1}
A->>DS: Update status = RUNNING
A->>TS: queue_tasks(doc, tenant_id)
TS->>DB: INSERT Task records
TS->>Q: PUSH task to queue
A-->>C: 200 OK
Note over Q,W: Background Processing
W->>Q: POP task
W->>M: Download file
W->>W: Parse document
W->>W: Generate chunks
W->>W: Create embeddings
W->>DB: Store chunks (Elasticsearch)
W->>DB: UPDATE Document progress
Performance Considerations
- Async File Handling: Uses Quart's async file handling for large uploads
- Chunked Upload: Supports streaming for large files (up to 1GB)
- Background Processing: Document parsing happens asynchronously
- Progress Tracking: Real-time progress via polling or WebSocket
Related Files
/api/db/services/document_service.py- Business logic/api/db/services/file_service.py- File operations/api/db/services/task_service.py- Task queue management/rag/svr/task_executor.py- Background task execution