History

Claude a6ee18476d docs: Add detailed backend module analysis documentation Add comprehensive documentation covering 6 modules: - 01-API-LAYER: Authentication, routing, SSE streaming - 02-SERVICE-LAYER: Dialog, Task, LLM service analysis - 03-RAG-ENGINE: Hybrid search, embedding, reranking - 04-AGENT-SYSTEM: Canvas engine, components, tools - 05-DOCUMENT-PROCESSING: Task executor, PDF parsing - 06-ALGORITHMS: BM25, fusion, RAPTOR Total 28 documentation files with code analysis, diagrams, and formulas.		2025-11-26 11:10:54 +00:00
..
pdf_parsing.md	docs: Add detailed backend module analysis documentation	2025-11-26 11:10:54 +00:00
README.md	docs: Add detailed backend module analysis documentation	2025-11-26 11:10:54 +00:00
task_executor_analysis.md	docs: Add detailed backend module analysis documentation	2025-11-26 11:10:54 +00:00

README.md

05-DOCUMENT-PROCESSING - Document Parsing Pipeline

Tong Quan

Document Processing pipeline chuyển đổi raw documents thành searchable chunks với layout analysis, OCR, và intelligent chunking.

Kien Truc Document Processing

┌─────────────────────────────────────────────────────────────────┐
│                 DOCUMENT PROCESSING PIPELINE                     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   File Upload   │────▶│  Task Creation  │────▶│  Task Queue     │
│   (MinIO)       │     │   (MySQL)       │     │   (Redis)       │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                     TASK EXECUTOR                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  1. Download file from MinIO                             │   │
│  │  2. Select parser based on file type                     │   │
│  │  3. Execute parsing pipeline                             │   │
│  │  4. Generate embeddings                                  │   │
│  │  5. Index in Elasticsearch                               │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   PDF Parser    │  │  Office Parser  │  │  Text Parser    │
│                 │  │                 │  │                 │
│ - Layout detect │  │ - DOCX/XLSX    │  │ - TXT/MD/CSV   │
│ - OCR          │  │ - Table extract │  │ - Direct chunk  │
│ - Table struct  │  │ - Image embed   │  │                 │
└─────────────────┘  └─────────────────┘  └─────────────────┘

Cau Truc Thu Muc

/rag/
├── svr/
│   └── task_executor.py    # Main task executor ⭐
├── app/
│   ├── naive.py           # Document parsing logic
│   ├── paper.py           # Academic paper parser
│   ├── qa.py              # Q&A document parser
│   └── table.py           # Structured table parser
├── flow/
│   ├── parser/            # Document parsers
│   ├── splitter/          # Chunking logic
│   └── tokenizer/         # Tokenization
└── nlp/
    └── __init__.py        # naive_merge() chunking

/deepdoc/
├── parser/
│   └── pdf_parser.py      # RAGFlow PDF parser ⭐
├── vision/
│   ├── ocr.py            # PaddleOCR integration
│   ├── layout_recognizer.py  # Detectron2 layout
│   └── table_structure_recognizer.py  # TSR
└── images/
    └── ...               # Image processing

Files Trong Module Nay

File	Mo Ta
task_executor_analysis.md	Task execution pipeline
pdf_parsing.md	PDF parsing với layout analysis
ocr_pipeline.md	OCR với PaddleOCR
layout_detection.md	Detectron2 layout recognition
table_extraction.md	Table structure recognition
file_type_handlers.md	Handler cho từng file type

Processing Flow

┌─────────────────────────────────────────────────────────────────┐
│                    PDF PROCESSING PIPELINE                       │
└─────────────────────────────────────────────────────────────────┘

                    PDF Binary Input
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  1. IMAGE EXTRACTION (0-40%)                                     │
│     pdfplumber → PIL Images (3x zoom)                           │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  2. OCR DETECTION (40-63%)                                       │
│     PaddleOCR → Bounding boxes + Text                           │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  3. LAYOUT RECOGNITION (63-83%)                                  │
│     Detectron2 → Layout types (Text, Title, Table, Figure)      │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  4. TABLE STRUCTURE (TSR)                                        │
│     TableTransformer → Rows, Columns, Cells                     │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  5. TEXT MERGING                                                 │
│     ML-based vertical merge (XGBoost)                           │
│     Column detection (KMeans)                                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  6. CHUNKING                                                     │
│     naive_merge() → Token-based chunks                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  7. EMBEDDING + INDEXING                                         │
│     Vector generation → Elasticsearch                           │
└─────────────────────────────────────────────────────────────────┘

Supported File Types

Category	Extensions	Parser
PDF	.pdf	RAGFlowPdfParser, PlainParser, VisionParser
Office	.docx, .xlsx, .pptx	python-docx, openpyxl
Text	.txt, .md, .csv	Direct reading
Images	.jpg, .png, .tiff	Vision LLM
Email	.eml	Email parser
Web	.html	Beautiful Soup

Layout Types Detected

Type	Description
Text	Regular body text
Title	Section/document titles
Figure	Images and diagrams
Figure caption	Figure descriptions
Table	Data tables
Table caption	Table descriptions
Header	Page headers
Footer	Page footers
Reference	Bibliography
Equation	Mathematical equations

Key Algorithms

Text Merging (XGBoost)

Features:
- Y-distance normalized by char height
- Same layout number
- Ending punctuation patterns
- Beginning character patterns
- Chinese numbering patterns

Output: Merge probability → threshold decision

Column Detection (KMeans)

Input: X-coordinates of text boxes
Output: Optimal column assignments

Algorithm:
1. For k = 1 to max_columns:
   - Fit KMeans(k)
   - Calculate silhouette_score
2. Select k with best score
3. Assign boxes to columns

Configuration

parser_config = {
    "chunk_token_num": 512,           # Tokens per chunk
    "delimiter": "\n。；！？",         # Chunk boundaries
    "layout_recognize": "DeepDOC",    # Layout method
    "task_page_size": 12,             # Pages per task
}

# Task executor config
MAX_CONCURRENT_TASKS = 5
EMBEDDING_BATCH_SIZE = 16
DOC_BULK_SIZE = 64

/rag/svr/task_executor.py - Main executor
/deepdoc/parser/pdf_parser.py - PDF parsing
/deepdoc/vision/ocr.py - OCR engine
/rag/nlp/__init__.py - Chunking algorithms

README.md

05-DOCUMENT-PROCESSING - Document Parsing Pipeline

Tong Quan

Kien Truc Document Processing

Cau Truc Thu Muc

Files Trong Module Nay

Processing Flow

Supported File Types

Layout Types Detected

Key Algorithms

Text Merging (XGBoost)

Column Detection (KMeans)

Configuration

Related Files