ragflow/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
Claude 6d4dbbfe2c
docs: Add comprehensive DeepDoc deep guide documentation
Created in-depth documentation for understanding the deepdoc module:

- README.md: Complete deep guide with:
  - Big picture explanation (what problem deepdoc solves)
  - Data flow diagrams (Input → Processing → Output)
  - Detailed code analysis with line numbers
  - Technical explanations (ONNX, CTC, NMS, etc.)
  - Design reasoning (why certain technologies chosen)
  - Difficult terms glossary
  - Extension examples

- ocr_deep_dive.md: Deep dive into OCR subsystem
  - DBNet text detection architecture
  - CRNN text recognition
  - CTC decoding algorithm
  - Rotation handling
  - Performance optimization

- layout_table_deep_dive.md: Deep dive into layout/table recognition
  - YOLOv10 layout detection
  - Table structure recognition
  - Grid construction algorithm
  - Spanning cell handling
  - HTML/descriptive output generation
2025-11-27 03:46:14 +00:00

1286 lines
56 KiB
Markdown

# DeepDoc Module - Hướng Dẫn Đọc Hiểu Chuyên Sâu
## Mục Lục
1. [Bức Tranh Lớn](#1-bức-tranh-lớn)
2. [Luồng Dữ Liệu](#2-luồng-dữ-liệu)
3. [Phân Tích Chi Tiết Code](#3-phân-tích-chi-tiết-code)
4. [Giải Thích Kỹ Thuật](#4-giải-thích-kỹ-thuật)
5. [Lý Do Thiết Kế](#5-lý-do-thiết-kế)
6. [Thuật Ngữ Khó](#6-thuật-ngữ-khó)
7. [Mở Rộng Từ Code](#7-mở-rộng-từ-code)
---
## 1. Bức Tranh Lớn
### 1.1 DeepDoc Giải Quyết Vấn Đề Gì?
**Vấn đề cốt lõi**: Khi xây dựng hệ thống RAG (Retrieval-Augmented Generation), bạn cần chuyển đổi tài liệu (PDF, Word, Excel...) thành dạng text có cấu trúc để:
- Tìm kiếm semantic (vector search)
- Chia nhỏ (chunking) hợp lý
- Giữ nguyên ngữ cảnh của bảng, hình ảnh
**DeepDoc là gì?**: Một module Python chuyên biệt để:
```
Document Files → Structured Text + Tables + Figures
(PDF, DOCX...) (Có position, layout type, reading order)
```
### 1.2 Kiến Trúc Tổng Quan
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEEPDOC MODULE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ PARSER LAYER │ │
│ │ Chuyển đổi các định dạng file thành text có cấu trúc │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ PDF │ │ DOCX │ │ Excel │ │ HTML │ │ Markdown │ │ │
│ │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │ │ │ │
│ └───────┼────────────┼────────────┼────────────┼────────────┼─────────┘ │
│ │ │ │ │ │ │
│ │ └────────────┴────────────┴────────────┘ │
│ │ │ │
│ │ Text-based parsing │
│ │ (pdfplumber, python-docx, openpyxl...) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ VISION LAYER │ │
│ │ Computer Vision cho PDF phức tạp (scanned, multi-column) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐ │ │
│ │ │ OCR │ │ Layout Recognizer│ │ Table Structure │ │ │
│ │ │ Detection + │ │ (YOLOv10) │ │ Recognizer │ │ │
│ │ │ Recognition │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────────────┘ └────────────────────┘ │ │
│ │ │ │ │ │ │
│ │ └───────────────────┴──────────────────────┘ │ │
│ │ │ │ │
│ │ ONNX Runtime Inference │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### 1.3 Các Thành Phần Chính
| Thành Phần | File | Mục Đích |
|------------|------|----------|
| **PDF Parser** | `parser/pdf_parser.py` | Parser phức tạp nhất - xử lý PDF với OCR + layout |
| **Office Parsers** | `parser/docx_parser.py`, `excel_parser.py`, `ppt_parser.py` | Xử lý file Microsoft Office |
| **Web Parsers** | `parser/html_parser.py`, `markdown_parser.py`, `json_parser.py` | Xử lý file web/markup |
| **OCR Engine** | `vision/ocr.py` | Text detection + recognition |
| **Layout Detector** | `vision/layout_recognizer.py` | Phân loại vùng (text, table, figure...) |
| **Table Detector** | `vision/table_structure_recognizer.py` | Nhận dạng cấu trúc bảng |
| **Operators** | `vision/operators.py` | Image preprocessing pipeline |
### 1.4 Tại Sao Cần DeepDoc?
**Không có DeepDoc** (naive approach):
```python
# Chỉ extract raw text từ PDF
text = pdfplumber.open("doc.pdf").pages[0].extract_text()
# Kết quả: "Header Footer Table content mixed together..."
# ❌ Mất cấu trúc, table thành text xáo trộn
```
**Với DeepDoc**:
```python
parser = RAGFlowPdfParser()
docs, tables = parser("doc.pdf")
# docs: [("Paragraph 1", "page_0_pos_100_200"), ("Paragraph 2", "page_0_pos_300_400")]
# tables: [{"html": "<table>...</table>", "bbox": [...]}]
# ✅ Giữ nguyên cấu trúc, table được parse riêng
```
---
## 2. Luồng Dữ Liệu
### 2.1 Luồng Chính: PDF Processing
```
┌────────────────────────────────────────────────────────────────────────────┐
│ PDF PROCESSING PIPELINE │
└────────────────────────────────────────────────────────────────────────────┘
Input: PDF File (path hoặc bytes)
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 1: IMAGE EXTRACTION │
│ File: pdf_parser.py, __images__() (lines 1042-1159) │
│ │
│ • Convert PDF pages → numpy images (using pdfplumber) │
│ • Extract native PDF characters (text layer) │
│ • Zoom factor: 3x (default) for OCR accuracy │
│ │
│ Output: page_images[], page_chars[] │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 2: OCR DETECTION & RECOGNITION │
│ File: vision/ocr.py, OCR.__call__() (lines 708-751) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ TextDetector │ → │ Crop & │ → │TextRecognizer│ │
│ │ (DBNet) │ │ Rotate │ │ (CRNN) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ • Detect text regions → bounding boxes │
│ • Crop each region, auto-rotate if needed │
│ • Recognize text in each region │
│ │
│ Output: boxes[] with {text, confidence, coordinates} │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 3: LAYOUT RECOGNITION │
│ File: vision/layout_recognizer.py, __call__() (lines 63-157) │
│ │
│ • Run YOLOv10 model on page image │
│ • Detect 10 layout types: Text, Title, Table, Figure, etc. │
│ • Match OCR boxes to layout regions │
│ │
│ Output: boxes[] with added {layout_type, layoutno} │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 4: COLUMN DETECTION │
│ File: pdf_parser.py, _assign_column() (lines 355-440) │
│ │
│ • K-Means clustering on X coordinates │
│ • Silhouette score to find optimal k (1-4 columns) │
│ • Assign col_id to each text box │
│ │
│ Output: boxes[] with added {col_id} │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 5: TABLE STRUCTURE RECOGNITION │
│ File: vision/table_structure_recognizer.py, __call__() (lines 67-111) │
│ │
│ • Detect rows, columns, headers, spanning cells │
│ • Match text boxes to table cells │
│ • Build 2D table matrix │
│ │
│ Output: table_components[] with grid structure │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 6: TEXT MERGING │
│ File: pdf_parser.py, _text_merge() (lines 442-478) │
│ _naive_vertical_merge() (lines 480-556) │
│ │
│ • Horizontal merge: same line, same column, same layout │
│ • Vertical merge: adjacent paragraphs with semantic checks │
│ • Respect sentence boundaries (。?!) │
│ │
│ Output: merged_boxes[] (fewer, larger text blocks) │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 7: FILTERING & CLEANUP │
│ File: pdf_parser.py, _filter_forpages() (lines 685-729) │
│ __filterout_scraps() (lines 971-1029) │
│ │
│ • Remove headers/footers (top/bottom 10% of page) │
│ • Remove table of contents │
│ • Filter low-quality OCR results │
│ │
│ Output: clean_boxes[] │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 8: EXTRACT TABLES & FIGURES │
│ File: pdf_parser.py, _extract_table_figure() (lines 757-930) │
│ │
│ • Convert table boxes to HTML/descriptive text │
│ • Extract figure images with captions │
│ • Handle spanning cells (colspan, rowspan) │
│ │
│ Output: tables[], figures[] │
└──────────────────────────────────┬──────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ FINAL OUTPUT │
│ │
│ documents: [(text, position_tag), ...] │
│ tables: [{"html": "...", "bbox": [...], "image": ...}, ...] │
│ │
│ position_tag format: "page_{page}_x0_{x0}_y0_{y0}_x1_{x1}_y1_{y1}" │
└─────────────────────────────────────────────────────────────────────────────┘
```
### 2.2 Luồng OCR Chi Tiết
```
Input Image (H, W, 3)
┌─────────────────────────────────────────────────────────────────────────────┐
│ TEXT DETECTION (DBNet) │
│ File: vision/ocr.py, TextDetector.__call__() (lines 503-530) │
└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Preprocess │ │ ONNX │ │ Postprocess │
│ │ │ Inference │ │ │
│ • Resize │ → │ │ → │ • Threshold │
│ • Normalize │ │ DBNet │ │ • Contours │
│ • Transpose │ │ Model │ │ • Unclip │
└─────────────┘ └─────────────┘ └─────────────┘
Text Region Polygons
[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
┌─────────────────────────────────────────────────────────────────────────────┐
│ TEXT RECOGNITION (CRNN) │
│ File: vision/ocr.py, TextRecognizer.__call__() (lines 363-408) │
└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Crop │ │ ONNX │ │ CTC Decode │
│ Rotate │ │ Inference │ │ │
│ │ → │ │ → │ • Argmax │
│ Perspective │ │ CRNN │ │ • Dedup │
│ Transform │ │ Model │ │ • Remove ε │
└─────────────┘ └─────────────┘ └─────────────┘
Output: [(box, (text, confidence)), ...]
```
### 2.3 Luồng Layout Recognition
```
Input: Page Image + OCR Results
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYOUT DETECTION (YOLOv10) │
│ File: vision/layout_recognizer.py (lines 163-237) │
└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Preprocess │ │ ONNX │ │ Postprocess │
│ │ │ Inference │ │ │
│ • Resize │ → │ │ → │ • NMS │
│ (640x640) │ │ YOLOv10 │ │ • Filter │
│ • Pad │ │ Model │ │ • Scale │
│ • Normalize │ │ │ │ back │
└─────────────┘ └─────────────┘ └─────────────┘
Layout Detections:
[{"type": "Table", "bbox": [...], "score": 0.95}]
┌─────────────────────────────────────────────────────────────────────────────┐
│ OCR-LAYOUT ASSOCIATION │
│ File: vision/layout_recognizer.py (lines 98-147) │
│ │
│ For each OCR box: │
│ • Find overlapping layout region (threshold: 40%) │
│ • Assign layout_type to OCR box │
│ • Filter garbage (headers/footers/page numbers) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Output: OCR boxes with layout_type attribute
[{"text": "...", "layout_type": "Text", "layoutno": 1}]
```
### 2.4 Data Flow Summary
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PDF File │ → │ Images │ → │ OCR Boxes │ → │ Merged │
│ │ │ + Chars │ │ + Layout │ │ Documents │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ Tables │
│ (HTML/Desc)│
└─────────────┘
Input Format:
- File path: str (e.g., "/path/to/doc.pdf")
- Or bytes: bytes (raw PDF content)
Output Format:
- documents: List[Tuple[str, str]]
- text: Extracted text content
- position_tag: "page_0_x0_100_y0_200_x1_500_y1_250"
- tables: List[Dict]
- html: "<table>...</table>"
- bbox: [x0, y0, x1, y1]
- image: numpy array (optional)
```
---
## 3. Phân Tích Chi Tiết Code
### 3.1 RAGFlowPdfParser Class
**File**: `/deepdoc/parser/pdf_parser.py`
**Lines**: 52-1479
#### 3.1.1 Constructor (__init__)
```python
# Line 52-104
class RAGFlowPdfParser:
def __init__(self, **kwargs):
# Load OCR model
self.ocr = OCR() # vision/ocr.py
# Load Layout Recognizer (YOLOv10)
self.layout_recognizer = LayoutRecognizer() # vision/layout_recognizer.py
# Load Table Structure Recognizer
self.tsr = TableStructureRecognizer() # vision/table_structure_recognizer.py
# Load XGBoost model for text concatenation
try:
self.updown_cnt_mdl = xgb.Booster()
model_path = os.path.join(get_project_base_directory(),
"rag/res/deepdoc/updown_concat_xgb.model")
self.updown_cnt_mdl.load_model(model_path)
except Exception as e:
self.updown_cnt_mdl = None
```
**Giải thích**:
- Constructor khởi tạo 4 models:
1. **OCR**: Text detection + recognition
2. **LayoutRecognizer**: Phân loại vùng layout (YOLOv10)
3. **TableStructureRecognizer**: Nhận dạng cấu trúc bảng
4. **XGBoost**: Quyết định merge text blocks (31 features)
#### 3.1.2 Main Entry Point (__call__)
```python
# Lines 1160-1168
def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
"""
Main entry point for PDF parsing.
Args:
fnm: File path or bytes
need_image: Whether to extract images
zoomin: Zoom factor for OCR (default 3x)
return_html: Return HTML tables instead of descriptive text
Returns:
(documents, tables) tuple
"""
self.__images__(fnm, zoomin) # Step 1: Load images
self._layouts_rec(zoomin) # Step 2-3: OCR + Layout
self._table_transformer_job(zoomin) # Step 4: Table structure
self._text_merge(zoomin) # Step 5: Merge text
self._filter_forpages() # Step 6: Filter
tbls = self._extract_table_figure(...) # Step 7: Extract tables
return self._final_result(), tbls # Final output
```
**Tại sao zoomin=3?**
- OCR accuracy tăng đáng kể khi image lớn hơn
- 3x là balance giữa accuracy và memory/speed
- Quá lớn (5x+) → memory issues, quá nhỏ (1x) → OCR errors
#### 3.1.3 Image Loading (__images__)
```python
# Lines 1042-1159
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
"""
Load PDF pages as images and extract native characters.
"""
self.page_images = []
self.page_chars = []
# Open PDF with pdfplumber
with pdfplumber.open(fnm) as pdf:
for i, page in enumerate(pdf.pages[page_from:page_to]):
# Convert page to image
img = page.to_image(resolution=72 * zoomin)
img = np.array(img.original)
self.page_images.append(img)
# Extract native PDF characters
chars = page.chars
self.page_chars.append(chars)
```
**Tại sao dùng pdfplumber?**
- Hỗ trợ cả text extraction và image conversion
- Giữ được character-level coordinates
- Xử lý tốt các PDF phức tạp
#### 3.1.4 Column Detection (_assign_column)
```python
# Lines 355-440
def _assign_column(self, boxes, zoomin=3):
"""
Detect columns using K-Means clustering on X coordinates.
"""
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Extract X coordinates
x_coords = np.array([[b["x0"]] for b in boxes])
best_k = 1
best_score = -1
# Try k from 1 to 4
for k in range(1, min(5, len(boxes))):
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
if k > 1:
score = silhouette_score(x_coords, labels)
if score > best_score:
best_score = score
best_k = k
# Final clustering with best k
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
# Assign column IDs
for i, box in enumerate(boxes):
box["col_id"] = labels[i]
```
**Tại sao K-Means?**
- Unsupervised: không cần training data
- Fast: O(n * k * iterations)
- Silhouette score tự động chọn số cột
### 3.2 OCR Class
**File**: `/deepdoc/vision/ocr.py`
**Lines**: 536-752
#### 3.2.1 Text Detection (TextDetector)
```python
# Lines 414-534
class TextDetector:
def __init__(self, model_dir, device_id=None):
# Preprocessing pipeline
self.preprocess_op = [
DetResizeForTest(limit_side_len=960, limit_type='max'),
NormalizeImage(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
ToCHWImage(),
KeepKeys(keep_keys=['image', 'shape'])
]
# Postprocessing
self.postprocess_op = DBPostProcess(
thresh=0.3, # Binary threshold
box_thresh=0.5, # Box confidence threshold
max_candidates=1000, # Max text regions
unclip_ratio=1.5 # Box expansion ratio
)
# Load ONNX model
self.ort_sess, self.run_opts = load_model(model_dir, "det", device_id)
```
**DBNet (Differentiable Binarization)**:
- Input: Image → Probability map (text regions)
- Thresholding: prob > 0.3 → foreground
- Unclipping: Expand boxes by 1.5x để capture full text
#### 3.2.2 Text Recognition (TextRecognizer)
```python
# Lines 133-412
class TextRecognizer:
def __init__(self, model_dir, device_id=None):
self.rec_image_shape = [3, 48, 320] # C, H, W
self.batch_size = 16
# Load CRNN model
self.ort_sess, self.run_opts = load_model(model_dir, "rec", device_id)
# CTC decoder
self.postprocess_op = CTCLabelDecode(character_dict_path=dict_path)
def __call__(self, img_list):
# Sort by aspect ratio for efficient batching
indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
results = []
for batch in chunks(indices, self.batch_size):
# Normalize images
norm_imgs = [self.resize_norm_img(img_list[i]) for i in batch]
# Run inference
preds = self.ort_sess.run(None, {"input": np.stack(norm_imgs)})
# CTC decode
texts = self.postprocess_op(preds[0])
results.extend(texts)
return results
```
**CRNN + CTC**:
- CNN: Extract visual features
- RNN: Sequence modeling
- CTC: Alignment-free decoding (handles variable-length text)
#### 3.2.3 Rotation Handling
```python
# Lines 584-638
def get_rotate_crop_image(self, img, points):
"""
Crop text region with auto-rotation detection.
"""
# Get perspective transform
rect = self.order_points_clockwise(points)
M = cv2.getPerspectiveTransform(rect, dst_pts)
warped = cv2.warpPerspective(img, M, (width, height))
# Check if text is vertical (height > 1.5 * width)
if warped.shape[0] / warped.shape[1] >= 1.5:
# Try 3 orientations
scores = []
for angle in [0, 90, -90]:
rotated = self.rotate(warped, angle)
_, conf = self.recognizer([rotated])[0]
scores.append(conf)
# Use orientation with highest confidence
best_angle = [0, 90, -90][np.argmax(scores)]
warped = self.rotate(warped, best_angle)
return warped
```
**Tại sao cần auto-rotation?**
- PDF có thể chứa text xoay 90°
- OCR model trained on horizontal text
- Auto-detect giúp nhận dạng text dọc chính xác
### 3.3 Layout Recognizer
**File**: `/deepdoc/vision/layout_recognizer.py`
**Lines**: 33-237
#### 3.3.1 YOLOv10 Preprocessing
```python
# Lines 186-209
def preprocess(self, image_list):
"""
Preprocess images for YOLOv10 inference.
"""
processed = []
for img in image_list:
h, w = img.shape[:2]
# Calculate scale (preserve aspect ratio)
r = min(640/h, 640/w)
new_h, new_w = int(h*r), int(w*r)
# Resize
resized = cv2.resize(img, (new_w, new_h))
# Pad to 640x640 (center padding, gray color)
padded = np.full((640, 640, 3), 114, dtype=np.uint8)
pad_top = (640 - new_h) // 2
pad_left = (640 - new_w) // 2
padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
# Normalize and transpose
padded = padded.astype(np.float32) / 255.0
padded = padded.transpose(2, 0, 1) # HWC → CHW
processed.append(padded)
return np.stack(processed)
```
**Tại sao 640x640?**
- YOLOv10 standard input size
- Balance accuracy vs speed
- 32-stride alignment (640 = 20 * 32)
#### 3.3.2 Layout Types
```python
# Lines 34-46
labels = [
"_background_", # 0: Background (ignored)
"Text", # 1: Body text paragraphs
"Title", # 2: Section/document titles
"Figure", # 3: Images, diagrams, charts
"Figure caption", # 4: Text describing figures
"Table", # 5: Data tables
"Table caption", # 6: Text describing tables
"Header", # 7: Page headers
"Footer", # 8: Page footers
"Reference", # 9: Bibliography, citations
"Equation", # 10: Mathematical equations
]
```
### 3.4 Table Structure Recognizer
**File**: `/deepdoc/vision/table_structure_recognizer.py`
**Lines**: 30-613
#### 3.4.1 Table Grid Construction
```python
# Lines 172-349
@staticmethod
def construct_table(boxes, is_english=False, html=True, **kwargs):
"""
Construct 2D table from detected components.
"""
# Step 1: Sort by row
boxes = Recognizer.sort_R_firstly(boxes, rowh/2)
# Step 2: Group into rows
rows = []
current_row = [boxes[0]]
for box in boxes[1:]:
if box["top"] - current_row[-1]["bottom"] > rowh/2:
rows.append(current_row)
current_row = [box]
else:
current_row.append(box)
rows.append(current_row)
# Step 3: Sort each row by column
for row in rows:
row.sort(key=lambda x: x["x0"])
# Step 4: Build 2D matrix
n_cols = max(len(row) for row in rows)
table = [[None] * n_cols for _ in range(len(rows))]
for i, row in enumerate(rows):
for j, cell in enumerate(row):
table[i][j] = cell["text"]
# Step 5: Generate output
if html:
return generate_html_table(table)
else:
return generate_descriptive_text(table)
```
#### 3.4.2 Spanning Cell Handling
```python
# Lines 496-575
def __cal_spans(self, boxes):
"""
Calculate colspan and rowspan for merged cells.
"""
for box in boxes:
if "SP" not in box: # Not a spanning cell
continue
# Find which rows this cell spans
box["rowspan"] = []
for i, row_box in enumerate(self.rows):
if self.overlapped_area(box, row_box) > 0.3:
box["rowspan"].append(i)
# Find which columns this cell spans
box["colspan"] = []
for j, col_box in enumerate(self.cols):
if self.overlapped_area(box, col_box) > 0.3:
box["colspan"].append(j)
```
---
## 4. Giải Thích Kỹ Thuật
### 4.1 ONNX Runtime
**ONNX là gì?**
- Open Neural Network Exchange
- Format chuẩn cho deep learning models
- Chạy trên nhiều hardware (CPU, GPU, NPU)
**Tại sao dùng ONNX?**
```python
# Không cần PyTorch/TensorFlow runtime
# Lightweight inference
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": input_data})
```
**Cấu hình trong DeepDoc**:
```python
# vision/ocr.py, lines 96-127
options = ort.SessionOptions()
options.enable_cpu_mem_arena = False # Giảm memory fragmentation
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
options.intra_op_num_threads = 2 # Threads per operator
options.inter_op_num_threads = 2 # Parallel operators
# GPU configuration
if torch.cuda.is_available():
providers = [
('CUDAExecutionProvider', {
'device_id': device_id,
'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB
})
]
```
### 4.2 CTC Decoding
**CTC (Connectionist Temporal Classification)**:
- Giải quyết alignment problem trong sequence-to-sequence
- Không cần biết vị trí chính xác của từng ký tự
**Ví dụ**:
```
OCR Model Output (time steps):
[a, a, a, -, l, l, -, p, p, h, h, a, -]
CTC Decoding:
1. Merge consecutive duplicates: [a, -, l, -, p, h, a, -]
2. Remove blank tokens (-): [a, l, p, h, a]
3. Result: "alpha"
```
**Implementation**:
```python
# vision/postprocess.py, lines 355-366
def __call__(self, preds, label=None):
# Get most probable character at each position
preds_idx = preds.argmax(axis=2) # Shape: (batch, time)
preds_prob = preds.max(axis=2) # Confidence scores
# Decode with deduplication
text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
return text
```
### 4.3 Non-Maximum Suppression (NMS)
**NMS là gì?**
- Loại bỏ duplicate detections
- Giữ lại box có confidence cao nhất
**Algorithm**:
```
1. Sort boxes by confidence (descending)
2. Pick box with highest score → add to results
3. Remove boxes with IoU > threshold (e.g., 0.5)
4. Repeat until no boxes remain
```
**Implementation**:
```python
# vision/operators.py, lines 702-725
def nms(bboxes, scores, iou_thresh):
indices = []
index = scores.argsort()[::-1] # Sort descending
while index.size > 0:
i = index[0]
indices.append(i)
# Compute IoU with remaining boxes
ious = compute_iou(bboxes[i], bboxes[index[1:]])
# Keep only boxes with IoU <= threshold
mask = ious <= iou_thresh
index = index[1:][mask]
return indices
```
### 4.4 DBNet (Differentiable Binarization)
**DBNet là gì?**
- Text detection network
- Tạo probability map + threshold map
- Differentiable binarization cho end-to-end training
**Pipeline**:
```
Image → CNN Backbone → Feature Map →
├→ Probability Map (text regions)
└→ Threshold Map (adaptive threshold)
Final = Probability > Threshold (pixel-wise)
```
**Post-processing**:
```python
# vision/postprocess.py, DBPostProcess
def __call__(self, outs_dict, shape_list):
pred = outs_dict["maps"]
# Binary thresholding
bitmap = pred > self.thresh # 0.3
# Find contours
contours = cv2.findContours(bitmap, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
# Unclip (expand) boxes
for contour in contours:
box = self.unclip(contour, self.unclip_ratio) # 1.5x expansion
boxes.append(box)
```
### 4.5 K-Means cho Column Detection
**Tại sao K-Means?**
- Text boxes trong cùng cột có X coordinate tương tự
- K-Means cluster các X values
- Silhouette score chọn số cột tối ưu
**Silhouette Score**:
```
s(i) = (b(i) - a(i)) / max(a(i), b(i))
- a(i): Average distance to same cluster
- b(i): Average distance to nearest other cluster
- Range: [-1, 1], higher = better clustering
```
**Ví dụ**:
```
Page with 2 columns:
Left column boxes: x0 = [50, 52, 48, 51, ...]
Right column boxes: x0 = [400, 398, 402, 399, ...]
K-Means (k=2):
- Cluster 0: x0 ≈ 50 (left column)
- Cluster 1: x0 ≈ 400 (right column)
Silhouette score ≈ 0.95 (high, good separation)
```
---
## 5. Lý Do Thiết Kế
### 5.1 Tại Sao Dùng Multiple Models?
**Vấn đề**: Một model không thể handle tất cả tasks
| Task | Model Type | Lý Do |
|------|------------|-------|
| Text Detection | DBNet | Specialized cho text regions |
| Text Recognition | CRNN | Sequential text với CTC |
| Layout Detection | YOLOv10 | Object detection tốt nhất |
| Table Structure | YOLOv10 variant | Fine-tuned cho table elements |
**Trade-off**:
- Pros: Mỗi model optimized cho task riêng
- Cons: Nhiều models → nhiều memory, complexity
### 5.2 Tại Sao Dùng XGBoost cho Text Merging?
**Vấn đề**: Merge text blocks là decision phức tạp
**Rule-based approach** (naive):
```python
# Simple heuristics
if y_distance < threshold and same_column:
merge()
# ❌ Không handle edge cases tốt
```
**ML approach** (XGBoost):
```python
# 31 features capturing various signals
features = [
y_distance / char_height, # Distance feature
ends_with_punctuation, # Text pattern
same_layout_type, # Layout feature
font_size_ratio, # Typography
...
]
# ✅ Learns complex patterns from data
```
**Tại sao XGBoost?**
- Fast inference (tree-based)
- Handles mixed feature types well
- Pre-trained model included
### 5.3 Tại Sao ONNX thay vì PyTorch/TensorFlow?
| Aspect | ONNX Runtime | PyTorch |
|--------|--------------|---------|
| Size | ~50MB | ~500MB+ |
| Memory | Lower | Higher |
| Startup | Fast | Slow (JIT) |
| Dependencies | Minimal | Many |
| Multi-platform | Yes | Limited |
**DeepDoc choice**: ONNX cho production deployment
- Không cần PyTorch runtime
- Lighter memory footprint
- Faster cold start
### 5.4 Tại Sao Zoomin = 3?
**Experiment results**:
```
zoomin=1: OCR accuracy ~70%, fast
zoomin=2: OCR accuracy ~85%, moderate
zoomin=3: OCR accuracy ~95%, acceptable speed ← chosen
zoomin=4: OCR accuracy ~97%, slow
zoomin=5: OCR accuracy ~98%, very slow, memory issues
```
**Balance**: 3x là sweet spot giữa accuracy và resource usage
### 5.5 Tại Sao Hybrid Text Extraction?
**Native PDF text** (pdfplumber):
- Pros: Accurate, fast, preserves fonts
- Cons: Không có cho scanned PDFs
**OCR text**:
- Pros: Works on any image
- Cons: Slower, potential errors
**Hybrid approach**:
```python
# Prefer native text, fallback to OCR
for box in ocr_boxes:
# Try to match with native characters
matched_chars = find_overlapping_chars(box, native_chars)
if matched_chars:
box["text"] = "".join(matched_chars) # Use native
else:
box["text"] = ocr_result # Use OCR
```
### 5.6 Pipeline vs End-to-End Model
**End-to-End** (e.g., Donut, Pix2Struct):
- Single model: Image → Structured output
- Pros: Simple, unified
- Cons: Less accurate on specific tasks, hard to debug
**Pipeline** (DeepDoc's choice):
- Multiple specialized models
- Pros:
- Each model optimized for task
- Easy to debug/improve individual components
- Mix and match different models
- Cons:
- More complexity
- Potential error accumulation
**DeepDoc's rationale**: Pipeline cho flexibility và accuracy
---
## 6. Thuật Ngữ Khó
### 6.1 Computer Vision Terms
| Term | Definition | Ví Dụ trong DeepDoc |
|------|------------|---------------------|
| **Bounding Box** | Hình chữ nhật bao quanh object | `[x0, y0, x1, y1]` coordinates |
| **IoU** | Intersection over Union - đo overlap | NMS threshold 0.5 |
| **NMS** | Non-Maximum Suppression | Loại duplicate detections |
| **Anchor** | Predefined box sizes | YOLOv10 anchors |
| **Stride** | Downsampling factor | 32 trong YOLOv10 |
| **FPN** | Feature Pyramid Network | Multi-scale detection |
### 6.2 OCR Terms
| Term | Definition | Ví Dụ trong DeepDoc |
|------|------------|---------------------|
| **CTC** | Connectionist Temporal Classification | CRNN output decoding |
| **CRNN** | CNN + RNN | Text recognition model |
| **DBNet** | Differentiable Binarization | Text detection model |
| **Unclip** | Expand polygon boundary | 1.5x expansion ratio |
### 6.3 ML Terms
| Term | Definition | Ví Dụ trong DeepDoc |
|------|------------|---------------------|
| **ONNX** | Open Neural Network Exchange | Model format |
| **Inference** | Running model on input | `session.run()` |
| **Batch** | Multiple inputs processed together | batch_size=16 |
| **Confidence** | Model's certainty score | 0.0 - 1.0 |
### 6.4 Document Processing Terms
| Term | Definition | Ví Dụ trong DeepDoc |
|------|------------|---------------------|
| **Layout** | Document structure | Text, Table, Figure |
| **TSR** | Table Structure Recognition | Row, Column detection |
| **Spanning Cell** | Merged table cell | colspan, rowspan |
| **Reading Order** | Text flow sequence | Top-to-bottom, left-to-right |
---
## 7. Mở Rộng Từ Code
### 7.1 Thêm Parser Mới
**Ví dụ**: Add RTF parser
```python
# deepdoc/parser/rtf_parser.py
from striprtf.striprtf import rtf_to_text
class RAGFlowRtfParser:
def __call__(self, fnm, binary=None, chunk_token_num=128):
if binary:
content = binary.decode('utf-8')
else:
with open(fnm, 'r') as f:
content = f.read()
text = rtf_to_text(content)
# Chunk text
chunks = self._chunk(text, chunk_token_num)
return [(chunk, f"rtf_chunk_{i}") for i, chunk in enumerate(chunks)]
```
### 7.2 Thêm Layout Type Mới
**Ví dụ**: Add "Code Block" layout
```python
# vision/layout_recognizer.py
labels = [
"_background_",
"Text",
"Title",
...
"Code Block", # New label (index 11)
]
# Train new YOLOv10 model with "Code Block" annotations
# Update model file
```
### 7.3 Custom Text Merging Logic
```python
# Override default merging behavior
class CustomPdfParser(RAGFlowPdfParser):
def _should_merge(self, box1, box2):
"""Custom merge logic"""
# Don't merge code blocks
if box1.get("layout_type") == "Code Block":
return False
# Use default logic otherwise
return super()._should_merge(box1, box2)
```
### 7.4 Thêm Output Format
```python
# Add Markdown output format
def to_markdown(self, documents, tables):
md_parts = []
for text, pos_tag in documents:
# Detect if title
if self._is_title(text):
md_parts.append(f"## {text}\n")
else:
md_parts.append(f"{text}\n\n")
# Convert tables to markdown
for table in tables:
md_table = html_to_markdown(table["html"])
md_parts.append(md_table)
return "\n".join(md_parts)
```
### 7.5 Optimize Performance
**GPU Batching**:
```python
# Process multiple pages in parallel
def _parallel_ocr(self, images, batch_size=4):
with ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for batch in chunks(images, batch_size):
future = executor.submit(self.ocr, batch)
futures.append(future)
results = [f.result() for f in futures]
return results
```
**Caching**:
```python
# Cache model instances
_model_cache = {}
def get_ocr_model(model_dir, device_id):
key = f"{model_dir}_{device_id}"
if key not in _model_cache:
_model_cache[key] = OCR(model_dir, device_id)
return _model_cache[key]
```
### 7.6 Integration với RAG Pipeline
```python
# rag/app/pdf.py (example integration)
from deepdoc.parser import RAGFlowPdfParser
def process_pdf_for_rag(file_path, chunk_size=512):
parser = RAGFlowPdfParser()
# Parse PDF
documents, tables = parser(file_path)
# Chunk documents
chunks = []
for text, pos_tag in documents:
for chunk in chunk_text(text, chunk_size):
chunks.append({
"text": chunk,
"metadata": {"position": pos_tag}
})
# Add tables as separate chunks
for table in tables:
chunks.append({
"text": table["html"],
"metadata": {"type": "table", "bbox": table["bbox"]}
})
return chunks
```
---
## 8. Tổng Kết
### 8.1 Key Takeaways
1. **DeepDoc = Parser Layer + Vision Layer**
- Parser: Format-specific handling (PDF, DOCX, etc.)
- Vision: OCR + Layout + Table recognition
2. **Pipeline Architecture**
- Multiple specialized models
- Easy to debug and improve
3. **ONNX Runtime**
- Lightweight inference
- Cross-platform compatibility
4. **Hybrid Text Extraction**
- Native PDF text khi available
- OCR fallback cho scanned documents
### 8.2 Diagram Tổng Hợp
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ DEEPDOC SUMMARY │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT PROCESSING OUTPUT │
│ ───── ────────── ────── │
│ │
│ ┌─────────┐ ┌────────────────────────────┐ ┌─────────────────┐ │
│ │ PDF │────▶│ 1. Image Extraction │─────▶│ Documents │ │
│ │ DOCX │ │ 2. OCR (DBNet + CRNN) │ │ [(text, pos)] │ │
│ │ Excel │ │ 3. Layout (YOLOv10) │ │ │ │
│ │ HTML │ │ 4. Column Detection │ │ Tables │ │
│ │ ... │ │ 5. Table Structure │ │ [html, bbox] │ │
│ └─────────┘ │ 6. Text Merging │ │ │ │
│ │ 7. Quality Filtering │ │ Figures │ │
│ └────────────────────────────┘ │ [image, cap] │ │
│ └─────────────────┘ │
│ │
│ MODELS USED: │
│ ──────────── │
│ • DBNet (Text Detection) - ONNX, ~30MB │
│ • CRNN (Text Recognition) - ONNX, ~20MB │
│ • YOLOv10 (Layout Detection) - ONNX, ~50MB │
│ • YOLOv10 (Table Structure) - ONNX, ~50MB │
│ • XGBoost (Text Merging) - Binary, ~5MB │
│ │
│ KEY ALGORITHMS: │
│ ─────────────── │
│ • CTC Decoding (text recognition) │
│ • NMS (duplicate removal) │
│ • K-Means (column detection) │
│ • IoU (overlap calculation) │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
```
### 8.3 Files Reference
| File | Lines | Description |
|------|-------|-------------|
| `parser/pdf_parser.py` | 1479 | Main PDF parser |
| `vision/ocr.py` | 752 | OCR detection + recognition |
| `vision/layout_recognizer.py` | 457 | Layout detection |
| `vision/table_structure_recognizer.py` | 613 | Table structure |
| `vision/recognizer.py` | 443 | Base recognizer class |
| `vision/operators.py` | 726 | Image preprocessing |
| `vision/postprocess.py` | 371 | Post-processing utilities |
---
*Document created for RAGFlow v0.22.1 analysis*