diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
new file mode 100644
index 000000000..45c4d3fcb
--- /dev/null
+++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
@@ -0,0 +1,1286 @@
+# DeepDoc Module - Hướng Dẫn Đọc Hiểu Chuyên Sâu
+
+## Mục Lục
+
+1. [Bức Tranh Lớn](#1-bức-tranh-lớn)
+2. [Luồng Dữ Liệu](#2-luồng-dữ-liệu)
+3. [Phân Tích Chi Tiết Code](#3-phân-tích-chi-tiết-code)
+4. [Giải Thích Kỹ Thuật](#4-giải-thích-kỹ-thuật)
+5. [Lý Do Thiết Kế](#5-lý-do-thiết-kế)
+6. [Thuật Ngữ Khó](#6-thuật-ngữ-khó)
+7. [Mở Rộng Từ Code](#7-mở-rộng-từ-code)
+
+---
+
+## 1. Bức Tranh Lớn
+
+### 1.1 DeepDoc Giải Quyết Vấn Đề Gì?
+
+**Vấn đề cốt lõi**: Khi xây dựng hệ thống RAG (Retrieval-Augmented Generation), bạn cần chuyển đổi tài liệu (PDF, Word, Excel...) thành dạng text có cấu trúc để:
+- Tìm kiếm semantic (vector search)
+- Chia nhỏ (chunking) hợp lý
+- Giữ nguyên ngữ cảnh của bảng, hình ảnh
+
+**DeepDoc là gì?**: Một module Python chuyên biệt để:
+```
+Document Files → Structured Text + Tables + Figures
+(PDF, DOCX...) (Có position, layout type, reading order)
+```
+
+### 1.2 Kiến Trúc Tổng Quan
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ DEEPDOC MODULE │
+├─────────────────────────────────────────────────────────────────────────────┤
+│ │
+│ ┌─────────────────────────────────────────────────────────────────────┐ │
+│ │ PARSER LAYER │ │
+│ │ Chuyển đổi các định dạng file thành text có cấu trúc │ │
+│ │ │ │
+│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
+│ │ │ PDF │ │ DOCX │ │ Excel │ │ HTML │ │ Markdown │ │ │
+│ │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ │
+│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
+│ │ │ │ │ │ │ │ │
+│ └───────┼────────────┼────────────┼────────────┼────────────┼─────────┘ │
+│ │ │ │ │ │ │
+│ │ └────────────┴────────────┴────────────┘ │
+│ │ │ │
+│ │ Text-based parsing │
+│ │ (pdfplumber, python-docx, openpyxl...) │
+│ │ │
+│ ▼ │
+│ ┌─────────────────────────────────────────────────────────────────────┐ │
+│ │ VISION LAYER │ │
+│ │ Computer Vision cho PDF phức tạp (scanned, multi-column) │ │
+│ │ │ │
+│ │ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐ │ │
+│ │ │ OCR │ │ Layout Recognizer│ │ Table Structure │ │ │
+│ │ │ Detection + │ │ (YOLOv10) │ │ Recognizer │ │ │
+│ │ │ Recognition │ │ │ │ │ │ │
+│ │ └──────────────┘ └──────────────────┘ └────────────────────┘ │ │
+│ │ │ │ │ │ │
+│ │ └───────────────────┴──────────────────────┘ │ │
+│ │ │ │ │
+│ │ ONNX Runtime Inference │ │
+│ │ │ │
+│ └─────────────────────────────────────────────────────────────────────┘ │
+│ │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 1.3 Các Thành Phần Chính
+
+| Thành Phần | File | Mục Đích |
+|------------|------|----------|
+| **PDF Parser** | `parser/pdf_parser.py` | Parser phức tạp nhất - xử lý PDF với OCR + layout |
+| **Office Parsers** | `parser/docx_parser.py`, `excel_parser.py`, `ppt_parser.py` | Xử lý file Microsoft Office |
+| **Web Parsers** | `parser/html_parser.py`, `markdown_parser.py`, `json_parser.py` | Xử lý file web/markup |
+| **OCR Engine** | `vision/ocr.py` | Text detection + recognition |
+| **Layout Detector** | `vision/layout_recognizer.py` | Phân loại vùng (text, table, figure...) |
+| **Table Detector** | `vision/table_structure_recognizer.py` | Nhận dạng cấu trúc bảng |
+| **Operators** | `vision/operators.py` | Image preprocessing pipeline |
+
+### 1.4 Tại Sao Cần DeepDoc?
+
+**Không có DeepDoc** (naive approach):
+```python
+# Chỉ extract raw text từ PDF
+text = pdfplumber.open("doc.pdf").pages[0].extract_text()
+# Kết quả: "Header Footer Table content mixed together..."
+# ❌ Mất cấu trúc, table thành text xáo trộn
+```
+
+**Với DeepDoc**:
+```python
+parser = RAGFlowPdfParser()
+docs, tables = parser("doc.pdf")
+# docs: [("Paragraph 1", "page_0_pos_100_200"), ("Paragraph 2", "page_0_pos_300_400")]
+# tables: [{"html": "
", "bbox": [...]}]
+# ✅ Giữ nguyên cấu trúc, table được parse riêng
+```
+
+---
+
+## 2. Luồng Dữ Liệu
+
+### 2.1 Luồng Chính: PDF Processing
+
+```
+┌────────────────────────────────────────────────────────────────────────────┐
+│ PDF PROCESSING PIPELINE │
+└────────────────────────────────────────────────────────────────────────────┘
+
+Input: PDF File (path hoặc bytes)
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 1: IMAGE EXTRACTION │
+│ File: pdf_parser.py, __images__() (lines 1042-1159) │
+│ │
+│ • Convert PDF pages → numpy images (using pdfplumber) │
+│ • Extract native PDF characters (text layer) │
+│ • Zoom factor: 3x (default) for OCR accuracy │
+│ │
+│ Output: page_images[], page_chars[] │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 2: OCR DETECTION & RECOGNITION │
+│ File: vision/ocr.py, OCR.__call__() (lines 708-751) │
+│ │
+│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
+│ │ TextDetector │ → │ Crop & │ → │TextRecognizer│ │
+│ │ (DBNet) │ │ Rotate │ │ (CRNN) │ │
+│ └──────────────┘ └──────────────┘ └──────────────┘ │
+│ │
+│ • Detect text regions → bounding boxes │
+│ • Crop each region, auto-rotate if needed │
+│ • Recognize text in each region │
+│ │
+│ Output: boxes[] with {text, confidence, coordinates} │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 3: LAYOUT RECOGNITION │
+│ File: vision/layout_recognizer.py, __call__() (lines 63-157) │
+│ │
+│ • Run YOLOv10 model on page image │
+│ • Detect 10 layout types: Text, Title, Table, Figure, etc. │
+│ • Match OCR boxes to layout regions │
+│ │
+│ Output: boxes[] with added {layout_type, layoutno} │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 4: COLUMN DETECTION │
+│ File: pdf_parser.py, _assign_column() (lines 355-440) │
+│ │
+│ • K-Means clustering on X coordinates │
+│ • Silhouette score to find optimal k (1-4 columns) │
+│ • Assign col_id to each text box │
+│ │
+│ Output: boxes[] with added {col_id} │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 5: TABLE STRUCTURE RECOGNITION │
+│ File: vision/table_structure_recognizer.py, __call__() (lines 67-111) │
+│ │
+│ • Detect rows, columns, headers, spanning cells │
+│ • Match text boxes to table cells │
+│ • Build 2D table matrix │
+│ │
+│ Output: table_components[] with grid structure │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 6: TEXT MERGING │
+│ File: pdf_parser.py, _text_merge() (lines 442-478) │
+│ _naive_vertical_merge() (lines 480-556) │
+│ │
+│ • Horizontal merge: same line, same column, same layout │
+│ • Vertical merge: adjacent paragraphs with semantic checks │
+│ • Respect sentence boundaries (。?!) │
+│ │
+│ Output: merged_boxes[] (fewer, larger text blocks) │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 7: FILTERING & CLEANUP │
+│ File: pdf_parser.py, _filter_forpages() (lines 685-729) │
+│ __filterout_scraps() (lines 971-1029) │
+│ │
+│ • Remove headers/footers (top/bottom 10% of page) │
+│ • Remove table of contents │
+│ • Filter low-quality OCR results │
+│ │
+│ Output: clean_boxes[] │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ STEP 8: EXTRACT TABLES & FIGURES │
+│ File: pdf_parser.py, _extract_table_figure() (lines 757-930) │
+│ │
+│ • Convert table boxes to HTML/descriptive text │
+│ • Extract figure images with captions │
+│ • Handle spanning cells (colspan, rowspan) │
+│ │
+│ Output: tables[], figures[] │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ FINAL OUTPUT │
+│ │
+│ documents: [(text, position_tag), ...] │
+│ tables: [{"html": "...", "bbox": [...], "image": ...}, ...] │
+│ │
+│ position_tag format: "page_{page}_x0_{x0}_y0_{y0}_x1_{x1}_y1_{y1}" │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 2.2 Luồng OCR Chi Tiết
+
+```
+ Input Image (H, W, 3)
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ TEXT DETECTION (DBNet) │
+│ File: vision/ocr.py, TextDetector.__call__() (lines 503-530) │
+└─────────────────────────────────────────────────────────────────────────────┘
+ │
+ ┌────────────────────────┼────────────────────────┐
+ │ │ │
+ ▼ ▼ ▼
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+ │ Preprocess │ │ ONNX │ │ Postprocess │
+ │ │ │ Inference │ │ │
+ │ • Resize │ → │ │ → │ • Threshold │
+ │ • Normalize │ │ DBNet │ │ • Contours │
+ │ • Transpose │ │ Model │ │ • Unclip │
+ └─────────────┘ └─────────────┘ └─────────────┘
+ │
+ ▼
+ Text Region Polygons
+ [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ TEXT RECOGNITION (CRNN) │
+│ File: vision/ocr.py, TextRecognizer.__call__() (lines 363-408) │
+└─────────────────────────────────────────────────────────────────────────────┘
+ │
+ ┌────────────────────────┼────────────────────────┐
+ │ │ │
+ ▼ ▼ ▼
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+ │ Crop │ │ ONNX │ │ CTC Decode │
+ │ Rotate │ │ Inference │ │ │
+ │ │ → │ │ → │ • Argmax │
+ │ Perspective │ │ CRNN │ │ • Dedup │
+ │ Transform │ │ Model │ │ • Remove ε │
+ └─────────────┘ └─────────────┘ └─────────────┘
+ │
+ ▼
+ Output: [(box, (text, confidence)), ...]
+```
+
+### 2.3 Luồng Layout Recognition
+
+```
+ Input: Page Image + OCR Results
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ LAYOUT DETECTION (YOLOv10) │
+│ File: vision/layout_recognizer.py (lines 163-237) │
+└─────────────────────────────────────────────────────────────────────────────┘
+ │
+ ┌────────────────────────────┼────────────────────────────┐
+ │ │ │
+ ▼ ▼ ▼
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+ │ Preprocess │ │ ONNX │ │ Postprocess │
+ │ │ │ Inference │ │ │
+ │ • Resize │ → │ │ → │ • NMS │
+ │ (640x640) │ │ YOLOv10 │ │ • Filter │
+ │ • Pad │ │ Model │ │ • Scale │
+ │ • Normalize │ │ │ │ back │
+ └─────────────┘ └─────────────┘ └─────────────┘
+ │
+ ▼
+ Layout Detections:
+ [{"type": "Table", "bbox": [...], "score": 0.95}]
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ OCR-LAYOUT ASSOCIATION │
+│ File: vision/layout_recognizer.py (lines 98-147) │
+│ │
+│ For each OCR box: │
+│ • Find overlapping layout region (threshold: 40%) │
+│ • Assign layout_type to OCR box │
+│ • Filter garbage (headers/footers/page numbers) │
+│ │
+└─────────────────────────────────────────────────────────────────────────────┘
+ │
+ ▼
+ Output: OCR boxes with layout_type attribute
+ [{"text": "...", "layout_type": "Text", "layoutno": 1}]
+```
+
+### 2.4 Data Flow Summary
+
+```
+┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+│ PDF File │ → │ Images │ → │ OCR Boxes │ → │ Merged │
+│ │ │ + Chars │ │ + Layout │ │ Documents │
+└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
+ │
+ ▼
+ ┌─────────────┐
+ │ Tables │
+ │ (HTML/Desc)│
+ └─────────────┘
+
+Input Format:
+- File path: str (e.g., "/path/to/doc.pdf")
+- Or bytes: bytes (raw PDF content)
+
+Output Format:
+- documents: List[Tuple[str, str]]
+ - text: Extracted text content
+ - position_tag: "page_0_x0_100_y0_200_x1_500_y1_250"
+
+- tables: List[Dict]
+ - html: ""
+ - bbox: [x0, y0, x1, y1]
+ - image: numpy array (optional)
+```
+
+---
+
+## 3. Phân Tích Chi Tiết Code
+
+### 3.1 RAGFlowPdfParser Class
+
+**File**: `/deepdoc/parser/pdf_parser.py`
+**Lines**: 52-1479
+
+#### 3.1.1 Constructor (__init__)
+
+```python
+# Line 52-104
+class RAGFlowPdfParser:
+ def __init__(self, **kwargs):
+ # Load OCR model
+ self.ocr = OCR() # vision/ocr.py
+
+ # Load Layout Recognizer (YOLOv10)
+ self.layout_recognizer = LayoutRecognizer() # vision/layout_recognizer.py
+
+ # Load Table Structure Recognizer
+ self.tsr = TableStructureRecognizer() # vision/table_structure_recognizer.py
+
+ # Load XGBoost model for text concatenation
+ try:
+ self.updown_cnt_mdl = xgb.Booster()
+ model_path = os.path.join(get_project_base_directory(),
+ "rag/res/deepdoc/updown_concat_xgb.model")
+ self.updown_cnt_mdl.load_model(model_path)
+ except Exception as e:
+ self.updown_cnt_mdl = None
+```
+
+**Giải thích**:
+- Constructor khởi tạo 4 models:
+ 1. **OCR**: Text detection + recognition
+ 2. **LayoutRecognizer**: Phân loại vùng layout (YOLOv10)
+ 3. **TableStructureRecognizer**: Nhận dạng cấu trúc bảng
+ 4. **XGBoost**: Quyết định merge text blocks (31 features)
+
+#### 3.1.2 Main Entry Point (__call__)
+
+```python
+# Lines 1160-1168
+def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
+ """
+ Main entry point for PDF parsing.
+
+ Args:
+ fnm: File path or bytes
+ need_image: Whether to extract images
+ zoomin: Zoom factor for OCR (default 3x)
+ return_html: Return HTML tables instead of descriptive text
+
+ Returns:
+ (documents, tables) tuple
+ """
+ self.__images__(fnm, zoomin) # Step 1: Load images
+ self._layouts_rec(zoomin) # Step 2-3: OCR + Layout
+ self._table_transformer_job(zoomin) # Step 4: Table structure
+ self._text_merge(zoomin) # Step 5: Merge text
+ self._filter_forpages() # Step 6: Filter
+ tbls = self._extract_table_figure(...) # Step 7: Extract tables
+ return self._final_result(), tbls # Final output
+```
+
+**Tại sao zoomin=3?**
+- OCR accuracy tăng đáng kể khi image lớn hơn
+- 3x là balance giữa accuracy và memory/speed
+- Quá lớn (5x+) → memory issues, quá nhỏ (1x) → OCR errors
+
+#### 3.1.3 Image Loading (__images__)
+
+```python
+# Lines 1042-1159
+def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
+ """
+ Load PDF pages as images and extract native characters.
+ """
+ self.page_images = []
+ self.page_chars = []
+
+ # Open PDF with pdfplumber
+ with pdfplumber.open(fnm) as pdf:
+ for i, page in enumerate(pdf.pages[page_from:page_to]):
+ # Convert page to image
+ img = page.to_image(resolution=72 * zoomin)
+ img = np.array(img.original)
+ self.page_images.append(img)
+
+ # Extract native PDF characters
+ chars = page.chars
+ self.page_chars.append(chars)
+```
+
+**Tại sao dùng pdfplumber?**
+- Hỗ trợ cả text extraction và image conversion
+- Giữ được character-level coordinates
+- Xử lý tốt các PDF phức tạp
+
+#### 3.1.4 Column Detection (_assign_column)
+
+```python
+# Lines 355-440
+def _assign_column(self, boxes, zoomin=3):
+ """
+ Detect columns using K-Means clustering on X coordinates.
+ """
+ from sklearn.cluster import KMeans
+ from sklearn.metrics import silhouette_score
+
+ # Extract X coordinates
+ x_coords = np.array([[b["x0"]] for b in boxes])
+
+ best_k = 1
+ best_score = -1
+
+ # Try k from 1 to 4
+ for k in range(1, min(5, len(boxes))):
+ km = KMeans(n_clusters=k, random_state=42, n_init="auto")
+ labels = km.fit_predict(x_coords)
+
+ if k > 1:
+ score = silhouette_score(x_coords, labels)
+ if score > best_score:
+ best_score = score
+ best_k = k
+
+ # Final clustering with best k
+ km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
+ labels = km.fit_predict(x_coords)
+
+ # Assign column IDs
+ for i, box in enumerate(boxes):
+ box["col_id"] = labels[i]
+```
+
+**Tại sao K-Means?**
+- Unsupervised: không cần training data
+- Fast: O(n * k * iterations)
+- Silhouette score tự động chọn số cột
+
+### 3.2 OCR Class
+
+**File**: `/deepdoc/vision/ocr.py`
+**Lines**: 536-752
+
+#### 3.2.1 Text Detection (TextDetector)
+
+```python
+# Lines 414-534
+class TextDetector:
+ def __init__(self, model_dir, device_id=None):
+ # Preprocessing pipeline
+ self.preprocess_op = [
+ DetResizeForTest(limit_side_len=960, limit_type='max'),
+ NormalizeImage(mean=[0.485, 0.456, 0.406],
+ std=[0.229, 0.224, 0.225]),
+ ToCHWImage(),
+ KeepKeys(keep_keys=['image', 'shape'])
+ ]
+
+ # Postprocessing
+ self.postprocess_op = DBPostProcess(
+ thresh=0.3, # Binary threshold
+ box_thresh=0.5, # Box confidence threshold
+ max_candidates=1000, # Max text regions
+ unclip_ratio=1.5 # Box expansion ratio
+ )
+
+ # Load ONNX model
+ self.ort_sess, self.run_opts = load_model(model_dir, "det", device_id)
+```
+
+**DBNet (Differentiable Binarization)**:
+- Input: Image → Probability map (text regions)
+- Thresholding: prob > 0.3 → foreground
+- Unclipping: Expand boxes by 1.5x để capture full text
+
+#### 3.2.2 Text Recognition (TextRecognizer)
+
+```python
+# Lines 133-412
+class TextRecognizer:
+ def __init__(self, model_dir, device_id=None):
+ self.rec_image_shape = [3, 48, 320] # C, H, W
+ self.batch_size = 16
+
+ # Load CRNN model
+ self.ort_sess, self.run_opts = load_model(model_dir, "rec", device_id)
+
+ # CTC decoder
+ self.postprocess_op = CTCLabelDecode(character_dict_path=dict_path)
+
+ def __call__(self, img_list):
+ # Sort by aspect ratio for efficient batching
+ indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
+
+ results = []
+ for batch in chunks(indices, self.batch_size):
+ # Normalize images
+ norm_imgs = [self.resize_norm_img(img_list[i]) for i in batch]
+
+ # Run inference
+ preds = self.ort_sess.run(None, {"input": np.stack(norm_imgs)})
+
+ # CTC decode
+ texts = self.postprocess_op(preds[0])
+ results.extend(texts)
+
+ return results
+```
+
+**CRNN + CTC**:
+- CNN: Extract visual features
+- RNN: Sequence modeling
+- CTC: Alignment-free decoding (handles variable-length text)
+
+#### 3.2.3 Rotation Handling
+
+```python
+# Lines 584-638
+def get_rotate_crop_image(self, img, points):
+ """
+ Crop text region with auto-rotation detection.
+ """
+ # Get perspective transform
+ rect = self.order_points_clockwise(points)
+ M = cv2.getPerspectiveTransform(rect, dst_pts)
+ warped = cv2.warpPerspective(img, M, (width, height))
+
+ # Check if text is vertical (height > 1.5 * width)
+ if warped.shape[0] / warped.shape[1] >= 1.5:
+ # Try 3 orientations
+ scores = []
+ for angle in [0, 90, -90]:
+ rotated = self.rotate(warped, angle)
+ _, conf = self.recognizer([rotated])[0]
+ scores.append(conf)
+
+ # Use orientation with highest confidence
+ best_angle = [0, 90, -90][np.argmax(scores)]
+ warped = self.rotate(warped, best_angle)
+
+ return warped
+```
+
+**Tại sao cần auto-rotation?**
+- PDF có thể chứa text xoay 90°
+- OCR model trained on horizontal text
+- Auto-detect giúp nhận dạng text dọc chính xác
+
+### 3.3 Layout Recognizer
+
+**File**: `/deepdoc/vision/layout_recognizer.py`
+**Lines**: 33-237
+
+#### 3.3.1 YOLOv10 Preprocessing
+
+```python
+# Lines 186-209
+def preprocess(self, image_list):
+ """
+ Preprocess images for YOLOv10 inference.
+ """
+ processed = []
+ for img in image_list:
+ h, w = img.shape[:2]
+
+ # Calculate scale (preserve aspect ratio)
+ r = min(640/h, 640/w)
+ new_h, new_w = int(h*r), int(w*r)
+
+ # Resize
+ resized = cv2.resize(img, (new_w, new_h))
+
+ # Pad to 640x640 (center padding, gray color)
+ padded = np.full((640, 640, 3), 114, dtype=np.uint8)
+ pad_top = (640 - new_h) // 2
+ pad_left = (640 - new_w) // 2
+ padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
+
+ # Normalize and transpose
+ padded = padded.astype(np.float32) / 255.0
+ padded = padded.transpose(2, 0, 1) # HWC → CHW
+
+ processed.append(padded)
+
+ return np.stack(processed)
+```
+
+**Tại sao 640x640?**
+- YOLOv10 standard input size
+- Balance accuracy vs speed
+- 32-stride alignment (640 = 20 * 32)
+
+#### 3.3.2 Layout Types
+
+```python
+# Lines 34-46
+labels = [
+ "_background_", # 0: Background (ignored)
+ "Text", # 1: Body text paragraphs
+ "Title", # 2: Section/document titles
+ "Figure", # 3: Images, diagrams, charts
+ "Figure caption", # 4: Text describing figures
+ "Table", # 5: Data tables
+ "Table caption", # 6: Text describing tables
+ "Header", # 7: Page headers
+ "Footer", # 8: Page footers
+ "Reference", # 9: Bibliography, citations
+ "Equation", # 10: Mathematical equations
+]
+```
+
+### 3.4 Table Structure Recognizer
+
+**File**: `/deepdoc/vision/table_structure_recognizer.py`
+**Lines**: 30-613
+
+#### 3.4.1 Table Grid Construction
+
+```python
+# Lines 172-349
+@staticmethod
+def construct_table(boxes, is_english=False, html=True, **kwargs):
+ """
+ Construct 2D table from detected components.
+ """
+ # Step 1: Sort by row
+ boxes = Recognizer.sort_R_firstly(boxes, rowh/2)
+
+ # Step 2: Group into rows
+ rows = []
+ current_row = [boxes[0]]
+ for box in boxes[1:]:
+ if box["top"] - current_row[-1]["bottom"] > rowh/2:
+ rows.append(current_row)
+ current_row = [box]
+ else:
+ current_row.append(box)
+ rows.append(current_row)
+
+ # Step 3: Sort each row by column
+ for row in rows:
+ row.sort(key=lambda x: x["x0"])
+
+ # Step 4: Build 2D matrix
+ n_cols = max(len(row) for row in rows)
+ table = [[None] * n_cols for _ in range(len(rows))]
+
+ for i, row in enumerate(rows):
+ for j, cell in enumerate(row):
+ table[i][j] = cell["text"]
+
+ # Step 5: Generate output
+ if html:
+ return generate_html_table(table)
+ else:
+ return generate_descriptive_text(table)
+```
+
+#### 3.4.2 Spanning Cell Handling
+
+```python
+# Lines 496-575
+def __cal_spans(self, boxes):
+ """
+ Calculate colspan and rowspan for merged cells.
+ """
+ for box in boxes:
+ if "SP" not in box: # Not a spanning cell
+ continue
+
+ # Find which rows this cell spans
+ box["rowspan"] = []
+ for i, row_box in enumerate(self.rows):
+ if self.overlapped_area(box, row_box) > 0.3:
+ box["rowspan"].append(i)
+
+ # Find which columns this cell spans
+ box["colspan"] = []
+ for j, col_box in enumerate(self.cols):
+ if self.overlapped_area(box, col_box) > 0.3:
+ box["colspan"].append(j)
+```
+
+---
+
+## 4. Giải Thích Kỹ Thuật
+
+### 4.1 ONNX Runtime
+
+**ONNX là gì?**
+- Open Neural Network Exchange
+- Format chuẩn cho deep learning models
+- Chạy trên nhiều hardware (CPU, GPU, NPU)
+
+**Tại sao dùng ONNX?**
+```python
+# Không cần PyTorch/TensorFlow runtime
+# Lightweight inference
+import onnxruntime as ort
+
+session = ort.InferenceSession("model.onnx")
+output = session.run(None, {"input": input_data})
+```
+
+**Cấu hình trong DeepDoc**:
+```python
+# vision/ocr.py, lines 96-127
+options = ort.SessionOptions()
+options.enable_cpu_mem_arena = False # Giảm memory fragmentation
+options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+options.intra_op_num_threads = 2 # Threads per operator
+options.inter_op_num_threads = 2 # Parallel operators
+
+# GPU configuration
+if torch.cuda.is_available():
+ providers = [
+ ('CUDAExecutionProvider', {
+ 'device_id': device_id,
+ 'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB
+ })
+ ]
+```
+
+### 4.2 CTC Decoding
+
+**CTC (Connectionist Temporal Classification)**:
+- Giải quyết alignment problem trong sequence-to-sequence
+- Không cần biết vị trí chính xác của từng ký tự
+
+**Ví dụ**:
+```
+OCR Model Output (time steps):
+[a, a, a, -, l, l, -, p, p, h, h, a, -]
+
+CTC Decoding:
+1. Merge consecutive duplicates: [a, -, l, -, p, h, a, -]
+2. Remove blank tokens (-): [a, l, p, h, a]
+3. Result: "alpha"
+```
+
+**Implementation**:
+```python
+# vision/postprocess.py, lines 355-366
+def __call__(self, preds, label=None):
+ # Get most probable character at each position
+ preds_idx = preds.argmax(axis=2) # Shape: (batch, time)
+ preds_prob = preds.max(axis=2) # Confidence scores
+
+ # Decode with deduplication
+ text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
+
+ return text
+```
+
+### 4.3 Non-Maximum Suppression (NMS)
+
+**NMS là gì?**
+- Loại bỏ duplicate detections
+- Giữ lại box có confidence cao nhất
+
+**Algorithm**:
+```
+1. Sort boxes by confidence (descending)
+2. Pick box with highest score → add to results
+3. Remove boxes with IoU > threshold (e.g., 0.5)
+4. Repeat until no boxes remain
+```
+
+**Implementation**:
+```python
+# vision/operators.py, lines 702-725
+def nms(bboxes, scores, iou_thresh):
+ indices = []
+ index = scores.argsort()[::-1] # Sort descending
+
+ while index.size > 0:
+ i = index[0]
+ indices.append(i)
+
+ # Compute IoU with remaining boxes
+ ious = compute_iou(bboxes[i], bboxes[index[1:]])
+
+ # Keep only boxes with IoU <= threshold
+ mask = ious <= iou_thresh
+ index = index[1:][mask]
+
+ return indices
+```
+
+### 4.4 DBNet (Differentiable Binarization)
+
+**DBNet là gì?**
+- Text detection network
+- Tạo probability map + threshold map
+- Differentiable binarization cho end-to-end training
+
+**Pipeline**:
+```
+Image → CNN Backbone → Feature Map →
+ ├→ Probability Map (text regions)
+ └→ Threshold Map (adaptive threshold)
+
+Final = Probability > Threshold (pixel-wise)
+```
+
+**Post-processing**:
+```python
+# vision/postprocess.py, DBPostProcess
+def __call__(self, outs_dict, shape_list):
+ pred = outs_dict["maps"]
+
+ # Binary thresholding
+ bitmap = pred > self.thresh # 0.3
+
+ # Find contours
+ contours = cv2.findContours(bitmap, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
+
+ # Unclip (expand) boxes
+ for contour in contours:
+ box = self.unclip(contour, self.unclip_ratio) # 1.5x expansion
+ boxes.append(box)
+```
+
+### 4.5 K-Means cho Column Detection
+
+**Tại sao K-Means?**
+- Text boxes trong cùng cột có X coordinate tương tự
+- K-Means cluster các X values
+- Silhouette score chọn số cột tối ưu
+
+**Silhouette Score**:
+```
+s(i) = (b(i) - a(i)) / max(a(i), b(i))
+
+- a(i): Average distance to same cluster
+- b(i): Average distance to nearest other cluster
+- Range: [-1, 1], higher = better clustering
+```
+
+**Ví dụ**:
+```
+Page with 2 columns:
+Left column boxes: x0 = [50, 52, 48, 51, ...]
+Right column boxes: x0 = [400, 398, 402, 399, ...]
+
+K-Means (k=2):
+- Cluster 0: x0 ≈ 50 (left column)
+- Cluster 1: x0 ≈ 400 (right column)
+
+Silhouette score ≈ 0.95 (high, good separation)
+```
+
+---
+
+## 5. Lý Do Thiết Kế
+
+### 5.1 Tại Sao Dùng Multiple Models?
+
+**Vấn đề**: Một model không thể handle tất cả tasks
+
+| Task | Model Type | Lý Do |
+|------|------------|-------|
+| Text Detection | DBNet | Specialized cho text regions |
+| Text Recognition | CRNN | Sequential text với CTC |
+| Layout Detection | YOLOv10 | Object detection tốt nhất |
+| Table Structure | YOLOv10 variant | Fine-tuned cho table elements |
+
+**Trade-off**:
+- Pros: Mỗi model optimized cho task riêng
+- Cons: Nhiều models → nhiều memory, complexity
+
+### 5.2 Tại Sao Dùng XGBoost cho Text Merging?
+
+**Vấn đề**: Merge text blocks là decision phức tạp
+
+**Rule-based approach** (naive):
+```python
+# Simple heuristics
+if y_distance < threshold and same_column:
+ merge()
+# ❌ Không handle edge cases tốt
+```
+
+**ML approach** (XGBoost):
+```python
+# 31 features capturing various signals
+features = [
+ y_distance / char_height, # Distance feature
+ ends_with_punctuation, # Text pattern
+ same_layout_type, # Layout feature
+ font_size_ratio, # Typography
+ ...
+]
+# ✅ Learns complex patterns from data
+```
+
+**Tại sao XGBoost?**
+- Fast inference (tree-based)
+- Handles mixed feature types well
+- Pre-trained model included
+
+### 5.3 Tại Sao ONNX thay vì PyTorch/TensorFlow?
+
+| Aspect | ONNX Runtime | PyTorch |
+|--------|--------------|---------|
+| Size | ~50MB | ~500MB+ |
+| Memory | Lower | Higher |
+| Startup | Fast | Slow (JIT) |
+| Dependencies | Minimal | Many |
+| Multi-platform | Yes | Limited |
+
+**DeepDoc choice**: ONNX cho production deployment
+- Không cần PyTorch runtime
+- Lighter memory footprint
+- Faster cold start
+
+### 5.4 Tại Sao Zoomin = 3?
+
+**Experiment results**:
+```
+zoomin=1: OCR accuracy ~70%, fast
+zoomin=2: OCR accuracy ~85%, moderate
+zoomin=3: OCR accuracy ~95%, acceptable speed ← chosen
+zoomin=4: OCR accuracy ~97%, slow
+zoomin=5: OCR accuracy ~98%, very slow, memory issues
+```
+
+**Balance**: 3x là sweet spot giữa accuracy và resource usage
+
+### 5.5 Tại Sao Hybrid Text Extraction?
+
+**Native PDF text** (pdfplumber):
+- Pros: Accurate, fast, preserves fonts
+- Cons: Không có cho scanned PDFs
+
+**OCR text**:
+- Pros: Works on any image
+- Cons: Slower, potential errors
+
+**Hybrid approach**:
+```python
+# Prefer native text, fallback to OCR
+for box in ocr_boxes:
+ # Try to match with native characters
+ matched_chars = find_overlapping_chars(box, native_chars)
+
+ if matched_chars:
+ box["text"] = "".join(matched_chars) # Use native
+ else:
+ box["text"] = ocr_result # Use OCR
+```
+
+### 5.6 Pipeline vs End-to-End Model
+
+**End-to-End** (e.g., Donut, Pix2Struct):
+- Single model: Image → Structured output
+- Pros: Simple, unified
+- Cons: Less accurate on specific tasks, hard to debug
+
+**Pipeline** (DeepDoc's choice):
+- Multiple specialized models
+- Pros:
+ - Each model optimized for task
+ - Easy to debug/improve individual components
+ - Mix and match different models
+- Cons:
+ - More complexity
+ - Potential error accumulation
+
+**DeepDoc's rationale**: Pipeline cho flexibility và accuracy
+
+---
+
+## 6. Thuật Ngữ Khó
+
+### 6.1 Computer Vision Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **Bounding Box** | Hình chữ nhật bao quanh object | `[x0, y0, x1, y1]` coordinates |
+| **IoU** | Intersection over Union - đo overlap | NMS threshold 0.5 |
+| **NMS** | Non-Maximum Suppression | Loại duplicate detections |
+| **Anchor** | Predefined box sizes | YOLOv10 anchors |
+| **Stride** | Downsampling factor | 32 trong YOLOv10 |
+| **FPN** | Feature Pyramid Network | Multi-scale detection |
+
+### 6.2 OCR Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **CTC** | Connectionist Temporal Classification | CRNN output decoding |
+| **CRNN** | CNN + RNN | Text recognition model |
+| **DBNet** | Differentiable Binarization | Text detection model |
+| **Unclip** | Expand polygon boundary | 1.5x expansion ratio |
+
+### 6.3 ML Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **ONNX** | Open Neural Network Exchange | Model format |
+| **Inference** | Running model on input | `session.run()` |
+| **Batch** | Multiple inputs processed together | batch_size=16 |
+| **Confidence** | Model's certainty score | 0.0 - 1.0 |
+
+### 6.4 Document Processing Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **Layout** | Document structure | Text, Table, Figure |
+| **TSR** | Table Structure Recognition | Row, Column detection |
+| **Spanning Cell** | Merged table cell | colspan, rowspan |
+| **Reading Order** | Text flow sequence | Top-to-bottom, left-to-right |
+
+---
+
+## 7. Mở Rộng Từ Code
+
+### 7.1 Thêm Parser Mới
+
+**Ví dụ**: Add RTF parser
+
+```python
+# deepdoc/parser/rtf_parser.py
+from striprtf.striprtf import rtf_to_text
+
+class RAGFlowRtfParser:
+ def __call__(self, fnm, binary=None, chunk_token_num=128):
+ if binary:
+ content = binary.decode('utf-8')
+ else:
+ with open(fnm, 'r') as f:
+ content = f.read()
+
+ text = rtf_to_text(content)
+
+ # Chunk text
+ chunks = self._chunk(text, chunk_token_num)
+
+ return [(chunk, f"rtf_chunk_{i}") for i, chunk in enumerate(chunks)]
+```
+
+### 7.2 Thêm Layout Type Mới
+
+**Ví dụ**: Add "Code Block" layout
+
+```python
+# vision/layout_recognizer.py
+labels = [
+ "_background_",
+ "Text",
+ "Title",
+ ...
+ "Code Block", # New label (index 11)
+]
+
+# Train new YOLOv10 model with "Code Block" annotations
+# Update model file
+```
+
+### 7.3 Custom Text Merging Logic
+
+```python
+# Override default merging behavior
+class CustomPdfParser(RAGFlowPdfParser):
+ def _should_merge(self, box1, box2):
+ """Custom merge logic"""
+ # Don't merge code blocks
+ if box1.get("layout_type") == "Code Block":
+ return False
+
+ # Use default logic otherwise
+ return super()._should_merge(box1, box2)
+```
+
+### 7.4 Thêm Output Format
+
+```python
+# Add Markdown output format
+def to_markdown(self, documents, tables):
+ md_parts = []
+
+ for text, pos_tag in documents:
+ # Detect if title
+ if self._is_title(text):
+ md_parts.append(f"## {text}\n")
+ else:
+ md_parts.append(f"{text}\n\n")
+
+ # Convert tables to markdown
+ for table in tables:
+ md_table = html_to_markdown(table["html"])
+ md_parts.append(md_table)
+
+ return "\n".join(md_parts)
+```
+
+### 7.5 Optimize Performance
+
+**GPU Batching**:
+```python
+# Process multiple pages in parallel
+def _parallel_ocr(self, images, batch_size=4):
+ with ThreadPoolExecutor(max_workers=4) as executor:
+ futures = []
+ for batch in chunks(images, batch_size):
+ future = executor.submit(self.ocr, batch)
+ futures.append(future)
+
+ results = [f.result() for f in futures]
+ return results
+```
+
+**Caching**:
+```python
+# Cache model instances
+_model_cache = {}
+
+def get_ocr_model(model_dir, device_id):
+ key = f"{model_dir}_{device_id}"
+ if key not in _model_cache:
+ _model_cache[key] = OCR(model_dir, device_id)
+ return _model_cache[key]
+```
+
+### 7.6 Integration với RAG Pipeline
+
+```python
+# rag/app/pdf.py (example integration)
+from deepdoc.parser import RAGFlowPdfParser
+
+def process_pdf_for_rag(file_path, chunk_size=512):
+ parser = RAGFlowPdfParser()
+
+ # Parse PDF
+ documents, tables = parser(file_path)
+
+ # Chunk documents
+ chunks = []
+ for text, pos_tag in documents:
+ for chunk in chunk_text(text, chunk_size):
+ chunks.append({
+ "text": chunk,
+ "metadata": {"position": pos_tag}
+ })
+
+ # Add tables as separate chunks
+ for table in tables:
+ chunks.append({
+ "text": table["html"],
+ "metadata": {"type": "table", "bbox": table["bbox"]}
+ })
+
+ return chunks
+```
+
+---
+
+## 8. Tổng Kết
+
+### 8.1 Key Takeaways
+
+1. **DeepDoc = Parser Layer + Vision Layer**
+ - Parser: Format-specific handling (PDF, DOCX, etc.)
+ - Vision: OCR + Layout + Table recognition
+
+2. **Pipeline Architecture**
+ - Multiple specialized models
+ - Easy to debug and improve
+
+3. **ONNX Runtime**
+ - Lightweight inference
+ - Cross-platform compatibility
+
+4. **Hybrid Text Extraction**
+ - Native PDF text khi available
+ - OCR fallback cho scanned documents
+
+### 8.2 Diagram Tổng Hợp
+
+```
+┌──────────────────────────────────────────────────────────────────────────────┐
+│ DEEPDOC SUMMARY │
+├──────────────────────────────────────────────────────────────────────────────┤
+│ │
+│ INPUT PROCESSING OUTPUT │
+│ ───── ────────── ────── │
+│ │
+│ ┌─────────┐ ┌────────────────────────────┐ ┌─────────────────┐ │
+│ │ PDF │────▶│ 1. Image Extraction │─────▶│ Documents │ │
+│ │ DOCX │ │ 2. OCR (DBNet + CRNN) │ │ [(text, pos)] │ │
+│ │ Excel │ │ 3. Layout (YOLOv10) │ │ │ │
+│ │ HTML │ │ 4. Column Detection │ │ Tables │ │
+│ │ ... │ │ 5. Table Structure │ │ [html, bbox] │ │
+│ └─────────┘ │ 6. Text Merging │ │ │ │
+│ │ 7. Quality Filtering │ │ Figures │ │
+│ └────────────────────────────┘ │ [image, cap] │ │
+│ └─────────────────┘ │
+│ │
+│ MODELS USED: │
+│ ──────────── │
+│ • DBNet (Text Detection) - ONNX, ~30MB │
+│ • CRNN (Text Recognition) - ONNX, ~20MB │
+│ • YOLOv10 (Layout Detection) - ONNX, ~50MB │
+│ • YOLOv10 (Table Structure) - ONNX, ~50MB │
+│ • XGBoost (Text Merging) - Binary, ~5MB │
+│ │
+│ KEY ALGORITHMS: │
+│ ─────────────── │
+│ • CTC Decoding (text recognition) │
+│ • NMS (duplicate removal) │
+│ • K-Means (column detection) │
+│ • IoU (overlap calculation) │
+│ │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 8.3 Files Reference
+
+| File | Lines | Description |
+|------|-------|-------------|
+| `parser/pdf_parser.py` | 1479 | Main PDF parser |
+| `vision/ocr.py` | 752 | OCR detection + recognition |
+| `vision/layout_recognizer.py` | 457 | Layout detection |
+| `vision/table_structure_recognizer.py` | 613 | Table structure |
+| `vision/recognizer.py` | 443 | Base recognizer class |
+| `vision/operators.py` | 726 | Image preprocessing |
+| `vision/postprocess.py` | 371 | Post-processing utilities |
+
+---
+
+*Document created for RAGFlow v0.22.1 analysis*
diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
new file mode 100644
index 000000000..063acd4be
--- /dev/null
+++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
@@ -0,0 +1,926 @@
+# Layout & Table Recognition Deep Dive
+
+## Tổng Quan
+
+Sau khi OCR extract được text boxes, DeepDoc cần:
+1. **Layout Recognition**: Phân loại vùng (Text, Title, Table, Figure...)
+2. **Table Structure Recognition**: Nhận dạng cấu trúc bảng (rows, columns, cells)
+
+## File Structure
+
+```
+deepdoc/vision/
+├── layout_recognizer.py # Layout detection (457 lines)
+├── table_structure_recognizer.py # Table structure (613 lines)
+└── recognizer.py # Base class (443 lines)
+```
+
+---
+
+## 1. Layout Recognition (YOLOv10)
+
+### 1.1 Layout Categories
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 34-46
+
+labels = [
+ "_background_", # 0: Background (ignored)
+ "Text", # 1: Body text paragraphs
+ "Title", # 2: Section/document titles
+ "Figure", # 3: Images, diagrams, charts
+ "Figure caption", # 4: Text describing figures
+ "Table", # 5: Data tables
+ "Table caption", # 6: Text describing tables
+ "Header", # 7: Page headers
+ "Footer", # 8: Page footers
+ "Reference", # 9: Bibliography, citations
+ "Equation", # 10: Mathematical equations
+]
+```
+
+### 1.2 YOLOv10 Architecture
+
+```
+YOLOv10 for Document Layout:
+
+Input Image (640, 640, 3)
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ CSPDarknet Backbone │
+│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│
+│ │ P1 │→ │ P2 │→ │ P3 │→ │ P4 ││
+│ │/2 │ │/4 │ │/8 │ │/16 ││
+│ └─────┘ └─────┘ └─────┘ └─────┘│
+└─────────────────────────────────────┘
+ │ │ │ │
+ ▼ ▼ ▼ ▼
+┌─────────────────────────────────────┐
+│ PANet Neck │
+│ FPN (top-down) + PAN (bottom-up) │
+│ Multi-scale feature fusion │
+└─────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ Detection Heads (3 scales) │
+│ Small (80x80) → tiny objects │
+│ Medium (40x40) → normal objects │
+│ Large (20x20) → big objects │
+└─────────────────────────────────────┘
+ │
+ ▼
+ Raw Predictions:
+ [x_center, y_center, width, height, confidence, class_probs...]
+```
+
+### 1.3 Preprocessing (LayoutRecognizer4YOLOv10)
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 186-209
+
+def preprocess(self, image_list):
+ """
+ Preprocess images for YOLOv10.
+
+ Key steps:
+ 1. Resize maintaining aspect ratio
+ 2. Pad to 640x640 (gray borders)
+ 3. Normalize [0,255] → [0,1]
+ 4. Transpose HWC → CHW
+ """
+ processed = []
+ scale_factors = []
+
+ for img in image_list:
+ h, w = img.shape[:2]
+
+ # Calculate scale (preserve aspect ratio)
+ r = min(640/h, 640/w)
+ new_h, new_w = int(h*r), int(w*r)
+
+ # Resize
+ resized = cv2.resize(img, (new_w, new_h))
+
+ # Calculate padding
+ pad_top = (640 - new_h) // 2
+ pad_left = (640 - new_w) // 2
+
+ # Pad to 640x640 (gray: 114)
+ padded = np.full((640, 640, 3), 114, dtype=np.uint8)
+ padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
+
+ # Normalize and transpose
+ padded = padded.astype(np.float32) / 255.0
+ padded = padded.transpose(2, 0, 1) # HWC → CHW
+
+ processed.append(padded)
+ scale_factors.append([1/r, 1/r, pad_left, pad_top])
+
+ return np.stack(processed), scale_factors
+```
+
+**Visualization**:
+```
+Original image (1000x800):
+┌────────────────────────────────────────┐
+│ │
+│ Document Content │
+│ │
+└────────────────────────────────────────┘
+
+After resize (scale=0.64) to (640x512):
+┌────────────────────────────────────────┐
+│ │
+│ Document Content │
+│ │
+└────────────────────────────────────────┘
+
+After padding to (640x640):
+┌────────────────────────────────────────┐
+│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
+├────────────────────────────────────────┤
+│ │
+│ Document Content │
+│ │
+├────────────────────────────────────────┤
+│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
+└────────────────────────────────────────┘
+```
+
+### 1.4 NMS Postprocessing
+
+```python
+# deepdoc/vision/recognizer.py, lines 330-407
+
+def postprocess(self, boxes, inputs, thr):
+ """
+ YOLOv10 postprocessing with per-class NMS.
+ """
+ results = []
+
+ for batch_idx, batch_boxes in enumerate(boxes):
+ scale_factor = inputs["scale_factor"][batch_idx]
+
+ # Filter by confidence threshold
+ mask = batch_boxes[:, 4] > thr # confidence > 0.2
+ filtered = batch_boxes[mask]
+
+ if len(filtered) == 0:
+ results.append([])
+ continue
+
+ # Convert xywh → xyxy
+ xyxy = self.xywh2xyxy(filtered[:, :4])
+
+ # Remove padding offset
+ xyxy[:, [0, 2]] -= scale_factor[2] # pad_left
+ xyxy[:, [1, 3]] -= scale_factor[3] # pad_top
+
+ # Scale back to original size
+ xyxy[:, [0, 2]] *= scale_factor[0] # scale_x
+ xyxy[:, [1, 3]] *= scale_factor[1] # scale_y
+
+ # Per-class NMS
+ class_ids = filtered[:, 5].astype(int)
+ scores = filtered[:, 4]
+
+ keep_indices = []
+ for cls in np.unique(class_ids):
+ cls_mask = class_ids == cls
+ cls_boxes = xyxy[cls_mask]
+ cls_scores = scores[cls_mask]
+
+ # NMS within class
+ keep = self.iou_filter(cls_boxes, cls_scores, iou_thresh=0.45)
+ keep_indices.extend(np.where(cls_mask)[0][keep])
+
+ # Build result
+ batch_results = []
+ for idx in keep_indices:
+ batch_results.append({
+ "type": self.labels[int(filtered[idx, 5])],
+ "bbox": xyxy[idx].tolist(),
+ "score": float(filtered[idx, 4])
+ })
+
+ results.append(batch_results)
+
+ return results
+```
+
+### 1.5 OCR-Layout Association
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 98-147
+
+def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
+ """
+ Detect layouts and associate with OCR results.
+ """
+ # Step 1: Run layout detection
+ page_layouts = super().__call__(image_list, thr, batch_size)
+
+ # Step 2: Clean up overlapping layouts
+ for i, layouts in enumerate(page_layouts):
+ page_layouts[i] = self.layouts_cleanup(layouts, thr=0.7)
+
+ # Step 3: Associate OCR boxes with layouts
+ for page_idx, (ocr_boxes, layouts) in enumerate(zip(ocr_res, page_layouts)):
+ # Sort layouts by priority: Footer → Header → Reference → Caption → Others
+ layouts_by_priority = self._sort_by_priority(layouts)
+
+ for ocr_box in ocr_boxes:
+ # Find overlapping layout
+ matched_layout = self.find_overlapped_with_threshold(
+ ocr_box,
+ layouts_by_priority,
+ thr=0.4 # 40% overlap threshold
+ )
+
+ if matched_layout:
+ ocr_box["layout_type"] = matched_layout["type"]
+ ocr_box["layoutno"] = matched_layout.get("layoutno", 0)
+ else:
+ ocr_box["layout_type"] = "Text" # Default to Text
+
+ # Step 4: Filter garbage (headers, footers, page numbers)
+ if drop:
+ self._filter_garbage(ocr_res, page_layouts)
+
+ return ocr_res, page_layouts
+```
+
+### 1.6 Garbage Detection
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 64-66
+
+# Patterns to filter out
+garbage_patterns = [
+ r"^•+$", # Bullet points only
+ r"^[0-9]{1,2} / ?[0-9]{1,2}$", # Page numbers (3/10, 3 / 10)
+ r"^[0-9]{1,2} of [0-9]{1,2}$", # Page numbers (3 of 10)
+ r"^http://[^ ]{12,}", # Long URLs
+ r"\(cid *: *[0-9]+ *\)", # PDF character IDs
+]
+
+def is_garbage(text, layout_type, page_position):
+ """
+ Determine if text should be filtered out.
+
+ Rules:
+ - Headers at top 10% of page → keep
+ - Footers at bottom 10% of page → keep
+ - Headers/footers elsewhere → garbage
+ - Page numbers → garbage
+ - URLs → garbage
+ """
+ for pattern in garbage_patterns:
+ if re.match(pattern, text):
+ return True
+
+ # Position-based filtering
+ if layout_type == "Header" and page_position > 0.1:
+ return True # Header not at top
+ if layout_type == "Footer" and page_position < 0.9:
+ return True # Footer not at bottom
+
+ return False
+```
+
+---
+
+## 2. Table Structure Recognition
+
+### 2.1 Table Components
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 31-38
+
+labels = [
+ "table", # 0: Whole table boundary
+ "table column", # 1: Column separators
+ "table row", # 2: Row separators
+ "table column header", # 3: Header rows
+ "table projected row header", # 4: Row labels
+ "table spanning cell", # 5: Merged cells
+]
+```
+
+### 2.2 Detection to Grid Construction
+
+```
+Detection Output → Table Grid:
+
+┌─────────────────────────────────────────────────────────────────┐
+│ Raw Detections │
+│ ┌──────────────────────────────────────────────────────────┐ │
+│ │ table: [0, 0, 500, 300] │ │
+│ │ table row: [0, 0, 500, 50], [0, 50, 500, 100], ... │ │
+│ │ table column: [0, 0, 150, 300], [150, 0, 300, 300], ... │ │
+│ │ table spanning cell: [0, 100, 300, 150] │ │
+│ └──────────────────────────────────────────────────────────┘ │
+│ │ │
+│ ▼ │
+│ ┌──────────────────────────────────────────────────────────┐ │
+│ │ Alignment │ │
+│ │ • Align row boundaries (left/right edges) │ │
+│ │ • Align column boundaries (top/bottom edges) │ │
+│ └──────────────────────────────────────────────────────────┘ │
+│ │ │
+│ ▼ │
+│ ┌──────────────────────────────────────────────────────────┐ │
+│ │ Grid Construction │ │
+│ │ │ │
+│ │ ┌──────────┬──────────┬──────────┐ │ │
+│ │ │ Header 1 │ Header 2 │ Header 3 │ ← Row 0 (header) │ │
+│ │ ├──────────┴──────────┼──────────┤ │ │
+│ │ │ Spanning Cell │ Cell 3 │ ← Row 1 │ │
+│ │ ├──────────┬──────────┼──────────┤ │ │
+│ │ │ Cell 4 │ Cell 5 │ Cell 6 │ ← Row 2 │ │
+│ │ └──────────┴──────────┴──────────┘ │ │
+│ │ │ │
+│ └──────────────────────────────────────────────────────────┘ │
+│ │ │
+│ ▼ │
+│ HTML or Descriptive Output │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 2.3 Alignment Algorithm
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 67-111
+
+def __call__(self, images, thr=0.2):
+ """
+ Detect and align table structure.
+ """
+ # Run detection
+ detections = super().__call__(images, thr)
+
+ for page_dets in detections:
+ rows = [d for d in page_dets if d["label"] == "table row"]
+ cols = [d for d in page_dets if d["label"] == "table column"]
+
+ if len(rows) > 4:
+ # Align row X coordinates (left edges)
+ x0_values = [r["x0"] for r in rows]
+ mean_x0 = np.mean(x0_values)
+ min_x0 = np.min(x0_values)
+ aligned_x0 = min(mean_x0, min_x0 + 0.05 * (max(x0_values) - min_x0))
+
+ for r in rows:
+ r["x0"] = aligned_x0
+
+ # Align row X coordinates (right edges)
+ x1_values = [r["x1"] for r in rows]
+ mean_x1 = np.mean(x1_values)
+ max_x1 = np.max(x1_values)
+ aligned_x1 = max(mean_x1, max_x1 - 0.05 * (max_x1 - min(x1_values)))
+
+ for r in rows:
+ r["x1"] = aligned_x1
+
+ if len(cols) > 4:
+ # Similar alignment for column Y coordinates
+ # ...
+```
+
+**Tại sao cần alignment?**
+
+Detection model có thể cho ra boundaries không perfectly aligned:
+```
+Before alignment:
+Row 1: x0=10, x1=490
+Row 2: x0=12, x1=488
+Row 3: x0=8, x1=492
+
+After alignment:
+Row 1: x0=10, x1=490
+Row 2: x0=10, x1=490
+Row 3: x0=10, x1=490
+```
+
+### 2.4 Grid Construction
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 172-349
+
+@staticmethod
+def construct_table(boxes, is_english=False, html=True, **kwargs):
+ """
+ Construct 2D table from detected components.
+
+ Args:
+ boxes: OCR boxes with R (row), C (column), SP (spanning) attributes
+ is_english: Language hint
+ html: Output format (HTML or descriptive text)
+
+ Returns:
+ HTML table string or descriptive text
+ """
+ # Step 1: Extract caption
+ caption = ""
+ for box in boxes[:]:
+ if is_caption(box):
+ caption = box["text"]
+ boxes.remove(box)
+
+ # Step 2: Sort by row position (R attribute)
+ rowh = np.median([b["bottom"] - b["top"] for b in boxes])
+ boxes = Recognizer.sort_R_firstly(boxes, rowh / 2)
+
+ # Step 3: Group into rows
+ rows = []
+ current_row = [boxes[0]]
+
+ for box in boxes[1:]:
+ # Same row if Y difference < row_height/2
+ if abs(box["R"] - current_row[-1]["R"]) < rowh / 2:
+ current_row.append(box)
+ else:
+ rows.append(current_row)
+ current_row = [box]
+ rows.append(current_row)
+
+ # Step 4: Sort each row by column position (C attribute)
+ for row in rows:
+ row.sort(key=lambda x: x["C"])
+
+ # Step 5: Build 2D table matrix
+ n_rows = len(rows)
+ n_cols = max(len(row) for row in rows)
+
+ table = [[None] * n_cols for _ in range(n_rows)]
+
+ for i, row in enumerate(rows):
+ for j, cell in enumerate(row):
+ table[i][j] = cell
+
+ # Step 6: Handle spanning cells
+ table = handle_spanning_cells(table, boxes)
+
+ # Step 7: Generate output
+ if html:
+ return generate_html_table(table, caption)
+ else:
+ return generate_descriptive_text(table, caption)
+```
+
+### 2.5 Spanning Cell Handling
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 496-575
+
+def __cal_spans(self, boxes, rows, cols):
+ """
+ Calculate colspan and rowspan for merged cells.
+
+ Spanning cell detection:
+ - "SP" attribute indicates merged cell
+ - Calculate which rows/cols it covers
+ """
+ for box in boxes:
+ if "SP" not in box:
+ continue
+
+ # Find rows this cell spans
+ box["rowspan"] = []
+ for i, row in enumerate(rows):
+ overlap = self.overlapped_area(box, row)
+ if overlap > 0.3: # 30% overlap
+ box["rowspan"].append(i)
+
+ # Find columns this cell spans
+ box["colspan"] = []
+ for j, col in enumerate(cols):
+ overlap = self.overlapped_area(box, col)
+ if overlap > 0.3:
+ box["colspan"].append(j)
+
+ return boxes
+```
+
+**Example**:
+```
+Spanning cell detection:
+
+┌──────────┬──────────┬──────────┐
+│ Header 1 │ Header 2 │ Header 3 │
+├──────────┴──────────┼──────────┤
+│ Merged Cell │ Cell 3 │ ← SP cell spans columns 0-1
+│ (colspan=2) │ │
+├──────────┬──────────┼──────────┤
+│ Cell 4 │ Cell 5 │ Cell 6 │
+└──────────┴──────────┴──────────┘
+
+Detection:
+- SP cell bbox: [0, 50, 300, 100]
+- Column 0: [0, 0, 150, 200] → overlap 0.5 ✓
+- Column 1: [150, 0, 300, 200] → overlap 0.5 ✓
+- Column 2: [300, 0, 450, 200] → overlap 0.0 ✗
+→ colspan = [0, 1]
+```
+
+### 2.6 HTML Output Generation
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 352-393
+
+def __html_table(table, header_rows, caption):
+ """
+ Generate HTML table from 2D matrix.
+ """
+ html_parts = [""]
+
+ # Add caption if exists
+ if caption:
+ html_parts.append(f"{caption}")
+
+ for i, row in enumerate(table):
+ html_parts.append("")
+
+ for j, cell in enumerate(row):
+ if cell is None:
+ continue # Skip cells covered by spanning
+
+ # Determine tag (th for header, td for data)
+ tag = "th" if i in header_rows else "td"
+
+ # Add colspan/rowspan attributes
+ attrs = []
+ if cell.get("colspan") and len(cell["colspan"]) > 1:
+ attrs.append(f'colspan="{len(cell["colspan"])}"')
+ if cell.get("rowspan") and len(cell["rowspan"]) > 1:
+ attrs.append(f'rowspan="{len(cell["rowspan"])}"')
+
+ attr_str = " " + " ".join(attrs) if attrs else ""
+
+ # Add cell content
+ html_parts.append(f"<{tag}{attr_str}>{cell['text']}{tag}>")
+
+ html_parts.append("
")
+
+ html_parts.append("
")
+
+ return "\n".join(html_parts)
+```
+
+**Output Example**:
+```html
+
+ Table 1: Sales Data
+
+ | Region |
+ Q1 |
+ Q2 |
+
+
+ | North America |
+ $150K |
+
+
+ | Europe |
+ $100K |
+ $120K |
+
+
+```
+
+### 2.7 Descriptive Text Output
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 396-493
+
+def __desc_table(table, header_rows, caption):
+ """
+ Generate natural language description of table.
+
+ For RAG, sometimes descriptive text is better than HTML.
+ """
+ descriptions = []
+
+ # Get headers
+ headers = [cell["text"] for cell in table[0]] if header_rows else []
+
+ # Process each data row
+ for i, row in enumerate(table):
+ if i in header_rows:
+ continue
+
+ row_desc = []
+ for j, cell in enumerate(row):
+ if cell is None:
+ continue
+
+ if headers and j < len(headers):
+ # "Column Name: Value" format
+ row_desc.append(f"{headers[j]}: {cell['text']}")
+ else:
+ row_desc.append(cell['text'])
+
+ if row_desc:
+ descriptions.append("; ".join(row_desc))
+
+ # Add source reference
+ if caption:
+ descriptions.append(f'(from "{caption}")')
+
+ return "\n".join(descriptions)
+```
+
+**Output Example**:
+```
+Region: North America; Q1: $100K; Q2: $150K
+Region: Europe; Q1: $80K; Q2: $120K
+(from "Table 1: Sales Data")
+```
+
+---
+
+## 3. Cell Content Classification
+
+### 3.1 Block Type Detection
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 121-149
+
+@staticmethod
+def blockType(text):
+ """
+ Classify cell content type.
+
+ Used for:
+ - Header detection (non-numeric cells likely headers)
+ - Data validation
+ - Smart formatting
+ """
+ patterns = {
+ "Dt": r"(^[0-9]{4}[-/][0-9]{1,2}|[0-9]{1,2}[-/][0-9]{1,2}[-/][0-9]{2,4}|"
+ r"[0-9]{1,2}月|[Q][1-4]|[一二三四]季度)", # Date
+ "Nu": r"^[-+]?[0-9.,%%¥$€£¥]+$", # Number
+ "Ca": r"^[A-Z0-9]{4,}$", # Code
+ "En": r"^[a-zA-Z\s]+$", # English
+ }
+
+ for type_name, pattern in patterns.items():
+ if re.search(pattern, text):
+ return type_name
+
+ # Classify by length
+ tokens = text.split()
+ if len(tokens) == 1:
+ return "Sg" # Single
+ elif len(tokens) <= 3:
+ return "Tx" # Short text
+ elif len(tokens) <= 12:
+ return "Lx" # Long text
+ else:
+ return "Ot" # Other
+
+# Examples:
+# "2023-01-15" → "Dt" (Date)
+# "$1,234.56" → "Nu" (Number)
+# "ABC123" → "Ca" (Code)
+# "Total Revenue" → "En" (English)
+# "北京市" → "Tx" (Text)
+```
+
+### 3.2 Header Detection
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 332-344
+
+def detect_headers(table):
+ """
+ Detect which rows are headers based on content type.
+
+ Heuristic: If >50% of cells in a row are non-numeric,
+ it's likely a header row.
+ """
+ header_rows = set()
+
+ for i, row in enumerate(table):
+ non_numeric = 0
+ total = 0
+
+ for cell in row:
+ if cell is None:
+ continue
+ total += 1
+ if blockType(cell["text"]) != "Nu":
+ non_numeric += 1
+
+ if total > 0 and non_numeric / total > 0.5:
+ header_rows.add(i)
+
+ return header_rows
+```
+
+---
+
+## 4. Integration với PDF Parser
+
+### 4.1 Table Detection in PDF Pipeline
+
+```python
+# deepdoc/parser/pdf_parser.py, lines 196-281
+
+def _table_transformer_job(self, zoomin=3):
+ """
+ Detect and structure tables using TableStructureRecognizer.
+ """
+ # Find table layouts
+ table_layouts = [
+ layout for layout in self.page_layout
+ if layout["type"] == "Table"
+ ]
+
+ if not table_layouts:
+ return
+
+ # Crop table images
+ table_images = []
+ for layout in table_layouts:
+ x0, y0, x1, y1 = layout["bbox"]
+ img = self.page_images[layout["page"]][
+ int(y0*zoomin):int(y1*zoomin),
+ int(x0*zoomin):int(x1*zoomin)
+ ]
+ table_images.append(img)
+
+ # Run TSR
+ table_structures = self.tsr(table_images)
+
+ # Match OCR boxes to table structure
+ for layout, structure in zip(table_layouts, table_structures):
+ # Get OCR boxes within table region
+ table_boxes = [
+ box for box in self.boxes
+ if self._box_in_region(box, layout["bbox"])
+ ]
+
+ # Assign R, C, SP attributes
+ for box in table_boxes:
+ box["R"] = self._find_row(box, structure["rows"])
+ box["C"] = self._find_column(box, structure["columns"])
+ if self._is_spanning(box, structure["spanning_cells"]):
+ box["SP"] = True
+
+ # Store for later extraction
+ self.tb_cpns[layout["id"]] = {
+ "boxes": table_boxes,
+ "structure": structure
+ }
+```
+
+### 4.2 Table Extraction
+
+```python
+# deepdoc/parser/pdf_parser.py, lines 757-930
+
+def _extract_table_figure(self, need_image, ZM, return_html, need_position):
+ """
+ Extract tables and figures from detected layouts.
+ """
+ tables = []
+
+ for layout_id, table_data in self.tb_cpns.items():
+ boxes = table_data["boxes"]
+
+ # Construct table (HTML or descriptive)
+ if return_html:
+ content = TableStructureRecognizer.construct_table(
+ boxes, html=True
+ )
+ else:
+ content = TableStructureRecognizer.construct_table(
+ boxes, html=False
+ )
+
+ table = {
+ "content": content,
+ "bbox": table_data["bbox"],
+ }
+
+ if need_image:
+ table["image"] = self._crop_region(table_data["bbox"])
+
+ tables.append(table)
+
+ return tables
+```
+
+---
+
+## 5. Performance Considerations
+
+### 5.1 Batch Processing
+
+```python
+# deepdoc/vision/recognizer.py, lines 415-437
+
+def __call__(self, image_list, thr=0.7, batch_size=16):
+ """
+ Batch inference for efficiency.
+
+ Why batch_size=16?
+ - GPU memory optimization
+ - Balance throughput vs latency
+ - Typical document has 10-50 elements
+ """
+ results = []
+
+ for i in range(0, len(image_list), batch_size):
+ batch = image_list[i:i+batch_size]
+
+ # Preprocess
+ inputs = self.preprocess(batch)
+
+ # Inference
+ outputs = self.ort_sess.run(None, inputs)
+
+ # Postprocess
+ batch_results = self.postprocess(outputs, inputs, thr)
+ results.extend(batch_results)
+
+ return results
+```
+
+### 5.2 Model Caching
+
+```python
+# deepdoc/vision/ocr.py, lines 36-73
+
+# Global model cache
+loaded_models = {}
+
+def load_model(model_dir, nm, device_id=None):
+ """
+ Load ONNX model with caching.
+
+ Cache key: model_path + device_id
+ """
+ model_path = os.path.join(model_dir, f"{nm}.onnx")
+ cache_key = f"{model_path}_{device_id}"
+
+ if cache_key in loaded_models:
+ return loaded_models[cache_key]
+
+ # Load model...
+ session = ort.InferenceSession(model_path, ...)
+
+ loaded_models[cache_key] = (session, run_opts)
+ return session, run_opts
+```
+
+---
+
+## 6. Troubleshooting
+
+### 6.1 Common Issues
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Missing table | Low confidence | Lower threshold (0.1-0.2) |
+| Wrong colspan | Misaligned detection | Check row/column alignment |
+| Merged cells wrong | Overlap threshold | Adjust SP detection threshold |
+| Headers not detected | All numeric | Manual header specification |
+| Layout overlap | NMS threshold | Increase NMS IoU threshold |
+
+### 6.2 Debugging
+
+```python
+# Visualize layout detection
+from deepdoc.vision.seeit import draw_boxes
+
+# Draw layout boxes on image
+layout_vis = draw_boxes(
+ page_image,
+ [(l["bbox"], l["type"]) for l in page_layouts],
+ colors={
+ "Text": (0, 255, 0),
+ "Table": (255, 0, 0),
+ "Figure": (0, 0, 255),
+ }
+)
+cv2.imwrite("layout_debug.png", layout_vis)
+
+# Check table structure
+for box in table_boxes:
+ print(f"Text: {box['text']}")
+ print(f" Row: {box.get('R', 'N/A')}")
+ print(f" Col: {box.get('C', 'N/A')}")
+ print(f" Spanning: {box.get('SP', False)}")
+```
+
+---
+
+## 7. References
+
+- YOLOv10 Paper: [YOLOv10: Real-Time End-to-End Object Detection](https://arxiv.org/abs/2405.14458)
+- Table Transformer: [PubTables-1M: Towards comprehensive table extraction](https://arxiv.org/abs/2110.00061)
+- Document Layout Analysis: [A Survey](https://arxiv.org/abs/2012.15005)
diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md
new file mode 100644
index 000000000..1885b37f3
--- /dev/null
+++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md
@@ -0,0 +1,678 @@
+# OCR Deep Dive
+
+## Tổng Quan
+
+Module OCR trong DeepDoc thực hiện 2 task chính:
+1. **Text Detection**: Phát hiện vùng chứa text trong image
+2. **Text Recognition**: Nhận dạng text trong các vùng đã phát hiện
+
+## File Structure
+
+```
+deepdoc/vision/
+├── ocr.py # Main OCR class (752 lines)
+├── postprocess.py # CTC decoder, DBNet postprocess (371 lines)
+└── operators.py # Image preprocessing (726 lines)
+```
+
+---
+
+## 1. Text Detection (DBNet)
+
+### 1.1 Model Architecture
+
+```
+DBNet (Differentiable Binarization Network):
+
+Input Image (H, W, 3)
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ ResNet-18 Backbone │
+│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│
+│ │ C1 │→ │ C2 │→ │ C3 │→ │ C4 ││
+│ │64ch │ │128ch│ │256ch│ │512ch││
+│ └─────┘ └─────┘ └─────┘ └─────┘│
+└─────────────────────────────────────┘
+ │ │ │ │
+ ▼ ▼ ▼ ▼
+┌─────────────────────────────────────┐
+│ Feature Pyramid Network │
+│ Upsample + Concatenate all levels │
+│ Output: 256 channels │
+└─────────────────────────────────────┘
+ │
+ ├─────────────────┐
+ ▼ ▼
+┌─────────────────┐ ┌─────────────────┐
+│ Probability │ │ Threshold │
+│ Head │ │ Head │
+│ Conv → Sigmoid │ │ Conv → Sigmoid │
+└────────┬────────┘ └────────┬────────┘
+ │ │
+ ▼ ▼
+ Prob Map (H, W) Thresh Map (H, W)
+ │ │
+ └─────────┬─────────┘
+ ▼
+┌─────────────────────────────────────┐
+│ Differentiable Binarization │
+│ B = sigmoid((P - T) * k) │
+│ k = 50 (amplification factor) │
+└─────────────────────────────────────┘
+ │
+ ▼
+ Binary Map (H, W)
+```
+
+### 1.2 DBNet Post-processing
+
+```python
+# deepdoc/vision/postprocess.py, lines 41-259
+
+class DBPostProcess:
+ def __init__(self,
+ thresh=0.3, # Binary threshold
+ box_thresh=0.5, # Box confidence threshold
+ max_candidates=1000, # Maximum text regions
+ unclip_ratio=1.5, # Polygon expansion ratio
+ use_dilation=False, # Morphological dilation
+ score_mode="fast"): # fast or slow scoring
+
+ def __call__(self, outs_dict, shape_list):
+ """
+ Post-process DBNet output.
+
+ Args:
+ outs_dict: {"maps": probability_map}
+ shape_list: Original image shapes
+
+ Returns:
+ List of detected text boxes
+ """
+ pred = outs_dict["maps"] # (N, 1, H, W)
+
+ # Step 1: Binary thresholding
+ bitmap = pred > self.thresh # 0.3
+
+ # Step 2: Optional dilation
+ if self.use_dilation:
+ kernel = np.ones((2, 2))
+ bitmap = cv2.dilate(bitmap, kernel)
+
+ # Step 3: Find contours
+ contours = cv2.findContours(
+ bitmap.astype(np.uint8),
+ cv2.RETR_LIST,
+ cv2.CHAIN_APPROX_SIMPLE
+ )
+
+ # Step 4: Process each contour
+ boxes = []
+ for contour in contours[:self.max_candidates]:
+ # Simplify polygon
+ epsilon = 0.002 * cv2.arcLength(contour, True)
+ approx = cv2.approxPolyDP(contour, epsilon, True)
+
+ if len(approx) < 4:
+ continue
+
+ # Calculate confidence score
+ score = self.box_score_fast(pred, approx)
+ if score < self.box_thresh:
+ continue
+
+ # Unclip (expand) polygon
+ box = self.unclip(approx, self.unclip_ratio)
+ boxes.append(box)
+
+ return boxes
+```
+
+### 1.3 Unclipping Algorithm
+
+**Vấn đề**: DBNet tends to predict tight boundaries → misses edge characters
+
+**Giải pháp**: Expand detected polygon by unclip_ratio
+
+```python
+# deepdoc/vision/postprocess.py, lines 163-169
+
+def unclip(self, box, unclip_ratio):
+ """
+ Expand polygon using Clipper library.
+
+ Công thức:
+ distance = Area * unclip_ratio / Perimeter
+
+ Với unclip_ratio = 1.5:
+ - Nhỏ polygon → expand nhiều hơn
+ - Lớn polygon → expand ít hơn (proportional)
+ """
+ poly = Polygon(box)
+ distance = poly.area * unclip_ratio / poly.length
+
+ offset = pyclipper.PyclipperOffset()
+ offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
+
+ expanded = offset.Execute(distance)
+ return np.array(expanded[0])
+```
+
+**Visualization**:
+```
+Original detection: After unclip (1.5x):
+┌──────────────┐ ┌────────────────────┐
+│ Hello │ → │ Hello │
+└──────────────┘ └────────────────────┘
+ (expanded boundaries)
+```
+
+---
+
+## 2. Text Recognition (CRNN)
+
+### 2.1 Model Architecture
+
+```
+CRNN (Convolutional Recurrent Neural Network):
+
+Input: Cropped text image (3, 48, W)
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ CNN Backbone │
+│ VGG-style convolutions │
+│ 7 conv layers + 4 max pooling │
+│ Output: (512, 1, W/4) │
+└────────────────┬────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ Sequence Reshaping │
+│ Collapse height dimension │
+│ Output: (W/4, 512) │
+└────────────────┬────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ Bidirectional LSTM │
+│ 2 layers, 256 hidden units │
+│ Output: (W/4, 512) │
+└────────────────┬────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────┐
+│ Classification Head │
+│ Linear(512 → num_classes) │
+│ Output: (W/4, num_classes) │
+└────────────────┬────────────────────┘
+ │
+ ▼
+ Probability Matrix (T, C)
+ T = time steps, C = characters
+```
+
+### 2.2 CTC Decoding
+
+```python
+# deepdoc/vision/postprocess.py, lines 347-370
+
+class CTCLabelDecode(BaseRecLabelDecode):
+ """
+ CTC (Connectionist Temporal Classification) Decoder.
+
+ CTC giải quyết vấn đề:
+ - Model output có T time steps
+ - Ground truth có N characters
+ - T > N (nhiều frame cho 1 ký tự)
+ - Không biết alignment chính xác
+
+ CTC thêm special "blank" token (ε):
+ - Represents "no output"
+ - Allows alignment without explicit segmentation
+ """
+
+ def __init__(self, character_dict_path, use_space_char=False):
+ super().__init__(character_dict_path, use_space_char)
+ # Prepend blank token at index 0
+ self.character = ['blank'] + self.character
+
+ def __call__(self, preds, label=None):
+ """
+ Decode CTC output.
+
+ Args:
+ preds: (batch, time, num_classes) probability matrix
+
+ Returns:
+ [(text, confidence), ...]
+ """
+ # Get most probable character at each time step
+ preds_idx = preds.argmax(axis=2) # (batch, time)
+ preds_prob = preds.max(axis=2) # (batch, time)
+
+ # Decode with deduplication
+ result = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
+
+ return result
+
+ def decode(self, text_index, text_prob, is_remove_duplicate=True):
+ """
+ CTC decoding algorithm.
+
+ Example:
+ Raw output: [a, a, ε, l, l, ε, p, h, a]
+ After dedup: [a, ε, l, ε, p, h, a]
+ Remove blank: [a, l, p, h, a]
+ Final: "alpha"
+ """
+ result = []
+
+ for batch_idx in range(len(text_index)):
+ char_list = []
+ conf_list = []
+
+ for idx in range(len(text_index[batch_idx])):
+ char_idx = text_index[batch_idx][idx]
+
+ # Skip blank token (index 0)
+ if char_idx == 0:
+ continue
+
+ # Skip consecutive duplicates
+ if is_remove_duplicate:
+ if idx > 0 and char_idx == text_index[batch_idx][idx-1]:
+ continue
+
+ char_list.append(self.character[char_idx])
+ conf_list.append(text_prob[batch_idx][idx])
+
+ text = ''.join(char_list)
+ conf = np.mean(conf_list) if conf_list else 0.0
+
+ result.append((text, conf))
+
+ return result
+```
+
+### 2.3 Aspect Ratio Handling
+
+```python
+# deepdoc/vision/ocr.py, lines 146-170
+
+def resize_norm_img(self, img, max_wh_ratio):
+ """
+ Resize image maintaining aspect ratio.
+
+ Vấn đề: Text images có width khác nhau
+ - "Hi" → narrow
+ - "Hello World" → wide
+
+ Giải pháp: Resize theo aspect ratio, pad right side
+ """
+ imgC, imgH, imgW = self.rec_image_shape # [3, 48, 320]
+
+ # Calculate target width from aspect ratio
+ max_width = int(imgH * max_wh_ratio)
+ max_width = min(max_width, imgW) # Cap at 320
+
+ h, w = img.shape[:2]
+ ratio = w / float(h)
+
+ # Resize maintaining aspect ratio
+ if ratio * imgH > max_width:
+ resized_w = max_width
+ else:
+ resized_w = int(ratio * imgH)
+
+ resized_img = cv2.resize(img, (resized_w, imgH))
+
+ # Pad right side to max_width
+ padded = np.zeros((imgH, max_width, 3), dtype=np.float32)
+ padded[:, :resized_w, :] = resized_img
+
+ # Normalize: [0, 255] → [-1, 1]
+ padded = (padded / 255.0 - 0.5) / 0.5
+
+ # Transpose: HWC → CHW
+ padded = padded.transpose(2, 0, 1)
+
+ return padded
+```
+
+**Visualization**:
+```
+Original images:
+┌──────┐ ┌────────────────┐ ┌──────────────────────┐
+│ Hi │ │ Hello │ │ Hello World │
+└──────┘ └────────────────┘ └──────────────────────┘
+ narrow medium wide
+
+After resize + pad (to width 320):
+┌──────────────────────────────────────────────────────┐
+│ Hi │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
+├──────────────────────────────────────────────────────┤
+│ Hello │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
+├──────────────────────────────────────────────────────┤
+│ Hello World │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
+└──────────────────────────────────────────────────────┘
+(░ = zero padding)
+```
+
+---
+
+## 3. Full OCR Pipeline
+
+### 3.1 OCR Class
+
+```python
+# deepdoc/vision/ocr.py, lines 536-752
+
+class OCR:
+ """
+ End-to-end OCR pipeline.
+
+ Usage:
+ ocr = OCR()
+ results = ocr(image)
+ # results: [(box_points, (text, confidence)), ...]
+ """
+
+ def __init__(self, model_dir=None):
+ # Auto-download models if not found
+ if model_dir is None:
+ model_dir = self._get_model_dir()
+
+ # Initialize detector and recognizer
+ self.text_detector = TextDetector(model_dir)
+ self.text_recognizer = TextRecognizer(model_dir)
+
+ def __call__(self, img, device_id=0, cls=True):
+ """
+ Full OCR pipeline.
+
+ Args:
+ img: numpy array (H, W, 3) in BGR
+ device_id: GPU device ID
+ cls: Whether to check text orientation
+
+ Returns:
+ [(box_4pts, (text, confidence)), ...]
+ """
+ # Step 1: Detect text regions
+ dt_boxes, det_time = self.text_detector(img)
+
+ if dt_boxes is None or len(dt_boxes) == 0:
+ return []
+
+ # Step 2: Sort boxes by reading order
+ dt_boxes = self.sorted_boxes(dt_boxes)
+
+ # Step 3: Crop and rotate each text region
+ img_crop_list = []
+ for box in dt_boxes:
+ tmp_box = self.get_rotate_crop_image(img, box)
+ img_crop_list.append(tmp_box)
+
+ # Step 4: Recognize text
+ rec_res, rec_time = self.text_recognizer(img_crop_list)
+
+ # Step 5: Filter by confidence
+ results = []
+ for box, rec in zip(dt_boxes, rec_res):
+ text, score = rec
+ if score >= 0.5: # drop_score threshold
+ results.append((box, (text, score)))
+
+ return results
+```
+
+### 3.2 Rotation Detection
+
+```python
+# deepdoc/vision/ocr.py, lines 584-638
+
+def get_rotate_crop_image(self, img, points):
+ """
+ Crop text region with automatic rotation detection.
+
+ Vấn đề: Text có thể xoay 90° hoặc 270°
+ Giải pháp: Try multiple orientations, pick best CTC score
+ """
+ # Order points: top-left → top-right → bottom-right → bottom-left
+ rect = self.order_points_clockwise(points)
+
+ # Perspective transform to get rectangular crop
+ width = int(max(
+ np.linalg.norm(rect[0] - rect[1]),
+ np.linalg.norm(rect[2] - rect[3])
+ ))
+ height = int(max(
+ np.linalg.norm(rect[0] - rect[3]),
+ np.linalg.norm(rect[1] - rect[2])
+ ))
+
+ dst = np.array([
+ [0, 0],
+ [width, 0],
+ [width, height],
+ [0, height]
+ ], dtype=np.float32)
+
+ M = cv2.getPerspectiveTransform(rect, dst)
+ warped = cv2.warpPerspective(img, M, (width, height))
+
+ # Check if text is vertical (need rotation)
+ if warped.shape[0] / warped.shape[1] >= 1.5:
+ # Try 3 orientations
+ orientations = [
+ (warped, 0), # Original
+ (cv2.rotate(warped, cv2.ROTATE_90_CLOCKWISE), 90),
+ (cv2.rotate(warped, cv2.ROTATE_90_COUNTERCLOCKWISE), -90)
+ ]
+
+ best_score = -1
+ best_img = warped
+
+ for rot_img, angle in orientations:
+ # Quick recognition to get confidence
+ _, score = self.text_recognizer([rot_img])[0]
+ if score > best_score:
+ best_score = score
+ best_img = rot_img
+
+ warped = best_img
+
+ return warped
+```
+
+### 3.3 Reading Order Sorting
+
+```python
+# deepdoc/vision/ocr.py, lines 640-661
+
+def sorted_boxes(self, dt_boxes):
+ """
+ Sort boxes by reading order (top-to-bottom, left-to-right).
+
+ Algorithm:
+ 1. Sort by Y coordinate (top of box)
+ 2. Within same "row" (Y within 10px), sort by X coordinate
+ """
+ num_boxes = len(dt_boxes)
+ sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
+
+ # Group into rows and sort each row
+ _boxes = list(sorted_boxes)
+
+ for i in range(num_boxes - 1):
+ for j in range(i, -1, -1):
+ # If boxes are on same row (Y difference < 10)
+ if abs(_boxes[j+1][0][1] - _boxes[j][0][1]) < 10:
+ # Sort by X coordinate
+ if _boxes[j+1][0][0] < _boxes[j][0][0]:
+ _boxes[j], _boxes[j+1] = _boxes[j+1], _boxes[j]
+ else:
+ break
+
+ return _boxes
+```
+
+---
+
+## 4. Performance Optimization
+
+### 4.1 GPU Memory Management
+
+```python
+# deepdoc/vision/ocr.py, lines 96-127
+
+def load_model(model_dir, nm, device_id=None):
+ """
+ Load ONNX model with optimized settings.
+ """
+ options = ort.SessionOptions()
+
+ # Reduce memory fragmentation
+ options.enable_cpu_mem_arena = False
+
+ # Sequential execution (more predictable memory)
+ options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+
+ # Limit thread usage
+ options.intra_op_num_threads = 2
+ options.inter_op_num_threads = 2
+
+ # GPU configuration
+ if torch.cuda.is_available() and device_id is not None:
+ providers = [
+ ('CUDAExecutionProvider', {
+ 'device_id': device_id,
+ # Limit GPU memory to 2GB
+ 'gpu_mem_limit': int(os.getenv('OCR_GPU_MEM_LIMIT_MB', 2048)) * 1024 * 1024,
+ # Memory allocation strategy
+ 'arena_extend_strategy': os.getenv('OCR_ARENA_EXTEND_STRATEGY', 'kNextPowerOfTwo'),
+ })
+ ]
+ else:
+ providers = ['CPUExecutionProvider']
+
+ session = ort.InferenceSession(model_path, options, providers)
+
+ # Run options for memory cleanup after each run
+ run_opts = ort.RunOptions()
+ run_opts.add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")
+
+ return session, run_opts
+```
+
+### 4.2 Batch Processing Optimization
+
+```python
+# deepdoc/vision/ocr.py, lines 363-408
+
+def __call__(self, img_list):
+ """
+ Optimized batch recognition.
+ """
+ # Sort images by aspect ratio for efficient batching
+ # Similar widths → less padding waste
+ indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
+
+ results = [None] * len(img_list)
+
+ for batch_start in range(0, len(indices), self.batch_size):
+ batch_indices = indices[batch_start:batch_start + self.batch_size]
+
+ # Get max width in batch for padding
+ max_wh_ratio = max(img_list[i].shape[1]/img_list[i].shape[0]
+ for i in batch_indices)
+
+ # Normalize all images to same width
+ norm_imgs = []
+ for i in batch_indices:
+ norm_img = self.resize_norm_img(img_list[i], max_wh_ratio)
+ norm_imgs.append(norm_img)
+
+ # Stack into batch
+ batch = np.stack(norm_imgs)
+
+ # Run inference
+ preds = self.ort_sess.run(None, {"input": batch})
+
+ # Decode results
+ texts = self.postprocess_op(preds[0])
+
+ # Map back to original indices
+ for j, idx in enumerate(batch_indices):
+ results[idx] = texts[j]
+
+ return results
+```
+
+### 4.3 Multi-GPU Parallel Processing
+
+```python
+# deepdoc/vision/ocr.py, lines 556-579
+
+class OCR:
+ def __init__(self, model_dir=None):
+ if settings.PARALLEL_DEVICES > 0:
+ # Create per-GPU instances
+ self.text_detector = [
+ TextDetector(model_dir, device_id)
+ for device_id in range(settings.PARALLEL_DEVICES)
+ ]
+ self.text_recognizer = [
+ TextRecognizer(model_dir, device_id)
+ for device_id in range(settings.PARALLEL_DEVICES)
+ ]
+ else:
+ # Single instance for CPU/single GPU
+ self.text_detector = TextDetector(model_dir)
+ self.text_recognizer = TextRecognizer(model_dir)
+```
+
+---
+
+## 5. Troubleshooting
+
+### 5.1 Common Issues
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Low accuracy | Low resolution input | Increase zoomin factor (3-5) |
+| Slow inference | Large images | Resize to max 960px |
+| Memory error | Too many candidates | Reduce max_candidates |
+| Missing text | Tight boundaries | Increase unclip_ratio |
+| Wrong orientation | Vertical text | Enable rotation detection |
+
+### 5.2 Debugging Tips
+
+```python
+# Enable verbose logging
+import logging
+logging.basicConfig(level=logging.DEBUG)
+
+# Visualize detections
+from deepdoc.vision.seeit import draw_boxes
+
+img_with_boxes = draw_boxes(img, dt_boxes)
+cv2.imwrite("debug_detection.png", img_with_boxes)
+
+# Check confidence scores
+for box, (text, conf) in results:
+ print(f"Text: {text}, Confidence: {conf:.2f}")
+ if conf < 0.5:
+ print(" ⚠️ Low confidence!")
+```
+
+---
+
+## 6. References
+
+- DBNet Paper: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
+- CRNN Paper: [An End-to-End Trainable Neural Network for Image-based Sequence Recognition](https://arxiv.org/abs/1507.05717)
+- CTC Paper: [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
+- PaddleOCR: [GitHub](https://github.com/PaddlePaddle/PaddleOCR)