From 6d4dbbfe2c0eb2732cabf85f17bff1f00f40bd27 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 27 Nov 2025 03:46:14 +0000
Subject: [PATCH] docs: Add comprehensive DeepDoc deep guide documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Created in-depth documentation for understanding the deepdoc module:

- README.md: Complete deep guide with:
  - Big picture explanation (what problem deepdoc solves)
  - Data flow diagrams (Input → Processing → Output)
  - Detailed code analysis with line numbers
  - Technical explanations (ONNX, CTC, NMS, etc.)
  - Design reasoning (why certain technologies chosen)
  - Difficult terms glossary
  - Extension examples

- ocr_deep_dive.md: Deep dive into OCR subsystem
  - DBNet text detection architecture
  - CRNN text recognition
  - CTC decoding algorithm
  - Rotation handling
  - Performance optimization

- layout_table_deep_dive.md: Deep dive into layout/table recognition
  - YOLOv10 layout detection
  - Table structure recognition
  - Grid construction algorithm
  - Spanning cell handling
  - HTML/descriptive output generation
---
 .../07-DEEPDOC-DEEP-GUIDE/README.md           | 1286 +++++++++++++++++
 .../layout_table_deep_dive.md                 |  926 ++++++++++++
 .../07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md    |  678 +++++++++
 3 files changed, 2890 insertions(+)
 create mode 100644 personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
 create mode 100644 personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
 create mode 100644 personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md

diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
new file mode 100644
index 000000000..45c4d3fcb
--- /dev/null
+++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
@@ -0,0 +1,1286 @@
+# DeepDoc Module - Hướng Dẫn Đọc Hiểu Chuyên Sâu
+
+## Mục Lục
+
+1. [Bức Tranh Lớn](#1-bức-tranh-lớn)
+2. [Luồng Dữ Liệu](#2-luồng-dữ-liệu)
+3. [Phân Tích Chi Tiết Code](#3-phân-tích-chi-tiết-code)
+4. [Giải Thích Kỹ Thuật](#4-giải-thích-kỹ-thuật)
+5. [Lý Do Thiết Kế](#5-lý-do-thiết-kế)
+6. [Thuật Ngữ Khó](#6-thuật-ngữ-khó)
+7. [Mở Rộng Từ Code](#7-mở-rộng-từ-code)
+
+---
+
+## 1. Bức Tranh Lớn
+
+### 1.1 DeepDoc Giải Quyết Vấn Đề Gì?
+
+**Vấn đề cốt lõi**: Khi xây dựng hệ thống RAG (Retrieval-Augmented Generation), bạn cần chuyển đổi tài liệu (PDF, Word, Excel...) thành dạng text có cấu trúc để:
+- Tìm kiếm semantic (vector search)
+- Chia nhỏ (chunking) hợp lý
+- Giữ nguyên ngữ cảnh của bảng, hình ảnh
+
+**DeepDoc là gì?**: Một module Python chuyên biệt để:
+```
+Document Files → Structured Text + Tables + Figures
+(PDF, DOCX...)   (Có position, layout type, reading order)
+```
+
+### 1.2 Kiến Trúc Tổng Quan
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              DEEPDOC MODULE                                  │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                              │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │                         PARSER LAYER                                 │   │
+│  │  Chuyển đổi các định dạng file thành text có cấu trúc               │   │
+│  │                                                                      │   │
+│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │   │
+│  │  │   PDF    │ │   DOCX   │ │  Excel   │ │   HTML   │ │ Markdown │  │   │
+│  │  │  Parser  │ │  Parser  │ │  Parser  │ │  Parser  │ │  Parser  │  │   │
+│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘  │   │
+│  │       │            │            │            │            │         │   │
+│  └───────┼────────────┼────────────┼────────────┼────────────┼─────────┘   │
+│          │            │            │            │            │              │
+│          │            └────────────┴────────────┴────────────┘              │
+│          │                         │                                        │
+│          │              Text-based parsing                                  │
+│          │              (pdfplumber, python-docx, openpyxl...)             │
+│          │                                                                  │
+│          ▼                                                                  │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │                         VISION LAYER                                 │   │
+│  │  Computer Vision cho PDF phức tạp (scanned, multi-column)           │   │
+│  │                                                                      │   │
+│  │  ┌──────────────┐  ┌──────────────────┐  ┌────────────────────┐    │   │
+│  │  │     OCR      │  │ Layout Recognizer│  │ Table Structure    │    │   │
+│  │  │  Detection + │  │    (YOLOv10)     │  │   Recognizer       │    │   │
+│  │  │  Recognition │  │                  │  │                    │    │   │
+│  │  └──────────────┘  └──────────────────┘  └────────────────────┘    │   │
+│  │         │                   │                      │                │   │
+│  │         └───────────────────┴──────────────────────┘                │   │
+│  │                             │                                        │   │
+│  │                    ONNX Runtime Inference                            │   │
+│  │                                                                      │   │
+│  └─────────────────────────────────────────────────────────────────────┘   │
+│                                                                              │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 1.3 Các Thành Phần Chính
+
+| Thành Phần | File | Mục Đích |
+|------------|------|----------|
+| **PDF Parser** | `parser/pdf_parser.py` | Parser phức tạp nhất - xử lý PDF với OCR + layout |
+| **Office Parsers** | `parser/docx_parser.py`, `excel_parser.py`, `ppt_parser.py` | Xử lý file Microsoft Office |
+| **Web Parsers** | `parser/html_parser.py`, `markdown_parser.py`, `json_parser.py` | Xử lý file web/markup |
+| **OCR Engine** | `vision/ocr.py` | Text detection + recognition |
+| **Layout Detector** | `vision/layout_recognizer.py` | Phân loại vùng (text, table, figure...) |
+| **Table Detector** | `vision/table_structure_recognizer.py` | Nhận dạng cấu trúc bảng |
+| **Operators** | `vision/operators.py` | Image preprocessing pipeline |
+
+### 1.4 Tại Sao Cần DeepDoc?
+
+**Không có DeepDoc** (naive approach):
+```python
+# Chỉ extract raw text từ PDF
+text = pdfplumber.open("doc.pdf").pages[0].extract_text()
+# Kết quả: "Header Footer Table content mixed together..."
+# ❌ Mất cấu trúc, table thành text xáo trộn
+```
+
+**Với DeepDoc**:
+```python
+parser = RAGFlowPdfParser()
+docs, tables = parser("doc.pdf")
+# docs: [("Paragraph 1", "page_0_pos_100_200"), ("Paragraph 2", "page_0_pos_300_400")]
+# tables: [{"html": "<table>...</table>", "bbox": [...]}]
+# ✅ Giữ nguyên cấu trúc, table được parse riêng
+```
+
+---
+
+## 2. Luồng Dữ Liệu
+
+### 2.1 Luồng Chính: PDF Processing
+
+```
+┌────────────────────────────────────────────────────────────────────────────┐
+│                        PDF PROCESSING PIPELINE                              │
+└────────────────────────────────────────────────────────────────────────────┘
+
+Input: PDF File (path hoặc bytes)
+         │
+         ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 1: IMAGE EXTRACTION                                                    │
+│  File: pdf_parser.py, __images__() (lines 1042-1159)                        │
+│                                                                              │
+│  • Convert PDF pages → numpy images (using pdfplumber)                      │
+│  • Extract native PDF characters (text layer)                               │
+│  • Zoom factor: 3x (default) for OCR accuracy                               │
+│                                                                              │
+│  Output: page_images[], page_chars[]                                        │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 2: OCR DETECTION & RECOGNITION                                         │
+│  File: vision/ocr.py, OCR.__call__() (lines 708-751)                        │
+│                                                                              │
+│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
+│  │ TextDetector │ →  │   Crop &     │ →  │TextRecognizer│                   │
+│  │   (DBNet)    │    │   Rotate     │    │   (CRNN)     │                   │
+│  └──────────────┘    └──────────────┘    └──────────────┘                   │
+│                                                                              │
+│  • Detect text regions → bounding boxes                                     │
+│  • Crop each region, auto-rotate if needed                                  │
+│  • Recognize text in each region                                            │
+│                                                                              │
+│  Output: boxes[] with {text, confidence, coordinates}                       │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 3: LAYOUT RECOGNITION                                                  │
+│  File: vision/layout_recognizer.py, __call__() (lines 63-157)               │
+│                                                                              │
+│  • Run YOLOv10 model on page image                                          │
+│  • Detect 10 layout types: Text, Title, Table, Figure, etc.                 │
+│  • Match OCR boxes to layout regions                                        │
+│                                                                              │
+│  Output: boxes[] with added {layout_type, layoutno}                         │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 4: COLUMN DETECTION                                                    │
+│  File: pdf_parser.py, _assign_column() (lines 355-440)                      │
+│                                                                              │
+│  • K-Means clustering on X coordinates                                      │
+│  • Silhouette score to find optimal k (1-4 columns)                         │
+│  • Assign col_id to each text box                                           │
+│                                                                              │
+│  Output: boxes[] with added {col_id}                                        │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 5: TABLE STRUCTURE RECOGNITION                                         │
+│  File: vision/table_structure_recognizer.py, __call__() (lines 67-111)      │
+│                                                                              │
+│  • Detect rows, columns, headers, spanning cells                            │
+│  • Match text boxes to table cells                                          │
+│  • Build 2D table matrix                                                    │
+│                                                                              │
+│  Output: table_components[] with grid structure                             │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 6: TEXT MERGING                                                        │
+│  File: pdf_parser.py, _text_merge() (lines 442-478)                         │
+│                           _naive_vertical_merge() (lines 480-556)           │
+│                                                                              │
+│  • Horizontal merge: same line, same column, same layout                    │
+│  • Vertical merge: adjacent paragraphs with semantic checks                 │
+│  • Respect sentence boundaries (。？！)                                      │
+│                                                                              │
+│  Output: merged_boxes[] (fewer, larger text blocks)                         │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 7: FILTERING & CLEANUP                                                 │
+│  File: pdf_parser.py, _filter_forpages() (lines 685-729)                    │
+│                        __filterout_scraps() (lines 971-1029)                │
+│                                                                              │
+│  • Remove headers/footers (top/bottom 10% of page)                          │
+│  • Remove table of contents                                                 │
+│  • Filter low-quality OCR results                                           │
+│                                                                              │
+│  Output: clean_boxes[]                                                      │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  STEP 8: EXTRACT TABLES & FIGURES                                            │
+│  File: pdf_parser.py, _extract_table_figure() (lines 757-930)               │
+│                                                                              │
+│  • Convert table boxes to HTML/descriptive text                             │
+│  • Extract figure images with captions                                      │
+│  • Handle spanning cells (colspan, rowspan)                                 │
+│                                                                              │
+│  Output: tables[], figures[]                                                │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  FINAL OUTPUT                                                                │
+│                                                                              │
+│  documents: [(text, position_tag), ...]                                     │
+│  tables: [{"html": "...", "bbox": [...], "image": ...}, ...]               │
+│                                                                              │
+│  position_tag format: "page_{page}_x0_{x0}_y0_{y0}_x1_{x1}_y1_{y1}"        │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 2.2 Luồng OCR Chi Tiết
+
+```
+                           Input Image (H, W, 3)
+                                    │
+                                    ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                        TEXT DETECTION (DBNet)                                │
+│  File: vision/ocr.py, TextDetector.__call__() (lines 503-530)               │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                    │
+           ┌────────────────────────┼────────────────────────┐
+           │                        │                        │
+           ▼                        ▼                        ▼
+    ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
+    │ Preprocess  │          │    ONNX     │          │ Postprocess │
+    │             │          │  Inference  │          │             │
+    │ • Resize    │    →     │             │    →     │ • Threshold │
+    │ • Normalize │          │  DBNet      │          │ • Contours  │
+    │ • Transpose │          │  Model      │          │ • Unclip    │
+    └─────────────┘          └─────────────┘          └─────────────┘
+                                    │
+                                    ▼
+                         Text Region Polygons
+                         [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
+                                    │
+                                    ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                      TEXT RECOGNITION (CRNN)                                 │
+│  File: vision/ocr.py, TextRecognizer.__call__() (lines 363-408)             │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                    │
+           ┌────────────────────────┼────────────────────────┐
+           │                        │                        │
+           ▼                        ▼                        ▼
+    ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
+    │    Crop     │          │    ONNX     │          │ CTC Decode  │
+    │   Rotate    │          │  Inference  │          │             │
+    │             │    →     │             │    →     │ • Argmax    │
+    │ Perspective │          │   CRNN      │          │ • Dedup     │
+    │ Transform   │          │   Model     │          │ • Remove ε  │
+    └─────────────┘          └─────────────┘          └─────────────┘
+                                    │
+                                    ▼
+                    Output: [(box, (text, confidence)), ...]
+```
+
+### 2.3 Luồng Layout Recognition
+
+```
+                           Input: Page Image + OCR Results
+                                        │
+                                        ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                     LAYOUT DETECTION (YOLOv10)                               │
+│  File: vision/layout_recognizer.py (lines 163-237)                          │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                        │
+           ┌────────────────────────────┼────────────────────────────┐
+           │                            │                            │
+           ▼                            ▼                            ▼
+    ┌─────────────┐              ┌─────────────┐              ┌─────────────┐
+    │ Preprocess  │              │    ONNX     │              │ Postprocess │
+    │             │              │  Inference  │              │             │
+    │ • Resize    │      →       │             │      →       │ • NMS       │
+    │   (640x640) │              │  YOLOv10    │              │ • Filter    │
+    │ • Pad       │              │   Model     │              │ • Scale     │
+    │ • Normalize │              │             │              │   back      │
+    └─────────────┘              └─────────────┘              └─────────────┘
+                                        │
+                                        ▼
+                              Layout Detections:
+                              [{"type": "Table", "bbox": [...], "score": 0.95}]
+                                        │
+                                        ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                     OCR-LAYOUT ASSOCIATION                                   │
+│  File: vision/layout_recognizer.py (lines 98-147)                           │
+│                                                                              │
+│  For each OCR box:                                                          │
+│    • Find overlapping layout region (threshold: 40%)                        │
+│    • Assign layout_type to OCR box                                          │
+│    • Filter garbage (headers/footers/page numbers)                          │
+│                                                                              │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                        │
+                                        ▼
+                    Output: OCR boxes with layout_type attribute
+                    [{"text": "...", "layout_type": "Text", "layoutno": 1}]
+```
+
+### 2.4 Data Flow Summary
+
+```
+┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
+│  PDF File   │ →   │   Images    │ →   │ OCR Boxes   │ →   │  Merged     │
+│             │     │ + Chars     │     │ + Layout    │     │  Documents  │
+└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
+                                              │
+                                              ▼
+                                        ┌─────────────┐
+                                        │   Tables    │
+                                        │  (HTML/Desc)│
+                                        └─────────────┘
+
+Input Format:
+- File path: str (e.g., "/path/to/doc.pdf")
+- Or bytes: bytes (raw PDF content)
+
+Output Format:
+- documents: List[Tuple[str, str]]
+  - text: Extracted text content
+  - position_tag: "page_0_x0_100_y0_200_x1_500_y1_250"
+
+- tables: List[Dict]
+  - html: "<table>...</table>"
+  - bbox: [x0, y0, x1, y1]
+  - image: numpy array (optional)
+```
+
+---
+
+## 3. Phân Tích Chi Tiết Code
+
+### 3.1 RAGFlowPdfParser Class
+
+**File**: `/deepdoc/parser/pdf_parser.py`
+**Lines**: 52-1479
+
+#### 3.1.1 Constructor (__init__)
+
+```python
+# Line 52-104
+class RAGFlowPdfParser:
+    def __init__(self, **kwargs):
+        # Load OCR model
+        self.ocr = OCR()  # vision/ocr.py
+
+        # Load Layout Recognizer (YOLOv10)
+        self.layout_recognizer = LayoutRecognizer()  # vision/layout_recognizer.py
+
+        # Load Table Structure Recognizer
+        self.tsr = TableStructureRecognizer()  # vision/table_structure_recognizer.py
+
+        # Load XGBoost model for text concatenation
+        try:
+            self.updown_cnt_mdl = xgb.Booster()
+            model_path = os.path.join(get_project_base_directory(),
+                                      "rag/res/deepdoc/updown_concat_xgb.model")
+            self.updown_cnt_mdl.load_model(model_path)
+        except Exception as e:
+            self.updown_cnt_mdl = None
+```
+
+**Giải thích**:
+- Constructor khởi tạo 4 models:
+  1. **OCR**: Text detection + recognition
+  2. **LayoutRecognizer**: Phân loại vùng layout (YOLOv10)
+  3. **TableStructureRecognizer**: Nhận dạng cấu trúc bảng
+  4. **XGBoost**: Quyết định merge text blocks (31 features)
+
+#### 3.1.2 Main Entry Point (__call__)
+
+```python
+# Lines 1160-1168
+def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
+    """
+    Main entry point for PDF parsing.
+
+    Args:
+        fnm: File path or bytes
+        need_image: Whether to extract images
+        zoomin: Zoom factor for OCR (default 3x)
+        return_html: Return HTML tables instead of descriptive text
+
+    Returns:
+        (documents, tables) tuple
+    """
+    self.__images__(fnm, zoomin)           # Step 1: Load images
+    self._layouts_rec(zoomin)              # Step 2-3: OCR + Layout
+    self._table_transformer_job(zoomin)    # Step 4: Table structure
+    self._text_merge(zoomin)               # Step 5: Merge text
+    self._filter_forpages()                # Step 6: Filter
+    tbls = self._extract_table_figure(...) # Step 7: Extract tables
+    return self._final_result(), tbls      # Final output
+```
+
+**Tại sao zoomin=3?**
+- OCR accuracy tăng đáng kể khi image lớn hơn
+- 3x là balance giữa accuracy và memory/speed
+- Quá lớn (5x+) → memory issues, quá nhỏ (1x) → OCR errors
+
+#### 3.1.3 Image Loading (__images__)
+
+```python
+# Lines 1042-1159
+def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
+    """
+    Load PDF pages as images and extract native characters.
+    """
+    self.page_images = []
+    self.page_chars = []
+
+    # Open PDF with pdfplumber
+    with pdfplumber.open(fnm) as pdf:
+        for i, page in enumerate(pdf.pages[page_from:page_to]):
+            # Convert page to image
+            img = page.to_image(resolution=72 * zoomin)
+            img = np.array(img.original)
+            self.page_images.append(img)
+
+            # Extract native PDF characters
+            chars = page.chars
+            self.page_chars.append(chars)
+```
+
+**Tại sao dùng pdfplumber?**
+- Hỗ trợ cả text extraction và image conversion
+- Giữ được character-level coordinates
+- Xử lý tốt các PDF phức tạp
+
+#### 3.1.4 Column Detection (_assign_column)
+
+```python
+# Lines 355-440
+def _assign_column(self, boxes, zoomin=3):
+    """
+    Detect columns using K-Means clustering on X coordinates.
+    """
+    from sklearn.cluster import KMeans
+    from sklearn.metrics import silhouette_score
+
+    # Extract X coordinates
+    x_coords = np.array([[b["x0"]] for b in boxes])
+
+    best_k = 1
+    best_score = -1
+
+    # Try k from 1 to 4
+    for k in range(1, min(5, len(boxes))):
+        km = KMeans(n_clusters=k, random_state=42, n_init="auto")
+        labels = km.fit_predict(x_coords)
+
+        if k > 1:
+            score = silhouette_score(x_coords, labels)
+            if score > best_score:
+                best_score = score
+                best_k = k
+
+    # Final clustering with best k
+    km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
+    labels = km.fit_predict(x_coords)
+
+    # Assign column IDs
+    for i, box in enumerate(boxes):
+        box["col_id"] = labels[i]
+```
+
+**Tại sao K-Means?**
+- Unsupervised: không cần training data
+- Fast: O(n * k * iterations)
+- Silhouette score tự động chọn số cột
+
+### 3.2 OCR Class
+
+**File**: `/deepdoc/vision/ocr.py`
+**Lines**: 536-752
+
+#### 3.2.1 Text Detection (TextDetector)
+
+```python
+# Lines 414-534
+class TextDetector:
+    def __init__(self, model_dir, device_id=None):
+        # Preprocessing pipeline
+        self.preprocess_op = [
+            DetResizeForTest(limit_side_len=960, limit_type='max'),
+            NormalizeImage(mean=[0.485, 0.456, 0.406],
+                          std=[0.229, 0.224, 0.225]),
+            ToCHWImage(),
+            KeepKeys(keep_keys=['image', 'shape'])
+        ]
+
+        # Postprocessing
+        self.postprocess_op = DBPostProcess(
+            thresh=0.3,           # Binary threshold
+            box_thresh=0.5,       # Box confidence threshold
+            max_candidates=1000,  # Max text regions
+            unclip_ratio=1.5      # Box expansion ratio
+        )
+
+        # Load ONNX model
+        self.ort_sess, self.run_opts = load_model(model_dir, "det", device_id)
+```
+
+**DBNet (Differentiable Binarization)**:
+- Input: Image → Probability map (text regions)
+- Thresholding: prob > 0.3 → foreground
+- Unclipping: Expand boxes by 1.5x để capture full text
+
+#### 3.2.2 Text Recognition (TextRecognizer)
+
+```python
+# Lines 133-412
+class TextRecognizer:
+    def __init__(self, model_dir, device_id=None):
+        self.rec_image_shape = [3, 48, 320]  # C, H, W
+        self.batch_size = 16
+
+        # Load CRNN model
+        self.ort_sess, self.run_opts = load_model(model_dir, "rec", device_id)
+
+        # CTC decoder
+        self.postprocess_op = CTCLabelDecode(character_dict_path=dict_path)
+
+    def __call__(self, img_list):
+        # Sort by aspect ratio for efficient batching
+        indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
+
+        results = []
+        for batch in chunks(indices, self.batch_size):
+            # Normalize images
+            norm_imgs = [self.resize_norm_img(img_list[i]) for i in batch]
+
+            # Run inference
+            preds = self.ort_sess.run(None, {"input": np.stack(norm_imgs)})
+
+            # CTC decode
+            texts = self.postprocess_op(preds[0])
+            results.extend(texts)
+
+        return results
+```
+
+**CRNN + CTC**:
+- CNN: Extract visual features
+- RNN: Sequence modeling
+- CTC: Alignment-free decoding (handles variable-length text)
+
+#### 3.2.3 Rotation Handling
+
+```python
+# Lines 584-638
+def get_rotate_crop_image(self, img, points):
+    """
+    Crop text region with auto-rotation detection.
+    """
+    # Get perspective transform
+    rect = self.order_points_clockwise(points)
+    M = cv2.getPerspectiveTransform(rect, dst_pts)
+    warped = cv2.warpPerspective(img, M, (width, height))
+
+    # Check if text is vertical (height > 1.5 * width)
+    if warped.shape[0] / warped.shape[1] >= 1.5:
+        # Try 3 orientations
+        scores = []
+        for angle in [0, 90, -90]:
+            rotated = self.rotate(warped, angle)
+            _, conf = self.recognizer([rotated])[0]
+            scores.append(conf)
+
+        # Use orientation with highest confidence
+        best_angle = [0, 90, -90][np.argmax(scores)]
+        warped = self.rotate(warped, best_angle)
+
+    return warped
+```
+
+**Tại sao cần auto-rotation?**
+- PDF có thể chứa text xoay 90°
+- OCR model trained on horizontal text
+- Auto-detect giúp nhận dạng text dọc chính xác
+
+### 3.3 Layout Recognizer
+
+**File**: `/deepdoc/vision/layout_recognizer.py`
+**Lines**: 33-237
+
+#### 3.3.1 YOLOv10 Preprocessing
+
+```python
+# Lines 186-209
+def preprocess(self, image_list):
+    """
+    Preprocess images for YOLOv10 inference.
+    """
+    processed = []
+    for img in image_list:
+        h, w = img.shape[:2]
+
+        # Calculate scale (preserve aspect ratio)
+        r = min(640/h, 640/w)
+        new_h, new_w = int(h*r), int(w*r)
+
+        # Resize
+        resized = cv2.resize(img, (new_w, new_h))
+
+        # Pad to 640x640 (center padding, gray color)
+        padded = np.full((640, 640, 3), 114, dtype=np.uint8)
+        pad_top = (640 - new_h) // 2
+        pad_left = (640 - new_w) // 2
+        padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
+
+        # Normalize and transpose
+        padded = padded.astype(np.float32) / 255.0
+        padded = padded.transpose(2, 0, 1)  # HWC → CHW
+
+        processed.append(padded)
+
+    return np.stack(processed)
+```
+
+**Tại sao 640x640?**
+- YOLOv10 standard input size
+- Balance accuracy vs speed
+- 32-stride alignment (640 = 20 * 32)
+
+#### 3.3.2 Layout Types
+
+```python
+# Lines 34-46
+labels = [
+    "_background_",    # 0: Background (ignored)
+    "Text",            # 1: Body text paragraphs
+    "Title",           # 2: Section/document titles
+    "Figure",          # 3: Images, diagrams, charts
+    "Figure caption",  # 4: Text describing figures
+    "Table",           # 5: Data tables
+    "Table caption",   # 6: Text describing tables
+    "Header",          # 7: Page headers
+    "Footer",          # 8: Page footers
+    "Reference",       # 9: Bibliography, citations
+    "Equation",        # 10: Mathematical equations
+]
+```
+
+### 3.4 Table Structure Recognizer
+
+**File**: `/deepdoc/vision/table_structure_recognizer.py`
+**Lines**: 30-613
+
+#### 3.4.1 Table Grid Construction
+
+```python
+# Lines 172-349
+@staticmethod
+def construct_table(boxes, is_english=False, html=True, **kwargs):
+    """
+    Construct 2D table from detected components.
+    """
+    # Step 1: Sort by row
+    boxes = Recognizer.sort_R_firstly(boxes, rowh/2)
+
+    # Step 2: Group into rows
+    rows = []
+    current_row = [boxes[0]]
+    for box in boxes[1:]:
+        if box["top"] - current_row[-1]["bottom"] > rowh/2:
+            rows.append(current_row)
+            current_row = [box]
+        else:
+            current_row.append(box)
+    rows.append(current_row)
+
+    # Step 3: Sort each row by column
+    for row in rows:
+        row.sort(key=lambda x: x["x0"])
+
+    # Step 4: Build 2D matrix
+    n_cols = max(len(row) for row in rows)
+    table = [[None] * n_cols for _ in range(len(rows))]
+
+    for i, row in enumerate(rows):
+        for j, cell in enumerate(row):
+            table[i][j] = cell["text"]
+
+    # Step 5: Generate output
+    if html:
+        return generate_html_table(table)
+    else:
+        return generate_descriptive_text(table)
+```
+
+#### 3.4.2 Spanning Cell Handling
+
+```python
+# Lines 496-575
+def __cal_spans(self, boxes):
+    """
+    Calculate colspan and rowspan for merged cells.
+    """
+    for box in boxes:
+        if "SP" not in box:  # Not a spanning cell
+            continue
+
+        # Find which rows this cell spans
+        box["rowspan"] = []
+        for i, row_box in enumerate(self.rows):
+            if self.overlapped_area(box, row_box) > 0.3:
+                box["rowspan"].append(i)
+
+        # Find which columns this cell spans
+        box["colspan"] = []
+        for j, col_box in enumerate(self.cols):
+            if self.overlapped_area(box, col_box) > 0.3:
+                box["colspan"].append(j)
+```
+
+---
+
+## 4. Giải Thích Kỹ Thuật
+
+### 4.1 ONNX Runtime
+
+**ONNX là gì?**
+- Open Neural Network Exchange
+- Format chuẩn cho deep learning models
+- Chạy trên nhiều hardware (CPU, GPU, NPU)
+
+**Tại sao dùng ONNX?**
+```python
+# Không cần PyTorch/TensorFlow runtime
+# Lightweight inference
+import onnxruntime as ort
+
+session = ort.InferenceSession("model.onnx")
+output = session.run(None, {"input": input_data})
+```
+
+**Cấu hình trong DeepDoc**:
+```python
+# vision/ocr.py, lines 96-127
+options = ort.SessionOptions()
+options.enable_cpu_mem_arena = False     # Giảm memory fragmentation
+options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+options.intra_op_num_threads = 2         # Threads per operator
+options.inter_op_num_threads = 2         # Parallel operators
+
+# GPU configuration
+if torch.cuda.is_available():
+    providers = [
+        ('CUDAExecutionProvider', {
+            'device_id': device_id,
+            'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
+        })
+    ]
+```
+
+### 4.2 CTC Decoding
+
+**CTC (Connectionist Temporal Classification)**:
+- Giải quyết alignment problem trong sequence-to-sequence
+- Không cần biết vị trí chính xác của từng ký tự
+
+**Ví dụ**:
+```
+OCR Model Output (time steps):
+[a, a, a, -, l, l, -, p, p, h, h, a, -]
+
+CTC Decoding:
+1. Merge consecutive duplicates: [a, -, l, -, p, h, a, -]
+2. Remove blank tokens (-): [a, l, p, h, a]
+3. Result: "alpha"
+```
+
+**Implementation**:
+```python
+# vision/postprocess.py, lines 355-366
+def __call__(self, preds, label=None):
+    # Get most probable character at each position
+    preds_idx = preds.argmax(axis=2)  # Shape: (batch, time)
+    preds_prob = preds.max(axis=2)     # Confidence scores
+
+    # Decode with deduplication
+    text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
+
+    return text
+```
+
+### 4.3 Non-Maximum Suppression (NMS)
+
+**NMS là gì?**
+- Loại bỏ duplicate detections
+- Giữ lại box có confidence cao nhất
+
+**Algorithm**:
+```
+1. Sort boxes by confidence (descending)
+2. Pick box with highest score → add to results
+3. Remove boxes with IoU > threshold (e.g., 0.5)
+4. Repeat until no boxes remain
+```
+
+**Implementation**:
+```python
+# vision/operators.py, lines 702-725
+def nms(bboxes, scores, iou_thresh):
+    indices = []
+    index = scores.argsort()[::-1]  # Sort descending
+
+    while index.size > 0:
+        i = index[0]
+        indices.append(i)
+
+        # Compute IoU with remaining boxes
+        ious = compute_iou(bboxes[i], bboxes[index[1:]])
+
+        # Keep only boxes with IoU <= threshold
+        mask = ious <= iou_thresh
+        index = index[1:][mask]
+
+    return indices
+```
+
+### 4.4 DBNet (Differentiable Binarization)
+
+**DBNet là gì?**
+- Text detection network
+- Tạo probability map + threshold map
+- Differentiable binarization cho end-to-end training
+
+**Pipeline**:
+```
+Image → CNN Backbone → Feature Map →
+                                    ├→ Probability Map (text regions)
+                                    └→ Threshold Map (adaptive threshold)
+
+Final = Probability > Threshold (pixel-wise)
+```
+
+**Post-processing**:
+```python
+# vision/postprocess.py, DBPostProcess
+def __call__(self, outs_dict, shape_list):
+    pred = outs_dict["maps"]
+
+    # Binary thresholding
+    bitmap = pred > self.thresh  # 0.3
+
+    # Find contours
+    contours = cv2.findContours(bitmap, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
+
+    # Unclip (expand) boxes
+    for contour in contours:
+        box = self.unclip(contour, self.unclip_ratio)  # 1.5x expansion
+        boxes.append(box)
+```
+
+### 4.5 K-Means cho Column Detection
+
+**Tại sao K-Means?**
+- Text boxes trong cùng cột có X coordinate tương tự
+- K-Means cluster các X values
+- Silhouette score chọn số cột tối ưu
+
+**Silhouette Score**:
+```
+s(i) = (b(i) - a(i)) / max(a(i), b(i))
+
+- a(i): Average distance to same cluster
+- b(i): Average distance to nearest other cluster
+- Range: [-1, 1], higher = better clustering
+```
+
+**Ví dụ**:
+```
+Page with 2 columns:
+Left column boxes: x0 = [50, 52, 48, 51, ...]
+Right column boxes: x0 = [400, 398, 402, 399, ...]
+
+K-Means (k=2):
+- Cluster 0: x0 ≈ 50 (left column)
+- Cluster 1: x0 ≈ 400 (right column)
+
+Silhouette score ≈ 0.95 (high, good separation)
+```
+
+---
+
+## 5. Lý Do Thiết Kế
+
+### 5.1 Tại Sao Dùng Multiple Models?
+
+**Vấn đề**: Một model không thể handle tất cả tasks
+
+| Task | Model Type | Lý Do |
+|------|------------|-------|
+| Text Detection | DBNet | Specialized cho text regions |
+| Text Recognition | CRNN | Sequential text với CTC |
+| Layout Detection | YOLOv10 | Object detection tốt nhất |
+| Table Structure | YOLOv10 variant | Fine-tuned cho table elements |
+
+**Trade-off**:
+- Pros: Mỗi model optimized cho task riêng
+- Cons: Nhiều models → nhiều memory, complexity
+
+### 5.2 Tại Sao Dùng XGBoost cho Text Merging?
+
+**Vấn đề**: Merge text blocks là decision phức tạp
+
+**Rule-based approach** (naive):
+```python
+# Simple heuristics
+if y_distance < threshold and same_column:
+    merge()
+# ❌ Không handle edge cases tốt
+```
+
+**ML approach** (XGBoost):
+```python
+# 31 features capturing various signals
+features = [
+    y_distance / char_height,      # Distance feature
+    ends_with_punctuation,          # Text pattern
+    same_layout_type,               # Layout feature
+    font_size_ratio,                # Typography
+    ...
+]
+# ✅ Learns complex patterns from data
+```
+
+**Tại sao XGBoost?**
+- Fast inference (tree-based)
+- Handles mixed feature types well
+- Pre-trained model included
+
+### 5.3 Tại Sao ONNX thay vì PyTorch/TensorFlow?
+
+| Aspect | ONNX Runtime | PyTorch |
+|--------|--------------|---------|
+| Size | ~50MB | ~500MB+ |
+| Memory | Lower | Higher |
+| Startup | Fast | Slow (JIT) |
+| Dependencies | Minimal | Many |
+| Multi-platform | Yes | Limited |
+
+**DeepDoc choice**: ONNX cho production deployment
+- Không cần PyTorch runtime
+- Lighter memory footprint
+- Faster cold start
+
+### 5.4 Tại Sao Zoomin = 3?
+
+**Experiment results**:
+```
+zoomin=1: OCR accuracy ~70%, fast
+zoomin=2: OCR accuracy ~85%, moderate
+zoomin=3: OCR accuracy ~95%, acceptable speed ← chosen
+zoomin=4: OCR accuracy ~97%, slow
+zoomin=5: OCR accuracy ~98%, very slow, memory issues
+```
+
+**Balance**: 3x là sweet spot giữa accuracy và resource usage
+
+### 5.5 Tại Sao Hybrid Text Extraction?
+
+**Native PDF text** (pdfplumber):
+- Pros: Accurate, fast, preserves fonts
+- Cons: Không có cho scanned PDFs
+
+**OCR text**:
+- Pros: Works on any image
+- Cons: Slower, potential errors
+
+**Hybrid approach**:
+```python
+# Prefer native text, fallback to OCR
+for box in ocr_boxes:
+    # Try to match with native characters
+    matched_chars = find_overlapping_chars(box, native_chars)
+
+    if matched_chars:
+        box["text"] = "".join(matched_chars)  # Use native
+    else:
+        box["text"] = ocr_result  # Use OCR
+```
+
+### 5.6 Pipeline vs End-to-End Model
+
+**End-to-End** (e.g., Donut, Pix2Struct):
+- Single model: Image → Structured output
+- Pros: Simple, unified
+- Cons: Less accurate on specific tasks, hard to debug
+
+**Pipeline** (DeepDoc's choice):
+- Multiple specialized models
+- Pros:
+  - Each model optimized for task
+  - Easy to debug/improve individual components
+  - Mix and match different models
+- Cons:
+  - More complexity
+  - Potential error accumulation
+
+**DeepDoc's rationale**: Pipeline cho flexibility và accuracy
+
+---
+
+## 6. Thuật Ngữ Khó
+
+### 6.1 Computer Vision Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **Bounding Box** | Hình chữ nhật bao quanh object | `[x0, y0, x1, y1]` coordinates |
+| **IoU** | Intersection over Union - đo overlap | NMS threshold 0.5 |
+| **NMS** | Non-Maximum Suppression | Loại duplicate detections |
+| **Anchor** | Predefined box sizes | YOLOv10 anchors |
+| **Stride** | Downsampling factor | 32 trong YOLOv10 |
+| **FPN** | Feature Pyramid Network | Multi-scale detection |
+
+### 6.2 OCR Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **CTC** | Connectionist Temporal Classification | CRNN output decoding |
+| **CRNN** | CNN + RNN | Text recognition model |
+| **DBNet** | Differentiable Binarization | Text detection model |
+| **Unclip** | Expand polygon boundary | 1.5x expansion ratio |
+
+### 6.3 ML Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **ONNX** | Open Neural Network Exchange | Model format |
+| **Inference** | Running model on input | `session.run()` |
+| **Batch** | Multiple inputs processed together | batch_size=16 |
+| **Confidence** | Model's certainty score | 0.0 - 1.0 |
+
+### 6.4 Document Processing Terms
+
+| Term | Definition | Ví Dụ trong DeepDoc |
+|------|------------|---------------------|
+| **Layout** | Document structure | Text, Table, Figure |
+| **TSR** | Table Structure Recognition | Row, Column detection |
+| **Spanning Cell** | Merged table cell | colspan, rowspan |
+| **Reading Order** | Text flow sequence | Top-to-bottom, left-to-right |
+
+---
+
+## 7. Mở Rộng Từ Code
+
+### 7.1 Thêm Parser Mới
+
+**Ví dụ**: Add RTF parser
+
+```python
+# deepdoc/parser/rtf_parser.py
+from striprtf.striprtf import rtf_to_text
+
+class RAGFlowRtfParser:
+    def __call__(self, fnm, binary=None, chunk_token_num=128):
+        if binary:
+            content = binary.decode('utf-8')
+        else:
+            with open(fnm, 'r') as f:
+                content = f.read()
+
+        text = rtf_to_text(content)
+
+        # Chunk text
+        chunks = self._chunk(text, chunk_token_num)
+
+        return [(chunk, f"rtf_chunk_{i}") for i, chunk in enumerate(chunks)]
+```
+
+### 7.2 Thêm Layout Type Mới
+
+**Ví dụ**: Add "Code Block" layout
+
+```python
+# vision/layout_recognizer.py
+labels = [
+    "_background_",
+    "Text",
+    "Title",
+    ...
+    "Code Block",  # New label (index 11)
+]
+
+# Train new YOLOv10 model with "Code Block" annotations
+# Update model file
+```
+
+### 7.3 Custom Text Merging Logic
+
+```python
+# Override default merging behavior
+class CustomPdfParser(RAGFlowPdfParser):
+    def _should_merge(self, box1, box2):
+        """Custom merge logic"""
+        # Don't merge code blocks
+        if box1.get("layout_type") == "Code Block":
+            return False
+
+        # Use default logic otherwise
+        return super()._should_merge(box1, box2)
+```
+
+### 7.4 Thêm Output Format
+
+```python
+# Add Markdown output format
+def to_markdown(self, documents, tables):
+    md_parts = []
+
+    for text, pos_tag in documents:
+        # Detect if title
+        if self._is_title(text):
+            md_parts.append(f"## {text}\n")
+        else:
+            md_parts.append(f"{text}\n\n")
+
+    # Convert tables to markdown
+    for table in tables:
+        md_table = html_to_markdown(table["html"])
+        md_parts.append(md_table)
+
+    return "\n".join(md_parts)
+```
+
+### 7.5 Optimize Performance
+
+**GPU Batching**:
+```python
+# Process multiple pages in parallel
+def _parallel_ocr(self, images, batch_size=4):
+    with ThreadPoolExecutor(max_workers=4) as executor:
+        futures = []
+        for batch in chunks(images, batch_size):
+            future = executor.submit(self.ocr, batch)
+            futures.append(future)
+
+        results = [f.result() for f in futures]
+    return results
+```
+
+**Caching**:
+```python
+# Cache model instances
+_model_cache = {}
+
+def get_ocr_model(model_dir, device_id):
+    key = f"{model_dir}_{device_id}"
+    if key not in _model_cache:
+        _model_cache[key] = OCR(model_dir, device_id)
+    return _model_cache[key]
+```
+
+### 7.6 Integration với RAG Pipeline
+
+```python
+# rag/app/pdf.py (example integration)
+from deepdoc.parser import RAGFlowPdfParser
+
+def process_pdf_for_rag(file_path, chunk_size=512):
+    parser = RAGFlowPdfParser()
+
+    # Parse PDF
+    documents, tables = parser(file_path)
+
+    # Chunk documents
+    chunks = []
+    for text, pos_tag in documents:
+        for chunk in chunk_text(text, chunk_size):
+            chunks.append({
+                "text": chunk,
+                "metadata": {"position": pos_tag}
+            })
+
+    # Add tables as separate chunks
+    for table in tables:
+        chunks.append({
+            "text": table["html"],
+            "metadata": {"type": "table", "bbox": table["bbox"]}
+        })
+
+    return chunks
+```
+
+---
+
+## 8. Tổng Kết
+
+### 8.1 Key Takeaways
+
+1. **DeepDoc = Parser Layer + Vision Layer**
+   - Parser: Format-specific handling (PDF, DOCX, etc.)
+   - Vision: OCR + Layout + Table recognition
+
+2. **Pipeline Architecture**
+   - Multiple specialized models
+   - Easy to debug and improve
+
+3. **ONNX Runtime**
+   - Lightweight inference
+   - Cross-platform compatibility
+
+4. **Hybrid Text Extraction**
+   - Native PDF text khi available
+   - OCR fallback cho scanned documents
+
+### 8.2 Diagram Tổng Hợp
+
+```
+┌──────────────────────────────────────────────────────────────────────────────┐
+│                            DEEPDOC SUMMARY                                    │
+├──────────────────────────────────────────────────────────────────────────────┤
+│                                                                               │
+│  INPUT                   PROCESSING                         OUTPUT            │
+│  ─────                   ──────────                         ──────            │
+│                                                                               │
+│  ┌─────────┐     ┌────────────────────────────┐      ┌─────────────────┐    │
+│  │  PDF    │────▶│  1. Image Extraction       │─────▶│  Documents      │    │
+│  │  DOCX   │     │  2. OCR (DBNet + CRNN)     │      │  [(text, pos)]  │    │
+│  │  Excel  │     │  3. Layout (YOLOv10)       │      │                 │    │
+│  │  HTML   │     │  4. Column Detection       │      │  Tables         │    │
+│  │  ...    │     │  5. Table Structure        │      │  [html, bbox]   │    │
+│  └─────────┘     │  6. Text Merging           │      │                 │    │
+│                  │  7. Quality Filtering      │      │  Figures        │    │
+│                  └────────────────────────────┘      │  [image, cap]   │    │
+│                                                       └─────────────────┘    │
+│                                                                               │
+│  MODELS USED:                                                                 │
+│  ────────────                                                                 │
+│  • DBNet (Text Detection)          - ONNX, ~30MB                             │
+│  • CRNN (Text Recognition)         - ONNX, ~20MB                             │
+│  • YOLOv10 (Layout Detection)      - ONNX, ~50MB                             │
+│  • YOLOv10 (Table Structure)       - ONNX, ~50MB                             │
+│  • XGBoost (Text Merging)          - Binary, ~5MB                            │
+│                                                                               │
+│  KEY ALGORITHMS:                                                              │
+│  ───────────────                                                              │
+│  • CTC Decoding (text recognition)                                           │
+│  • NMS (duplicate removal)                                                   │
+│  • K-Means (column detection)                                                │
+│  • IoU (overlap calculation)                                                 │
+│                                                                               │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 8.3 Files Reference
+
+| File | Lines | Description |
+|------|-------|-------------|
+| `parser/pdf_parser.py` | 1479 | Main PDF parser |
+| `vision/ocr.py` | 752 | OCR detection + recognition |
+| `vision/layout_recognizer.py` | 457 | Layout detection |
+| `vision/table_structure_recognizer.py` | 613 | Table structure |
+| `vision/recognizer.py` | 443 | Base recognizer class |
+| `vision/operators.py` | 726 | Image preprocessing |
+| `vision/postprocess.py` | 371 | Post-processing utilities |
+
+---
+
+*Document created for RAGFlow v0.22.1 analysis*
diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
new file mode 100644
index 000000000..063acd4be
--- /dev/null
+++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
@@ -0,0 +1,926 @@
+# Layout & Table Recognition Deep Dive
+
+## Tổng Quan
+
+Sau khi OCR extract được text boxes, DeepDoc cần:
+1. **Layout Recognition**: Phân loại vùng (Text, Title, Table, Figure...)
+2. **Table Structure Recognition**: Nhận dạng cấu trúc bảng (rows, columns, cells)
+
+## File Structure
+
+```
+deepdoc/vision/
+├── layout_recognizer.py              # Layout detection (457 lines)
+├── table_structure_recognizer.py     # Table structure (613 lines)
+└── recognizer.py                     # Base class (443 lines)
+```
+
+---
+
+## 1. Layout Recognition (YOLOv10)
+
+### 1.1 Layout Categories
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 34-46
+
+labels = [
+    "_background_",     # 0: Background (ignored)
+    "Text",             # 1: Body text paragraphs
+    "Title",            # 2: Section/document titles
+    "Figure",           # 3: Images, diagrams, charts
+    "Figure caption",   # 4: Text describing figures
+    "Table",            # 5: Data tables
+    "Table caption",    # 6: Text describing tables
+    "Header",           # 7: Page headers
+    "Footer",           # 8: Page footers
+    "Reference",        # 9: Bibliography, citations
+    "Equation",         # 10: Mathematical equations
+]
+```
+
+### 1.2 YOLOv10 Architecture
+
+```
+YOLOv10 for Document Layout:
+
+Input Image (640, 640, 3)
+         │
+         ▼
+┌─────────────────────────────────────┐
+│        CSPDarknet Backbone          │
+│  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐│
+│  │ P1  │→ │ P2  │→ │ P3  │→ │ P4  ││
+│  │/2   │  │/4   │  │/8   │  │/16  ││
+│  └─────┘  └─────┘  └─────┘  └─────┘│
+└─────────────────────────────────────┘
+         │      │      │      │
+         ▼      ▼      ▼      ▼
+┌─────────────────────────────────────┐
+│           PANet Neck                 │
+│  FPN (top-down) + PAN (bottom-up)   │
+│  Multi-scale feature fusion         │
+└─────────────────────────────────────┘
+         │
+         ▼
+┌─────────────────────────────────────┐
+│       Detection Heads (3 scales)     │
+│  Small (80x80) → tiny objects       │
+│  Medium (40x40) → normal objects    │
+│  Large (20x20) → big objects        │
+└─────────────────────────────────────┘
+         │
+         ▼
+    Raw Predictions:
+    [x_center, y_center, width, height, confidence, class_probs...]
+```
+
+### 1.3 Preprocessing (LayoutRecognizer4YOLOv10)
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 186-209
+
+def preprocess(self, image_list):
+    """
+    Preprocess images for YOLOv10.
+
+    Key steps:
+    1. Resize maintaining aspect ratio
+    2. Pad to 640x640 (gray borders)
+    3. Normalize [0,255] → [0,1]
+    4. Transpose HWC → CHW
+    """
+    processed = []
+    scale_factors = []
+
+    for img in image_list:
+        h, w = img.shape[:2]
+
+        # Calculate scale (preserve aspect ratio)
+        r = min(640/h, 640/w)
+        new_h, new_w = int(h*r), int(w*r)
+
+        # Resize
+        resized = cv2.resize(img, (new_w, new_h))
+
+        # Calculate padding
+        pad_top = (640 - new_h) // 2
+        pad_left = (640 - new_w) // 2
+
+        # Pad to 640x640 (gray: 114)
+        padded = np.full((640, 640, 3), 114, dtype=np.uint8)
+        padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
+
+        # Normalize and transpose
+        padded = padded.astype(np.float32) / 255.0
+        padded = padded.transpose(2, 0, 1)  # HWC → CHW
+
+        processed.append(padded)
+        scale_factors.append([1/r, 1/r, pad_left, pad_top])
+
+    return np.stack(processed), scale_factors
+```
+
+**Visualization**:
+```
+Original image (1000x800):
+┌────────────────────────────────────────┐
+│                                        │
+│         Document Content               │
+│                                        │
+└────────────────────────────────────────┘
+
+After resize (scale=0.64) to (640x512):
+┌────────────────────────────────────────┐
+│                                        │
+│         Document Content               │
+│                                        │
+└────────────────────────────────────────┘
+
+After padding to (640x640):
+┌────────────────────────────────────────┐
+│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
+├────────────────────────────────────────┤
+│                                        │
+│         Document Content               │
+│                                        │
+├────────────────────────────────────────┤
+│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
+└────────────────────────────────────────┘
+```
+
+### 1.4 NMS Postprocessing
+
+```python
+# deepdoc/vision/recognizer.py, lines 330-407
+
+def postprocess(self, boxes, inputs, thr):
+    """
+    YOLOv10 postprocessing with per-class NMS.
+    """
+    results = []
+
+    for batch_idx, batch_boxes in enumerate(boxes):
+        scale_factor = inputs["scale_factor"][batch_idx]
+
+        # Filter by confidence threshold
+        mask = batch_boxes[:, 4] > thr  # confidence > 0.2
+        filtered = batch_boxes[mask]
+
+        if len(filtered) == 0:
+            results.append([])
+            continue
+
+        # Convert xywh → xyxy
+        xyxy = self.xywh2xyxy(filtered[:, :4])
+
+        # Remove padding offset
+        xyxy[:, [0, 2]] -= scale_factor[2]  # pad_left
+        xyxy[:, [1, 3]] -= scale_factor[3]  # pad_top
+
+        # Scale back to original size
+        xyxy[:, [0, 2]] *= scale_factor[0]  # scale_x
+        xyxy[:, [1, 3]] *= scale_factor[1]  # scale_y
+
+        # Per-class NMS
+        class_ids = filtered[:, 5].astype(int)
+        scores = filtered[:, 4]
+
+        keep_indices = []
+        for cls in np.unique(class_ids):
+            cls_mask = class_ids == cls
+            cls_boxes = xyxy[cls_mask]
+            cls_scores = scores[cls_mask]
+
+            # NMS within class
+            keep = self.iou_filter(cls_boxes, cls_scores, iou_thresh=0.45)
+            keep_indices.extend(np.where(cls_mask)[0][keep])
+
+        # Build result
+        batch_results = []
+        for idx in keep_indices:
+            batch_results.append({
+                "type": self.labels[int(filtered[idx, 5])],
+                "bbox": xyxy[idx].tolist(),
+                "score": float(filtered[idx, 4])
+            })
+
+        results.append(batch_results)
+
+    return results
+```
+
+### 1.5 OCR-Layout Association
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 98-147
+
+def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
+    """
+    Detect layouts and associate with OCR results.
+    """
+    # Step 1: Run layout detection
+    page_layouts = super().__call__(image_list, thr, batch_size)
+
+    # Step 2: Clean up overlapping layouts
+    for i, layouts in enumerate(page_layouts):
+        page_layouts[i] = self.layouts_cleanup(layouts, thr=0.7)
+
+    # Step 3: Associate OCR boxes with layouts
+    for page_idx, (ocr_boxes, layouts) in enumerate(zip(ocr_res, page_layouts)):
+        # Sort layouts by priority: Footer → Header → Reference → Caption → Others
+        layouts_by_priority = self._sort_by_priority(layouts)
+
+        for ocr_box in ocr_boxes:
+            # Find overlapping layout
+            matched_layout = self.find_overlapped_with_threshold(
+                ocr_box,
+                layouts_by_priority,
+                thr=0.4  # 40% overlap threshold
+            )
+
+            if matched_layout:
+                ocr_box["layout_type"] = matched_layout["type"]
+                ocr_box["layoutno"] = matched_layout.get("layoutno", 0)
+            else:
+                ocr_box["layout_type"] = "Text"  # Default to Text
+
+    # Step 4: Filter garbage (headers, footers, page numbers)
+    if drop:
+        self._filter_garbage(ocr_res, page_layouts)
+
+    return ocr_res, page_layouts
+```
+
+### 1.6 Garbage Detection
+
+```python
+# deepdoc/vision/layout_recognizer.py, lines 64-66
+
+# Patterns to filter out
+garbage_patterns = [
+    r"^•+$",                        # Bullet points only
+    r"^[0-9]{1,2} / ?[0-9]{1,2}$",  # Page numbers (3/10, 3 / 10)
+    r"^[0-9]{1,2} of [0-9]{1,2}$",  # Page numbers (3 of 10)
+    r"^http://[^ ]{12,}",           # Long URLs
+    r"\(cid *: *[0-9]+ *\)",        # PDF character IDs
+]
+
+def is_garbage(text, layout_type, page_position):
+    """
+    Determine if text should be filtered out.
+
+    Rules:
+    - Headers at top 10% of page → keep
+    - Footers at bottom 10% of page → keep
+    - Headers/footers elsewhere → garbage
+    - Page numbers → garbage
+    - URLs → garbage
+    """
+    for pattern in garbage_patterns:
+        if re.match(pattern, text):
+            return True
+
+    # Position-based filtering
+    if layout_type == "Header" and page_position > 0.1:
+        return True  # Header not at top
+    if layout_type == "Footer" and page_position < 0.9:
+        return True  # Footer not at bottom
+
+    return False
+```
+
+---
+
+## 2. Table Structure Recognition
+
+### 2.1 Table Components
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 31-38
+
+labels = [
+    "table",                      # 0: Whole table boundary
+    "table column",               # 1: Column separators
+    "table row",                  # 2: Row separators
+    "table column header",        # 3: Header rows
+    "table projected row header", # 4: Row labels
+    "table spanning cell",        # 5: Merged cells
+]
+```
+
+### 2.2 Detection to Grid Construction
+
+```
+Detection Output → Table Grid:
+
+┌─────────────────────────────────────────────────────────────────┐
+│                        Raw Detections                            │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │ table: [0, 0, 500, 300]                                  │   │
+│  │ table row: [0, 0, 500, 50], [0, 50, 500, 100], ...       │   │
+│  │ table column: [0, 0, 150, 300], [150, 0, 300, 300], ...  │   │
+│  │ table spanning cell: [0, 100, 300, 150]                  │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│                              │                                   │
+│                              ▼                                   │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │                    Alignment                              │   │
+│  │  • Align row boundaries (left/right edges)               │   │
+│  │  • Align column boundaries (top/bottom edges)            │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│                              │                                   │
+│                              ▼                                   │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │                  Grid Construction                        │   │
+│  │                                                           │   │
+│  │  ┌──────────┬──────────┬──────────┐                      │   │
+│  │  │ Header 1 │ Header 2 │ Header 3 │  ← Row 0 (header)    │   │
+│  │  ├──────────┴──────────┼──────────┤                      │   │
+│  │  │   Spanning Cell     │  Cell 3  │  ← Row 1             │   │
+│  │  ├──────────┬──────────┼──────────┤                      │   │
+│  │  │  Cell 4  │  Cell 5  │  Cell 6  │  ← Row 2             │   │
+│  │  └──────────┴──────────┴──────────┘                      │   │
+│  │                                                           │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│                              │                                   │
+│                              ▼                                   │
+│                   HTML or Descriptive Output                     │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 2.3 Alignment Algorithm
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 67-111
+
+def __call__(self, images, thr=0.2):
+    """
+    Detect and align table structure.
+    """
+    # Run detection
+    detections = super().__call__(images, thr)
+
+    for page_dets in detections:
+        rows = [d for d in page_dets if d["label"] == "table row"]
+        cols = [d for d in page_dets if d["label"] == "table column"]
+
+        if len(rows) > 4:
+            # Align row X coordinates (left edges)
+            x0_values = [r["x0"] for r in rows]
+            mean_x0 = np.mean(x0_values)
+            min_x0 = np.min(x0_values)
+            aligned_x0 = min(mean_x0, min_x0 + 0.05 * (max(x0_values) - min_x0))
+
+            for r in rows:
+                r["x0"] = aligned_x0
+
+            # Align row X coordinates (right edges)
+            x1_values = [r["x1"] for r in rows]
+            mean_x1 = np.mean(x1_values)
+            max_x1 = np.max(x1_values)
+            aligned_x1 = max(mean_x1, max_x1 - 0.05 * (max_x1 - min(x1_values)))
+
+            for r in rows:
+                r["x1"] = aligned_x1
+
+        if len(cols) > 4:
+            # Similar alignment for column Y coordinates
+            # ...
+```
+
+**Tại sao cần alignment?**
+
+Detection model có thể cho ra boundaries không perfectly aligned:
+```
+Before alignment:
+Row 1: x0=10, x1=490
+Row 2: x0=12, x1=488
+Row 3: x0=8, x1=492
+
+After alignment:
+Row 1: x0=10, x1=490
+Row 2: x0=10, x1=490
+Row 3: x0=10, x1=490
+```
+
+### 2.4 Grid Construction
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 172-349
+
+@staticmethod
+def construct_table(boxes, is_english=False, html=True, **kwargs):
+    """
+    Construct 2D table from detected components.
+
+    Args:
+        boxes: OCR boxes with R (row), C (column), SP (spanning) attributes
+        is_english: Language hint
+        html: Output format (HTML or descriptive text)
+
+    Returns:
+        HTML table string or descriptive text
+    """
+    # Step 1: Extract caption
+    caption = ""
+    for box in boxes[:]:
+        if is_caption(box):
+            caption = box["text"]
+            boxes.remove(box)
+
+    # Step 2: Sort by row position (R attribute)
+    rowh = np.median([b["bottom"] - b["top"] for b in boxes])
+    boxes = Recognizer.sort_R_firstly(boxes, rowh / 2)
+
+    # Step 3: Group into rows
+    rows = []
+    current_row = [boxes[0]]
+
+    for box in boxes[1:]:
+        # Same row if Y difference < row_height/2
+        if abs(box["R"] - current_row[-1]["R"]) < rowh / 2:
+            current_row.append(box)
+        else:
+            rows.append(current_row)
+            current_row = [box]
+    rows.append(current_row)
+
+    # Step 4: Sort each row by column position (C attribute)
+    for row in rows:
+        row.sort(key=lambda x: x["C"])
+
+    # Step 5: Build 2D table matrix
+    n_rows = len(rows)
+    n_cols = max(len(row) for row in rows)
+
+    table = [[None] * n_cols for _ in range(n_rows)]
+
+    for i, row in enumerate(rows):
+        for j, cell in enumerate(row):
+            table[i][j] = cell
+
+    # Step 6: Handle spanning cells
+    table = handle_spanning_cells(table, boxes)
+
+    # Step 7: Generate output
+    if html:
+        return generate_html_table(table, caption)
+    else:
+        return generate_descriptive_text(table, caption)
+```
+
+### 2.5 Spanning Cell Handling
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 496-575
+
+def __cal_spans(self, boxes, rows, cols):
+    """
+    Calculate colspan and rowspan for merged cells.
+
+    Spanning cell detection:
+    - "SP" attribute indicates merged cell
+    - Calculate which rows/cols it covers
+    """
+    for box in boxes:
+        if "SP" not in box:
+            continue
+
+        # Find rows this cell spans
+        box["rowspan"] = []
+        for i, row in enumerate(rows):
+            overlap = self.overlapped_area(box, row)
+            if overlap > 0.3:  # 30% overlap
+                box["rowspan"].append(i)
+
+        # Find columns this cell spans
+        box["colspan"] = []
+        for j, col in enumerate(cols):
+            overlap = self.overlapped_area(box, col)
+            if overlap > 0.3:
+                box["colspan"].append(j)
+
+    return boxes
+```
+
+**Example**:
+```
+Spanning cell detection:
+
+┌──────────┬──────────┬──────────┐
+│ Header 1 │ Header 2 │ Header 3 │
+├──────────┴──────────┼──────────┤
+│   Merged Cell       │  Cell 3  │  ← SP cell spans columns 0-1
+│   (colspan=2)       │          │
+├──────────┬──────────┼──────────┤
+│  Cell 4  │  Cell 5  │  Cell 6  │
+└──────────┴──────────┴──────────┘
+
+Detection:
+- SP cell bbox: [0, 50, 300, 100]
+- Column 0: [0, 0, 150, 200]  → overlap 0.5 ✓
+- Column 1: [150, 0, 300, 200] → overlap 0.5 ✓
+- Column 2: [300, 0, 450, 200] → overlap 0.0 ✗
+→ colspan = [0, 1]
+```
+
+### 2.6 HTML Output Generation
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 352-393
+
+def __html_table(table, header_rows, caption):
+    """
+    Generate HTML table from 2D matrix.
+    """
+    html_parts = ["<table>"]
+
+    # Add caption if exists
+    if caption:
+        html_parts.append(f"<caption>{caption}</caption>")
+
+    for i, row in enumerate(table):
+        html_parts.append("<tr>")
+
+        for j, cell in enumerate(row):
+            if cell is None:
+                continue  # Skip cells covered by spanning
+
+            # Determine tag (th for header, td for data)
+            tag = "th" if i in header_rows else "td"
+
+            # Add colspan/rowspan attributes
+            attrs = []
+            if cell.get("colspan") and len(cell["colspan"]) > 1:
+                attrs.append(f'colspan="{len(cell["colspan"])}"')
+            if cell.get("rowspan") and len(cell["rowspan"]) > 1:
+                attrs.append(f'rowspan="{len(cell["rowspan"])}"')
+
+            attr_str = " " + " ".join(attrs) if attrs else ""
+
+            # Add cell content
+            html_parts.append(f"<{tag}{attr_str}>{cell['text']}</{tag}>")
+
+        html_parts.append("</tr>")
+
+    html_parts.append("</table>")
+
+    return "\n".join(html_parts)
+```
+
+**Output Example**:
+```html
+<table>
+  <caption>Table 1: Sales Data</caption>
+  <tr>
+    <th>Region</th>
+    <th>Q1</th>
+    <th>Q2</th>
+  </tr>
+  <tr>
+    <td colspan="2">North America</td>
+    <td>$150K</td>
+  </tr>
+  <tr>
+    <td>Europe</td>
+    <td>$100K</td>
+    <td>$120K</td>
+  </tr>
+</table>
+```
+
+### 2.7 Descriptive Text Output
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 396-493
+
+def __desc_table(table, header_rows, caption):
+    """
+    Generate natural language description of table.
+
+    For RAG, sometimes descriptive text is better than HTML.
+    """
+    descriptions = []
+
+    # Get headers
+    headers = [cell["text"] for cell in table[0]] if header_rows else []
+
+    # Process each data row
+    for i, row in enumerate(table):
+        if i in header_rows:
+            continue
+
+        row_desc = []
+        for j, cell in enumerate(row):
+            if cell is None:
+                continue
+
+            if headers and j < len(headers):
+                # "Column Name: Value" format
+                row_desc.append(f"{headers[j]}: {cell['text']}")
+            else:
+                row_desc.append(cell['text'])
+
+        if row_desc:
+            descriptions.append("; ".join(row_desc))
+
+    # Add source reference
+    if caption:
+        descriptions.append(f'(from "{caption}")')
+
+    return "\n".join(descriptions)
+```
+
+**Output Example**:
+```
+Region: North America; Q1: $100K; Q2: $150K
+Region: Europe; Q1: $80K; Q2: $120K
+(from "Table 1: Sales Data")
+```
+
+---
+
+## 3. Cell Content Classification
+
+### 3.1 Block Type Detection
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 121-149
+
+@staticmethod
+def blockType(text):
+    """
+    Classify cell content type.
+
+    Used for:
+    - Header detection (non-numeric cells likely headers)
+    - Data validation
+    - Smart formatting
+    """
+    patterns = {
+        "Dt": r"(^[0-9]{4}[-/][0-9]{1,2}|[0-9]{1,2}[-/][0-9]{1,2}[-/][0-9]{2,4}|"
+              r"[0-9]{1,2}月|[Q][1-4]|[一二三四]季度)",  # Date
+        "Nu": r"^[-+]?[0-9.,%%￥$€£¥]+$",  # Number
+        "Ca": r"^[A-Z0-9]{4,}$",  # Code
+        "En": r"^[a-zA-Z\s]+$",  # English
+    }
+
+    for type_name, pattern in patterns.items():
+        if re.search(pattern, text):
+            return type_name
+
+    # Classify by length
+    tokens = text.split()
+    if len(tokens) == 1:
+        return "Sg"  # Single
+    elif len(tokens) <= 3:
+        return "Tx"  # Short text
+    elif len(tokens) <= 12:
+        return "Lx"  # Long text
+    else:
+        return "Ot"  # Other
+
+# Examples:
+# "2023-01-15" → "Dt" (Date)
+# "$1,234.56" → "Nu" (Number)
+# "ABC123" → "Ca" (Code)
+# "Total Revenue" → "En" (English)
+# "北京市" → "Tx" (Text)
+```
+
+### 3.2 Header Detection
+
+```python
+# deepdoc/vision/table_structure_recognizer.py, lines 332-344
+
+def detect_headers(table):
+    """
+    Detect which rows are headers based on content type.
+
+    Heuristic: If >50% of cells in a row are non-numeric,
+    it's likely a header row.
+    """
+    header_rows = set()
+
+    for i, row in enumerate(table):
+        non_numeric = 0
+        total = 0
+
+        for cell in row:
+            if cell is None:
+                continue
+            total += 1
+            if blockType(cell["text"]) != "Nu":
+                non_numeric += 1
+
+        if total > 0 and non_numeric / total > 0.5:
+            header_rows.add(i)
+
+    return header_rows
+```
+
+---
+
+## 4. Integration với PDF Parser
+
+### 4.1 Table Detection in PDF Pipeline
+
+```python
+# deepdoc/parser/pdf_parser.py, lines 196-281
+
+def _table_transformer_job(self, zoomin=3):
+    """
+    Detect and structure tables using TableStructureRecognizer.
+    """
+    # Find table layouts
+    table_layouts = [
+        layout for layout in self.page_layout
+        if layout["type"] == "Table"
+    ]
+
+    if not table_layouts:
+        return
+
+    # Crop table images
+    table_images = []
+    for layout in table_layouts:
+        x0, y0, x1, y1 = layout["bbox"]
+        img = self.page_images[layout["page"]][
+            int(y0*zoomin):int(y1*zoomin),
+            int(x0*zoomin):int(x1*zoomin)
+        ]
+        table_images.append(img)
+
+    # Run TSR
+    table_structures = self.tsr(table_images)
+
+    # Match OCR boxes to table structure
+    for layout, structure in zip(table_layouts, table_structures):
+        # Get OCR boxes within table region
+        table_boxes = [
+            box for box in self.boxes
+            if self._box_in_region(box, layout["bbox"])
+        ]
+
+        # Assign R, C, SP attributes
+        for box in table_boxes:
+            box["R"] = self._find_row(box, structure["rows"])
+            box["C"] = self._find_column(box, structure["columns"])
+            if self._is_spanning(box, structure["spanning_cells"]):
+                box["SP"] = True
+
+        # Store for later extraction
+        self.tb_cpns[layout["id"]] = {
+            "boxes": table_boxes,
+            "structure": structure
+        }
+```
+
+### 4.2 Table Extraction
+
+```python
+# deepdoc/parser/pdf_parser.py, lines 757-930
+
+def _extract_table_figure(self, need_image, ZM, return_html, need_position):
+    """
+    Extract tables and figures from detected layouts.
+    """
+    tables = []
+
+    for layout_id, table_data in self.tb_cpns.items():
+        boxes = table_data["boxes"]
+
+        # Construct table (HTML or descriptive)
+        if return_html:
+            content = TableStructureRecognizer.construct_table(
+                boxes, html=True
+            )
+        else:
+            content = TableStructureRecognizer.construct_table(
+                boxes, html=False
+            )
+
+        table = {
+            "content": content,
+            "bbox": table_data["bbox"],
+        }
+
+        if need_image:
+            table["image"] = self._crop_region(table_data["bbox"])
+
+        tables.append(table)
+
+    return tables
+```
+
+---
+
+## 5. Performance Considerations
+
+### 5.1 Batch Processing
+
+```python
+# deepdoc/vision/recognizer.py, lines 415-437
+
+def __call__(self, image_list, thr=0.7, batch_size=16):
+    """
+    Batch inference for efficiency.
+
+    Why batch_size=16?
+    - GPU memory optimization
+    - Balance throughput vs latency
+    - Typical document has 10-50 elements
+    """
+    results = []
+
+    for i in range(0, len(image_list), batch_size):
+        batch = image_list[i:i+batch_size]
+
+        # Preprocess
+        inputs = self.preprocess(batch)
+
+        # Inference
+        outputs = self.ort_sess.run(None, inputs)
+
+        # Postprocess
+        batch_results = self.postprocess(outputs, inputs, thr)
+        results.extend(batch_results)
+
+    return results
+```
+
+### 5.2 Model Caching
+
+```python
+# deepdoc/vision/ocr.py, lines 36-73
+
+# Global model cache
+loaded_models = {}
+
+def load_model(model_dir, nm, device_id=None):
+    """
+    Load ONNX model with caching.
+
+    Cache key: model_path + device_id
+    """
+    model_path = os.path.join(model_dir, f"{nm}.onnx")
+    cache_key = f"{model_path}_{device_id}"
+
+    if cache_key in loaded_models:
+        return loaded_models[cache_key]
+
+    # Load model...
+    session = ort.InferenceSession(model_path, ...)
+
+    loaded_models[cache_key] = (session, run_opts)
+    return session, run_opts
+```
+
+---
+
+## 6. Troubleshooting
+
+### 6.1 Common Issues
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Missing table | Low confidence | Lower threshold (0.1-0.2) |
+| Wrong colspan | Misaligned detection | Check row/column alignment |
+| Merged cells wrong | Overlap threshold | Adjust SP detection threshold |
+| Headers not detected | All numeric | Manual header specification |
+| Layout overlap | NMS threshold | Increase NMS IoU threshold |
+
+### 6.2 Debugging
+
+```python
+# Visualize layout detection
+from deepdoc.vision.seeit import draw_boxes
+
+# Draw layout boxes on image
+layout_vis = draw_boxes(
+    page_image,
+    [(l["bbox"], l["type"]) for l in page_layouts],
+    colors={
+        "Text": (0, 255, 0),
+        "Table": (255, 0, 0),
+        "Figure": (0, 0, 255),
+    }
+)
+cv2.imwrite("layout_debug.png", layout_vis)
+
+# Check table structure
+for box in table_boxes:
+    print(f"Text: {box['text']}")
+    print(f"  Row: {box.get('R', 'N/A')}")
+    print(f"  Col: {box.get('C', 'N/A')}")
+    print(f"  Spanning: {box.get('SP', False)}")
+```
+
+---
+
+## 7. References
+
+- YOLOv10 Paper: [YOLOv10: Real-Time End-to-End Object Detection](https://arxiv.org/abs/2405.14458)
+- Table Transformer: [PubTables-1M: Towards comprehensive table extraction](https://arxiv.org/abs/2110.00061)
+- Document Layout Analysis: [A Survey](https://arxiv.org/abs/2012.15005)
diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md
new file mode 100644
index 000000000..1885b37f3
--- /dev/null
+++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md
@@ -0,0 +1,678 @@
+# OCR Deep Dive
+
+## Tổng Quan
+
+Module OCR trong DeepDoc thực hiện 2 task chính:
+1. **Text Detection**: Phát hiện vùng chứa text trong image
+2. **Text Recognition**: Nhận dạng text trong các vùng đã phát hiện
+
+## File Structure
+
+```
+deepdoc/vision/
+├── ocr.py                 # Main OCR class (752 lines)
+├── postprocess.py         # CTC decoder, DBNet postprocess (371 lines)
+└── operators.py           # Image preprocessing (726 lines)
+```
+
+---
+
+## 1. Text Detection (DBNet)
+
+### 1.1 Model Architecture
+
+```
+DBNet (Differentiable Binarization Network):
+
+Input Image (H, W, 3)
+         │
+         ▼
+┌─────────────────────────────────────┐
+│        ResNet-18 Backbone           │
+│  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐│
+│  │ C1  │→ │ C2  │→ │ C3  │→ │ C4  ││
+│  │64ch │  │128ch│  │256ch│  │512ch││
+│  └─────┘  └─────┘  └─────┘  └─────┘│
+└─────────────────────────────────────┘
+         │      │      │      │
+         ▼      ▼      ▼      ▼
+┌─────────────────────────────────────┐
+│        Feature Pyramid Network       │
+│  Upsample + Concatenate all levels  │
+│  Output: 256 channels               │
+└─────────────────────────────────────┘
+         │
+         ├─────────────────┐
+         ▼                 ▼
+┌─────────────────┐ ┌─────────────────┐
+│  Probability    │ │   Threshold     │
+│     Head        │ │     Head        │
+│  Conv → Sigmoid │ │  Conv → Sigmoid │
+└────────┬────────┘ └────────┬────────┘
+         │                   │
+         ▼                   ▼
+    Prob Map (H, W)    Thresh Map (H, W)
+         │                   │
+         └─────────┬─────────┘
+                   ▼
+┌─────────────────────────────────────┐
+│    Differentiable Binarization      │
+│    B = sigmoid((P - T) * k)         │
+│    k = 50 (amplification factor)    │
+└─────────────────────────────────────┘
+                   │
+                   ▼
+            Binary Map (H, W)
+```
+
+### 1.2 DBNet Post-processing
+
+```python
+# deepdoc/vision/postprocess.py, lines 41-259
+
+class DBPostProcess:
+    def __init__(self,
+                 thresh=0.3,           # Binary threshold
+                 box_thresh=0.5,       # Box confidence threshold
+                 max_candidates=1000,  # Maximum text regions
+                 unclip_ratio=1.5,     # Polygon expansion ratio
+                 use_dilation=False,   # Morphological dilation
+                 score_mode="fast"):   # fast or slow scoring
+
+    def __call__(self, outs_dict, shape_list):
+        """
+        Post-process DBNet output.
+
+        Args:
+            outs_dict: {"maps": probability_map}
+            shape_list: Original image shapes
+
+        Returns:
+            List of detected text boxes
+        """
+        pred = outs_dict["maps"]  # (N, 1, H, W)
+
+        # Step 1: Binary thresholding
+        bitmap = pred > self.thresh  # 0.3
+
+        # Step 2: Optional dilation
+        if self.use_dilation:
+            kernel = np.ones((2, 2))
+            bitmap = cv2.dilate(bitmap, kernel)
+
+        # Step 3: Find contours
+        contours = cv2.findContours(
+            bitmap.astype(np.uint8),
+            cv2.RETR_LIST,
+            cv2.CHAIN_APPROX_SIMPLE
+        )
+
+        # Step 4: Process each contour
+        boxes = []
+        for contour in contours[:self.max_candidates]:
+            # Simplify polygon
+            epsilon = 0.002 * cv2.arcLength(contour, True)
+            approx = cv2.approxPolyDP(contour, epsilon, True)
+
+            if len(approx) < 4:
+                continue
+
+            # Calculate confidence score
+            score = self.box_score_fast(pred, approx)
+            if score < self.box_thresh:
+                continue
+
+            # Unclip (expand) polygon
+            box = self.unclip(approx, self.unclip_ratio)
+            boxes.append(box)
+
+        return boxes
+```
+
+### 1.3 Unclipping Algorithm
+
+**Vấn đề**: DBNet tends to predict tight boundaries → misses edge characters
+
+**Giải pháp**: Expand detected polygon by unclip_ratio
+
+```python
+# deepdoc/vision/postprocess.py, lines 163-169
+
+def unclip(self, box, unclip_ratio):
+    """
+    Expand polygon using Clipper library.
+
+    Công thức:
+    distance = Area * unclip_ratio / Perimeter
+
+    Với unclip_ratio = 1.5:
+    - Nhỏ polygon → expand nhiều hơn
+    - Lớn polygon → expand ít hơn (proportional)
+    """
+    poly = Polygon(box)
+    distance = poly.area * unclip_ratio / poly.length
+
+    offset = pyclipper.PyclipperOffset()
+    offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
+
+    expanded = offset.Execute(distance)
+    return np.array(expanded[0])
+```
+
+**Visualization**:
+```
+Original detection:     After unclip (1.5x):
+┌──────────────┐        ┌────────────────────┐
+│   Hello      │   →    │      Hello         │
+└──────────────┘        └────────────────────┘
+                        (expanded boundaries)
+```
+
+---
+
+## 2. Text Recognition (CRNN)
+
+### 2.1 Model Architecture
+
+```
+CRNN (Convolutional Recurrent Neural Network):
+
+Input: Cropped text image (3, 48, W)
+                │
+                ▼
+┌─────────────────────────────────────┐
+│            CNN Backbone              │
+│  VGG-style convolutions             │
+│  7 conv layers + 4 max pooling      │
+│  Output: (512, 1, W/4)              │
+└────────────────┬────────────────────┘
+                 │
+                 ▼
+┌─────────────────────────────────────┐
+│         Sequence Reshaping          │
+│  Collapse height dimension          │
+│  Output: (W/4, 512)                 │
+└────────────────┬────────────────────┘
+                 │
+                 ▼
+┌─────────────────────────────────────┐
+│         Bidirectional LSTM          │
+│  2 layers, 256 hidden units         │
+│  Output: (W/4, 512)                 │
+└────────────────┬────────────────────┘
+                 │
+                 ▼
+┌─────────────────────────────────────┐
+│          Classification Head         │
+│  Linear(512 → num_classes)          │
+│  Output: (W/4, num_classes)         │
+└────────────────┬────────────────────┘
+                 │
+                 ▼
+        Probability Matrix (T, C)
+        T = time steps, C = characters
+```
+
+### 2.2 CTC Decoding
+
+```python
+# deepdoc/vision/postprocess.py, lines 347-370
+
+class CTCLabelDecode(BaseRecLabelDecode):
+    """
+    CTC (Connectionist Temporal Classification) Decoder.
+
+    CTC giải quyết vấn đề:
+    - Model output có T time steps
+    - Ground truth có N characters
+    - T > N (nhiều frame cho 1 ký tự)
+    - Không biết alignment chính xác
+
+    CTC thêm special "blank" token (ε):
+    - Represents "no output"
+    - Allows alignment without explicit segmentation
+    """
+
+    def __init__(self, character_dict_path, use_space_char=False):
+        super().__init__(character_dict_path, use_space_char)
+        # Prepend blank token at index 0
+        self.character = ['blank'] + self.character
+
+    def __call__(self, preds, label=None):
+        """
+        Decode CTC output.
+
+        Args:
+            preds: (batch, time, num_classes) probability matrix
+
+        Returns:
+            [(text, confidence), ...]
+        """
+        # Get most probable character at each time step
+        preds_idx = preds.argmax(axis=2)   # (batch, time)
+        preds_prob = preds.max(axis=2)      # (batch, time)
+
+        # Decode with deduplication
+        result = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
+
+        return result
+
+    def decode(self, text_index, text_prob, is_remove_duplicate=True):
+        """
+        CTC decoding algorithm.
+
+        Example:
+        Raw output:  [a, a, ε, l, l, ε, p, h, a]
+        After dedup: [a, ε, l, ε, p, h, a]
+        Remove blank: [a, l, p, h, a]
+        Final: "alpha"
+        """
+        result = []
+
+        for batch_idx in range(len(text_index)):
+            char_list = []
+            conf_list = []
+
+            for idx in range(len(text_index[batch_idx])):
+                char_idx = text_index[batch_idx][idx]
+
+                # Skip blank token (index 0)
+                if char_idx == 0:
+                    continue
+
+                # Skip consecutive duplicates
+                if is_remove_duplicate:
+                    if idx > 0 and char_idx == text_index[batch_idx][idx-1]:
+                        continue
+
+                char_list.append(self.character[char_idx])
+                conf_list.append(text_prob[batch_idx][idx])
+
+            text = ''.join(char_list)
+            conf = np.mean(conf_list) if conf_list else 0.0
+
+            result.append((text, conf))
+
+        return result
+```
+
+### 2.3 Aspect Ratio Handling
+
+```python
+# deepdoc/vision/ocr.py, lines 146-170
+
+def resize_norm_img(self, img, max_wh_ratio):
+    """
+    Resize image maintaining aspect ratio.
+
+    Vấn đề: Text images có width khác nhau
+    - "Hi" → narrow
+    - "Hello World" → wide
+
+    Giải pháp: Resize theo aspect ratio, pad right side
+    """
+    imgC, imgH, imgW = self.rec_image_shape  # [3, 48, 320]
+
+    # Calculate target width from aspect ratio
+    max_width = int(imgH * max_wh_ratio)
+    max_width = min(max_width, imgW)  # Cap at 320
+
+    h, w = img.shape[:2]
+    ratio = w / float(h)
+
+    # Resize maintaining aspect ratio
+    if ratio * imgH > max_width:
+        resized_w = max_width
+    else:
+        resized_w = int(ratio * imgH)
+
+    resized_img = cv2.resize(img, (resized_w, imgH))
+
+    # Pad right side to max_width
+    padded = np.zeros((imgH, max_width, 3), dtype=np.float32)
+    padded[:, :resized_w, :] = resized_img
+
+    # Normalize: [0, 255] → [-1, 1]
+    padded = (padded / 255.0 - 0.5) / 0.5
+
+    # Transpose: HWC → CHW
+    padded = padded.transpose(2, 0, 1)
+
+    return padded
+```
+
+**Visualization**:
+```
+Original images:
+┌──────┐  ┌────────────────┐  ┌──────────────────────┐
+│ Hi   │  │ Hello          │  │ Hello World          │
+└──────┘  └────────────────┘  └──────────────────────┘
+ narrow        medium               wide
+
+After resize + pad (to width 320):
+┌──────────────────────────────────────────────────────┐
+│ Hi   │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
+├──────────────────────────────────────────────────────┤
+│ Hello          │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
+├──────────────────────────────────────────────────────┤
+│ Hello World          │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
+└──────────────────────────────────────────────────────┘
+(░ = zero padding)
+```
+
+---
+
+## 3. Full OCR Pipeline
+
+### 3.1 OCR Class
+
+```python
+# deepdoc/vision/ocr.py, lines 536-752
+
+class OCR:
+    """
+    End-to-end OCR pipeline.
+
+    Usage:
+        ocr = OCR()
+        results = ocr(image)
+        # results: [(box_points, (text, confidence)), ...]
+    """
+
+    def __init__(self, model_dir=None):
+        # Auto-download models if not found
+        if model_dir is None:
+            model_dir = self._get_model_dir()
+
+        # Initialize detector and recognizer
+        self.text_detector = TextDetector(model_dir)
+        self.text_recognizer = TextRecognizer(model_dir)
+
+    def __call__(self, img, device_id=0, cls=True):
+        """
+        Full OCR pipeline.
+
+        Args:
+            img: numpy array (H, W, 3) in BGR
+            device_id: GPU device ID
+            cls: Whether to check text orientation
+
+        Returns:
+            [(box_4pts, (text, confidence)), ...]
+        """
+        # Step 1: Detect text regions
+        dt_boxes, det_time = self.text_detector(img)
+
+        if dt_boxes is None or len(dt_boxes) == 0:
+            return []
+
+        # Step 2: Sort boxes by reading order
+        dt_boxes = self.sorted_boxes(dt_boxes)
+
+        # Step 3: Crop and rotate each text region
+        img_crop_list = []
+        for box in dt_boxes:
+            tmp_box = self.get_rotate_crop_image(img, box)
+            img_crop_list.append(tmp_box)
+
+        # Step 4: Recognize text
+        rec_res, rec_time = self.text_recognizer(img_crop_list)
+
+        # Step 5: Filter by confidence
+        results = []
+        for box, rec in zip(dt_boxes, rec_res):
+            text, score = rec
+            if score >= 0.5:  # drop_score threshold
+                results.append((box, (text, score)))
+
+        return results
+```
+
+### 3.2 Rotation Detection
+
+```python
+# deepdoc/vision/ocr.py, lines 584-638
+
+def get_rotate_crop_image(self, img, points):
+    """
+    Crop text region with automatic rotation detection.
+
+    Vấn đề: Text có thể xoay 90° hoặc 270°
+    Giải pháp: Try multiple orientations, pick best CTC score
+    """
+    # Order points: top-left → top-right → bottom-right → bottom-left
+    rect = self.order_points_clockwise(points)
+
+    # Perspective transform to get rectangular crop
+    width = int(max(
+        np.linalg.norm(rect[0] - rect[1]),
+        np.linalg.norm(rect[2] - rect[3])
+    ))
+    height = int(max(
+        np.linalg.norm(rect[0] - rect[3]),
+        np.linalg.norm(rect[1] - rect[2])
+    ))
+
+    dst = np.array([
+        [0, 0],
+        [width, 0],
+        [width, height],
+        [0, height]
+    ], dtype=np.float32)
+
+    M = cv2.getPerspectiveTransform(rect, dst)
+    warped = cv2.warpPerspective(img, M, (width, height))
+
+    # Check if text is vertical (need rotation)
+    if warped.shape[0] / warped.shape[1] >= 1.5:
+        # Try 3 orientations
+        orientations = [
+            (warped, 0),                              # Original
+            (cv2.rotate(warped, cv2.ROTATE_90_CLOCKWISE), 90),
+            (cv2.rotate(warped, cv2.ROTATE_90_COUNTERCLOCKWISE), -90)
+        ]
+
+        best_score = -1
+        best_img = warped
+
+        for rot_img, angle in orientations:
+            # Quick recognition to get confidence
+            _, score = self.text_recognizer([rot_img])[0]
+            if score > best_score:
+                best_score = score
+                best_img = rot_img
+
+        warped = best_img
+
+    return warped
+```
+
+### 3.3 Reading Order Sorting
+
+```python
+# deepdoc/vision/ocr.py, lines 640-661
+
+def sorted_boxes(self, dt_boxes):
+    """
+    Sort boxes by reading order (top-to-bottom, left-to-right).
+
+    Algorithm:
+    1. Sort by Y coordinate (top of box)
+    2. Within same "row" (Y within 10px), sort by X coordinate
+    """
+    num_boxes = len(dt_boxes)
+    sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
+
+    # Group into rows and sort each row
+    _boxes = list(sorted_boxes)
+
+    for i in range(num_boxes - 1):
+        for j in range(i, -1, -1):
+            # If boxes are on same row (Y difference < 10)
+            if abs(_boxes[j+1][0][1] - _boxes[j][0][1]) < 10:
+                # Sort by X coordinate
+                if _boxes[j+1][0][0] < _boxes[j][0][0]:
+                    _boxes[j], _boxes[j+1] = _boxes[j+1], _boxes[j]
+            else:
+                break
+
+    return _boxes
+```
+
+---
+
+## 4. Performance Optimization
+
+### 4.1 GPU Memory Management
+
+```python
+# deepdoc/vision/ocr.py, lines 96-127
+
+def load_model(model_dir, nm, device_id=None):
+    """
+    Load ONNX model with optimized settings.
+    """
+    options = ort.SessionOptions()
+
+    # Reduce memory fragmentation
+    options.enable_cpu_mem_arena = False
+
+    # Sequential execution (more predictable memory)
+    options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+
+    # Limit thread usage
+    options.intra_op_num_threads = 2
+    options.inter_op_num_threads = 2
+
+    # GPU configuration
+    if torch.cuda.is_available() and device_id is not None:
+        providers = [
+            ('CUDAExecutionProvider', {
+                'device_id': device_id,
+                # Limit GPU memory to 2GB
+                'gpu_mem_limit': int(os.getenv('OCR_GPU_MEM_LIMIT_MB', 2048)) * 1024 * 1024,
+                # Memory allocation strategy
+                'arena_extend_strategy': os.getenv('OCR_ARENA_EXTEND_STRATEGY', 'kNextPowerOfTwo'),
+            })
+        ]
+    else:
+        providers = ['CPUExecutionProvider']
+
+    session = ort.InferenceSession(model_path, options, providers)
+
+    # Run options for memory cleanup after each run
+    run_opts = ort.RunOptions()
+    run_opts.add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")
+
+    return session, run_opts
+```
+
+### 4.2 Batch Processing Optimization
+
+```python
+# deepdoc/vision/ocr.py, lines 363-408
+
+def __call__(self, img_list):
+    """
+    Optimized batch recognition.
+    """
+    # Sort images by aspect ratio for efficient batching
+    # Similar widths → less padding waste
+    indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
+
+    results = [None] * len(img_list)
+
+    for batch_start in range(0, len(indices), self.batch_size):
+        batch_indices = indices[batch_start:batch_start + self.batch_size]
+
+        # Get max width in batch for padding
+        max_wh_ratio = max(img_list[i].shape[1]/img_list[i].shape[0]
+                          for i in batch_indices)
+
+        # Normalize all images to same width
+        norm_imgs = []
+        for i in batch_indices:
+            norm_img = self.resize_norm_img(img_list[i], max_wh_ratio)
+            norm_imgs.append(norm_img)
+
+        # Stack into batch
+        batch = np.stack(norm_imgs)
+
+        # Run inference
+        preds = self.ort_sess.run(None, {"input": batch})
+
+        # Decode results
+        texts = self.postprocess_op(preds[0])
+
+        # Map back to original indices
+        for j, idx in enumerate(batch_indices):
+            results[idx] = texts[j]
+
+    return results
+```
+
+### 4.3 Multi-GPU Parallel Processing
+
+```python
+# deepdoc/vision/ocr.py, lines 556-579
+
+class OCR:
+    def __init__(self, model_dir=None):
+        if settings.PARALLEL_DEVICES > 0:
+            # Create per-GPU instances
+            self.text_detector = [
+                TextDetector(model_dir, device_id)
+                for device_id in range(settings.PARALLEL_DEVICES)
+            ]
+            self.text_recognizer = [
+                TextRecognizer(model_dir, device_id)
+                for device_id in range(settings.PARALLEL_DEVICES)
+            ]
+        else:
+            # Single instance for CPU/single GPU
+            self.text_detector = TextDetector(model_dir)
+            self.text_recognizer = TextRecognizer(model_dir)
+```
+
+---
+
+## 5. Troubleshooting
+
+### 5.1 Common Issues
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Low accuracy | Low resolution input | Increase zoomin factor (3-5) |
+| Slow inference | Large images | Resize to max 960px |
+| Memory error | Too many candidates | Reduce max_candidates |
+| Missing text | Tight boundaries | Increase unclip_ratio |
+| Wrong orientation | Vertical text | Enable rotation detection |
+
+### 5.2 Debugging Tips
+
+```python
+# Enable verbose logging
+import logging
+logging.basicConfig(level=logging.DEBUG)
+
+# Visualize detections
+from deepdoc.vision.seeit import draw_boxes
+
+img_with_boxes = draw_boxes(img, dt_boxes)
+cv2.imwrite("debug_detection.png", img_with_boxes)
+
+# Check confidence scores
+for box, (text, conf) in results:
+    print(f"Text: {text}, Confidence: {conf:.2f}")
+    if conf < 0.5:
+        print("  ⚠️ Low confidence!")
+```
+
+---
+
+## 6. References
+
+- DBNet Paper: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
+- CRNN Paper: [An End-to-End Trainable Neural Network for Image-based Sequence Recognition](https://arxiv.org/abs/1507.05717)
+- CTC Paper: [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
+- PaddleOCR: [GitHub](https://github.com/PaddlePaddle/PaddleOCR)