From 6d4dbbfe2c0eb2732cabf85f17bff1f00f40bd27 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 27 Nov 2025 03:46:14 +0000 Subject: [PATCH] docs: Add comprehensive DeepDoc deep guide documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Created in-depth documentation for understanding the deepdoc module: - README.md: Complete deep guide with: - Big picture explanation (what problem deepdoc solves) - Data flow diagrams (Input → Processing → Output) - Detailed code analysis with line numbers - Technical explanations (ONNX, CTC, NMS, etc.) - Design reasoning (why certain technologies chosen) - Difficult terms glossary - Extension examples - ocr_deep_dive.md: Deep dive into OCR subsystem - DBNet text detection architecture - CRNN text recognition - CTC decoding algorithm - Rotation handling - Performance optimization - layout_table_deep_dive.md: Deep dive into layout/table recognition - YOLOv10 layout detection - Table structure recognition - Grid construction algorithm - Spanning cell handling - HTML/descriptive output generation --- .../07-DEEPDOC-DEEP-GUIDE/README.md | 1286 +++++++++++++++++ .../layout_table_deep_dive.md | 926 ++++++++++++ .../07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md | 678 +++++++++ 3 files changed, 2890 insertions(+) create mode 100644 personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md create mode 100644 personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md create mode 100644 personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md new file mode 100644 index 000000000..45c4d3fcb --- /dev/null +++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md @@ -0,0 +1,1286 @@ +# DeepDoc Module - Hướng Dẫn Đọc Hiểu Chuyên Sâu + +## Mục Lục + +1. [Bức Tranh Lớn](#1-bức-tranh-lớn) +2. [Luồng Dữ Liệu](#2-luồng-dữ-liệu) +3. [Phân Tích Chi Tiết Code](#3-phân-tích-chi-tiết-code) +4. [Giải Thích Kỹ Thuật](#4-giải-thích-kỹ-thuật) +5. [Lý Do Thiết Kế](#5-lý-do-thiết-kế) +6. [Thuật Ngữ Khó](#6-thuật-ngữ-khó) +7. [Mở Rộng Từ Code](#7-mở-rộng-từ-code) + +--- + +## 1. Bức Tranh Lớn + +### 1.1 DeepDoc Giải Quyết Vấn Đề Gì? + +**Vấn đề cốt lõi**: Khi xây dựng hệ thống RAG (Retrieval-Augmented Generation), bạn cần chuyển đổi tài liệu (PDF, Word, Excel...) thành dạng text có cấu trúc để: +- Tìm kiếm semantic (vector search) +- Chia nhỏ (chunking) hợp lý +- Giữ nguyên ngữ cảnh của bảng, hình ảnh + +**DeepDoc là gì?**: Một module Python chuyên biệt để: +``` +Document Files → Structured Text + Tables + Figures +(PDF, DOCX...) (Có position, layout type, reading order) +``` + +### 1.2 Kiến Trúc Tổng Quan + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ DEEPDOC MODULE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ PARSER LAYER │ │ +│ │ Chuyển đổi các định dạng file thành text có cấu trúc │ │ +│ │ │ │ +│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ +│ │ │ PDF │ │ DOCX │ │ Excel │ │ HTML │ │ Markdown │ │ │ +│ │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ │ +│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ +│ │ │ │ │ │ │ │ │ +│ └───────┼────────────┼────────────┼────────────┼────────────┼─────────┘ │ +│ │ │ │ │ │ │ +│ │ └────────────┴────────────┴────────────┘ │ +│ │ │ │ +│ │ Text-based parsing │ +│ │ (pdfplumber, python-docx, openpyxl...) │ +│ │ │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ VISION LAYER │ │ +│ │ Computer Vision cho PDF phức tạp (scanned, multi-column) │ │ +│ │ │ │ +│ │ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐ │ │ +│ │ │ OCR │ │ Layout Recognizer│ │ Table Structure │ │ │ +│ │ │ Detection + │ │ (YOLOv10) │ │ Recognizer │ │ │ +│ │ │ Recognition │ │ │ │ │ │ │ +│ │ └──────────────┘ └──────────────────┘ └────────────────────┘ │ │ +│ │ │ │ │ │ │ +│ │ └───────────────────┴──────────────────────┘ │ │ +│ │ │ │ │ +│ │ ONNX Runtime Inference │ │ +│ │ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### 1.3 Các Thành Phần Chính + +| Thành Phần | File | Mục Đích | +|------------|------|----------| +| **PDF Parser** | `parser/pdf_parser.py` | Parser phức tạp nhất - xử lý PDF với OCR + layout | +| **Office Parsers** | `parser/docx_parser.py`, `excel_parser.py`, `ppt_parser.py` | Xử lý file Microsoft Office | +| **Web Parsers** | `parser/html_parser.py`, `markdown_parser.py`, `json_parser.py` | Xử lý file web/markup | +| **OCR Engine** | `vision/ocr.py` | Text detection + recognition | +| **Layout Detector** | `vision/layout_recognizer.py` | Phân loại vùng (text, table, figure...) | +| **Table Detector** | `vision/table_structure_recognizer.py` | Nhận dạng cấu trúc bảng | +| **Operators** | `vision/operators.py` | Image preprocessing pipeline | + +### 1.4 Tại Sao Cần DeepDoc? + +**Không có DeepDoc** (naive approach): +```python +# Chỉ extract raw text từ PDF +text = pdfplumber.open("doc.pdf").pages[0].extract_text() +# Kết quả: "Header Footer Table content mixed together..." +# ❌ Mất cấu trúc, table thành text xáo trộn +``` + +**Với DeepDoc**: +```python +parser = RAGFlowPdfParser() +docs, tables = parser("doc.pdf") +# docs: [("Paragraph 1", "page_0_pos_100_200"), ("Paragraph 2", "page_0_pos_300_400")] +# tables: [{"html": "...
", "bbox": [...]}] +# ✅ Giữ nguyên cấu trúc, table được parse riêng +``` + +--- + +## 2. Luồng Dữ Liệu + +### 2.1 Luồng Chính: PDF Processing + +``` +┌────────────────────────────────────────────────────────────────────────────┐ +│ PDF PROCESSING PIPELINE │ +└────────────────────────────────────────────────────────────────────────────┘ + +Input: PDF File (path hoặc bytes) + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 1: IMAGE EXTRACTION │ +│ File: pdf_parser.py, __images__() (lines 1042-1159) │ +│ │ +│ • Convert PDF pages → numpy images (using pdfplumber) │ +│ • Extract native PDF characters (text layer) │ +│ • Zoom factor: 3x (default) for OCR accuracy │ +│ │ +│ Output: page_images[], page_chars[] │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 2: OCR DETECTION & RECOGNITION │ +│ File: vision/ocr.py, OCR.__call__() (lines 708-751) │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ TextDetector │ → │ Crop & │ → │TextRecognizer│ │ +│ │ (DBNet) │ │ Rotate │ │ (CRNN) │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +│ │ +│ • Detect text regions → bounding boxes │ +│ • Crop each region, auto-rotate if needed │ +│ • Recognize text in each region │ +│ │ +│ Output: boxes[] with {text, confidence, coordinates} │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 3: LAYOUT RECOGNITION │ +│ File: vision/layout_recognizer.py, __call__() (lines 63-157) │ +│ │ +│ • Run YOLOv10 model on page image │ +│ • Detect 10 layout types: Text, Title, Table, Figure, etc. │ +│ • Match OCR boxes to layout regions │ +│ │ +│ Output: boxes[] with added {layout_type, layoutno} │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 4: COLUMN DETECTION │ +│ File: pdf_parser.py, _assign_column() (lines 355-440) │ +│ │ +│ • K-Means clustering on X coordinates │ +│ • Silhouette score to find optimal k (1-4 columns) │ +│ • Assign col_id to each text box │ +│ │ +│ Output: boxes[] with added {col_id} │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 5: TABLE STRUCTURE RECOGNITION │ +│ File: vision/table_structure_recognizer.py, __call__() (lines 67-111) │ +│ │ +│ • Detect rows, columns, headers, spanning cells │ +│ • Match text boxes to table cells │ +│ • Build 2D table matrix │ +│ │ +│ Output: table_components[] with grid structure │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 6: TEXT MERGING │ +│ File: pdf_parser.py, _text_merge() (lines 442-478) │ +│ _naive_vertical_merge() (lines 480-556) │ +│ │ +│ • Horizontal merge: same line, same column, same layout │ +│ • Vertical merge: adjacent paragraphs with semantic checks │ +│ • Respect sentence boundaries (。?!) │ +│ │ +│ Output: merged_boxes[] (fewer, larger text blocks) │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 7: FILTERING & CLEANUP │ +│ File: pdf_parser.py, _filter_forpages() (lines 685-729) │ +│ __filterout_scraps() (lines 971-1029) │ +│ │ +│ • Remove headers/footers (top/bottom 10% of page) │ +│ • Remove table of contents │ +│ • Filter low-quality OCR results │ +│ │ +│ Output: clean_boxes[] │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STEP 8: EXTRACT TABLES & FIGURES │ +│ File: pdf_parser.py, _extract_table_figure() (lines 757-930) │ +│ │ +│ • Convert table boxes to HTML/descriptive text │ +│ • Extract figure images with captions │ +│ • Handle spanning cells (colspan, rowspan) │ +│ │ +│ Output: tables[], figures[] │ +└──────────────────────────────────┬──────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ FINAL OUTPUT │ +│ │ +│ documents: [(text, position_tag), ...] │ +│ tables: [{"html": "...", "bbox": [...], "image": ...}, ...] │ +│ │ +│ position_tag format: "page_{page}_x0_{x0}_y0_{y0}_x1_{x1}_y1_{y1}" │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### 2.2 Luồng OCR Chi Tiết + +``` + Input Image (H, W, 3) + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ TEXT DETECTION (DBNet) │ +│ File: vision/ocr.py, TextDetector.__call__() (lines 503-530) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ┌────────────────────────┼────────────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ + │ Preprocess │ │ ONNX │ │ Postprocess │ + │ │ │ Inference │ │ │ + │ • Resize │ → │ │ → │ • Threshold │ + │ • Normalize │ │ DBNet │ │ • Contours │ + │ • Transpose │ │ Model │ │ • Unclip │ + └─────────────┘ └─────────────┘ └─────────────┘ + │ + ▼ + Text Region Polygons + [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ TEXT RECOGNITION (CRNN) │ +│ File: vision/ocr.py, TextRecognizer.__call__() (lines 363-408) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ┌────────────────────────┼────────────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ + │ Crop │ │ ONNX │ │ CTC Decode │ + │ Rotate │ │ Inference │ │ │ + │ │ → │ │ → │ • Argmax │ + │ Perspective │ │ CRNN │ │ • Dedup │ + │ Transform │ │ Model │ │ • Remove ε │ + └─────────────┘ └─────────────┘ └─────────────┘ + │ + ▼ + Output: [(box, (text, confidence)), ...] +``` + +### 2.3 Luồng Layout Recognition + +``` + Input: Page Image + OCR Results + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYOUT DETECTION (YOLOv10) │ +│ File: vision/layout_recognizer.py (lines 163-237) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ┌────────────────────────────┼────────────────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ + │ Preprocess │ │ ONNX │ │ Postprocess │ + │ │ │ Inference │ │ │ + │ • Resize │ → │ │ → │ • NMS │ + │ (640x640) │ │ YOLOv10 │ │ • Filter │ + │ • Pad │ │ Model │ │ • Scale │ + │ • Normalize │ │ │ │ back │ + └─────────────┘ └─────────────┘ └─────────────┘ + │ + ▼ + Layout Detections: + [{"type": "Table", "bbox": [...], "score": 0.95}] + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ OCR-LAYOUT ASSOCIATION │ +│ File: vision/layout_recognizer.py (lines 98-147) │ +│ │ +│ For each OCR box: │ +│ • Find overlapping layout region (threshold: 40%) │ +│ • Assign layout_type to OCR box │ +│ • Filter garbage (headers/footers/page numbers) │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + Output: OCR boxes with layout_type attribute + [{"text": "...", "layout_type": "Text", "layoutno": 1}] +``` + +### 2.4 Data Flow Summary + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ PDF File │ → │ Images │ → │ OCR Boxes │ → │ Merged │ +│ │ │ + Chars │ │ + Layout │ │ Documents │ +└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ + │ + ▼ + ┌─────────────┐ + │ Tables │ + │ (HTML/Desc)│ + └─────────────┘ + +Input Format: +- File path: str (e.g., "/path/to/doc.pdf") +- Or bytes: bytes (raw PDF content) + +Output Format: +- documents: List[Tuple[str, str]] + - text: Extracted text content + - position_tag: "page_0_x0_100_y0_200_x1_500_y1_250" + +- tables: List[Dict] + - html: "...
" + - bbox: [x0, y0, x1, y1] + - image: numpy array (optional) +``` + +--- + +## 3. Phân Tích Chi Tiết Code + +### 3.1 RAGFlowPdfParser Class + +**File**: `/deepdoc/parser/pdf_parser.py` +**Lines**: 52-1479 + +#### 3.1.1 Constructor (__init__) + +```python +# Line 52-104 +class RAGFlowPdfParser: + def __init__(self, **kwargs): + # Load OCR model + self.ocr = OCR() # vision/ocr.py + + # Load Layout Recognizer (YOLOv10) + self.layout_recognizer = LayoutRecognizer() # vision/layout_recognizer.py + + # Load Table Structure Recognizer + self.tsr = TableStructureRecognizer() # vision/table_structure_recognizer.py + + # Load XGBoost model for text concatenation + try: + self.updown_cnt_mdl = xgb.Booster() + model_path = os.path.join(get_project_base_directory(), + "rag/res/deepdoc/updown_concat_xgb.model") + self.updown_cnt_mdl.load_model(model_path) + except Exception as e: + self.updown_cnt_mdl = None +``` + +**Giải thích**: +- Constructor khởi tạo 4 models: + 1. **OCR**: Text detection + recognition + 2. **LayoutRecognizer**: Phân loại vùng layout (YOLOv10) + 3. **TableStructureRecognizer**: Nhận dạng cấu trúc bảng + 4. **XGBoost**: Quyết định merge text blocks (31 features) + +#### 3.1.2 Main Entry Point (__call__) + +```python +# Lines 1160-1168 +def __call__(self, fnm, need_image=True, zoomin=3, return_html=False): + """ + Main entry point for PDF parsing. + + Args: + fnm: File path or bytes + need_image: Whether to extract images + zoomin: Zoom factor for OCR (default 3x) + return_html: Return HTML tables instead of descriptive text + + Returns: + (documents, tables) tuple + """ + self.__images__(fnm, zoomin) # Step 1: Load images + self._layouts_rec(zoomin) # Step 2-3: OCR + Layout + self._table_transformer_job(zoomin) # Step 4: Table structure + self._text_merge(zoomin) # Step 5: Merge text + self._filter_forpages() # Step 6: Filter + tbls = self._extract_table_figure(...) # Step 7: Extract tables + return self._final_result(), tbls # Final output +``` + +**Tại sao zoomin=3?** +- OCR accuracy tăng đáng kể khi image lớn hơn +- 3x là balance giữa accuracy và memory/speed +- Quá lớn (5x+) → memory issues, quá nhỏ (1x) → OCR errors + +#### 3.1.3 Image Loading (__images__) + +```python +# Lines 1042-1159 +def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): + """ + Load PDF pages as images and extract native characters. + """ + self.page_images = [] + self.page_chars = [] + + # Open PDF with pdfplumber + with pdfplumber.open(fnm) as pdf: + for i, page in enumerate(pdf.pages[page_from:page_to]): + # Convert page to image + img = page.to_image(resolution=72 * zoomin) + img = np.array(img.original) + self.page_images.append(img) + + # Extract native PDF characters + chars = page.chars + self.page_chars.append(chars) +``` + +**Tại sao dùng pdfplumber?** +- Hỗ trợ cả text extraction và image conversion +- Giữ được character-level coordinates +- Xử lý tốt các PDF phức tạp + +#### 3.1.4 Column Detection (_assign_column) + +```python +# Lines 355-440 +def _assign_column(self, boxes, zoomin=3): + """ + Detect columns using K-Means clustering on X coordinates. + """ + from sklearn.cluster import KMeans + from sklearn.metrics import silhouette_score + + # Extract X coordinates + x_coords = np.array([[b["x0"]] for b in boxes]) + + best_k = 1 + best_score = -1 + + # Try k from 1 to 4 + for k in range(1, min(5, len(boxes))): + km = KMeans(n_clusters=k, random_state=42, n_init="auto") + labels = km.fit_predict(x_coords) + + if k > 1: + score = silhouette_score(x_coords, labels) + if score > best_score: + best_score = score + best_k = k + + # Final clustering with best k + km = KMeans(n_clusters=best_k, random_state=42, n_init="auto") + labels = km.fit_predict(x_coords) + + # Assign column IDs + for i, box in enumerate(boxes): + box["col_id"] = labels[i] +``` + +**Tại sao K-Means?** +- Unsupervised: không cần training data +- Fast: O(n * k * iterations) +- Silhouette score tự động chọn số cột + +### 3.2 OCR Class + +**File**: `/deepdoc/vision/ocr.py` +**Lines**: 536-752 + +#### 3.2.1 Text Detection (TextDetector) + +```python +# Lines 414-534 +class TextDetector: + def __init__(self, model_dir, device_id=None): + # Preprocessing pipeline + self.preprocess_op = [ + DetResizeForTest(limit_side_len=960, limit_type='max'), + NormalizeImage(mean=[0.485, 0.456, 0.406], + std=[0.229, 0.224, 0.225]), + ToCHWImage(), + KeepKeys(keep_keys=['image', 'shape']) + ] + + # Postprocessing + self.postprocess_op = DBPostProcess( + thresh=0.3, # Binary threshold + box_thresh=0.5, # Box confidence threshold + max_candidates=1000, # Max text regions + unclip_ratio=1.5 # Box expansion ratio + ) + + # Load ONNX model + self.ort_sess, self.run_opts = load_model(model_dir, "det", device_id) +``` + +**DBNet (Differentiable Binarization)**: +- Input: Image → Probability map (text regions) +- Thresholding: prob > 0.3 → foreground +- Unclipping: Expand boxes by 1.5x để capture full text + +#### 3.2.2 Text Recognition (TextRecognizer) + +```python +# Lines 133-412 +class TextRecognizer: + def __init__(self, model_dir, device_id=None): + self.rec_image_shape = [3, 48, 320] # C, H, W + self.batch_size = 16 + + # Load CRNN model + self.ort_sess, self.run_opts = load_model(model_dir, "rec", device_id) + + # CTC decoder + self.postprocess_op = CTCLabelDecode(character_dict_path=dict_path) + + def __call__(self, img_list): + # Sort by aspect ratio for efficient batching + indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list]) + + results = [] + for batch in chunks(indices, self.batch_size): + # Normalize images + norm_imgs = [self.resize_norm_img(img_list[i]) for i in batch] + + # Run inference + preds = self.ort_sess.run(None, {"input": np.stack(norm_imgs)}) + + # CTC decode + texts = self.postprocess_op(preds[0]) + results.extend(texts) + + return results +``` + +**CRNN + CTC**: +- CNN: Extract visual features +- RNN: Sequence modeling +- CTC: Alignment-free decoding (handles variable-length text) + +#### 3.2.3 Rotation Handling + +```python +# Lines 584-638 +def get_rotate_crop_image(self, img, points): + """ + Crop text region with auto-rotation detection. + """ + # Get perspective transform + rect = self.order_points_clockwise(points) + M = cv2.getPerspectiveTransform(rect, dst_pts) + warped = cv2.warpPerspective(img, M, (width, height)) + + # Check if text is vertical (height > 1.5 * width) + if warped.shape[0] / warped.shape[1] >= 1.5: + # Try 3 orientations + scores = [] + for angle in [0, 90, -90]: + rotated = self.rotate(warped, angle) + _, conf = self.recognizer([rotated])[0] + scores.append(conf) + + # Use orientation with highest confidence + best_angle = [0, 90, -90][np.argmax(scores)] + warped = self.rotate(warped, best_angle) + + return warped +``` + +**Tại sao cần auto-rotation?** +- PDF có thể chứa text xoay 90° +- OCR model trained on horizontal text +- Auto-detect giúp nhận dạng text dọc chính xác + +### 3.3 Layout Recognizer + +**File**: `/deepdoc/vision/layout_recognizer.py` +**Lines**: 33-237 + +#### 3.3.1 YOLOv10 Preprocessing + +```python +# Lines 186-209 +def preprocess(self, image_list): + """ + Preprocess images for YOLOv10 inference. + """ + processed = [] + for img in image_list: + h, w = img.shape[:2] + + # Calculate scale (preserve aspect ratio) + r = min(640/h, 640/w) + new_h, new_w = int(h*r), int(w*r) + + # Resize + resized = cv2.resize(img, (new_w, new_h)) + + # Pad to 640x640 (center padding, gray color) + padded = np.full((640, 640, 3), 114, dtype=np.uint8) + pad_top = (640 - new_h) // 2 + pad_left = (640 - new_w) // 2 + padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized + + # Normalize and transpose + padded = padded.astype(np.float32) / 255.0 + padded = padded.transpose(2, 0, 1) # HWC → CHW + + processed.append(padded) + + return np.stack(processed) +``` + +**Tại sao 640x640?** +- YOLOv10 standard input size +- Balance accuracy vs speed +- 32-stride alignment (640 = 20 * 32) + +#### 3.3.2 Layout Types + +```python +# Lines 34-46 +labels = [ + "_background_", # 0: Background (ignored) + "Text", # 1: Body text paragraphs + "Title", # 2: Section/document titles + "Figure", # 3: Images, diagrams, charts + "Figure caption", # 4: Text describing figures + "Table", # 5: Data tables + "Table caption", # 6: Text describing tables + "Header", # 7: Page headers + "Footer", # 8: Page footers + "Reference", # 9: Bibliography, citations + "Equation", # 10: Mathematical equations +] +``` + +### 3.4 Table Structure Recognizer + +**File**: `/deepdoc/vision/table_structure_recognizer.py` +**Lines**: 30-613 + +#### 3.4.1 Table Grid Construction + +```python +# Lines 172-349 +@staticmethod +def construct_table(boxes, is_english=False, html=True, **kwargs): + """ + Construct 2D table from detected components. + """ + # Step 1: Sort by row + boxes = Recognizer.sort_R_firstly(boxes, rowh/2) + + # Step 2: Group into rows + rows = [] + current_row = [boxes[0]] + for box in boxes[1:]: + if box["top"] - current_row[-1]["bottom"] > rowh/2: + rows.append(current_row) + current_row = [box] + else: + current_row.append(box) + rows.append(current_row) + + # Step 3: Sort each row by column + for row in rows: + row.sort(key=lambda x: x["x0"]) + + # Step 4: Build 2D matrix + n_cols = max(len(row) for row in rows) + table = [[None] * n_cols for _ in range(len(rows))] + + for i, row in enumerate(rows): + for j, cell in enumerate(row): + table[i][j] = cell["text"] + + # Step 5: Generate output + if html: + return generate_html_table(table) + else: + return generate_descriptive_text(table) +``` + +#### 3.4.2 Spanning Cell Handling + +```python +# Lines 496-575 +def __cal_spans(self, boxes): + """ + Calculate colspan and rowspan for merged cells. + """ + for box in boxes: + if "SP" not in box: # Not a spanning cell + continue + + # Find which rows this cell spans + box["rowspan"] = [] + for i, row_box in enumerate(self.rows): + if self.overlapped_area(box, row_box) > 0.3: + box["rowspan"].append(i) + + # Find which columns this cell spans + box["colspan"] = [] + for j, col_box in enumerate(self.cols): + if self.overlapped_area(box, col_box) > 0.3: + box["colspan"].append(j) +``` + +--- + +## 4. Giải Thích Kỹ Thuật + +### 4.1 ONNX Runtime + +**ONNX là gì?** +- Open Neural Network Exchange +- Format chuẩn cho deep learning models +- Chạy trên nhiều hardware (CPU, GPU, NPU) + +**Tại sao dùng ONNX?** +```python +# Không cần PyTorch/TensorFlow runtime +# Lightweight inference +import onnxruntime as ort + +session = ort.InferenceSession("model.onnx") +output = session.run(None, {"input": input_data}) +``` + +**Cấu hình trong DeepDoc**: +```python +# vision/ocr.py, lines 96-127 +options = ort.SessionOptions() +options.enable_cpu_mem_arena = False # Giảm memory fragmentation +options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL +options.intra_op_num_threads = 2 # Threads per operator +options.inter_op_num_threads = 2 # Parallel operators + +# GPU configuration +if torch.cuda.is_available(): + providers = [ + ('CUDAExecutionProvider', { + 'device_id': device_id, + 'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB + }) + ] +``` + +### 4.2 CTC Decoding + +**CTC (Connectionist Temporal Classification)**: +- Giải quyết alignment problem trong sequence-to-sequence +- Không cần biết vị trí chính xác của từng ký tự + +**Ví dụ**: +``` +OCR Model Output (time steps): +[a, a, a, -, l, l, -, p, p, h, h, a, -] + +CTC Decoding: +1. Merge consecutive duplicates: [a, -, l, -, p, h, a, -] +2. Remove blank tokens (-): [a, l, p, h, a] +3. Result: "alpha" +``` + +**Implementation**: +```python +# vision/postprocess.py, lines 355-366 +def __call__(self, preds, label=None): + # Get most probable character at each position + preds_idx = preds.argmax(axis=2) # Shape: (batch, time) + preds_prob = preds.max(axis=2) # Confidence scores + + # Decode with deduplication + text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True) + + return text +``` + +### 4.3 Non-Maximum Suppression (NMS) + +**NMS là gì?** +- Loại bỏ duplicate detections +- Giữ lại box có confidence cao nhất + +**Algorithm**: +``` +1. Sort boxes by confidence (descending) +2. Pick box with highest score → add to results +3. Remove boxes with IoU > threshold (e.g., 0.5) +4. Repeat until no boxes remain +``` + +**Implementation**: +```python +# vision/operators.py, lines 702-725 +def nms(bboxes, scores, iou_thresh): + indices = [] + index = scores.argsort()[::-1] # Sort descending + + while index.size > 0: + i = index[0] + indices.append(i) + + # Compute IoU with remaining boxes + ious = compute_iou(bboxes[i], bboxes[index[1:]]) + + # Keep only boxes with IoU <= threshold + mask = ious <= iou_thresh + index = index[1:][mask] + + return indices +``` + +### 4.4 DBNet (Differentiable Binarization) + +**DBNet là gì?** +- Text detection network +- Tạo probability map + threshold map +- Differentiable binarization cho end-to-end training + +**Pipeline**: +``` +Image → CNN Backbone → Feature Map → + ├→ Probability Map (text regions) + └→ Threshold Map (adaptive threshold) + +Final = Probability > Threshold (pixel-wise) +``` + +**Post-processing**: +```python +# vision/postprocess.py, DBPostProcess +def __call__(self, outs_dict, shape_list): + pred = outs_dict["maps"] + + # Binary thresholding + bitmap = pred > self.thresh # 0.3 + + # Find contours + contours = cv2.findContours(bitmap, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE) + + # Unclip (expand) boxes + for contour in contours: + box = self.unclip(contour, self.unclip_ratio) # 1.5x expansion + boxes.append(box) +``` + +### 4.5 K-Means cho Column Detection + +**Tại sao K-Means?** +- Text boxes trong cùng cột có X coordinate tương tự +- K-Means cluster các X values +- Silhouette score chọn số cột tối ưu + +**Silhouette Score**: +``` +s(i) = (b(i) - a(i)) / max(a(i), b(i)) + +- a(i): Average distance to same cluster +- b(i): Average distance to nearest other cluster +- Range: [-1, 1], higher = better clustering +``` + +**Ví dụ**: +``` +Page with 2 columns: +Left column boxes: x0 = [50, 52, 48, 51, ...] +Right column boxes: x0 = [400, 398, 402, 399, ...] + +K-Means (k=2): +- Cluster 0: x0 ≈ 50 (left column) +- Cluster 1: x0 ≈ 400 (right column) + +Silhouette score ≈ 0.95 (high, good separation) +``` + +--- + +## 5. Lý Do Thiết Kế + +### 5.1 Tại Sao Dùng Multiple Models? + +**Vấn đề**: Một model không thể handle tất cả tasks + +| Task | Model Type | Lý Do | +|------|------------|-------| +| Text Detection | DBNet | Specialized cho text regions | +| Text Recognition | CRNN | Sequential text với CTC | +| Layout Detection | YOLOv10 | Object detection tốt nhất | +| Table Structure | YOLOv10 variant | Fine-tuned cho table elements | + +**Trade-off**: +- Pros: Mỗi model optimized cho task riêng +- Cons: Nhiều models → nhiều memory, complexity + +### 5.2 Tại Sao Dùng XGBoost cho Text Merging? + +**Vấn đề**: Merge text blocks là decision phức tạp + +**Rule-based approach** (naive): +```python +# Simple heuristics +if y_distance < threshold and same_column: + merge() +# ❌ Không handle edge cases tốt +``` + +**ML approach** (XGBoost): +```python +# 31 features capturing various signals +features = [ + y_distance / char_height, # Distance feature + ends_with_punctuation, # Text pattern + same_layout_type, # Layout feature + font_size_ratio, # Typography + ... +] +# ✅ Learns complex patterns from data +``` + +**Tại sao XGBoost?** +- Fast inference (tree-based) +- Handles mixed feature types well +- Pre-trained model included + +### 5.3 Tại Sao ONNX thay vì PyTorch/TensorFlow? + +| Aspect | ONNX Runtime | PyTorch | +|--------|--------------|---------| +| Size | ~50MB | ~500MB+ | +| Memory | Lower | Higher | +| Startup | Fast | Slow (JIT) | +| Dependencies | Minimal | Many | +| Multi-platform | Yes | Limited | + +**DeepDoc choice**: ONNX cho production deployment +- Không cần PyTorch runtime +- Lighter memory footprint +- Faster cold start + +### 5.4 Tại Sao Zoomin = 3? + +**Experiment results**: +``` +zoomin=1: OCR accuracy ~70%, fast +zoomin=2: OCR accuracy ~85%, moderate +zoomin=3: OCR accuracy ~95%, acceptable speed ← chosen +zoomin=4: OCR accuracy ~97%, slow +zoomin=5: OCR accuracy ~98%, very slow, memory issues +``` + +**Balance**: 3x là sweet spot giữa accuracy và resource usage + +### 5.5 Tại Sao Hybrid Text Extraction? + +**Native PDF text** (pdfplumber): +- Pros: Accurate, fast, preserves fonts +- Cons: Không có cho scanned PDFs + +**OCR text**: +- Pros: Works on any image +- Cons: Slower, potential errors + +**Hybrid approach**: +```python +# Prefer native text, fallback to OCR +for box in ocr_boxes: + # Try to match with native characters + matched_chars = find_overlapping_chars(box, native_chars) + + if matched_chars: + box["text"] = "".join(matched_chars) # Use native + else: + box["text"] = ocr_result # Use OCR +``` + +### 5.6 Pipeline vs End-to-End Model + +**End-to-End** (e.g., Donut, Pix2Struct): +- Single model: Image → Structured output +- Pros: Simple, unified +- Cons: Less accurate on specific tasks, hard to debug + +**Pipeline** (DeepDoc's choice): +- Multiple specialized models +- Pros: + - Each model optimized for task + - Easy to debug/improve individual components + - Mix and match different models +- Cons: + - More complexity + - Potential error accumulation + +**DeepDoc's rationale**: Pipeline cho flexibility và accuracy + +--- + +## 6. Thuật Ngữ Khó + +### 6.1 Computer Vision Terms + +| Term | Definition | Ví Dụ trong DeepDoc | +|------|------------|---------------------| +| **Bounding Box** | Hình chữ nhật bao quanh object | `[x0, y0, x1, y1]` coordinates | +| **IoU** | Intersection over Union - đo overlap | NMS threshold 0.5 | +| **NMS** | Non-Maximum Suppression | Loại duplicate detections | +| **Anchor** | Predefined box sizes | YOLOv10 anchors | +| **Stride** | Downsampling factor | 32 trong YOLOv10 | +| **FPN** | Feature Pyramid Network | Multi-scale detection | + +### 6.2 OCR Terms + +| Term | Definition | Ví Dụ trong DeepDoc | +|------|------------|---------------------| +| **CTC** | Connectionist Temporal Classification | CRNN output decoding | +| **CRNN** | CNN + RNN | Text recognition model | +| **DBNet** | Differentiable Binarization | Text detection model | +| **Unclip** | Expand polygon boundary | 1.5x expansion ratio | + +### 6.3 ML Terms + +| Term | Definition | Ví Dụ trong DeepDoc | +|------|------------|---------------------| +| **ONNX** | Open Neural Network Exchange | Model format | +| **Inference** | Running model on input | `session.run()` | +| **Batch** | Multiple inputs processed together | batch_size=16 | +| **Confidence** | Model's certainty score | 0.0 - 1.0 | + +### 6.4 Document Processing Terms + +| Term | Definition | Ví Dụ trong DeepDoc | +|------|------------|---------------------| +| **Layout** | Document structure | Text, Table, Figure | +| **TSR** | Table Structure Recognition | Row, Column detection | +| **Spanning Cell** | Merged table cell | colspan, rowspan | +| **Reading Order** | Text flow sequence | Top-to-bottom, left-to-right | + +--- + +## 7. Mở Rộng Từ Code + +### 7.1 Thêm Parser Mới + +**Ví dụ**: Add RTF parser + +```python +# deepdoc/parser/rtf_parser.py +from striprtf.striprtf import rtf_to_text + +class RAGFlowRtfParser: + def __call__(self, fnm, binary=None, chunk_token_num=128): + if binary: + content = binary.decode('utf-8') + else: + with open(fnm, 'r') as f: + content = f.read() + + text = rtf_to_text(content) + + # Chunk text + chunks = self._chunk(text, chunk_token_num) + + return [(chunk, f"rtf_chunk_{i}") for i, chunk in enumerate(chunks)] +``` + +### 7.2 Thêm Layout Type Mới + +**Ví dụ**: Add "Code Block" layout + +```python +# vision/layout_recognizer.py +labels = [ + "_background_", + "Text", + "Title", + ... + "Code Block", # New label (index 11) +] + +# Train new YOLOv10 model with "Code Block" annotations +# Update model file +``` + +### 7.3 Custom Text Merging Logic + +```python +# Override default merging behavior +class CustomPdfParser(RAGFlowPdfParser): + def _should_merge(self, box1, box2): + """Custom merge logic""" + # Don't merge code blocks + if box1.get("layout_type") == "Code Block": + return False + + # Use default logic otherwise + return super()._should_merge(box1, box2) +``` + +### 7.4 Thêm Output Format + +```python +# Add Markdown output format +def to_markdown(self, documents, tables): + md_parts = [] + + for text, pos_tag in documents: + # Detect if title + if self._is_title(text): + md_parts.append(f"## {text}\n") + else: + md_parts.append(f"{text}\n\n") + + # Convert tables to markdown + for table in tables: + md_table = html_to_markdown(table["html"]) + md_parts.append(md_table) + + return "\n".join(md_parts) +``` + +### 7.5 Optimize Performance + +**GPU Batching**: +```python +# Process multiple pages in parallel +def _parallel_ocr(self, images, batch_size=4): + with ThreadPoolExecutor(max_workers=4) as executor: + futures = [] + for batch in chunks(images, batch_size): + future = executor.submit(self.ocr, batch) + futures.append(future) + + results = [f.result() for f in futures] + return results +``` + +**Caching**: +```python +# Cache model instances +_model_cache = {} + +def get_ocr_model(model_dir, device_id): + key = f"{model_dir}_{device_id}" + if key not in _model_cache: + _model_cache[key] = OCR(model_dir, device_id) + return _model_cache[key] +``` + +### 7.6 Integration với RAG Pipeline + +```python +# rag/app/pdf.py (example integration) +from deepdoc.parser import RAGFlowPdfParser + +def process_pdf_for_rag(file_path, chunk_size=512): + parser = RAGFlowPdfParser() + + # Parse PDF + documents, tables = parser(file_path) + + # Chunk documents + chunks = [] + for text, pos_tag in documents: + for chunk in chunk_text(text, chunk_size): + chunks.append({ + "text": chunk, + "metadata": {"position": pos_tag} + }) + + # Add tables as separate chunks + for table in tables: + chunks.append({ + "text": table["html"], + "metadata": {"type": "table", "bbox": table["bbox"]} + }) + + return chunks +``` + +--- + +## 8. Tổng Kết + +### 8.1 Key Takeaways + +1. **DeepDoc = Parser Layer + Vision Layer** + - Parser: Format-specific handling (PDF, DOCX, etc.) + - Vision: OCR + Layout + Table recognition + +2. **Pipeline Architecture** + - Multiple specialized models + - Easy to debug and improve + +3. **ONNX Runtime** + - Lightweight inference + - Cross-platform compatibility + +4. **Hybrid Text Extraction** + - Native PDF text khi available + - OCR fallback cho scanned documents + +### 8.2 Diagram Tổng Hợp + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ DEEPDOC SUMMARY │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ INPUT PROCESSING OUTPUT │ +│ ───── ────────── ────── │ +│ │ +│ ┌─────────┐ ┌────────────────────────────┐ ┌─────────────────┐ │ +│ │ PDF │────▶│ 1. Image Extraction │─────▶│ Documents │ │ +│ │ DOCX │ │ 2. OCR (DBNet + CRNN) │ │ [(text, pos)] │ │ +│ │ Excel │ │ 3. Layout (YOLOv10) │ │ │ │ +│ │ HTML │ │ 4. Column Detection │ │ Tables │ │ +│ │ ... │ │ 5. Table Structure │ │ [html, bbox] │ │ +│ └─────────┘ │ 6. Text Merging │ │ │ │ +│ │ 7. Quality Filtering │ │ Figures │ │ +│ └────────────────────────────┘ │ [image, cap] │ │ +│ └─────────────────┘ │ +│ │ +│ MODELS USED: │ +│ ──────────── │ +│ • DBNet (Text Detection) - ONNX, ~30MB │ +│ • CRNN (Text Recognition) - ONNX, ~20MB │ +│ • YOLOv10 (Layout Detection) - ONNX, ~50MB │ +│ • YOLOv10 (Table Structure) - ONNX, ~50MB │ +│ • XGBoost (Text Merging) - Binary, ~5MB │ +│ │ +│ KEY ALGORITHMS: │ +│ ─────────────── │ +│ • CTC Decoding (text recognition) │ +│ • NMS (duplicate removal) │ +│ • K-Means (column detection) │ +│ • IoU (overlap calculation) │ +│ │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +### 8.3 Files Reference + +| File | Lines | Description | +|------|-------|-------------| +| `parser/pdf_parser.py` | 1479 | Main PDF parser | +| `vision/ocr.py` | 752 | OCR detection + recognition | +| `vision/layout_recognizer.py` | 457 | Layout detection | +| `vision/table_structure_recognizer.py` | 613 | Table structure | +| `vision/recognizer.py` | 443 | Base recognizer class | +| `vision/operators.py` | 726 | Image preprocessing | +| `vision/postprocess.py` | 371 | Post-processing utilities | + +--- + +*Document created for RAGFlow v0.22.1 analysis* diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md new file mode 100644 index 000000000..063acd4be --- /dev/null +++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md @@ -0,0 +1,926 @@ +# Layout & Table Recognition Deep Dive + +## Tổng Quan + +Sau khi OCR extract được text boxes, DeepDoc cần: +1. **Layout Recognition**: Phân loại vùng (Text, Title, Table, Figure...) +2. **Table Structure Recognition**: Nhận dạng cấu trúc bảng (rows, columns, cells) + +## File Structure + +``` +deepdoc/vision/ +├── layout_recognizer.py # Layout detection (457 lines) +├── table_structure_recognizer.py # Table structure (613 lines) +└── recognizer.py # Base class (443 lines) +``` + +--- + +## 1. Layout Recognition (YOLOv10) + +### 1.1 Layout Categories + +```python +# deepdoc/vision/layout_recognizer.py, lines 34-46 + +labels = [ + "_background_", # 0: Background (ignored) + "Text", # 1: Body text paragraphs + "Title", # 2: Section/document titles + "Figure", # 3: Images, diagrams, charts + "Figure caption", # 4: Text describing figures + "Table", # 5: Data tables + "Table caption", # 6: Text describing tables + "Header", # 7: Page headers + "Footer", # 8: Page footers + "Reference", # 9: Bibliography, citations + "Equation", # 10: Mathematical equations +] +``` + +### 1.2 YOLOv10 Architecture + +``` +YOLOv10 for Document Layout: + +Input Image (640, 640, 3) + │ + ▼ +┌─────────────────────────────────────┐ +│ CSPDarknet Backbone │ +│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│ +│ │ P1 │→ │ P2 │→ │ P3 │→ │ P4 ││ +│ │/2 │ │/4 │ │/8 │ │/16 ││ +│ └─────┘ └─────┘ └─────┘ └─────┘│ +└─────────────────────────────────────┘ + │ │ │ │ + ▼ ▼ ▼ ▼ +┌─────────────────────────────────────┐ +│ PANet Neck │ +│ FPN (top-down) + PAN (bottom-up) │ +│ Multi-scale feature fusion │ +└─────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────┐ +│ Detection Heads (3 scales) │ +│ Small (80x80) → tiny objects │ +│ Medium (40x40) → normal objects │ +│ Large (20x20) → big objects │ +└─────────────────────────────────────┘ + │ + ▼ + Raw Predictions: + [x_center, y_center, width, height, confidence, class_probs...] +``` + +### 1.3 Preprocessing (LayoutRecognizer4YOLOv10) + +```python +# deepdoc/vision/layout_recognizer.py, lines 186-209 + +def preprocess(self, image_list): + """ + Preprocess images for YOLOv10. + + Key steps: + 1. Resize maintaining aspect ratio + 2. Pad to 640x640 (gray borders) + 3. Normalize [0,255] → [0,1] + 4. Transpose HWC → CHW + """ + processed = [] + scale_factors = [] + + for img in image_list: + h, w = img.shape[:2] + + # Calculate scale (preserve aspect ratio) + r = min(640/h, 640/w) + new_h, new_w = int(h*r), int(w*r) + + # Resize + resized = cv2.resize(img, (new_w, new_h)) + + # Calculate padding + pad_top = (640 - new_h) // 2 + pad_left = (640 - new_w) // 2 + + # Pad to 640x640 (gray: 114) + padded = np.full((640, 640, 3), 114, dtype=np.uint8) + padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized + + # Normalize and transpose + padded = padded.astype(np.float32) / 255.0 + padded = padded.transpose(2, 0, 1) # HWC → CHW + + processed.append(padded) + scale_factors.append([1/r, 1/r, pad_left, pad_top]) + + return np.stack(processed), scale_factors +``` + +**Visualization**: +``` +Original image (1000x800): +┌────────────────────────────────────────┐ +│ │ +│ Document Content │ +│ │ +└────────────────────────────────────────┘ + +After resize (scale=0.64) to (640x512): +┌────────────────────────────────────────┐ +│ │ +│ Document Content │ +│ │ +└────────────────────────────────────────┘ + +After padding to (640x640): +┌────────────────────────────────────────┐ +│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding +├────────────────────────────────────────┤ +│ │ +│ Document Content │ +│ │ +├────────────────────────────────────────┤ +│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding +└────────────────────────────────────────┘ +``` + +### 1.4 NMS Postprocessing + +```python +# deepdoc/vision/recognizer.py, lines 330-407 + +def postprocess(self, boxes, inputs, thr): + """ + YOLOv10 postprocessing with per-class NMS. + """ + results = [] + + for batch_idx, batch_boxes in enumerate(boxes): + scale_factor = inputs["scale_factor"][batch_idx] + + # Filter by confidence threshold + mask = batch_boxes[:, 4] > thr # confidence > 0.2 + filtered = batch_boxes[mask] + + if len(filtered) == 0: + results.append([]) + continue + + # Convert xywh → xyxy + xyxy = self.xywh2xyxy(filtered[:, :4]) + + # Remove padding offset + xyxy[:, [0, 2]] -= scale_factor[2] # pad_left + xyxy[:, [1, 3]] -= scale_factor[3] # pad_top + + # Scale back to original size + xyxy[:, [0, 2]] *= scale_factor[0] # scale_x + xyxy[:, [1, 3]] *= scale_factor[1] # scale_y + + # Per-class NMS + class_ids = filtered[:, 5].astype(int) + scores = filtered[:, 4] + + keep_indices = [] + for cls in np.unique(class_ids): + cls_mask = class_ids == cls + cls_boxes = xyxy[cls_mask] + cls_scores = scores[cls_mask] + + # NMS within class + keep = self.iou_filter(cls_boxes, cls_scores, iou_thresh=0.45) + keep_indices.extend(np.where(cls_mask)[0][keep]) + + # Build result + batch_results = [] + for idx in keep_indices: + batch_results.append({ + "type": self.labels[int(filtered[idx, 5])], + "bbox": xyxy[idx].tolist(), + "score": float(filtered[idx, 4]) + }) + + results.append(batch_results) + + return results +``` + +### 1.5 OCR-Layout Association + +```python +# deepdoc/vision/layout_recognizer.py, lines 98-147 + +def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True): + """ + Detect layouts and associate with OCR results. + """ + # Step 1: Run layout detection + page_layouts = super().__call__(image_list, thr, batch_size) + + # Step 2: Clean up overlapping layouts + for i, layouts in enumerate(page_layouts): + page_layouts[i] = self.layouts_cleanup(layouts, thr=0.7) + + # Step 3: Associate OCR boxes with layouts + for page_idx, (ocr_boxes, layouts) in enumerate(zip(ocr_res, page_layouts)): + # Sort layouts by priority: Footer → Header → Reference → Caption → Others + layouts_by_priority = self._sort_by_priority(layouts) + + for ocr_box in ocr_boxes: + # Find overlapping layout + matched_layout = self.find_overlapped_with_threshold( + ocr_box, + layouts_by_priority, + thr=0.4 # 40% overlap threshold + ) + + if matched_layout: + ocr_box["layout_type"] = matched_layout["type"] + ocr_box["layoutno"] = matched_layout.get("layoutno", 0) + else: + ocr_box["layout_type"] = "Text" # Default to Text + + # Step 4: Filter garbage (headers, footers, page numbers) + if drop: + self._filter_garbage(ocr_res, page_layouts) + + return ocr_res, page_layouts +``` + +### 1.6 Garbage Detection + +```python +# deepdoc/vision/layout_recognizer.py, lines 64-66 + +# Patterns to filter out +garbage_patterns = [ + r"^•+$", # Bullet points only + r"^[0-9]{1,2} / ?[0-9]{1,2}$", # Page numbers (3/10, 3 / 10) + r"^[0-9]{1,2} of [0-9]{1,2}$", # Page numbers (3 of 10) + r"^http://[^ ]{12,}", # Long URLs + r"\(cid *: *[0-9]+ *\)", # PDF character IDs +] + +def is_garbage(text, layout_type, page_position): + """ + Determine if text should be filtered out. + + Rules: + - Headers at top 10% of page → keep + - Footers at bottom 10% of page → keep + - Headers/footers elsewhere → garbage + - Page numbers → garbage + - URLs → garbage + """ + for pattern in garbage_patterns: + if re.match(pattern, text): + return True + + # Position-based filtering + if layout_type == "Header" and page_position > 0.1: + return True # Header not at top + if layout_type == "Footer" and page_position < 0.9: + return True # Footer not at bottom + + return False +``` + +--- + +## 2. Table Structure Recognition + +### 2.1 Table Components + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 31-38 + +labels = [ + "table", # 0: Whole table boundary + "table column", # 1: Column separators + "table row", # 2: Row separators + "table column header", # 3: Header rows + "table projected row header", # 4: Row labels + "table spanning cell", # 5: Merged cells +] +``` + +### 2.2 Detection to Grid Construction + +``` +Detection Output → Table Grid: + +┌─────────────────────────────────────────────────────────────────┐ +│ Raw Detections │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ table: [0, 0, 500, 300] │ │ +│ │ table row: [0, 0, 500, 50], [0, 50, 500, 100], ... │ │ +│ │ table column: [0, 0, 150, 300], [150, 0, 300, 300], ... │ │ +│ │ table spanning cell: [0, 100, 300, 150] │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Alignment │ │ +│ │ • Align row boundaries (left/right edges) │ │ +│ │ • Align column boundaries (top/bottom edges) │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Grid Construction │ │ +│ │ │ │ +│ │ ┌──────────┬──────────┬──────────┐ │ │ +│ │ │ Header 1 │ Header 2 │ Header 3 │ ← Row 0 (header) │ │ +│ │ ├──────────┴──────────┼──────────┤ │ │ +│ │ │ Spanning Cell │ Cell 3 │ ← Row 1 │ │ +│ │ ├──────────┬──────────┼──────────┤ │ │ +│ │ │ Cell 4 │ Cell 5 │ Cell 6 │ ← Row 2 │ │ +│ │ └──────────┴──────────┴──────────┘ │ │ +│ │ │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ HTML or Descriptive Output │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### 2.3 Alignment Algorithm + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 67-111 + +def __call__(self, images, thr=0.2): + """ + Detect and align table structure. + """ + # Run detection + detections = super().__call__(images, thr) + + for page_dets in detections: + rows = [d for d in page_dets if d["label"] == "table row"] + cols = [d for d in page_dets if d["label"] == "table column"] + + if len(rows) > 4: + # Align row X coordinates (left edges) + x0_values = [r["x0"] for r in rows] + mean_x0 = np.mean(x0_values) + min_x0 = np.min(x0_values) + aligned_x0 = min(mean_x0, min_x0 + 0.05 * (max(x0_values) - min_x0)) + + for r in rows: + r["x0"] = aligned_x0 + + # Align row X coordinates (right edges) + x1_values = [r["x1"] for r in rows] + mean_x1 = np.mean(x1_values) + max_x1 = np.max(x1_values) + aligned_x1 = max(mean_x1, max_x1 - 0.05 * (max_x1 - min(x1_values))) + + for r in rows: + r["x1"] = aligned_x1 + + if len(cols) > 4: + # Similar alignment for column Y coordinates + # ... +``` + +**Tại sao cần alignment?** + +Detection model có thể cho ra boundaries không perfectly aligned: +``` +Before alignment: +Row 1: x0=10, x1=490 +Row 2: x0=12, x1=488 +Row 3: x0=8, x1=492 + +After alignment: +Row 1: x0=10, x1=490 +Row 2: x0=10, x1=490 +Row 3: x0=10, x1=490 +``` + +### 2.4 Grid Construction + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 172-349 + +@staticmethod +def construct_table(boxes, is_english=False, html=True, **kwargs): + """ + Construct 2D table from detected components. + + Args: + boxes: OCR boxes with R (row), C (column), SP (spanning) attributes + is_english: Language hint + html: Output format (HTML or descriptive text) + + Returns: + HTML table string or descriptive text + """ + # Step 1: Extract caption + caption = "" + for box in boxes[:]: + if is_caption(box): + caption = box["text"] + boxes.remove(box) + + # Step 2: Sort by row position (R attribute) + rowh = np.median([b["bottom"] - b["top"] for b in boxes]) + boxes = Recognizer.sort_R_firstly(boxes, rowh / 2) + + # Step 3: Group into rows + rows = [] + current_row = [boxes[0]] + + for box in boxes[1:]: + # Same row if Y difference < row_height/2 + if abs(box["R"] - current_row[-1]["R"]) < rowh / 2: + current_row.append(box) + else: + rows.append(current_row) + current_row = [box] + rows.append(current_row) + + # Step 4: Sort each row by column position (C attribute) + for row in rows: + row.sort(key=lambda x: x["C"]) + + # Step 5: Build 2D table matrix + n_rows = len(rows) + n_cols = max(len(row) for row in rows) + + table = [[None] * n_cols for _ in range(n_rows)] + + for i, row in enumerate(rows): + for j, cell in enumerate(row): + table[i][j] = cell + + # Step 6: Handle spanning cells + table = handle_spanning_cells(table, boxes) + + # Step 7: Generate output + if html: + return generate_html_table(table, caption) + else: + return generate_descriptive_text(table, caption) +``` + +### 2.5 Spanning Cell Handling + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 496-575 + +def __cal_spans(self, boxes, rows, cols): + """ + Calculate colspan and rowspan for merged cells. + + Spanning cell detection: + - "SP" attribute indicates merged cell + - Calculate which rows/cols it covers + """ + for box in boxes: + if "SP" not in box: + continue + + # Find rows this cell spans + box["rowspan"] = [] + for i, row in enumerate(rows): + overlap = self.overlapped_area(box, row) + if overlap > 0.3: # 30% overlap + box["rowspan"].append(i) + + # Find columns this cell spans + box["colspan"] = [] + for j, col in enumerate(cols): + overlap = self.overlapped_area(box, col) + if overlap > 0.3: + box["colspan"].append(j) + + return boxes +``` + +**Example**: +``` +Spanning cell detection: + +┌──────────┬──────────┬──────────┐ +│ Header 1 │ Header 2 │ Header 3 │ +├──────────┴──────────┼──────────┤ +│ Merged Cell │ Cell 3 │ ← SP cell spans columns 0-1 +│ (colspan=2) │ │ +├──────────┬──────────┼──────────┤ +│ Cell 4 │ Cell 5 │ Cell 6 │ +└──────────┴──────────┴──────────┘ + +Detection: +- SP cell bbox: [0, 50, 300, 100] +- Column 0: [0, 0, 150, 200] → overlap 0.5 ✓ +- Column 1: [150, 0, 300, 200] → overlap 0.5 ✓ +- Column 2: [300, 0, 450, 200] → overlap 0.0 ✗ +→ colspan = [0, 1] +``` + +### 2.6 HTML Output Generation + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 352-393 + +def __html_table(table, header_rows, caption): + """ + Generate HTML table from 2D matrix. + """ + html_parts = [""] + + # Add caption if exists + if caption: + html_parts.append(f"") + + for i, row in enumerate(table): + html_parts.append("") + + for j, cell in enumerate(row): + if cell is None: + continue # Skip cells covered by spanning + + # Determine tag (th for header, td for data) + tag = "th" if i in header_rows else "td" + + # Add colspan/rowspan attributes + attrs = [] + if cell.get("colspan") and len(cell["colspan"]) > 1: + attrs.append(f'colspan="{len(cell["colspan"])}"') + if cell.get("rowspan") and len(cell["rowspan"]) > 1: + attrs.append(f'rowspan="{len(cell["rowspan"])}"') + + attr_str = " " + " ".join(attrs) if attrs else "" + + # Add cell content + html_parts.append(f"<{tag}{attr_str}>{cell['text']}") + + html_parts.append("") + + html_parts.append("
{caption}
") + + return "\n".join(html_parts) +``` + +**Output Example**: +```html + + + + + + + + + + + + + + + + +
Table 1: Sales Data
RegionQ1Q2
North America$150K
Europe$100K$120K
+``` + +### 2.7 Descriptive Text Output + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 396-493 + +def __desc_table(table, header_rows, caption): + """ + Generate natural language description of table. + + For RAG, sometimes descriptive text is better than HTML. + """ + descriptions = [] + + # Get headers + headers = [cell["text"] for cell in table[0]] if header_rows else [] + + # Process each data row + for i, row in enumerate(table): + if i in header_rows: + continue + + row_desc = [] + for j, cell in enumerate(row): + if cell is None: + continue + + if headers and j < len(headers): + # "Column Name: Value" format + row_desc.append(f"{headers[j]}: {cell['text']}") + else: + row_desc.append(cell['text']) + + if row_desc: + descriptions.append("; ".join(row_desc)) + + # Add source reference + if caption: + descriptions.append(f'(from "{caption}")') + + return "\n".join(descriptions) +``` + +**Output Example**: +``` +Region: North America; Q1: $100K; Q2: $150K +Region: Europe; Q1: $80K; Q2: $120K +(from "Table 1: Sales Data") +``` + +--- + +## 3. Cell Content Classification + +### 3.1 Block Type Detection + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 121-149 + +@staticmethod +def blockType(text): + """ + Classify cell content type. + + Used for: + - Header detection (non-numeric cells likely headers) + - Data validation + - Smart formatting + """ + patterns = { + "Dt": r"(^[0-9]{4}[-/][0-9]{1,2}|[0-9]{1,2}[-/][0-9]{1,2}[-/][0-9]{2,4}|" + r"[0-9]{1,2}月|[Q][1-4]|[一二三四]季度)", # Date + "Nu": r"^[-+]?[0-9.,%%¥$€£¥]+$", # Number + "Ca": r"^[A-Z0-9]{4,}$", # Code + "En": r"^[a-zA-Z\s]+$", # English + } + + for type_name, pattern in patterns.items(): + if re.search(pattern, text): + return type_name + + # Classify by length + tokens = text.split() + if len(tokens) == 1: + return "Sg" # Single + elif len(tokens) <= 3: + return "Tx" # Short text + elif len(tokens) <= 12: + return "Lx" # Long text + else: + return "Ot" # Other + +# Examples: +# "2023-01-15" → "Dt" (Date) +# "$1,234.56" → "Nu" (Number) +# "ABC123" → "Ca" (Code) +# "Total Revenue" → "En" (English) +# "北京市" → "Tx" (Text) +``` + +### 3.2 Header Detection + +```python +# deepdoc/vision/table_structure_recognizer.py, lines 332-344 + +def detect_headers(table): + """ + Detect which rows are headers based on content type. + + Heuristic: If >50% of cells in a row are non-numeric, + it's likely a header row. + """ + header_rows = set() + + for i, row in enumerate(table): + non_numeric = 0 + total = 0 + + for cell in row: + if cell is None: + continue + total += 1 + if blockType(cell["text"]) != "Nu": + non_numeric += 1 + + if total > 0 and non_numeric / total > 0.5: + header_rows.add(i) + + return header_rows +``` + +--- + +## 4. Integration với PDF Parser + +### 4.1 Table Detection in PDF Pipeline + +```python +# deepdoc/parser/pdf_parser.py, lines 196-281 + +def _table_transformer_job(self, zoomin=3): + """ + Detect and structure tables using TableStructureRecognizer. + """ + # Find table layouts + table_layouts = [ + layout for layout in self.page_layout + if layout["type"] == "Table" + ] + + if not table_layouts: + return + + # Crop table images + table_images = [] + for layout in table_layouts: + x0, y0, x1, y1 = layout["bbox"] + img = self.page_images[layout["page"]][ + int(y0*zoomin):int(y1*zoomin), + int(x0*zoomin):int(x1*zoomin) + ] + table_images.append(img) + + # Run TSR + table_structures = self.tsr(table_images) + + # Match OCR boxes to table structure + for layout, structure in zip(table_layouts, table_structures): + # Get OCR boxes within table region + table_boxes = [ + box for box in self.boxes + if self._box_in_region(box, layout["bbox"]) + ] + + # Assign R, C, SP attributes + for box in table_boxes: + box["R"] = self._find_row(box, structure["rows"]) + box["C"] = self._find_column(box, structure["columns"]) + if self._is_spanning(box, structure["spanning_cells"]): + box["SP"] = True + + # Store for later extraction + self.tb_cpns[layout["id"]] = { + "boxes": table_boxes, + "structure": structure + } +``` + +### 4.2 Table Extraction + +```python +# deepdoc/parser/pdf_parser.py, lines 757-930 + +def _extract_table_figure(self, need_image, ZM, return_html, need_position): + """ + Extract tables and figures from detected layouts. + """ + tables = [] + + for layout_id, table_data in self.tb_cpns.items(): + boxes = table_data["boxes"] + + # Construct table (HTML or descriptive) + if return_html: + content = TableStructureRecognizer.construct_table( + boxes, html=True + ) + else: + content = TableStructureRecognizer.construct_table( + boxes, html=False + ) + + table = { + "content": content, + "bbox": table_data["bbox"], + } + + if need_image: + table["image"] = self._crop_region(table_data["bbox"]) + + tables.append(table) + + return tables +``` + +--- + +## 5. Performance Considerations + +### 5.1 Batch Processing + +```python +# deepdoc/vision/recognizer.py, lines 415-437 + +def __call__(self, image_list, thr=0.7, batch_size=16): + """ + Batch inference for efficiency. + + Why batch_size=16? + - GPU memory optimization + - Balance throughput vs latency + - Typical document has 10-50 elements + """ + results = [] + + for i in range(0, len(image_list), batch_size): + batch = image_list[i:i+batch_size] + + # Preprocess + inputs = self.preprocess(batch) + + # Inference + outputs = self.ort_sess.run(None, inputs) + + # Postprocess + batch_results = self.postprocess(outputs, inputs, thr) + results.extend(batch_results) + + return results +``` + +### 5.2 Model Caching + +```python +# deepdoc/vision/ocr.py, lines 36-73 + +# Global model cache +loaded_models = {} + +def load_model(model_dir, nm, device_id=None): + """ + Load ONNX model with caching. + + Cache key: model_path + device_id + """ + model_path = os.path.join(model_dir, f"{nm}.onnx") + cache_key = f"{model_path}_{device_id}" + + if cache_key in loaded_models: + return loaded_models[cache_key] + + # Load model... + session = ort.InferenceSession(model_path, ...) + + loaded_models[cache_key] = (session, run_opts) + return session, run_opts +``` + +--- + +## 6. Troubleshooting + +### 6.1 Common Issues + +| Issue | Cause | Solution | +|-------|-------|----------| +| Missing table | Low confidence | Lower threshold (0.1-0.2) | +| Wrong colspan | Misaligned detection | Check row/column alignment | +| Merged cells wrong | Overlap threshold | Adjust SP detection threshold | +| Headers not detected | All numeric | Manual header specification | +| Layout overlap | NMS threshold | Increase NMS IoU threshold | + +### 6.2 Debugging + +```python +# Visualize layout detection +from deepdoc.vision.seeit import draw_boxes + +# Draw layout boxes on image +layout_vis = draw_boxes( + page_image, + [(l["bbox"], l["type"]) for l in page_layouts], + colors={ + "Text": (0, 255, 0), + "Table": (255, 0, 0), + "Figure": (0, 0, 255), + } +) +cv2.imwrite("layout_debug.png", layout_vis) + +# Check table structure +for box in table_boxes: + print(f"Text: {box['text']}") + print(f" Row: {box.get('R', 'N/A')}") + print(f" Col: {box.get('C', 'N/A')}") + print(f" Spanning: {box.get('SP', False)}") +``` + +--- + +## 7. References + +- YOLOv10 Paper: [YOLOv10: Real-Time End-to-End Object Detection](https://arxiv.org/abs/2405.14458) +- Table Transformer: [PubTables-1M: Towards comprehensive table extraction](https://arxiv.org/abs/2110.00061) +- Document Layout Analysis: [A Survey](https://arxiv.org/abs/2012.15005) diff --git a/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md new file mode 100644 index 000000000..1885b37f3 --- /dev/null +++ b/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md @@ -0,0 +1,678 @@ +# OCR Deep Dive + +## Tổng Quan + +Module OCR trong DeepDoc thực hiện 2 task chính: +1. **Text Detection**: Phát hiện vùng chứa text trong image +2. **Text Recognition**: Nhận dạng text trong các vùng đã phát hiện + +## File Structure + +``` +deepdoc/vision/ +├── ocr.py # Main OCR class (752 lines) +├── postprocess.py # CTC decoder, DBNet postprocess (371 lines) +└── operators.py # Image preprocessing (726 lines) +``` + +--- + +## 1. Text Detection (DBNet) + +### 1.1 Model Architecture + +``` +DBNet (Differentiable Binarization Network): + +Input Image (H, W, 3) + │ + ▼ +┌─────────────────────────────────────┐ +│ ResNet-18 Backbone │ +│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│ +│ │ C1 │→ │ C2 │→ │ C3 │→ │ C4 ││ +│ │64ch │ │128ch│ │256ch│ │512ch││ +│ └─────┘ └─────┘ └─────┘ └─────┘│ +└─────────────────────────────────────┘ + │ │ │ │ + ▼ ▼ ▼ ▼ +┌─────────────────────────────────────┐ +│ Feature Pyramid Network │ +│ Upsample + Concatenate all levels │ +│ Output: 256 channels │ +└─────────────────────────────────────┘ + │ + ├─────────────────┐ + ▼ ▼ +┌─────────────────┐ ┌─────────────────┐ +│ Probability │ │ Threshold │ +│ Head │ │ Head │ +│ Conv → Sigmoid │ │ Conv → Sigmoid │ +└────────┬────────┘ └────────┬────────┘ + │ │ + ▼ ▼ + Prob Map (H, W) Thresh Map (H, W) + │ │ + └─────────┬─────────┘ + ▼ +┌─────────────────────────────────────┐ +│ Differentiable Binarization │ +│ B = sigmoid((P - T) * k) │ +│ k = 50 (amplification factor) │ +└─────────────────────────────────────┘ + │ + ▼ + Binary Map (H, W) +``` + +### 1.2 DBNet Post-processing + +```python +# deepdoc/vision/postprocess.py, lines 41-259 + +class DBPostProcess: + def __init__(self, + thresh=0.3, # Binary threshold + box_thresh=0.5, # Box confidence threshold + max_candidates=1000, # Maximum text regions + unclip_ratio=1.5, # Polygon expansion ratio + use_dilation=False, # Morphological dilation + score_mode="fast"): # fast or slow scoring + + def __call__(self, outs_dict, shape_list): + """ + Post-process DBNet output. + + Args: + outs_dict: {"maps": probability_map} + shape_list: Original image shapes + + Returns: + List of detected text boxes + """ + pred = outs_dict["maps"] # (N, 1, H, W) + + # Step 1: Binary thresholding + bitmap = pred > self.thresh # 0.3 + + # Step 2: Optional dilation + if self.use_dilation: + kernel = np.ones((2, 2)) + bitmap = cv2.dilate(bitmap, kernel) + + # Step 3: Find contours + contours = cv2.findContours( + bitmap.astype(np.uint8), + cv2.RETR_LIST, + cv2.CHAIN_APPROX_SIMPLE + ) + + # Step 4: Process each contour + boxes = [] + for contour in contours[:self.max_candidates]: + # Simplify polygon + epsilon = 0.002 * cv2.arcLength(contour, True) + approx = cv2.approxPolyDP(contour, epsilon, True) + + if len(approx) < 4: + continue + + # Calculate confidence score + score = self.box_score_fast(pred, approx) + if score < self.box_thresh: + continue + + # Unclip (expand) polygon + box = self.unclip(approx, self.unclip_ratio) + boxes.append(box) + + return boxes +``` + +### 1.3 Unclipping Algorithm + +**Vấn đề**: DBNet tends to predict tight boundaries → misses edge characters + +**Giải pháp**: Expand detected polygon by unclip_ratio + +```python +# deepdoc/vision/postprocess.py, lines 163-169 + +def unclip(self, box, unclip_ratio): + """ + Expand polygon using Clipper library. + + Công thức: + distance = Area * unclip_ratio / Perimeter + + Với unclip_ratio = 1.5: + - Nhỏ polygon → expand nhiều hơn + - Lớn polygon → expand ít hơn (proportional) + """ + poly = Polygon(box) + distance = poly.area * unclip_ratio / poly.length + + offset = pyclipper.PyclipperOffset() + offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON) + + expanded = offset.Execute(distance) + return np.array(expanded[0]) +``` + +**Visualization**: +``` +Original detection: After unclip (1.5x): +┌──────────────┐ ┌────────────────────┐ +│ Hello │ → │ Hello │ +└──────────────┘ └────────────────────┘ + (expanded boundaries) +``` + +--- + +## 2. Text Recognition (CRNN) + +### 2.1 Model Architecture + +``` +CRNN (Convolutional Recurrent Neural Network): + +Input: Cropped text image (3, 48, W) + │ + ▼ +┌─────────────────────────────────────┐ +│ CNN Backbone │ +│ VGG-style convolutions │ +│ 7 conv layers + 4 max pooling │ +│ Output: (512, 1, W/4) │ +└────────────────┬────────────────────┘ + │ + ▼ +┌─────────────────────────────────────┐ +│ Sequence Reshaping │ +│ Collapse height dimension │ +│ Output: (W/4, 512) │ +└────────────────┬────────────────────┘ + │ + ▼ +┌─────────────────────────────────────┐ +│ Bidirectional LSTM │ +│ 2 layers, 256 hidden units │ +│ Output: (W/4, 512) │ +└────────────────┬────────────────────┘ + │ + ▼ +┌─────────────────────────────────────┐ +│ Classification Head │ +│ Linear(512 → num_classes) │ +│ Output: (W/4, num_classes) │ +└────────────────┬────────────────────┘ + │ + ▼ + Probability Matrix (T, C) + T = time steps, C = characters +``` + +### 2.2 CTC Decoding + +```python +# deepdoc/vision/postprocess.py, lines 347-370 + +class CTCLabelDecode(BaseRecLabelDecode): + """ + CTC (Connectionist Temporal Classification) Decoder. + + CTC giải quyết vấn đề: + - Model output có T time steps + - Ground truth có N characters + - T > N (nhiều frame cho 1 ký tự) + - Không biết alignment chính xác + + CTC thêm special "blank" token (ε): + - Represents "no output" + - Allows alignment without explicit segmentation + """ + + def __init__(self, character_dict_path, use_space_char=False): + super().__init__(character_dict_path, use_space_char) + # Prepend blank token at index 0 + self.character = ['blank'] + self.character + + def __call__(self, preds, label=None): + """ + Decode CTC output. + + Args: + preds: (batch, time, num_classes) probability matrix + + Returns: + [(text, confidence), ...] + """ + # Get most probable character at each time step + preds_idx = preds.argmax(axis=2) # (batch, time) + preds_prob = preds.max(axis=2) # (batch, time) + + # Decode with deduplication + result = self.decode(preds_idx, preds_prob, is_remove_duplicate=True) + + return result + + def decode(self, text_index, text_prob, is_remove_duplicate=True): + """ + CTC decoding algorithm. + + Example: + Raw output: [a, a, ε, l, l, ε, p, h, a] + After dedup: [a, ε, l, ε, p, h, a] + Remove blank: [a, l, p, h, a] + Final: "alpha" + """ + result = [] + + for batch_idx in range(len(text_index)): + char_list = [] + conf_list = [] + + for idx in range(len(text_index[batch_idx])): + char_idx = text_index[batch_idx][idx] + + # Skip blank token (index 0) + if char_idx == 0: + continue + + # Skip consecutive duplicates + if is_remove_duplicate: + if idx > 0 and char_idx == text_index[batch_idx][idx-1]: + continue + + char_list.append(self.character[char_idx]) + conf_list.append(text_prob[batch_idx][idx]) + + text = ''.join(char_list) + conf = np.mean(conf_list) if conf_list else 0.0 + + result.append((text, conf)) + + return result +``` + +### 2.3 Aspect Ratio Handling + +```python +# deepdoc/vision/ocr.py, lines 146-170 + +def resize_norm_img(self, img, max_wh_ratio): + """ + Resize image maintaining aspect ratio. + + Vấn đề: Text images có width khác nhau + - "Hi" → narrow + - "Hello World" → wide + + Giải pháp: Resize theo aspect ratio, pad right side + """ + imgC, imgH, imgW = self.rec_image_shape # [3, 48, 320] + + # Calculate target width from aspect ratio + max_width = int(imgH * max_wh_ratio) + max_width = min(max_width, imgW) # Cap at 320 + + h, w = img.shape[:2] + ratio = w / float(h) + + # Resize maintaining aspect ratio + if ratio * imgH > max_width: + resized_w = max_width + else: + resized_w = int(ratio * imgH) + + resized_img = cv2.resize(img, (resized_w, imgH)) + + # Pad right side to max_width + padded = np.zeros((imgH, max_width, 3), dtype=np.float32) + padded[:, :resized_w, :] = resized_img + + # Normalize: [0, 255] → [-1, 1] + padded = (padded / 255.0 - 0.5) / 0.5 + + # Transpose: HWC → CHW + padded = padded.transpose(2, 0, 1) + + return padded +``` + +**Visualization**: +``` +Original images: +┌──────┐ ┌────────────────┐ ┌──────────────────────┐ +│ Hi │ │ Hello │ │ Hello World │ +└──────┘ └────────────────┘ └──────────────────────┘ + narrow medium wide + +After resize + pad (to width 320): +┌──────────────────────────────────────────────────────┐ +│ Hi │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ +├──────────────────────────────────────────────────────┤ +│ Hello │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ +├──────────────────────────────────────────────────────┤ +│ Hello World │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ +└──────────────────────────────────────────────────────┘ +(░ = zero padding) +``` + +--- + +## 3. Full OCR Pipeline + +### 3.1 OCR Class + +```python +# deepdoc/vision/ocr.py, lines 536-752 + +class OCR: + """ + End-to-end OCR pipeline. + + Usage: + ocr = OCR() + results = ocr(image) + # results: [(box_points, (text, confidence)), ...] + """ + + def __init__(self, model_dir=None): + # Auto-download models if not found + if model_dir is None: + model_dir = self._get_model_dir() + + # Initialize detector and recognizer + self.text_detector = TextDetector(model_dir) + self.text_recognizer = TextRecognizer(model_dir) + + def __call__(self, img, device_id=0, cls=True): + """ + Full OCR pipeline. + + Args: + img: numpy array (H, W, 3) in BGR + device_id: GPU device ID + cls: Whether to check text orientation + + Returns: + [(box_4pts, (text, confidence)), ...] + """ + # Step 1: Detect text regions + dt_boxes, det_time = self.text_detector(img) + + if dt_boxes is None or len(dt_boxes) == 0: + return [] + + # Step 2: Sort boxes by reading order + dt_boxes = self.sorted_boxes(dt_boxes) + + # Step 3: Crop and rotate each text region + img_crop_list = [] + for box in dt_boxes: + tmp_box = self.get_rotate_crop_image(img, box) + img_crop_list.append(tmp_box) + + # Step 4: Recognize text + rec_res, rec_time = self.text_recognizer(img_crop_list) + + # Step 5: Filter by confidence + results = [] + for box, rec in zip(dt_boxes, rec_res): + text, score = rec + if score >= 0.5: # drop_score threshold + results.append((box, (text, score))) + + return results +``` + +### 3.2 Rotation Detection + +```python +# deepdoc/vision/ocr.py, lines 584-638 + +def get_rotate_crop_image(self, img, points): + """ + Crop text region with automatic rotation detection. + + Vấn đề: Text có thể xoay 90° hoặc 270° + Giải pháp: Try multiple orientations, pick best CTC score + """ + # Order points: top-left → top-right → bottom-right → bottom-left + rect = self.order_points_clockwise(points) + + # Perspective transform to get rectangular crop + width = int(max( + np.linalg.norm(rect[0] - rect[1]), + np.linalg.norm(rect[2] - rect[3]) + )) + height = int(max( + np.linalg.norm(rect[0] - rect[3]), + np.linalg.norm(rect[1] - rect[2]) + )) + + dst = np.array([ + [0, 0], + [width, 0], + [width, height], + [0, height] + ], dtype=np.float32) + + M = cv2.getPerspectiveTransform(rect, dst) + warped = cv2.warpPerspective(img, M, (width, height)) + + # Check if text is vertical (need rotation) + if warped.shape[0] / warped.shape[1] >= 1.5: + # Try 3 orientations + orientations = [ + (warped, 0), # Original + (cv2.rotate(warped, cv2.ROTATE_90_CLOCKWISE), 90), + (cv2.rotate(warped, cv2.ROTATE_90_COUNTERCLOCKWISE), -90) + ] + + best_score = -1 + best_img = warped + + for rot_img, angle in orientations: + # Quick recognition to get confidence + _, score = self.text_recognizer([rot_img])[0] + if score > best_score: + best_score = score + best_img = rot_img + + warped = best_img + + return warped +``` + +### 3.3 Reading Order Sorting + +```python +# deepdoc/vision/ocr.py, lines 640-661 + +def sorted_boxes(self, dt_boxes): + """ + Sort boxes by reading order (top-to-bottom, left-to-right). + + Algorithm: + 1. Sort by Y coordinate (top of box) + 2. Within same "row" (Y within 10px), sort by X coordinate + """ + num_boxes = len(dt_boxes) + sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0])) + + # Group into rows and sort each row + _boxes = list(sorted_boxes) + + for i in range(num_boxes - 1): + for j in range(i, -1, -1): + # If boxes are on same row (Y difference < 10) + if abs(_boxes[j+1][0][1] - _boxes[j][0][1]) < 10: + # Sort by X coordinate + if _boxes[j+1][0][0] < _boxes[j][0][0]: + _boxes[j], _boxes[j+1] = _boxes[j+1], _boxes[j] + else: + break + + return _boxes +``` + +--- + +## 4. Performance Optimization + +### 4.1 GPU Memory Management + +```python +# deepdoc/vision/ocr.py, lines 96-127 + +def load_model(model_dir, nm, device_id=None): + """ + Load ONNX model with optimized settings. + """ + options = ort.SessionOptions() + + # Reduce memory fragmentation + options.enable_cpu_mem_arena = False + + # Sequential execution (more predictable memory) + options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL + + # Limit thread usage + options.intra_op_num_threads = 2 + options.inter_op_num_threads = 2 + + # GPU configuration + if torch.cuda.is_available() and device_id is not None: + providers = [ + ('CUDAExecutionProvider', { + 'device_id': device_id, + # Limit GPU memory to 2GB + 'gpu_mem_limit': int(os.getenv('OCR_GPU_MEM_LIMIT_MB', 2048)) * 1024 * 1024, + # Memory allocation strategy + 'arena_extend_strategy': os.getenv('OCR_ARENA_EXTEND_STRATEGY', 'kNextPowerOfTwo'), + }) + ] + else: + providers = ['CPUExecutionProvider'] + + session = ort.InferenceSession(model_path, options, providers) + + # Run options for memory cleanup after each run + run_opts = ort.RunOptions() + run_opts.add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0") + + return session, run_opts +``` + +### 4.2 Batch Processing Optimization + +```python +# deepdoc/vision/ocr.py, lines 363-408 + +def __call__(self, img_list): + """ + Optimized batch recognition. + """ + # Sort images by aspect ratio for efficient batching + # Similar widths → less padding waste + indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list]) + + results = [None] * len(img_list) + + for batch_start in range(0, len(indices), self.batch_size): + batch_indices = indices[batch_start:batch_start + self.batch_size] + + # Get max width in batch for padding + max_wh_ratio = max(img_list[i].shape[1]/img_list[i].shape[0] + for i in batch_indices) + + # Normalize all images to same width + norm_imgs = [] + for i in batch_indices: + norm_img = self.resize_norm_img(img_list[i], max_wh_ratio) + norm_imgs.append(norm_img) + + # Stack into batch + batch = np.stack(norm_imgs) + + # Run inference + preds = self.ort_sess.run(None, {"input": batch}) + + # Decode results + texts = self.postprocess_op(preds[0]) + + # Map back to original indices + for j, idx in enumerate(batch_indices): + results[idx] = texts[j] + + return results +``` + +### 4.3 Multi-GPU Parallel Processing + +```python +# deepdoc/vision/ocr.py, lines 556-579 + +class OCR: + def __init__(self, model_dir=None): + if settings.PARALLEL_DEVICES > 0: + # Create per-GPU instances + self.text_detector = [ + TextDetector(model_dir, device_id) + for device_id in range(settings.PARALLEL_DEVICES) + ] + self.text_recognizer = [ + TextRecognizer(model_dir, device_id) + for device_id in range(settings.PARALLEL_DEVICES) + ] + else: + # Single instance for CPU/single GPU + self.text_detector = TextDetector(model_dir) + self.text_recognizer = TextRecognizer(model_dir) +``` + +--- + +## 5. Troubleshooting + +### 5.1 Common Issues + +| Issue | Cause | Solution | +|-------|-------|----------| +| Low accuracy | Low resolution input | Increase zoomin factor (3-5) | +| Slow inference | Large images | Resize to max 960px | +| Memory error | Too many candidates | Reduce max_candidates | +| Missing text | Tight boundaries | Increase unclip_ratio | +| Wrong orientation | Vertical text | Enable rotation detection | + +### 5.2 Debugging Tips + +```python +# Enable verbose logging +import logging +logging.basicConfig(level=logging.DEBUG) + +# Visualize detections +from deepdoc.vision.seeit import draw_boxes + +img_with_boxes = draw_boxes(img, dt_boxes) +cv2.imwrite("debug_detection.png", img_with_boxes) + +# Check confidence scores +for box, (text, conf) in results: + print(f"Text: {text}, Confidence: {conf:.2f}") + if conf < 0.5: + print(" ⚠️ Low confidence!") +``` + +--- + +## 6. References + +- DBNet Paper: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947) +- CRNN Paper: [An End-to-End Trainable Neural Network for Image-based Sequence Recognition](https://arxiv.org/abs/1507.05717) +- CTC Paper: [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) +- PaddleOCR: [GitHub](https://github.com/PaddlePaddle/PaddleOCR)