ragflow/personal_analyze/07-DEEPDOC-DEEP-GUIDE
Claude 1dcc9a870b
docs: Add detailed PDF parser processing steps documentation
Created comprehensive documentation for RAGFlowPdfParser processing pipeline:

- 10 major processing steps with code references
- Complete data flow diagrams
- Algorithm explanations (K-Means column detection, text merging)
- Box data structure evolution through pipeline
- Position tag format specification
- Line-by-line code analysis for key methods:
  - __init__ (model loading)
  - __images__ (OCR processing)
  - _layouts_rec (layout detection)
  - _table_transformer_job (table structure)
  - _assign_column (column detection)
  - _text_merge (horizontal merge)
  - _naive_vertical_merge (vertical merge)
  - _filter_forpages (cleanup)
  - _extract_table_figure (extraction)
  - __filterout_scraps (final output)
2025-11-27 06:29:12 +00:00
..
layout_table_deep_dive.md docs: Add comprehensive DeepDoc deep guide documentation 2025-11-27 03:46:14 +00:00
ocr_deep_dive.md docs: Add comprehensive DeepDoc deep guide documentation 2025-11-27 03:46:14 +00:00
pdf_parser_steps_detail.md docs: Add detailed PDF parser processing steps documentation 2025-11-27 06:29:12 +00:00
README.md docs: Add comprehensive DeepDoc deep guide documentation 2025-11-27 03:46:14 +00:00

DeepDoc Module - Hướng Dẫn Đọc Hiểu Chuyên Sâu

Mục Lục

  1. Bức Tranh Lớn
  2. Luồng Dữ Liệu
  3. Phân Tích Chi Tiết Code
  4. Giải Thích Kỹ Thuật
  5. Lý Do Thiết Kế
  6. Thuật Ngữ Khó
  7. Mở Rộng Từ Code

1. Bức Tranh Lớn

1.1 DeepDoc Giải Quyết Vấn Đề Gì?

Vấn đề cốt lõi: Khi xây dựng hệ thống RAG (Retrieval-Augmented Generation), bạn cần chuyển đổi tài liệu (PDF, Word, Excel...) thành dạng text có cấu trúc để:

  • Tìm kiếm semantic (vector search)
  • Chia nhỏ (chunking) hợp lý
  • Giữ nguyên ngữ cảnh của bảng, hình ảnh

DeepDoc là gì?: Một module Python chuyên biệt để:

Document Files → Structured Text + Tables + Figures
(PDF, DOCX...)   (Có position, layout type, reading order)

1.2 Kiến Trúc Tổng Quan

┌─────────────────────────────────────────────────────────────────────────────┐
│                              DEEPDOC MODULE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         PARSER LAYER                                 │   │
│  │  Chuyển đổi các định dạng file thành text có cấu trúc               │   │
│  │                                                                      │   │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │   │
│  │  │   PDF    │ │   DOCX   │ │  Excel   │ │   HTML   │ │ Markdown │  │   │
│  │  │  Parser  │ │  Parser  │ │  Parser  │ │  Parser  │ │  Parser  │  │   │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘  │   │
│  │       │            │            │            │            │         │   │
│  └───────┼────────────┼────────────┼────────────┼────────────┼─────────┘   │
│          │            │            │            │            │              │
│          │            └────────────┴────────────┴────────────┘              │
│          │                         │                                        │
│          │              Text-based parsing                                  │
│          │              (pdfplumber, python-docx, openpyxl...)             │
│          │                                                                  │
│          ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         VISION LAYER                                 │   │
│  │  Computer Vision cho PDF phức tạp (scanned, multi-column)           │   │
│  │                                                                      │   │
│  │  ┌──────────────┐  ┌──────────────────┐  ┌────────────────────┐    │   │
│  │  │     OCR      │  │ Layout Recognizer│  │ Table Structure    │    │   │
│  │  │  Detection + │  │    (YOLOv10)     │  │   Recognizer       │    │   │
│  │  │  Recognition │  │                  │  │                    │    │   │
│  │  └──────────────┘  └──────────────────┘  └────────────────────┘    │   │
│  │         │                   │                      │                │   │
│  │         └───────────────────┴──────────────────────┘                │   │
│  │                             │                                        │   │
│  │                    ONNX Runtime Inference                            │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Các Thành Phần Chính

Thành Phần File Mục Đích
PDF Parser parser/pdf_parser.py Parser phức tạp nhất - xử lý PDF với OCR + layout
Office Parsers parser/docx_parser.py, excel_parser.py, ppt_parser.py Xử lý file Microsoft Office
Web Parsers parser/html_parser.py, markdown_parser.py, json_parser.py Xử lý file web/markup
OCR Engine vision/ocr.py Text detection + recognition
Layout Detector vision/layout_recognizer.py Phân loại vùng (text, table, figure...)
Table Detector vision/table_structure_recognizer.py Nhận dạng cấu trúc bảng
Operators vision/operators.py Image preprocessing pipeline

1.4 Tại Sao Cần DeepDoc?

Không có DeepDoc (naive approach):

# Chỉ extract raw text từ PDF
text = pdfplumber.open("doc.pdf").pages[0].extract_text()
# Kết quả: "Header Footer Table content mixed together..."
# ❌ Mất cấu trúc, table thành text xáo trộn

Với DeepDoc:

parser = RAGFlowPdfParser()
docs, tables = parser("doc.pdf")
# docs: [("Paragraph 1", "page_0_pos_100_200"), ("Paragraph 2", "page_0_pos_300_400")]
# tables: [{"html": "<table>...</table>", "bbox": [...]}]
# ✅ Giữ nguyên cấu trúc, table được parse riêng

2. Luồng Dữ Liệu

2.1 Luồng Chính: PDF Processing

┌────────────────────────────────────────────────────────────────────────────┐
│                        PDF PROCESSING PIPELINE                              │
└────────────────────────────────────────────────────────────────────────────┘

Input: PDF File (path hoặc bytes)
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 1: IMAGE EXTRACTION                                                    │
│  File: pdf_parser.py, __images__() (lines 1042-1159)                        │
│                                                                              │
│  • Convert PDF pages → numpy images (using pdfplumber)                      │
│  • Extract native PDF characters (text layer)                               │
│  • Zoom factor: 3x (default) for OCR accuracy                               │
│                                                                              │
│  Output: page_images[], page_chars[]                                        │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 2: OCR DETECTION & RECOGNITION                                         │
│  File: vision/ocr.py, OCR.__call__() (lines 708-751)                        │
│                                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │ TextDetector │ →  │   Crop &     │ →  │TextRecognizer│                   │
│  │   (DBNet)    │    │   Rotate     │    │   (CRNN)     │                   │
│  └──────────────┘    └──────────────┘    └──────────────┘                   │
│                                                                              │
│  • Detect text regions → bounding boxes                                     │
│  • Crop each region, auto-rotate if needed                                  │
│  • Recognize text in each region                                            │
│                                                                              │
│  Output: boxes[] with {text, confidence, coordinates}                       │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 3: LAYOUT RECOGNITION                                                  │
│  File: vision/layout_recognizer.py, __call__() (lines 63-157)               │
│                                                                              │
│  • Run YOLOv10 model on page image                                          │
│  • Detect 10 layout types: Text, Title, Table, Figure, etc.                 │
│  • Match OCR boxes to layout regions                                        │
│                                                                              │
│  Output: boxes[] with added {layout_type, layoutno}                         │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 4: COLUMN DETECTION                                                    │
│  File: pdf_parser.py, _assign_column() (lines 355-440)                      │
│                                                                              │
│  • K-Means clustering on X coordinates                                      │
│  • Silhouette score to find optimal k (1-4 columns)                         │
│  • Assign col_id to each text box                                           │
│                                                                              │
│  Output: boxes[] with added {col_id}                                        │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 5: TABLE STRUCTURE RECOGNITION                                         │
│  File: vision/table_structure_recognizer.py, __call__() (lines 67-111)      │
│                                                                              │
│  • Detect rows, columns, headers, spanning cells                            │
│  • Match text boxes to table cells                                          │
│  • Build 2D table matrix                                                    │
│                                                                              │
│  Output: table_components[] with grid structure                             │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 6: TEXT MERGING                                                        │
│  File: pdf_parser.py, _text_merge() (lines 442-478)                         │
│                           _naive_vertical_merge() (lines 480-556)           │
│                                                                              │
│  • Horizontal merge: same line, same column, same layout                    │
│  • Vertical merge: adjacent paragraphs with semantic checks                 │
│  • Respect sentence boundaries (。?!)                                      │
│                                                                              │
│  Output: merged_boxes[] (fewer, larger text blocks)                         │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 7: FILTERING & CLEANUP                                                 │
│  File: pdf_parser.py, _filter_forpages() (lines 685-729)                    │
│                        __filterout_scraps() (lines 971-1029)                │
│                                                                              │
│  • Remove headers/footers (top/bottom 10% of page)                          │
│  • Remove table of contents                                                 │
│  • Filter low-quality OCR results                                           │
│                                                                              │
│  Output: clean_boxes[]                                                      │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 8: EXTRACT TABLES & FIGURES                                            │
│  File: pdf_parser.py, _extract_table_figure() (lines 757-930)               │
│                                                                              │
│  • Convert table boxes to HTML/descriptive text                             │
│  • Extract figure images with captions                                      │
│  • Handle spanning cells (colspan, rowspan)                                 │
│                                                                              │
│  Output: tables[], figures[]                                                │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  FINAL OUTPUT                                                                │
│                                                                              │
│  documents: [(text, position_tag), ...]                                     │
│  tables: [{"html": "...", "bbox": [...], "image": ...}, ...]               │
│                                                                              │
│  position_tag format: "page_{page}_x0_{x0}_y0_{y0}_x1_{x1}_y1_{y1}"        │
└─────────────────────────────────────────────────────────────────────────────┘

2.2 Luồng OCR Chi Tiết

                           Input Image (H, W, 3)
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        TEXT DETECTION (DBNet)                                │
│  File: vision/ocr.py, TextDetector.__call__() (lines 503-530)               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
           ┌────────────────────────┼────────────────────────┐
           │                        │                        │
           ▼                        ▼                        ▼
    ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
    │ Preprocess  │          │    ONNX     │          │ Postprocess │
    │             │          │  Inference  │          │             │
    │ • Resize    │    →     │             │    →     │ • Threshold │
    │ • Normalize │          │  DBNet      │          │ • Contours  │
    │ • Transpose │          │  Model      │          │ • Unclip    │
    └─────────────┘          └─────────────┘          └─────────────┘
                                    │
                                    ▼
                         Text Region Polygons
                         [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      TEXT RECOGNITION (CRNN)                                 │
│  File: vision/ocr.py, TextRecognizer.__call__() (lines 363-408)             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
           ┌────────────────────────┼────────────────────────┐
           │                        │                        │
           ▼                        ▼                        ▼
    ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
    │    Crop     │          │    ONNX     │          │ CTC Decode  │
    │   Rotate    │          │  Inference  │          │             │
    │             │    →     │             │    →     │ • Argmax    │
    │ Perspective │          │   CRNN      │          │ • Dedup     │
    │ Transform   │          │   Model     │          │ • Remove ε  │
    └─────────────┘          └─────────────┘          └─────────────┘
                                    │
                                    ▼
                    Output: [(box, (text, confidence)), ...]

2.3 Luồng Layout Recognition

                           Input: Page Image + OCR Results
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                     LAYOUT DETECTION (YOLOv10)                               │
│  File: vision/layout_recognizer.py (lines 163-237)                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
           ┌────────────────────────────┼────────────────────────────┐
           │                            │                            │
           ▼                            ▼                            ▼
    ┌─────────────┐              ┌─────────────┐              ┌─────────────┐
    │ Preprocess  │              │    ONNX     │              │ Postprocess │
    │             │              │  Inference  │              │             │
    │ • Resize    │      →       │             │      →       │ • NMS       │
    │   (640x640) │              │  YOLOv10    │              │ • Filter    │
    │ • Pad       │              │   Model     │              │ • Scale     │
    │ • Normalize │              │             │              │   back      │
    └─────────────┘              └─────────────┘              └─────────────┘
                                        │
                                        ▼
                              Layout Detections:
                              [{"type": "Table", "bbox": [...], "score": 0.95}]
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                     OCR-LAYOUT ASSOCIATION                                   │
│  File: vision/layout_recognizer.py (lines 98-147)                           │
│                                                                              │
│  For each OCR box:                                                          │
│    • Find overlapping layout region (threshold: 40%)                        │
│    • Assign layout_type to OCR box                                          │
│    • Filter garbage (headers/footers/page numbers)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
                    Output: OCR boxes with layout_type attribute
                    [{"text": "...", "layout_type": "Text", "layoutno": 1}]

2.4 Data Flow Summary

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  PDF File   │ →   │   Images    │ →   │ OCR Boxes   │ →   │  Merged     │
│             │     │ + Chars     │     │ + Layout    │     │  Documents  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
                                        ┌─────────────┐
                                        │   Tables    │
                                        │  (HTML/Desc)│
                                        └─────────────┘

Input Format:
- File path: str (e.g., "/path/to/doc.pdf")
- Or bytes: bytes (raw PDF content)

Output Format:
- documents: List[Tuple[str, str]]
  - text: Extracted text content
  - position_tag: "page_0_x0_100_y0_200_x1_500_y1_250"

- tables: List[Dict]
  - html: "<table>...</table>"
  - bbox: [x0, y0, x1, y1]
  - image: numpy array (optional)

3. Phân Tích Chi Tiết Code

3.1 RAGFlowPdfParser Class

File: /deepdoc/parser/pdf_parser.py Lines: 52-1479

3.1.1 Constructor (init)

# Line 52-104
class RAGFlowPdfParser:
    def __init__(self, **kwargs):
        # Load OCR model
        self.ocr = OCR()  # vision/ocr.py

        # Load Layout Recognizer (YOLOv10)
        self.layout_recognizer = LayoutRecognizer()  # vision/layout_recognizer.py

        # Load Table Structure Recognizer
        self.tsr = TableStructureRecognizer()  # vision/table_structure_recognizer.py

        # Load XGBoost model for text concatenation
        try:
            self.updown_cnt_mdl = xgb.Booster()
            model_path = os.path.join(get_project_base_directory(),
                                      "rag/res/deepdoc/updown_concat_xgb.model")
            self.updown_cnt_mdl.load_model(model_path)
        except Exception as e:
            self.updown_cnt_mdl = None

Giải thích:

  • Constructor khởi tạo 4 models:
    1. OCR: Text detection + recognition
    2. LayoutRecognizer: Phân loại vùng layout (YOLOv10)
    3. TableStructureRecognizer: Nhận dạng cấu trúc bảng
    4. XGBoost: Quyết định merge text blocks (31 features)

3.1.2 Main Entry Point (call)

# Lines 1160-1168
def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
    """
    Main entry point for PDF parsing.

    Args:
        fnm: File path or bytes
        need_image: Whether to extract images
        zoomin: Zoom factor for OCR (default 3x)
        return_html: Return HTML tables instead of descriptive text

    Returns:
        (documents, tables) tuple
    """
    self.__images__(fnm, zoomin)           # Step 1: Load images
    self._layouts_rec(zoomin)              # Step 2-3: OCR + Layout
    self._table_transformer_job(zoomin)    # Step 4: Table structure
    self._text_merge(zoomin)               # Step 5: Merge text
    self._filter_forpages()                # Step 6: Filter
    tbls = self._extract_table_figure(...) # Step 7: Extract tables
    return self._final_result(), tbls      # Final output

Tại sao zoomin=3?

  • OCR accuracy tăng đáng kể khi image lớn hơn
  • 3x là balance giữa accuracy và memory/speed
  • Quá lớn (5x+) → memory issues, quá nhỏ (1x) → OCR errors

3.1.3 Image Loading (images)

# Lines 1042-1159
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    """
    Load PDF pages as images and extract native characters.
    """
    self.page_images = []
    self.page_chars = []

    # Open PDF with pdfplumber
    with pdfplumber.open(fnm) as pdf:
        for i, page in enumerate(pdf.pages[page_from:page_to]):
            # Convert page to image
            img = page.to_image(resolution=72 * zoomin)
            img = np.array(img.original)
            self.page_images.append(img)

            # Extract native PDF characters
            chars = page.chars
            self.page_chars.append(chars)

Tại sao dùng pdfplumber?

  • Hỗ trợ cả text extraction và image conversion
  • Giữ được character-level coordinates
  • Xử lý tốt các PDF phức tạp

3.1.4 Column Detection (_assign_column)

# Lines 355-440
def _assign_column(self, boxes, zoomin=3):
    """
    Detect columns using K-Means clustering on X coordinates.
    """
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score

    # Extract X coordinates
    x_coords = np.array([[b["x0"]] for b in boxes])

    best_k = 1
    best_score = -1

    # Try k from 1 to 4
    for k in range(1, min(5, len(boxes))):
        km = KMeans(n_clusters=k, random_state=42, n_init="auto")
        labels = km.fit_predict(x_coords)

        if k > 1:
            score = silhouette_score(x_coords, labels)
            if score > best_score:
                best_score = score
                best_k = k

    # Final clustering with best k
    km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
    labels = km.fit_predict(x_coords)

    # Assign column IDs
    for i, box in enumerate(boxes):
        box["col_id"] = labels[i]

Tại sao K-Means?

  • Unsupervised: không cần training data
  • Fast: O(n * k * iterations)
  • Silhouette score tự động chọn số cột

3.2 OCR Class

File: /deepdoc/vision/ocr.py Lines: 536-752

3.2.1 Text Detection (TextDetector)

# Lines 414-534
class TextDetector:
    def __init__(self, model_dir, device_id=None):
        # Preprocessing pipeline
        self.preprocess_op = [
            DetResizeForTest(limit_side_len=960, limit_type='max'),
            NormalizeImage(mean=[0.485, 0.456, 0.406],
                          std=[0.229, 0.224, 0.225]),
            ToCHWImage(),
            KeepKeys(keep_keys=['image', 'shape'])
        ]

        # Postprocessing
        self.postprocess_op = DBPostProcess(
            thresh=0.3,           # Binary threshold
            box_thresh=0.5,       # Box confidence threshold
            max_candidates=1000,  # Max text regions
            unclip_ratio=1.5      # Box expansion ratio
        )

        # Load ONNX model
        self.ort_sess, self.run_opts = load_model(model_dir, "det", device_id)

DBNet (Differentiable Binarization):

  • Input: Image → Probability map (text regions)
  • Thresholding: prob > 0.3 → foreground
  • Unclipping: Expand boxes by 1.5x để capture full text

3.2.2 Text Recognition (TextRecognizer)

# Lines 133-412
class TextRecognizer:
    def __init__(self, model_dir, device_id=None):
        self.rec_image_shape = [3, 48, 320]  # C, H, W
        self.batch_size = 16

        # Load CRNN model
        self.ort_sess, self.run_opts = load_model(model_dir, "rec", device_id)

        # CTC decoder
        self.postprocess_op = CTCLabelDecode(character_dict_path=dict_path)

    def __call__(self, img_list):
        # Sort by aspect ratio for efficient batching
        indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])

        results = []
        for batch in chunks(indices, self.batch_size):
            # Normalize images
            norm_imgs = [self.resize_norm_img(img_list[i]) for i in batch]

            # Run inference
            preds = self.ort_sess.run(None, {"input": np.stack(norm_imgs)})

            # CTC decode
            texts = self.postprocess_op(preds[0])
            results.extend(texts)

        return results

CRNN + CTC:

  • CNN: Extract visual features
  • RNN: Sequence modeling
  • CTC: Alignment-free decoding (handles variable-length text)

3.2.3 Rotation Handling

# Lines 584-638
def get_rotate_crop_image(self, img, points):
    """
    Crop text region with auto-rotation detection.
    """
    # Get perspective transform
    rect = self.order_points_clockwise(points)
    M = cv2.getPerspectiveTransform(rect, dst_pts)
    warped = cv2.warpPerspective(img, M, (width, height))

    # Check if text is vertical (height > 1.5 * width)
    if warped.shape[0] / warped.shape[1] >= 1.5:
        # Try 3 orientations
        scores = []
        for angle in [0, 90, -90]:
            rotated = self.rotate(warped, angle)
            _, conf = self.recognizer([rotated])[0]
            scores.append(conf)

        # Use orientation with highest confidence
        best_angle = [0, 90, -90][np.argmax(scores)]
        warped = self.rotate(warped, best_angle)

    return warped

Tại sao cần auto-rotation?

  • PDF có thể chứa text xoay 90°
  • OCR model trained on horizontal text
  • Auto-detect giúp nhận dạng text dọc chính xác

3.3 Layout Recognizer

File: /deepdoc/vision/layout_recognizer.py Lines: 33-237

3.3.1 YOLOv10 Preprocessing

# Lines 186-209
def preprocess(self, image_list):
    """
    Preprocess images for YOLOv10 inference.
    """
    processed = []
    for img in image_list:
        h, w = img.shape[:2]

        # Calculate scale (preserve aspect ratio)
        r = min(640/h, 640/w)
        new_h, new_w = int(h*r), int(w*r)

        # Resize
        resized = cv2.resize(img, (new_w, new_h))

        # Pad to 640x640 (center padding, gray color)
        padded = np.full((640, 640, 3), 114, dtype=np.uint8)
        pad_top = (640 - new_h) // 2
        pad_left = (640 - new_w) // 2
        padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized

        # Normalize and transpose
        padded = padded.astype(np.float32) / 255.0
        padded = padded.transpose(2, 0, 1)  # HWC → CHW

        processed.append(padded)

    return np.stack(processed)

Tại sao 640x640?

  • YOLOv10 standard input size
  • Balance accuracy vs speed
  • 32-stride alignment (640 = 20 * 32)

3.3.2 Layout Types

# Lines 34-46
labels = [
    "_background_",    # 0: Background (ignored)
    "Text",            # 1: Body text paragraphs
    "Title",           # 2: Section/document titles
    "Figure",          # 3: Images, diagrams, charts
    "Figure caption",  # 4: Text describing figures
    "Table",           # 5: Data tables
    "Table caption",   # 6: Text describing tables
    "Header",          # 7: Page headers
    "Footer",          # 8: Page footers
    "Reference",       # 9: Bibliography, citations
    "Equation",        # 10: Mathematical equations
]

3.4 Table Structure Recognizer

File: /deepdoc/vision/table_structure_recognizer.py Lines: 30-613

3.4.1 Table Grid Construction

# Lines 172-349
@staticmethod
def construct_table(boxes, is_english=False, html=True, **kwargs):
    """
    Construct 2D table from detected components.
    """
    # Step 1: Sort by row
    boxes = Recognizer.sort_R_firstly(boxes, rowh/2)

    # Step 2: Group into rows
    rows = []
    current_row = [boxes[0]]
    for box in boxes[1:]:
        if box["top"] - current_row[-1]["bottom"] > rowh/2:
            rows.append(current_row)
            current_row = [box]
        else:
            current_row.append(box)
    rows.append(current_row)

    # Step 3: Sort each row by column
    for row in rows:
        row.sort(key=lambda x: x["x0"])

    # Step 4: Build 2D matrix
    n_cols = max(len(row) for row in rows)
    table = [[None] * n_cols for _ in range(len(rows))]

    for i, row in enumerate(rows):
        for j, cell in enumerate(row):
            table[i][j] = cell["text"]

    # Step 5: Generate output
    if html:
        return generate_html_table(table)
    else:
        return generate_descriptive_text(table)

3.4.2 Spanning Cell Handling

# Lines 496-575
def __cal_spans(self, boxes):
    """
    Calculate colspan and rowspan for merged cells.
    """
    for box in boxes:
        if "SP" not in box:  # Not a spanning cell
            continue

        # Find which rows this cell spans
        box["rowspan"] = []
        for i, row_box in enumerate(self.rows):
            if self.overlapped_area(box, row_box) > 0.3:
                box["rowspan"].append(i)

        # Find which columns this cell spans
        box["colspan"] = []
        for j, col_box in enumerate(self.cols):
            if self.overlapped_area(box, col_box) > 0.3:
                box["colspan"].append(j)

4. Giải Thích Kỹ Thuật

4.1 ONNX Runtime

ONNX là gì?

  • Open Neural Network Exchange
  • Format chuẩn cho deep learning models
  • Chạy trên nhiều hardware (CPU, GPU, NPU)

Tại sao dùng ONNX?

# Không cần PyTorch/TensorFlow runtime
# Lightweight inference
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": input_data})

Cấu hình trong DeepDoc:

# vision/ocr.py, lines 96-127
options = ort.SessionOptions()
options.enable_cpu_mem_arena = False     # Giảm memory fragmentation
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
options.intra_op_num_threads = 2         # Threads per operator
options.inter_op_num_threads = 2         # Parallel operators

# GPU configuration
if torch.cuda.is_available():
    providers = [
        ('CUDAExecutionProvider', {
            'device_id': device_id,
            'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
        })
    ]

4.2 CTC Decoding

CTC (Connectionist Temporal Classification):

  • Giải quyết alignment problem trong sequence-to-sequence
  • Không cần biết vị trí chính xác của từng ký tự

Ví dụ:

OCR Model Output (time steps):
[a, a, a, -, l, l, -, p, p, h, h, a, -]

CTC Decoding:
1. Merge consecutive duplicates: [a, -, l, -, p, h, a, -]
2. Remove blank tokens (-): [a, l, p, h, a]
3. Result: "alpha"

Implementation:

# vision/postprocess.py, lines 355-366
def __call__(self, preds, label=None):
    # Get most probable character at each position
    preds_idx = preds.argmax(axis=2)  # Shape: (batch, time)
    preds_prob = preds.max(axis=2)     # Confidence scores

    # Decode with deduplication
    text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)

    return text

4.3 Non-Maximum Suppression (NMS)

NMS là gì?

  • Loại bỏ duplicate detections
  • Giữ lại box có confidence cao nhất

Algorithm:

1. Sort boxes by confidence (descending)
2. Pick box with highest score → add to results
3. Remove boxes with IoU > threshold (e.g., 0.5)
4. Repeat until no boxes remain

Implementation:

# vision/operators.py, lines 702-725
def nms(bboxes, scores, iou_thresh):
    indices = []
    index = scores.argsort()[::-1]  # Sort descending

    while index.size > 0:
        i = index[0]
        indices.append(i)

        # Compute IoU with remaining boxes
        ious = compute_iou(bboxes[i], bboxes[index[1:]])

        # Keep only boxes with IoU <= threshold
        mask = ious <= iou_thresh
        index = index[1:][mask]

    return indices

4.4 DBNet (Differentiable Binarization)

DBNet là gì?

  • Text detection network
  • Tạo probability map + threshold map
  • Differentiable binarization cho end-to-end training

Pipeline:

Image → CNN Backbone → Feature Map →
                                    ├→ Probability Map (text regions)
                                    └→ Threshold Map (adaptive threshold)

Final = Probability > Threshold (pixel-wise)

Post-processing:

# vision/postprocess.py, DBPostProcess
def __call__(self, outs_dict, shape_list):
    pred = outs_dict["maps"]

    # Binary thresholding
    bitmap = pred > self.thresh  # 0.3

    # Find contours
    contours = cv2.findContours(bitmap, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

    # Unclip (expand) boxes
    for contour in contours:
        box = self.unclip(contour, self.unclip_ratio)  # 1.5x expansion
        boxes.append(box)

4.5 K-Means cho Column Detection

Tại sao K-Means?

  • Text boxes trong cùng cột có X coordinate tương tự
  • K-Means cluster các X values
  • Silhouette score chọn số cột tối ưu

Silhouette Score:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

- a(i): Average distance to same cluster
- b(i): Average distance to nearest other cluster
- Range: [-1, 1], higher = better clustering

Ví dụ:

Page with 2 columns:
Left column boxes: x0 = [50, 52, 48, 51, ...]
Right column boxes: x0 = [400, 398, 402, 399, ...]

K-Means (k=2):
- Cluster 0: x0 ≈ 50 (left column)
- Cluster 1: x0 ≈ 400 (right column)

Silhouette score ≈ 0.95 (high, good separation)

5. Lý Do Thiết Kế

5.1 Tại Sao Dùng Multiple Models?

Vấn đề: Một model không thể handle tất cả tasks

Task Model Type Lý Do
Text Detection DBNet Specialized cho text regions
Text Recognition CRNN Sequential text với CTC
Layout Detection YOLOv10 Object detection tốt nhất
Table Structure YOLOv10 variant Fine-tuned cho table elements

Trade-off:

  • Pros: Mỗi model optimized cho task riêng
  • Cons: Nhiều models → nhiều memory, complexity

5.2 Tại Sao Dùng XGBoost cho Text Merging?

Vấn đề: Merge text blocks là decision phức tạp

Rule-based approach (naive):

# Simple heuristics
if y_distance < threshold and same_column:
    merge()
# ❌ Không handle edge cases tốt

ML approach (XGBoost):

# 31 features capturing various signals
features = [
    y_distance / char_height,      # Distance feature
    ends_with_punctuation,          # Text pattern
    same_layout_type,               # Layout feature
    font_size_ratio,                # Typography
    ...
]
# ✅ Learns complex patterns from data

Tại sao XGBoost?

  • Fast inference (tree-based)
  • Handles mixed feature types well
  • Pre-trained model included

5.3 Tại Sao ONNX thay vì PyTorch/TensorFlow?

Aspect ONNX Runtime PyTorch
Size ~50MB ~500MB+
Memory Lower Higher
Startup Fast Slow (JIT)
Dependencies Minimal Many
Multi-platform Yes Limited

DeepDoc choice: ONNX cho production deployment

  • Không cần PyTorch runtime
  • Lighter memory footprint
  • Faster cold start

5.4 Tại Sao Zoomin = 3?

Experiment results:

zoomin=1: OCR accuracy ~70%, fast
zoomin=2: OCR accuracy ~85%, moderate
zoomin=3: OCR accuracy ~95%, acceptable speed ← chosen
zoomin=4: OCR accuracy ~97%, slow
zoomin=5: OCR accuracy ~98%, very slow, memory issues

Balance: 3x là sweet spot giữa accuracy và resource usage

5.5 Tại Sao Hybrid Text Extraction?

Native PDF text (pdfplumber):

  • Pros: Accurate, fast, preserves fonts
  • Cons: Không có cho scanned PDFs

OCR text:

  • Pros: Works on any image
  • Cons: Slower, potential errors

Hybrid approach:

# Prefer native text, fallback to OCR
for box in ocr_boxes:
    # Try to match with native characters
    matched_chars = find_overlapping_chars(box, native_chars)

    if matched_chars:
        box["text"] = "".join(matched_chars)  # Use native
    else:
        box["text"] = ocr_result  # Use OCR

5.6 Pipeline vs End-to-End Model

End-to-End (e.g., Donut, Pix2Struct):

  • Single model: Image → Structured output
  • Pros: Simple, unified
  • Cons: Less accurate on specific tasks, hard to debug

Pipeline (DeepDoc's choice):

  • Multiple specialized models
  • Pros:
    • Each model optimized for task
    • Easy to debug/improve individual components
    • Mix and match different models
  • Cons:
    • More complexity
    • Potential error accumulation

DeepDoc's rationale: Pipeline cho flexibility và accuracy


6. Thuật Ngữ Khó

6.1 Computer Vision Terms

Term Definition Ví Dụ trong DeepDoc
Bounding Box Hình chữ nhật bao quanh object [x0, y0, x1, y1] coordinates
IoU Intersection over Union - đo overlap NMS threshold 0.5
NMS Non-Maximum Suppression Loại duplicate detections
Anchor Predefined box sizes YOLOv10 anchors
Stride Downsampling factor 32 trong YOLOv10
FPN Feature Pyramid Network Multi-scale detection

6.2 OCR Terms

Term Definition Ví Dụ trong DeepDoc
CTC Connectionist Temporal Classification CRNN output decoding
CRNN CNN + RNN Text recognition model
DBNet Differentiable Binarization Text detection model
Unclip Expand polygon boundary 1.5x expansion ratio

6.3 ML Terms

Term Definition Ví Dụ trong DeepDoc
ONNX Open Neural Network Exchange Model format
Inference Running model on input session.run()
Batch Multiple inputs processed together batch_size=16
Confidence Model's certainty score 0.0 - 1.0

6.4 Document Processing Terms

Term Definition Ví Dụ trong DeepDoc
Layout Document structure Text, Table, Figure
TSR Table Structure Recognition Row, Column detection
Spanning Cell Merged table cell colspan, rowspan
Reading Order Text flow sequence Top-to-bottom, left-to-right

7. Mở Rộng Từ Code

7.1 Thêm Parser Mới

Ví dụ: Add RTF parser

# deepdoc/parser/rtf_parser.py
from striprtf.striprtf import rtf_to_text

class RAGFlowRtfParser:
    def __call__(self, fnm, binary=None, chunk_token_num=128):
        if binary:
            content = binary.decode('utf-8')
        else:
            with open(fnm, 'r') as f:
                content = f.read()

        text = rtf_to_text(content)

        # Chunk text
        chunks = self._chunk(text, chunk_token_num)

        return [(chunk, f"rtf_chunk_{i}") for i, chunk in enumerate(chunks)]

7.2 Thêm Layout Type Mới

Ví dụ: Add "Code Block" layout

# vision/layout_recognizer.py
labels = [
    "_background_",
    "Text",
    "Title",
    ...
    "Code Block",  # New label (index 11)
]

# Train new YOLOv10 model with "Code Block" annotations
# Update model file

7.3 Custom Text Merging Logic

# Override default merging behavior
class CustomPdfParser(RAGFlowPdfParser):
    def _should_merge(self, box1, box2):
        """Custom merge logic"""
        # Don't merge code blocks
        if box1.get("layout_type") == "Code Block":
            return False

        # Use default logic otherwise
        return super()._should_merge(box1, box2)

7.4 Thêm Output Format

# Add Markdown output format
def to_markdown(self, documents, tables):
    md_parts = []

    for text, pos_tag in documents:
        # Detect if title
        if self._is_title(text):
            md_parts.append(f"## {text}\n")
        else:
            md_parts.append(f"{text}\n\n")

    # Convert tables to markdown
    for table in tables:
        md_table = html_to_markdown(table["html"])
        md_parts.append(md_table)

    return "\n".join(md_parts)

7.5 Optimize Performance

GPU Batching:

# Process multiple pages in parallel
def _parallel_ocr(self, images, batch_size=4):
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = []
        for batch in chunks(images, batch_size):
            future = executor.submit(self.ocr, batch)
            futures.append(future)

        results = [f.result() for f in futures]
    return results

Caching:

# Cache model instances
_model_cache = {}

def get_ocr_model(model_dir, device_id):
    key = f"{model_dir}_{device_id}"
    if key not in _model_cache:
        _model_cache[key] = OCR(model_dir, device_id)
    return _model_cache[key]

7.6 Integration với RAG Pipeline

# rag/app/pdf.py (example integration)
from deepdoc.parser import RAGFlowPdfParser

def process_pdf_for_rag(file_path, chunk_size=512):
    parser = RAGFlowPdfParser()

    # Parse PDF
    documents, tables = parser(file_path)

    # Chunk documents
    chunks = []
    for text, pos_tag in documents:
        for chunk in chunk_text(text, chunk_size):
            chunks.append({
                "text": chunk,
                "metadata": {"position": pos_tag}
            })

    # Add tables as separate chunks
    for table in tables:
        chunks.append({
            "text": table["html"],
            "metadata": {"type": "table", "bbox": table["bbox"]}
        })

    return chunks

8. Tổng Kết

8.1 Key Takeaways

  1. DeepDoc = Parser Layer + Vision Layer

    • Parser: Format-specific handling (PDF, DOCX, etc.)
    • Vision: OCR + Layout + Table recognition
  2. Pipeline Architecture

    • Multiple specialized models
    • Easy to debug and improve
  3. ONNX Runtime

    • Lightweight inference
    • Cross-platform compatibility
  4. Hybrid Text Extraction

    • Native PDF text khi available
    • OCR fallback cho scanned documents

8.2 Diagram Tổng Hợp

┌──────────────────────────────────────────────────────────────────────────────┐
│                            DEEPDOC SUMMARY                                    │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                               │
│  INPUT                   PROCESSING                         OUTPUT            │
│  ─────                   ──────────                         ──────            │
│                                                                               │
│  ┌─────────┐     ┌────────────────────────────┐      ┌─────────────────┐    │
│  │  PDF    │────▶│  1. Image Extraction       │─────▶│  Documents      │    │
│  │  DOCX   │     │  2. OCR (DBNet + CRNN)     │      │  [(text, pos)]  │    │
│  │  Excel  │     │  3. Layout (YOLOv10)       │      │                 │    │
│  │  HTML   │     │  4. Column Detection       │      │  Tables         │    │
│  │  ...    │     │  5. Table Structure        │      │  [html, bbox]   │    │
│  └─────────┘     │  6. Text Merging           │      │                 │    │
│                  │  7. Quality Filtering      │      │  Figures        │    │
│                  └────────────────────────────┘      │  [image, cap]   │    │
│                                                       └─────────────────┘    │
│                                                                               │
│  MODELS USED:                                                                 │
│  ────────────                                                                 │
│  • DBNet (Text Detection)          - ONNX, ~30MB                             │
│  • CRNN (Text Recognition)         - ONNX, ~20MB                             │
│  • YOLOv10 (Layout Detection)      - ONNX, ~50MB                             │
│  • YOLOv10 (Table Structure)       - ONNX, ~50MB                             │
│  • XGBoost (Text Merging)          - Binary, ~5MB                            │
│                                                                               │
│  KEY ALGORITHMS:                                                              │
│  ───────────────                                                              │
│  • CTC Decoding (text recognition)                                           │
│  • NMS (duplicate removal)                                                   │
│  • K-Means (column detection)                                                │
│  • IoU (overlap calculation)                                                 │
│                                                                               │
└──────────────────────────────────────────────────────────────────────────────┘

8.3 Files Reference

File Lines Description
parser/pdf_parser.py 1479 Main PDF parser
vision/ocr.py 752 OCR detection + recognition
vision/layout_recognizer.py 457 Layout detection
vision/table_structure_recognizer.py 613 Table structure
vision/recognizer.py 443 Base recognizer class
vision/operators.py 726 Image preprocessing
vision/postprocess.py 371 Post-processing utilities

Document created for RAGFlow v0.22.1 analysis