Created in-depth documentation for understanding the deepdoc module: - README.md: Complete deep guide with: - Big picture explanation (what problem deepdoc solves) - Data flow diagrams (Input → Processing → Output) - Detailed code analysis with line numbers - Technical explanations (ONNX, CTC, NMS, etc.) - Design reasoning (why certain technologies chosen) - Difficult terms glossary - Extension examples - ocr_deep_dive.md: Deep dive into OCR subsystem - DBNet text detection architecture - CRNN text recognition - CTC decoding algorithm - Rotation handling - Performance optimization - layout_table_deep_dive.md: Deep dive into layout/table recognition - YOLOv10 layout detection - Table structure recognition - Grid construction algorithm - Spanning cell handling - HTML/descriptive output generation
1286 lines
56 KiB
Markdown
1286 lines
56 KiB
Markdown
# DeepDoc Module - Hướng Dẫn Đọc Hiểu Chuyên Sâu
|
|
|
|
## Mục Lục
|
|
|
|
1. [Bức Tranh Lớn](#1-bức-tranh-lớn)
|
|
2. [Luồng Dữ Liệu](#2-luồng-dữ-liệu)
|
|
3. [Phân Tích Chi Tiết Code](#3-phân-tích-chi-tiết-code)
|
|
4. [Giải Thích Kỹ Thuật](#4-giải-thích-kỹ-thuật)
|
|
5. [Lý Do Thiết Kế](#5-lý-do-thiết-kế)
|
|
6. [Thuật Ngữ Khó](#6-thuật-ngữ-khó)
|
|
7. [Mở Rộng Từ Code](#7-mở-rộng-từ-code)
|
|
|
|
---
|
|
|
|
## 1. Bức Tranh Lớn
|
|
|
|
### 1.1 DeepDoc Giải Quyết Vấn Đề Gì?
|
|
|
|
**Vấn đề cốt lõi**: Khi xây dựng hệ thống RAG (Retrieval-Augmented Generation), bạn cần chuyển đổi tài liệu (PDF, Word, Excel...) thành dạng text có cấu trúc để:
|
|
- Tìm kiếm semantic (vector search)
|
|
- Chia nhỏ (chunking) hợp lý
|
|
- Giữ nguyên ngữ cảnh của bảng, hình ảnh
|
|
|
|
**DeepDoc là gì?**: Một module Python chuyên biệt để:
|
|
```
|
|
Document Files → Structured Text + Tables + Figures
|
|
(PDF, DOCX...) (Có position, layout type, reading order)
|
|
```
|
|
|
|
### 1.2 Kiến Trúc Tổng Quan
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ DEEPDOC MODULE │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ PARSER LAYER │ │
|
|
│ │ Chuyển đổi các định dạng file thành text có cấu trúc │ │
|
|
│ │ │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
|
|
│ │ │ PDF │ │ DOCX │ │ Excel │ │ HTML │ │ Markdown │ │ │
|
|
│ │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ Parser │ │ │
|
|
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
|
|
│ │ │ │ │ │ │ │ │
|
|
│ └───────┼────────────┼────────────┼────────────┼────────────┼─────────┘ │
|
|
│ │ │ │ │ │ │
|
|
│ │ └────────────┴────────────┴────────────┘ │
|
|
│ │ │ │
|
|
│ │ Text-based parsing │
|
|
│ │ (pdfplumber, python-docx, openpyxl...) │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ VISION LAYER │ │
|
|
│ │ Computer Vision cho PDF phức tạp (scanned, multi-column) │ │
|
|
│ │ │ │
|
|
│ │ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐ │ │
|
|
│ │ │ OCR │ │ Layout Recognizer│ │ Table Structure │ │ │
|
|
│ │ │ Detection + │ │ (YOLOv10) │ │ Recognizer │ │ │
|
|
│ │ │ Recognition │ │ │ │ │ │ │
|
|
│ │ └──────────────┘ └──────────────────┘ └────────────────────┘ │ │
|
|
│ │ │ │ │ │ │
|
|
│ │ └───────────────────┴──────────────────────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ ONNX Runtime Inference │ │
|
|
│ │ │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 1.3 Các Thành Phần Chính
|
|
|
|
| Thành Phần | File | Mục Đích |
|
|
|------------|------|----------|
|
|
| **PDF Parser** | `parser/pdf_parser.py` | Parser phức tạp nhất - xử lý PDF với OCR + layout |
|
|
| **Office Parsers** | `parser/docx_parser.py`, `excel_parser.py`, `ppt_parser.py` | Xử lý file Microsoft Office |
|
|
| **Web Parsers** | `parser/html_parser.py`, `markdown_parser.py`, `json_parser.py` | Xử lý file web/markup |
|
|
| **OCR Engine** | `vision/ocr.py` | Text detection + recognition |
|
|
| **Layout Detector** | `vision/layout_recognizer.py` | Phân loại vùng (text, table, figure...) |
|
|
| **Table Detector** | `vision/table_structure_recognizer.py` | Nhận dạng cấu trúc bảng |
|
|
| **Operators** | `vision/operators.py` | Image preprocessing pipeline |
|
|
|
|
### 1.4 Tại Sao Cần DeepDoc?
|
|
|
|
**Không có DeepDoc** (naive approach):
|
|
```python
|
|
# Chỉ extract raw text từ PDF
|
|
text = pdfplumber.open("doc.pdf").pages[0].extract_text()
|
|
# Kết quả: "Header Footer Table content mixed together..."
|
|
# ❌ Mất cấu trúc, table thành text xáo trộn
|
|
```
|
|
|
|
**Với DeepDoc**:
|
|
```python
|
|
parser = RAGFlowPdfParser()
|
|
docs, tables = parser("doc.pdf")
|
|
# docs: [("Paragraph 1", "page_0_pos_100_200"), ("Paragraph 2", "page_0_pos_300_400")]
|
|
# tables: [{"html": "<table>...</table>", "bbox": [...]}]
|
|
# ✅ Giữ nguyên cấu trúc, table được parse riêng
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Luồng Dữ Liệu
|
|
|
|
### 2.1 Luồng Chính: PDF Processing
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────────────────┐
|
|
│ PDF PROCESSING PIPELINE │
|
|
└────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
Input: PDF File (path hoặc bytes)
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 1: IMAGE EXTRACTION │
|
|
│ File: pdf_parser.py, __images__() (lines 1042-1159) │
|
|
│ │
|
|
│ • Convert PDF pages → numpy images (using pdfplumber) │
|
|
│ • Extract native PDF characters (text layer) │
|
|
│ • Zoom factor: 3x (default) for OCR accuracy │
|
|
│ │
|
|
│ Output: page_images[], page_chars[] │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 2: OCR DETECTION & RECOGNITION │
|
|
│ File: vision/ocr.py, OCR.__call__() (lines 708-751) │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ TextDetector │ → │ Crop & │ → │TextRecognizer│ │
|
|
│ │ (DBNet) │ │ Rotate │ │ (CRNN) │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │
|
|
│ • Detect text regions → bounding boxes │
|
|
│ • Crop each region, auto-rotate if needed │
|
|
│ • Recognize text in each region │
|
|
│ │
|
|
│ Output: boxes[] with {text, confidence, coordinates} │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 3: LAYOUT RECOGNITION │
|
|
│ File: vision/layout_recognizer.py, __call__() (lines 63-157) │
|
|
│ │
|
|
│ • Run YOLOv10 model on page image │
|
|
│ • Detect 10 layout types: Text, Title, Table, Figure, etc. │
|
|
│ • Match OCR boxes to layout regions │
|
|
│ │
|
|
│ Output: boxes[] with added {layout_type, layoutno} │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 4: COLUMN DETECTION │
|
|
│ File: pdf_parser.py, _assign_column() (lines 355-440) │
|
|
│ │
|
|
│ • K-Means clustering on X coordinates │
|
|
│ • Silhouette score to find optimal k (1-4 columns) │
|
|
│ • Assign col_id to each text box │
|
|
│ │
|
|
│ Output: boxes[] with added {col_id} │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 5: TABLE STRUCTURE RECOGNITION │
|
|
│ File: vision/table_structure_recognizer.py, __call__() (lines 67-111) │
|
|
│ │
|
|
│ • Detect rows, columns, headers, spanning cells │
|
|
│ • Match text boxes to table cells │
|
|
│ • Build 2D table matrix │
|
|
│ │
|
|
│ Output: table_components[] with grid structure │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 6: TEXT MERGING │
|
|
│ File: pdf_parser.py, _text_merge() (lines 442-478) │
|
|
│ _naive_vertical_merge() (lines 480-556) │
|
|
│ │
|
|
│ • Horizontal merge: same line, same column, same layout │
|
|
│ • Vertical merge: adjacent paragraphs with semantic checks │
|
|
│ • Respect sentence boundaries (。?!) │
|
|
│ │
|
|
│ Output: merged_boxes[] (fewer, larger text blocks) │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 7: FILTERING & CLEANUP │
|
|
│ File: pdf_parser.py, _filter_forpages() (lines 685-729) │
|
|
│ __filterout_scraps() (lines 971-1029) │
|
|
│ │
|
|
│ • Remove headers/footers (top/bottom 10% of page) │
|
|
│ • Remove table of contents │
|
|
│ • Filter low-quality OCR results │
|
|
│ │
|
|
│ Output: clean_boxes[] │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STEP 8: EXTRACT TABLES & FIGURES │
|
|
│ File: pdf_parser.py, _extract_table_figure() (lines 757-930) │
|
|
│ │
|
|
│ • Convert table boxes to HTML/descriptive text │
|
|
│ • Extract figure images with captions │
|
|
│ • Handle spanning cells (colspan, rowspan) │
|
|
│ │
|
|
│ Output: tables[], figures[] │
|
|
└──────────────────────────────────┬──────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ FINAL OUTPUT │
|
|
│ │
|
|
│ documents: [(text, position_tag), ...] │
|
|
│ tables: [{"html": "...", "bbox": [...], "image": ...}, ...] │
|
|
│ │
|
|
│ position_tag format: "page_{page}_x0_{x0}_y0_{y0}_x1_{x1}_y1_{y1}" │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.2 Luồng OCR Chi Tiết
|
|
|
|
```
|
|
Input Image (H, W, 3)
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ TEXT DETECTION (DBNet) │
|
|
│ File: vision/ocr.py, TextDetector.__call__() (lines 503-530) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────┼────────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Preprocess │ │ ONNX │ │ Postprocess │
|
|
│ │ │ Inference │ │ │
|
|
│ • Resize │ → │ │ → │ • Threshold │
|
|
│ • Normalize │ │ DBNet │ │ • Contours │
|
|
│ • Transpose │ │ Model │ │ • Unclip │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│
|
|
▼
|
|
Text Region Polygons
|
|
[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ TEXT RECOGNITION (CRNN) │
|
|
│ File: vision/ocr.py, TextRecognizer.__call__() (lines 363-408) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────┼────────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Crop │ │ ONNX │ │ CTC Decode │
|
|
│ Rotate │ │ Inference │ │ │
|
|
│ │ → │ │ → │ • Argmax │
|
|
│ Perspective │ │ CRNN │ │ • Dedup │
|
|
│ Transform │ │ Model │ │ • Remove ε │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│
|
|
▼
|
|
Output: [(box, (text, confidence)), ...]
|
|
```
|
|
|
|
### 2.3 Luồng Layout Recognition
|
|
|
|
```
|
|
Input: Page Image + OCR Results
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ LAYOUT DETECTION (YOLOv10) │
|
|
│ File: vision/layout_recognizer.py (lines 163-237) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────────┼────────────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Preprocess │ │ ONNX │ │ Postprocess │
|
|
│ │ │ Inference │ │ │
|
|
│ • Resize │ → │ │ → │ • NMS │
|
|
│ (640x640) │ │ YOLOv10 │ │ • Filter │
|
|
│ • Pad │ │ Model │ │ • Scale │
|
|
│ • Normalize │ │ │ │ back │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│
|
|
▼
|
|
Layout Detections:
|
|
[{"type": "Table", "bbox": [...], "score": 0.95}]
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ OCR-LAYOUT ASSOCIATION │
|
|
│ File: vision/layout_recognizer.py (lines 98-147) │
|
|
│ │
|
|
│ For each OCR box: │
|
|
│ • Find overlapping layout region (threshold: 40%) │
|
|
│ • Assign layout_type to OCR box │
|
|
│ • Filter garbage (headers/footers/page numbers) │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
Output: OCR boxes with layout_type attribute
|
|
[{"text": "...", "layout_type": "Text", "layoutno": 1}]
|
|
```
|
|
|
|
### 2.4 Data Flow Summary
|
|
|
|
```
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ PDF File │ → │ Images │ → │ OCR Boxes │ → │ Merged │
|
|
│ │ │ + Chars │ │ + Layout │ │ Documents │
|
|
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ Tables │
|
|
│ (HTML/Desc)│
|
|
└─────────────┘
|
|
|
|
Input Format:
|
|
- File path: str (e.g., "/path/to/doc.pdf")
|
|
- Or bytes: bytes (raw PDF content)
|
|
|
|
Output Format:
|
|
- documents: List[Tuple[str, str]]
|
|
- text: Extracted text content
|
|
- position_tag: "page_0_x0_100_y0_200_x1_500_y1_250"
|
|
|
|
- tables: List[Dict]
|
|
- html: "<table>...</table>"
|
|
- bbox: [x0, y0, x1, y1]
|
|
- image: numpy array (optional)
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Phân Tích Chi Tiết Code
|
|
|
|
### 3.1 RAGFlowPdfParser Class
|
|
|
|
**File**: `/deepdoc/parser/pdf_parser.py`
|
|
**Lines**: 52-1479
|
|
|
|
#### 3.1.1 Constructor (__init__)
|
|
|
|
```python
|
|
# Line 52-104
|
|
class RAGFlowPdfParser:
|
|
def __init__(self, **kwargs):
|
|
# Load OCR model
|
|
self.ocr = OCR() # vision/ocr.py
|
|
|
|
# Load Layout Recognizer (YOLOv10)
|
|
self.layout_recognizer = LayoutRecognizer() # vision/layout_recognizer.py
|
|
|
|
# Load Table Structure Recognizer
|
|
self.tsr = TableStructureRecognizer() # vision/table_structure_recognizer.py
|
|
|
|
# Load XGBoost model for text concatenation
|
|
try:
|
|
self.updown_cnt_mdl = xgb.Booster()
|
|
model_path = os.path.join(get_project_base_directory(),
|
|
"rag/res/deepdoc/updown_concat_xgb.model")
|
|
self.updown_cnt_mdl.load_model(model_path)
|
|
except Exception as e:
|
|
self.updown_cnt_mdl = None
|
|
```
|
|
|
|
**Giải thích**:
|
|
- Constructor khởi tạo 4 models:
|
|
1. **OCR**: Text detection + recognition
|
|
2. **LayoutRecognizer**: Phân loại vùng layout (YOLOv10)
|
|
3. **TableStructureRecognizer**: Nhận dạng cấu trúc bảng
|
|
4. **XGBoost**: Quyết định merge text blocks (31 features)
|
|
|
|
#### 3.1.2 Main Entry Point (__call__)
|
|
|
|
```python
|
|
# Lines 1160-1168
|
|
def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
|
|
"""
|
|
Main entry point for PDF parsing.
|
|
|
|
Args:
|
|
fnm: File path or bytes
|
|
need_image: Whether to extract images
|
|
zoomin: Zoom factor for OCR (default 3x)
|
|
return_html: Return HTML tables instead of descriptive text
|
|
|
|
Returns:
|
|
(documents, tables) tuple
|
|
"""
|
|
self.__images__(fnm, zoomin) # Step 1: Load images
|
|
self._layouts_rec(zoomin) # Step 2-3: OCR + Layout
|
|
self._table_transformer_job(zoomin) # Step 4: Table structure
|
|
self._text_merge(zoomin) # Step 5: Merge text
|
|
self._filter_forpages() # Step 6: Filter
|
|
tbls = self._extract_table_figure(...) # Step 7: Extract tables
|
|
return self._final_result(), tbls # Final output
|
|
```
|
|
|
|
**Tại sao zoomin=3?**
|
|
- OCR accuracy tăng đáng kể khi image lớn hơn
|
|
- 3x là balance giữa accuracy và memory/speed
|
|
- Quá lớn (5x+) → memory issues, quá nhỏ (1x) → OCR errors
|
|
|
|
#### 3.1.3 Image Loading (__images__)
|
|
|
|
```python
|
|
# Lines 1042-1159
|
|
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
|
|
"""
|
|
Load PDF pages as images and extract native characters.
|
|
"""
|
|
self.page_images = []
|
|
self.page_chars = []
|
|
|
|
# Open PDF with pdfplumber
|
|
with pdfplumber.open(fnm) as pdf:
|
|
for i, page in enumerate(pdf.pages[page_from:page_to]):
|
|
# Convert page to image
|
|
img = page.to_image(resolution=72 * zoomin)
|
|
img = np.array(img.original)
|
|
self.page_images.append(img)
|
|
|
|
# Extract native PDF characters
|
|
chars = page.chars
|
|
self.page_chars.append(chars)
|
|
```
|
|
|
|
**Tại sao dùng pdfplumber?**
|
|
- Hỗ trợ cả text extraction và image conversion
|
|
- Giữ được character-level coordinates
|
|
- Xử lý tốt các PDF phức tạp
|
|
|
|
#### 3.1.4 Column Detection (_assign_column)
|
|
|
|
```python
|
|
# Lines 355-440
|
|
def _assign_column(self, boxes, zoomin=3):
|
|
"""
|
|
Detect columns using K-Means clustering on X coordinates.
|
|
"""
|
|
from sklearn.cluster import KMeans
|
|
from sklearn.metrics import silhouette_score
|
|
|
|
# Extract X coordinates
|
|
x_coords = np.array([[b["x0"]] for b in boxes])
|
|
|
|
best_k = 1
|
|
best_score = -1
|
|
|
|
# Try k from 1 to 4
|
|
for k in range(1, min(5, len(boxes))):
|
|
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
|
|
labels = km.fit_predict(x_coords)
|
|
|
|
if k > 1:
|
|
score = silhouette_score(x_coords, labels)
|
|
if score > best_score:
|
|
best_score = score
|
|
best_k = k
|
|
|
|
# Final clustering with best k
|
|
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
|
|
labels = km.fit_predict(x_coords)
|
|
|
|
# Assign column IDs
|
|
for i, box in enumerate(boxes):
|
|
box["col_id"] = labels[i]
|
|
```
|
|
|
|
**Tại sao K-Means?**
|
|
- Unsupervised: không cần training data
|
|
- Fast: O(n * k * iterations)
|
|
- Silhouette score tự động chọn số cột
|
|
|
|
### 3.2 OCR Class
|
|
|
|
**File**: `/deepdoc/vision/ocr.py`
|
|
**Lines**: 536-752
|
|
|
|
#### 3.2.1 Text Detection (TextDetector)
|
|
|
|
```python
|
|
# Lines 414-534
|
|
class TextDetector:
|
|
def __init__(self, model_dir, device_id=None):
|
|
# Preprocessing pipeline
|
|
self.preprocess_op = [
|
|
DetResizeForTest(limit_side_len=960, limit_type='max'),
|
|
NormalizeImage(mean=[0.485, 0.456, 0.406],
|
|
std=[0.229, 0.224, 0.225]),
|
|
ToCHWImage(),
|
|
KeepKeys(keep_keys=['image', 'shape'])
|
|
]
|
|
|
|
# Postprocessing
|
|
self.postprocess_op = DBPostProcess(
|
|
thresh=0.3, # Binary threshold
|
|
box_thresh=0.5, # Box confidence threshold
|
|
max_candidates=1000, # Max text regions
|
|
unclip_ratio=1.5 # Box expansion ratio
|
|
)
|
|
|
|
# Load ONNX model
|
|
self.ort_sess, self.run_opts = load_model(model_dir, "det", device_id)
|
|
```
|
|
|
|
**DBNet (Differentiable Binarization)**:
|
|
- Input: Image → Probability map (text regions)
|
|
- Thresholding: prob > 0.3 → foreground
|
|
- Unclipping: Expand boxes by 1.5x để capture full text
|
|
|
|
#### 3.2.2 Text Recognition (TextRecognizer)
|
|
|
|
```python
|
|
# Lines 133-412
|
|
class TextRecognizer:
|
|
def __init__(self, model_dir, device_id=None):
|
|
self.rec_image_shape = [3, 48, 320] # C, H, W
|
|
self.batch_size = 16
|
|
|
|
# Load CRNN model
|
|
self.ort_sess, self.run_opts = load_model(model_dir, "rec", device_id)
|
|
|
|
# CTC decoder
|
|
self.postprocess_op = CTCLabelDecode(character_dict_path=dict_path)
|
|
|
|
def __call__(self, img_list):
|
|
# Sort by aspect ratio for efficient batching
|
|
indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
|
|
|
|
results = []
|
|
for batch in chunks(indices, self.batch_size):
|
|
# Normalize images
|
|
norm_imgs = [self.resize_norm_img(img_list[i]) for i in batch]
|
|
|
|
# Run inference
|
|
preds = self.ort_sess.run(None, {"input": np.stack(norm_imgs)})
|
|
|
|
# CTC decode
|
|
texts = self.postprocess_op(preds[0])
|
|
results.extend(texts)
|
|
|
|
return results
|
|
```
|
|
|
|
**CRNN + CTC**:
|
|
- CNN: Extract visual features
|
|
- RNN: Sequence modeling
|
|
- CTC: Alignment-free decoding (handles variable-length text)
|
|
|
|
#### 3.2.3 Rotation Handling
|
|
|
|
```python
|
|
# Lines 584-638
|
|
def get_rotate_crop_image(self, img, points):
|
|
"""
|
|
Crop text region with auto-rotation detection.
|
|
"""
|
|
# Get perspective transform
|
|
rect = self.order_points_clockwise(points)
|
|
M = cv2.getPerspectiveTransform(rect, dst_pts)
|
|
warped = cv2.warpPerspective(img, M, (width, height))
|
|
|
|
# Check if text is vertical (height > 1.5 * width)
|
|
if warped.shape[0] / warped.shape[1] >= 1.5:
|
|
# Try 3 orientations
|
|
scores = []
|
|
for angle in [0, 90, -90]:
|
|
rotated = self.rotate(warped, angle)
|
|
_, conf = self.recognizer([rotated])[0]
|
|
scores.append(conf)
|
|
|
|
# Use orientation with highest confidence
|
|
best_angle = [0, 90, -90][np.argmax(scores)]
|
|
warped = self.rotate(warped, best_angle)
|
|
|
|
return warped
|
|
```
|
|
|
|
**Tại sao cần auto-rotation?**
|
|
- PDF có thể chứa text xoay 90°
|
|
- OCR model trained on horizontal text
|
|
- Auto-detect giúp nhận dạng text dọc chính xác
|
|
|
|
### 3.3 Layout Recognizer
|
|
|
|
**File**: `/deepdoc/vision/layout_recognizer.py`
|
|
**Lines**: 33-237
|
|
|
|
#### 3.3.1 YOLOv10 Preprocessing
|
|
|
|
```python
|
|
# Lines 186-209
|
|
def preprocess(self, image_list):
|
|
"""
|
|
Preprocess images for YOLOv10 inference.
|
|
"""
|
|
processed = []
|
|
for img in image_list:
|
|
h, w = img.shape[:2]
|
|
|
|
# Calculate scale (preserve aspect ratio)
|
|
r = min(640/h, 640/w)
|
|
new_h, new_w = int(h*r), int(w*r)
|
|
|
|
# Resize
|
|
resized = cv2.resize(img, (new_w, new_h))
|
|
|
|
# Pad to 640x640 (center padding, gray color)
|
|
padded = np.full((640, 640, 3), 114, dtype=np.uint8)
|
|
pad_top = (640 - new_h) // 2
|
|
pad_left = (640 - new_w) // 2
|
|
padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
|
|
|
|
# Normalize and transpose
|
|
padded = padded.astype(np.float32) / 255.0
|
|
padded = padded.transpose(2, 0, 1) # HWC → CHW
|
|
|
|
processed.append(padded)
|
|
|
|
return np.stack(processed)
|
|
```
|
|
|
|
**Tại sao 640x640?**
|
|
- YOLOv10 standard input size
|
|
- Balance accuracy vs speed
|
|
- 32-stride alignment (640 = 20 * 32)
|
|
|
|
#### 3.3.2 Layout Types
|
|
|
|
```python
|
|
# Lines 34-46
|
|
labels = [
|
|
"_background_", # 0: Background (ignored)
|
|
"Text", # 1: Body text paragraphs
|
|
"Title", # 2: Section/document titles
|
|
"Figure", # 3: Images, diagrams, charts
|
|
"Figure caption", # 4: Text describing figures
|
|
"Table", # 5: Data tables
|
|
"Table caption", # 6: Text describing tables
|
|
"Header", # 7: Page headers
|
|
"Footer", # 8: Page footers
|
|
"Reference", # 9: Bibliography, citations
|
|
"Equation", # 10: Mathematical equations
|
|
]
|
|
```
|
|
|
|
### 3.4 Table Structure Recognizer
|
|
|
|
**File**: `/deepdoc/vision/table_structure_recognizer.py`
|
|
**Lines**: 30-613
|
|
|
|
#### 3.4.1 Table Grid Construction
|
|
|
|
```python
|
|
# Lines 172-349
|
|
@staticmethod
|
|
def construct_table(boxes, is_english=False, html=True, **kwargs):
|
|
"""
|
|
Construct 2D table from detected components.
|
|
"""
|
|
# Step 1: Sort by row
|
|
boxes = Recognizer.sort_R_firstly(boxes, rowh/2)
|
|
|
|
# Step 2: Group into rows
|
|
rows = []
|
|
current_row = [boxes[0]]
|
|
for box in boxes[1:]:
|
|
if box["top"] - current_row[-1]["bottom"] > rowh/2:
|
|
rows.append(current_row)
|
|
current_row = [box]
|
|
else:
|
|
current_row.append(box)
|
|
rows.append(current_row)
|
|
|
|
# Step 3: Sort each row by column
|
|
for row in rows:
|
|
row.sort(key=lambda x: x["x0"])
|
|
|
|
# Step 4: Build 2D matrix
|
|
n_cols = max(len(row) for row in rows)
|
|
table = [[None] * n_cols for _ in range(len(rows))]
|
|
|
|
for i, row in enumerate(rows):
|
|
for j, cell in enumerate(row):
|
|
table[i][j] = cell["text"]
|
|
|
|
# Step 5: Generate output
|
|
if html:
|
|
return generate_html_table(table)
|
|
else:
|
|
return generate_descriptive_text(table)
|
|
```
|
|
|
|
#### 3.4.2 Spanning Cell Handling
|
|
|
|
```python
|
|
# Lines 496-575
|
|
def __cal_spans(self, boxes):
|
|
"""
|
|
Calculate colspan and rowspan for merged cells.
|
|
"""
|
|
for box in boxes:
|
|
if "SP" not in box: # Not a spanning cell
|
|
continue
|
|
|
|
# Find which rows this cell spans
|
|
box["rowspan"] = []
|
|
for i, row_box in enumerate(self.rows):
|
|
if self.overlapped_area(box, row_box) > 0.3:
|
|
box["rowspan"].append(i)
|
|
|
|
# Find which columns this cell spans
|
|
box["colspan"] = []
|
|
for j, col_box in enumerate(self.cols):
|
|
if self.overlapped_area(box, col_box) > 0.3:
|
|
box["colspan"].append(j)
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Giải Thích Kỹ Thuật
|
|
|
|
### 4.1 ONNX Runtime
|
|
|
|
**ONNX là gì?**
|
|
- Open Neural Network Exchange
|
|
- Format chuẩn cho deep learning models
|
|
- Chạy trên nhiều hardware (CPU, GPU, NPU)
|
|
|
|
**Tại sao dùng ONNX?**
|
|
```python
|
|
# Không cần PyTorch/TensorFlow runtime
|
|
# Lightweight inference
|
|
import onnxruntime as ort
|
|
|
|
session = ort.InferenceSession("model.onnx")
|
|
output = session.run(None, {"input": input_data})
|
|
```
|
|
|
|
**Cấu hình trong DeepDoc**:
|
|
```python
|
|
# vision/ocr.py, lines 96-127
|
|
options = ort.SessionOptions()
|
|
options.enable_cpu_mem_arena = False # Giảm memory fragmentation
|
|
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
|
|
options.intra_op_num_threads = 2 # Threads per operator
|
|
options.inter_op_num_threads = 2 # Parallel operators
|
|
|
|
# GPU configuration
|
|
if torch.cuda.is_available():
|
|
providers = [
|
|
('CUDAExecutionProvider', {
|
|
'device_id': device_id,
|
|
'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB
|
|
})
|
|
]
|
|
```
|
|
|
|
### 4.2 CTC Decoding
|
|
|
|
**CTC (Connectionist Temporal Classification)**:
|
|
- Giải quyết alignment problem trong sequence-to-sequence
|
|
- Không cần biết vị trí chính xác của từng ký tự
|
|
|
|
**Ví dụ**:
|
|
```
|
|
OCR Model Output (time steps):
|
|
[a, a, a, -, l, l, -, p, p, h, h, a, -]
|
|
|
|
CTC Decoding:
|
|
1. Merge consecutive duplicates: [a, -, l, -, p, h, a, -]
|
|
2. Remove blank tokens (-): [a, l, p, h, a]
|
|
3. Result: "alpha"
|
|
```
|
|
|
|
**Implementation**:
|
|
```python
|
|
# vision/postprocess.py, lines 355-366
|
|
def __call__(self, preds, label=None):
|
|
# Get most probable character at each position
|
|
preds_idx = preds.argmax(axis=2) # Shape: (batch, time)
|
|
preds_prob = preds.max(axis=2) # Confidence scores
|
|
|
|
# Decode with deduplication
|
|
text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
|
|
|
|
return text
|
|
```
|
|
|
|
### 4.3 Non-Maximum Suppression (NMS)
|
|
|
|
**NMS là gì?**
|
|
- Loại bỏ duplicate detections
|
|
- Giữ lại box có confidence cao nhất
|
|
|
|
**Algorithm**:
|
|
```
|
|
1. Sort boxes by confidence (descending)
|
|
2. Pick box with highest score → add to results
|
|
3. Remove boxes with IoU > threshold (e.g., 0.5)
|
|
4. Repeat until no boxes remain
|
|
```
|
|
|
|
**Implementation**:
|
|
```python
|
|
# vision/operators.py, lines 702-725
|
|
def nms(bboxes, scores, iou_thresh):
|
|
indices = []
|
|
index = scores.argsort()[::-1] # Sort descending
|
|
|
|
while index.size > 0:
|
|
i = index[0]
|
|
indices.append(i)
|
|
|
|
# Compute IoU with remaining boxes
|
|
ious = compute_iou(bboxes[i], bboxes[index[1:]])
|
|
|
|
# Keep only boxes with IoU <= threshold
|
|
mask = ious <= iou_thresh
|
|
index = index[1:][mask]
|
|
|
|
return indices
|
|
```
|
|
|
|
### 4.4 DBNet (Differentiable Binarization)
|
|
|
|
**DBNet là gì?**
|
|
- Text detection network
|
|
- Tạo probability map + threshold map
|
|
- Differentiable binarization cho end-to-end training
|
|
|
|
**Pipeline**:
|
|
```
|
|
Image → CNN Backbone → Feature Map →
|
|
├→ Probability Map (text regions)
|
|
└→ Threshold Map (adaptive threshold)
|
|
|
|
Final = Probability > Threshold (pixel-wise)
|
|
```
|
|
|
|
**Post-processing**:
|
|
```python
|
|
# vision/postprocess.py, DBPostProcess
|
|
def __call__(self, outs_dict, shape_list):
|
|
pred = outs_dict["maps"]
|
|
|
|
# Binary thresholding
|
|
bitmap = pred > self.thresh # 0.3
|
|
|
|
# Find contours
|
|
contours = cv2.findContours(bitmap, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
|
|
|
|
# Unclip (expand) boxes
|
|
for contour in contours:
|
|
box = self.unclip(contour, self.unclip_ratio) # 1.5x expansion
|
|
boxes.append(box)
|
|
```
|
|
|
|
### 4.5 K-Means cho Column Detection
|
|
|
|
**Tại sao K-Means?**
|
|
- Text boxes trong cùng cột có X coordinate tương tự
|
|
- K-Means cluster các X values
|
|
- Silhouette score chọn số cột tối ưu
|
|
|
|
**Silhouette Score**:
|
|
```
|
|
s(i) = (b(i) - a(i)) / max(a(i), b(i))
|
|
|
|
- a(i): Average distance to same cluster
|
|
- b(i): Average distance to nearest other cluster
|
|
- Range: [-1, 1], higher = better clustering
|
|
```
|
|
|
|
**Ví dụ**:
|
|
```
|
|
Page with 2 columns:
|
|
Left column boxes: x0 = [50, 52, 48, 51, ...]
|
|
Right column boxes: x0 = [400, 398, 402, 399, ...]
|
|
|
|
K-Means (k=2):
|
|
- Cluster 0: x0 ≈ 50 (left column)
|
|
- Cluster 1: x0 ≈ 400 (right column)
|
|
|
|
Silhouette score ≈ 0.95 (high, good separation)
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Lý Do Thiết Kế
|
|
|
|
### 5.1 Tại Sao Dùng Multiple Models?
|
|
|
|
**Vấn đề**: Một model không thể handle tất cả tasks
|
|
|
|
| Task | Model Type | Lý Do |
|
|
|------|------------|-------|
|
|
| Text Detection | DBNet | Specialized cho text regions |
|
|
| Text Recognition | CRNN | Sequential text với CTC |
|
|
| Layout Detection | YOLOv10 | Object detection tốt nhất |
|
|
| Table Structure | YOLOv10 variant | Fine-tuned cho table elements |
|
|
|
|
**Trade-off**:
|
|
- Pros: Mỗi model optimized cho task riêng
|
|
- Cons: Nhiều models → nhiều memory, complexity
|
|
|
|
### 5.2 Tại Sao Dùng XGBoost cho Text Merging?
|
|
|
|
**Vấn đề**: Merge text blocks là decision phức tạp
|
|
|
|
**Rule-based approach** (naive):
|
|
```python
|
|
# Simple heuristics
|
|
if y_distance < threshold and same_column:
|
|
merge()
|
|
# ❌ Không handle edge cases tốt
|
|
```
|
|
|
|
**ML approach** (XGBoost):
|
|
```python
|
|
# 31 features capturing various signals
|
|
features = [
|
|
y_distance / char_height, # Distance feature
|
|
ends_with_punctuation, # Text pattern
|
|
same_layout_type, # Layout feature
|
|
font_size_ratio, # Typography
|
|
...
|
|
]
|
|
# ✅ Learns complex patterns from data
|
|
```
|
|
|
|
**Tại sao XGBoost?**
|
|
- Fast inference (tree-based)
|
|
- Handles mixed feature types well
|
|
- Pre-trained model included
|
|
|
|
### 5.3 Tại Sao ONNX thay vì PyTorch/TensorFlow?
|
|
|
|
| Aspect | ONNX Runtime | PyTorch |
|
|
|--------|--------------|---------|
|
|
| Size | ~50MB | ~500MB+ |
|
|
| Memory | Lower | Higher |
|
|
| Startup | Fast | Slow (JIT) |
|
|
| Dependencies | Minimal | Many |
|
|
| Multi-platform | Yes | Limited |
|
|
|
|
**DeepDoc choice**: ONNX cho production deployment
|
|
- Không cần PyTorch runtime
|
|
- Lighter memory footprint
|
|
- Faster cold start
|
|
|
|
### 5.4 Tại Sao Zoomin = 3?
|
|
|
|
**Experiment results**:
|
|
```
|
|
zoomin=1: OCR accuracy ~70%, fast
|
|
zoomin=2: OCR accuracy ~85%, moderate
|
|
zoomin=3: OCR accuracy ~95%, acceptable speed ← chosen
|
|
zoomin=4: OCR accuracy ~97%, slow
|
|
zoomin=5: OCR accuracy ~98%, very slow, memory issues
|
|
```
|
|
|
|
**Balance**: 3x là sweet spot giữa accuracy và resource usage
|
|
|
|
### 5.5 Tại Sao Hybrid Text Extraction?
|
|
|
|
**Native PDF text** (pdfplumber):
|
|
- Pros: Accurate, fast, preserves fonts
|
|
- Cons: Không có cho scanned PDFs
|
|
|
|
**OCR text**:
|
|
- Pros: Works on any image
|
|
- Cons: Slower, potential errors
|
|
|
|
**Hybrid approach**:
|
|
```python
|
|
# Prefer native text, fallback to OCR
|
|
for box in ocr_boxes:
|
|
# Try to match with native characters
|
|
matched_chars = find_overlapping_chars(box, native_chars)
|
|
|
|
if matched_chars:
|
|
box["text"] = "".join(matched_chars) # Use native
|
|
else:
|
|
box["text"] = ocr_result # Use OCR
|
|
```
|
|
|
|
### 5.6 Pipeline vs End-to-End Model
|
|
|
|
**End-to-End** (e.g., Donut, Pix2Struct):
|
|
- Single model: Image → Structured output
|
|
- Pros: Simple, unified
|
|
- Cons: Less accurate on specific tasks, hard to debug
|
|
|
|
**Pipeline** (DeepDoc's choice):
|
|
- Multiple specialized models
|
|
- Pros:
|
|
- Each model optimized for task
|
|
- Easy to debug/improve individual components
|
|
- Mix and match different models
|
|
- Cons:
|
|
- More complexity
|
|
- Potential error accumulation
|
|
|
|
**DeepDoc's rationale**: Pipeline cho flexibility và accuracy
|
|
|
|
---
|
|
|
|
## 6. Thuật Ngữ Khó
|
|
|
|
### 6.1 Computer Vision Terms
|
|
|
|
| Term | Definition | Ví Dụ trong DeepDoc |
|
|
|------|------------|---------------------|
|
|
| **Bounding Box** | Hình chữ nhật bao quanh object | `[x0, y0, x1, y1]` coordinates |
|
|
| **IoU** | Intersection over Union - đo overlap | NMS threshold 0.5 |
|
|
| **NMS** | Non-Maximum Suppression | Loại duplicate detections |
|
|
| **Anchor** | Predefined box sizes | YOLOv10 anchors |
|
|
| **Stride** | Downsampling factor | 32 trong YOLOv10 |
|
|
| **FPN** | Feature Pyramid Network | Multi-scale detection |
|
|
|
|
### 6.2 OCR Terms
|
|
|
|
| Term | Definition | Ví Dụ trong DeepDoc |
|
|
|------|------------|---------------------|
|
|
| **CTC** | Connectionist Temporal Classification | CRNN output decoding |
|
|
| **CRNN** | CNN + RNN | Text recognition model |
|
|
| **DBNet** | Differentiable Binarization | Text detection model |
|
|
| **Unclip** | Expand polygon boundary | 1.5x expansion ratio |
|
|
|
|
### 6.3 ML Terms
|
|
|
|
| Term | Definition | Ví Dụ trong DeepDoc |
|
|
|------|------------|---------------------|
|
|
| **ONNX** | Open Neural Network Exchange | Model format |
|
|
| **Inference** | Running model on input | `session.run()` |
|
|
| **Batch** | Multiple inputs processed together | batch_size=16 |
|
|
| **Confidence** | Model's certainty score | 0.0 - 1.0 |
|
|
|
|
### 6.4 Document Processing Terms
|
|
|
|
| Term | Definition | Ví Dụ trong DeepDoc |
|
|
|------|------------|---------------------|
|
|
| **Layout** | Document structure | Text, Table, Figure |
|
|
| **TSR** | Table Structure Recognition | Row, Column detection |
|
|
| **Spanning Cell** | Merged table cell | colspan, rowspan |
|
|
| **Reading Order** | Text flow sequence | Top-to-bottom, left-to-right |
|
|
|
|
---
|
|
|
|
## 7. Mở Rộng Từ Code
|
|
|
|
### 7.1 Thêm Parser Mới
|
|
|
|
**Ví dụ**: Add RTF parser
|
|
|
|
```python
|
|
# deepdoc/parser/rtf_parser.py
|
|
from striprtf.striprtf import rtf_to_text
|
|
|
|
class RAGFlowRtfParser:
|
|
def __call__(self, fnm, binary=None, chunk_token_num=128):
|
|
if binary:
|
|
content = binary.decode('utf-8')
|
|
else:
|
|
with open(fnm, 'r') as f:
|
|
content = f.read()
|
|
|
|
text = rtf_to_text(content)
|
|
|
|
# Chunk text
|
|
chunks = self._chunk(text, chunk_token_num)
|
|
|
|
return [(chunk, f"rtf_chunk_{i}") for i, chunk in enumerate(chunks)]
|
|
```
|
|
|
|
### 7.2 Thêm Layout Type Mới
|
|
|
|
**Ví dụ**: Add "Code Block" layout
|
|
|
|
```python
|
|
# vision/layout_recognizer.py
|
|
labels = [
|
|
"_background_",
|
|
"Text",
|
|
"Title",
|
|
...
|
|
"Code Block", # New label (index 11)
|
|
]
|
|
|
|
# Train new YOLOv10 model with "Code Block" annotations
|
|
# Update model file
|
|
```
|
|
|
|
### 7.3 Custom Text Merging Logic
|
|
|
|
```python
|
|
# Override default merging behavior
|
|
class CustomPdfParser(RAGFlowPdfParser):
|
|
def _should_merge(self, box1, box2):
|
|
"""Custom merge logic"""
|
|
# Don't merge code blocks
|
|
if box1.get("layout_type") == "Code Block":
|
|
return False
|
|
|
|
# Use default logic otherwise
|
|
return super()._should_merge(box1, box2)
|
|
```
|
|
|
|
### 7.4 Thêm Output Format
|
|
|
|
```python
|
|
# Add Markdown output format
|
|
def to_markdown(self, documents, tables):
|
|
md_parts = []
|
|
|
|
for text, pos_tag in documents:
|
|
# Detect if title
|
|
if self._is_title(text):
|
|
md_parts.append(f"## {text}\n")
|
|
else:
|
|
md_parts.append(f"{text}\n\n")
|
|
|
|
# Convert tables to markdown
|
|
for table in tables:
|
|
md_table = html_to_markdown(table["html"])
|
|
md_parts.append(md_table)
|
|
|
|
return "\n".join(md_parts)
|
|
```
|
|
|
|
### 7.5 Optimize Performance
|
|
|
|
**GPU Batching**:
|
|
```python
|
|
# Process multiple pages in parallel
|
|
def _parallel_ocr(self, images, batch_size=4):
|
|
with ThreadPoolExecutor(max_workers=4) as executor:
|
|
futures = []
|
|
for batch in chunks(images, batch_size):
|
|
future = executor.submit(self.ocr, batch)
|
|
futures.append(future)
|
|
|
|
results = [f.result() for f in futures]
|
|
return results
|
|
```
|
|
|
|
**Caching**:
|
|
```python
|
|
# Cache model instances
|
|
_model_cache = {}
|
|
|
|
def get_ocr_model(model_dir, device_id):
|
|
key = f"{model_dir}_{device_id}"
|
|
if key not in _model_cache:
|
|
_model_cache[key] = OCR(model_dir, device_id)
|
|
return _model_cache[key]
|
|
```
|
|
|
|
### 7.6 Integration với RAG Pipeline
|
|
|
|
```python
|
|
# rag/app/pdf.py (example integration)
|
|
from deepdoc.parser import RAGFlowPdfParser
|
|
|
|
def process_pdf_for_rag(file_path, chunk_size=512):
|
|
parser = RAGFlowPdfParser()
|
|
|
|
# Parse PDF
|
|
documents, tables = parser(file_path)
|
|
|
|
# Chunk documents
|
|
chunks = []
|
|
for text, pos_tag in documents:
|
|
for chunk in chunk_text(text, chunk_size):
|
|
chunks.append({
|
|
"text": chunk,
|
|
"metadata": {"position": pos_tag}
|
|
})
|
|
|
|
# Add tables as separate chunks
|
|
for table in tables:
|
|
chunks.append({
|
|
"text": table["html"],
|
|
"metadata": {"type": "table", "bbox": table["bbox"]}
|
|
})
|
|
|
|
return chunks
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Tổng Kết
|
|
|
|
### 8.1 Key Takeaways
|
|
|
|
1. **DeepDoc = Parser Layer + Vision Layer**
|
|
- Parser: Format-specific handling (PDF, DOCX, etc.)
|
|
- Vision: OCR + Layout + Table recognition
|
|
|
|
2. **Pipeline Architecture**
|
|
- Multiple specialized models
|
|
- Easy to debug and improve
|
|
|
|
3. **ONNX Runtime**
|
|
- Lightweight inference
|
|
- Cross-platform compatibility
|
|
|
|
4. **Hybrid Text Extraction**
|
|
- Native PDF text khi available
|
|
- OCR fallback cho scanned documents
|
|
|
|
### 8.2 Diagram Tổng Hợp
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ DEEPDOC SUMMARY │
|
|
├──────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ INPUT PROCESSING OUTPUT │
|
|
│ ───── ────────── ────── │
|
|
│ │
|
|
│ ┌─────────┐ ┌────────────────────────────┐ ┌─────────────────┐ │
|
|
│ │ PDF │────▶│ 1. Image Extraction │─────▶│ Documents │ │
|
|
│ │ DOCX │ │ 2. OCR (DBNet + CRNN) │ │ [(text, pos)] │ │
|
|
│ │ Excel │ │ 3. Layout (YOLOv10) │ │ │ │
|
|
│ │ HTML │ │ 4. Column Detection │ │ Tables │ │
|
|
│ │ ... │ │ 5. Table Structure │ │ [html, bbox] │ │
|
|
│ └─────────┘ │ 6. Text Merging │ │ │ │
|
|
│ │ 7. Quality Filtering │ │ Figures │ │
|
|
│ └────────────────────────────┘ │ [image, cap] │ │
|
|
│ └─────────────────┘ │
|
|
│ │
|
|
│ MODELS USED: │
|
|
│ ──────────── │
|
|
│ • DBNet (Text Detection) - ONNX, ~30MB │
|
|
│ • CRNN (Text Recognition) - ONNX, ~20MB │
|
|
│ • YOLOv10 (Layout Detection) - ONNX, ~50MB │
|
|
│ • YOLOv10 (Table Structure) - ONNX, ~50MB │
|
|
│ • XGBoost (Text Merging) - Binary, ~5MB │
|
|
│ │
|
|
│ KEY ALGORITHMS: │
|
|
│ ─────────────── │
|
|
│ • CTC Decoding (text recognition) │
|
|
│ • NMS (duplicate removal) │
|
|
│ • K-Means (column detection) │
|
|
│ • IoU (overlap calculation) │
|
|
│ │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 8.3 Files Reference
|
|
|
|
| File | Lines | Description |
|
|
|------|-------|-------------|
|
|
| `parser/pdf_parser.py` | 1479 | Main PDF parser |
|
|
| `vision/ocr.py` | 752 | OCR detection + recognition |
|
|
| `vision/layout_recognizer.py` | 457 | Layout detection |
|
|
| `vision/table_structure_recognizer.py` | 613 | Table structure |
|
|
| `vision/recognizer.py` | 443 | Base recognizer class |
|
|
| `vision/operators.py` | 726 | Image preprocessing |
|
|
| `vision/postprocess.py` | 371 | Post-processing utilities |
|
|
|
|
---
|
|
|
|
*Document created for RAGFlow v0.22.1 analysis*
|