docs: Add comprehensive DeepDoc deep guide documentation
Created in-depth documentation for understanding the deepdoc module: - README.md: Complete deep guide with: - Big picture explanation (what problem deepdoc solves) - Data flow diagrams (Input → Processing → Output) - Detailed code analysis with line numbers - Technical explanations (ONNX, CTC, NMS, etc.) - Design reasoning (why certain technologies chosen) - Difficult terms glossary - Extension examples - ocr_deep_dive.md: Deep dive into OCR subsystem - DBNet text detection architecture - CRNN text recognition - CTC decoding algorithm - Rotation handling - Performance optimization - layout_table_deep_dive.md: Deep dive into layout/table recognition - YOLOv10 layout detection - Table structure recognition - Grid construction algorithm - Spanning cell handling - HTML/descriptive output generation
This commit is contained in:
parent
566bce428b
commit
6d4dbbfe2c
3 changed files with 2890 additions and 0 deletions
1286
personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
Normal file
1286
personal_analyze/07-DEEPDOC-DEEP-GUIDE/README.md
Normal file
File diff suppressed because it is too large
Load diff
926
personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
Normal file
926
personal_analyze/07-DEEPDOC-DEEP-GUIDE/layout_table_deep_dive.md
Normal file
|
|
@ -0,0 +1,926 @@
|
|||
# Layout & Table Recognition Deep Dive
|
||||
|
||||
## Tổng Quan
|
||||
|
||||
Sau khi OCR extract được text boxes, DeepDoc cần:
|
||||
1. **Layout Recognition**: Phân loại vùng (Text, Title, Table, Figure...)
|
||||
2. **Table Structure Recognition**: Nhận dạng cấu trúc bảng (rows, columns, cells)
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
deepdoc/vision/
|
||||
├── layout_recognizer.py # Layout detection (457 lines)
|
||||
├── table_structure_recognizer.py # Table structure (613 lines)
|
||||
└── recognizer.py # Base class (443 lines)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1. Layout Recognition (YOLOv10)
|
||||
|
||||
### 1.1 Layout Categories
|
||||
|
||||
```python
|
||||
# deepdoc/vision/layout_recognizer.py, lines 34-46
|
||||
|
||||
labels = [
|
||||
"_background_", # 0: Background (ignored)
|
||||
"Text", # 1: Body text paragraphs
|
||||
"Title", # 2: Section/document titles
|
||||
"Figure", # 3: Images, diagrams, charts
|
||||
"Figure caption", # 4: Text describing figures
|
||||
"Table", # 5: Data tables
|
||||
"Table caption", # 6: Text describing tables
|
||||
"Header", # 7: Page headers
|
||||
"Footer", # 8: Page footers
|
||||
"Reference", # 9: Bibliography, citations
|
||||
"Equation", # 10: Mathematical equations
|
||||
]
|
||||
```
|
||||
|
||||
### 1.2 YOLOv10 Architecture
|
||||
|
||||
```
|
||||
YOLOv10 for Document Layout:
|
||||
|
||||
Input Image (640, 640, 3)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ CSPDarknet Backbone │
|
||||
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│
|
||||
│ │ P1 │→ │ P2 │→ │ P3 │→ │ P4 ││
|
||||
│ │/2 │ │/4 │ │/8 │ │/16 ││
|
||||
│ └─────┘ └─────┘ └─────┘ └─────┘│
|
||||
└─────────────────────────────────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ PANet Neck │
|
||||
│ FPN (top-down) + PAN (bottom-up) │
|
||||
│ Multi-scale feature fusion │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Detection Heads (3 scales) │
|
||||
│ Small (80x80) → tiny objects │
|
||||
│ Medium (40x40) → normal objects │
|
||||
│ Large (20x20) → big objects │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
Raw Predictions:
|
||||
[x_center, y_center, width, height, confidence, class_probs...]
|
||||
```
|
||||
|
||||
### 1.3 Preprocessing (LayoutRecognizer4YOLOv10)
|
||||
|
||||
```python
|
||||
# deepdoc/vision/layout_recognizer.py, lines 186-209
|
||||
|
||||
def preprocess(self, image_list):
|
||||
"""
|
||||
Preprocess images for YOLOv10.
|
||||
|
||||
Key steps:
|
||||
1. Resize maintaining aspect ratio
|
||||
2. Pad to 640x640 (gray borders)
|
||||
3. Normalize [0,255] → [0,1]
|
||||
4. Transpose HWC → CHW
|
||||
"""
|
||||
processed = []
|
||||
scale_factors = []
|
||||
|
||||
for img in image_list:
|
||||
h, w = img.shape[:2]
|
||||
|
||||
# Calculate scale (preserve aspect ratio)
|
||||
r = min(640/h, 640/w)
|
||||
new_h, new_w = int(h*r), int(w*r)
|
||||
|
||||
# Resize
|
||||
resized = cv2.resize(img, (new_w, new_h))
|
||||
|
||||
# Calculate padding
|
||||
pad_top = (640 - new_h) // 2
|
||||
pad_left = (640 - new_w) // 2
|
||||
|
||||
# Pad to 640x640 (gray: 114)
|
||||
padded = np.full((640, 640, 3), 114, dtype=np.uint8)
|
||||
padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
|
||||
|
||||
# Normalize and transpose
|
||||
padded = padded.astype(np.float32) / 255.0
|
||||
padded = padded.transpose(2, 0, 1) # HWC → CHW
|
||||
|
||||
processed.append(padded)
|
||||
scale_factors.append([1/r, 1/r, pad_left, pad_top])
|
||||
|
||||
return np.stack(processed), scale_factors
|
||||
```
|
||||
|
||||
**Visualization**:
|
||||
```
|
||||
Original image (1000x800):
|
||||
┌────────────────────────────────────────┐
|
||||
│ │
|
||||
│ Document Content │
|
||||
│ │
|
||||
└────────────────────────────────────────┘
|
||||
|
||||
After resize (scale=0.64) to (640x512):
|
||||
┌────────────────────────────────────────┐
|
||||
│ │
|
||||
│ Document Content │
|
||||
│ │
|
||||
└────────────────────────────────────────┘
|
||||
|
||||
After padding to (640x640):
|
||||
┌────────────────────────────────────────┐
|
||||
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
|
||||
├────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Document Content │
|
||||
│ │
|
||||
├────────────────────────────────────────┤
|
||||
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
|
||||
└────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 1.4 NMS Postprocessing
|
||||
|
||||
```python
|
||||
# deepdoc/vision/recognizer.py, lines 330-407
|
||||
|
||||
def postprocess(self, boxes, inputs, thr):
|
||||
"""
|
||||
YOLOv10 postprocessing with per-class NMS.
|
||||
"""
|
||||
results = []
|
||||
|
||||
for batch_idx, batch_boxes in enumerate(boxes):
|
||||
scale_factor = inputs["scale_factor"][batch_idx]
|
||||
|
||||
# Filter by confidence threshold
|
||||
mask = batch_boxes[:, 4] > thr # confidence > 0.2
|
||||
filtered = batch_boxes[mask]
|
||||
|
||||
if len(filtered) == 0:
|
||||
results.append([])
|
||||
continue
|
||||
|
||||
# Convert xywh → xyxy
|
||||
xyxy = self.xywh2xyxy(filtered[:, :4])
|
||||
|
||||
# Remove padding offset
|
||||
xyxy[:, [0, 2]] -= scale_factor[2] # pad_left
|
||||
xyxy[:, [1, 3]] -= scale_factor[3] # pad_top
|
||||
|
||||
# Scale back to original size
|
||||
xyxy[:, [0, 2]] *= scale_factor[0] # scale_x
|
||||
xyxy[:, [1, 3]] *= scale_factor[1] # scale_y
|
||||
|
||||
# Per-class NMS
|
||||
class_ids = filtered[:, 5].astype(int)
|
||||
scores = filtered[:, 4]
|
||||
|
||||
keep_indices = []
|
||||
for cls in np.unique(class_ids):
|
||||
cls_mask = class_ids == cls
|
||||
cls_boxes = xyxy[cls_mask]
|
||||
cls_scores = scores[cls_mask]
|
||||
|
||||
# NMS within class
|
||||
keep = self.iou_filter(cls_boxes, cls_scores, iou_thresh=0.45)
|
||||
keep_indices.extend(np.where(cls_mask)[0][keep])
|
||||
|
||||
# Build result
|
||||
batch_results = []
|
||||
for idx in keep_indices:
|
||||
batch_results.append({
|
||||
"type": self.labels[int(filtered[idx, 5])],
|
||||
"bbox": xyxy[idx].tolist(),
|
||||
"score": float(filtered[idx, 4])
|
||||
})
|
||||
|
||||
results.append(batch_results)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### 1.5 OCR-Layout Association
|
||||
|
||||
```python
|
||||
# deepdoc/vision/layout_recognizer.py, lines 98-147
|
||||
|
||||
def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
|
||||
"""
|
||||
Detect layouts and associate with OCR results.
|
||||
"""
|
||||
# Step 1: Run layout detection
|
||||
page_layouts = super().__call__(image_list, thr, batch_size)
|
||||
|
||||
# Step 2: Clean up overlapping layouts
|
||||
for i, layouts in enumerate(page_layouts):
|
||||
page_layouts[i] = self.layouts_cleanup(layouts, thr=0.7)
|
||||
|
||||
# Step 3: Associate OCR boxes with layouts
|
||||
for page_idx, (ocr_boxes, layouts) in enumerate(zip(ocr_res, page_layouts)):
|
||||
# Sort layouts by priority: Footer → Header → Reference → Caption → Others
|
||||
layouts_by_priority = self._sort_by_priority(layouts)
|
||||
|
||||
for ocr_box in ocr_boxes:
|
||||
# Find overlapping layout
|
||||
matched_layout = self.find_overlapped_with_threshold(
|
||||
ocr_box,
|
||||
layouts_by_priority,
|
||||
thr=0.4 # 40% overlap threshold
|
||||
)
|
||||
|
||||
if matched_layout:
|
||||
ocr_box["layout_type"] = matched_layout["type"]
|
||||
ocr_box["layoutno"] = matched_layout.get("layoutno", 0)
|
||||
else:
|
||||
ocr_box["layout_type"] = "Text" # Default to Text
|
||||
|
||||
# Step 4: Filter garbage (headers, footers, page numbers)
|
||||
if drop:
|
||||
self._filter_garbage(ocr_res, page_layouts)
|
||||
|
||||
return ocr_res, page_layouts
|
||||
```
|
||||
|
||||
### 1.6 Garbage Detection
|
||||
|
||||
```python
|
||||
# deepdoc/vision/layout_recognizer.py, lines 64-66
|
||||
|
||||
# Patterns to filter out
|
||||
garbage_patterns = [
|
||||
r"^•+$", # Bullet points only
|
||||
r"^[0-9]{1,2} / ?[0-9]{1,2}$", # Page numbers (3/10, 3 / 10)
|
||||
r"^[0-9]{1,2} of [0-9]{1,2}$", # Page numbers (3 of 10)
|
||||
r"^http://[^ ]{12,}", # Long URLs
|
||||
r"\(cid *: *[0-9]+ *\)", # PDF character IDs
|
||||
]
|
||||
|
||||
def is_garbage(text, layout_type, page_position):
|
||||
"""
|
||||
Determine if text should be filtered out.
|
||||
|
||||
Rules:
|
||||
- Headers at top 10% of page → keep
|
||||
- Footers at bottom 10% of page → keep
|
||||
- Headers/footers elsewhere → garbage
|
||||
- Page numbers → garbage
|
||||
- URLs → garbage
|
||||
"""
|
||||
for pattern in garbage_patterns:
|
||||
if re.match(pattern, text):
|
||||
return True
|
||||
|
||||
# Position-based filtering
|
||||
if layout_type == "Header" and page_position > 0.1:
|
||||
return True # Header not at top
|
||||
if layout_type == "Footer" and page_position < 0.9:
|
||||
return True # Footer not at bottom
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Table Structure Recognition
|
||||
|
||||
### 2.1 Table Components
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 31-38
|
||||
|
||||
labels = [
|
||||
"table", # 0: Whole table boundary
|
||||
"table column", # 1: Column separators
|
||||
"table row", # 2: Row separators
|
||||
"table column header", # 3: Header rows
|
||||
"table projected row header", # 4: Row labels
|
||||
"table spanning cell", # 5: Merged cells
|
||||
]
|
||||
```
|
||||
|
||||
### 2.2 Detection to Grid Construction
|
||||
|
||||
```
|
||||
Detection Output → Table Grid:
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Raw Detections │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ table: [0, 0, 500, 300] │ │
|
||||
│ │ table row: [0, 0, 500, 50], [0, 50, 500, 100], ... │ │
|
||||
│ │ table column: [0, 0, 150, 300], [150, 0, 300, 300], ... │ │
|
||||
│ │ table spanning cell: [0, 100, 300, 150] │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ Alignment │ │
|
||||
│ │ • Align row boundaries (left/right edges) │ │
|
||||
│ │ • Align column boundaries (top/bottom edges) │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ Grid Construction │ │
|
||||
│ │ │ │
|
||||
│ │ ┌──────────┬──────────┬──────────┐ │ │
|
||||
│ │ │ Header 1 │ Header 2 │ Header 3 │ ← Row 0 (header) │ │
|
||||
│ │ ├──────────┴──────────┼──────────┤ │ │
|
||||
│ │ │ Spanning Cell │ Cell 3 │ ← Row 1 │ │
|
||||
│ │ ├──────────┬──────────┼──────────┤ │ │
|
||||
│ │ │ Cell 4 │ Cell 5 │ Cell 6 │ ← Row 2 │ │
|
||||
│ │ └──────────┴──────────┴──────────┘ │ │
|
||||
│ │ │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ HTML or Descriptive Output │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.3 Alignment Algorithm
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 67-111
|
||||
|
||||
def __call__(self, images, thr=0.2):
|
||||
"""
|
||||
Detect and align table structure.
|
||||
"""
|
||||
# Run detection
|
||||
detections = super().__call__(images, thr)
|
||||
|
||||
for page_dets in detections:
|
||||
rows = [d for d in page_dets if d["label"] == "table row"]
|
||||
cols = [d for d in page_dets if d["label"] == "table column"]
|
||||
|
||||
if len(rows) > 4:
|
||||
# Align row X coordinates (left edges)
|
||||
x0_values = [r["x0"] for r in rows]
|
||||
mean_x0 = np.mean(x0_values)
|
||||
min_x0 = np.min(x0_values)
|
||||
aligned_x0 = min(mean_x0, min_x0 + 0.05 * (max(x0_values) - min_x0))
|
||||
|
||||
for r in rows:
|
||||
r["x0"] = aligned_x0
|
||||
|
||||
# Align row X coordinates (right edges)
|
||||
x1_values = [r["x1"] for r in rows]
|
||||
mean_x1 = np.mean(x1_values)
|
||||
max_x1 = np.max(x1_values)
|
||||
aligned_x1 = max(mean_x1, max_x1 - 0.05 * (max_x1 - min(x1_values)))
|
||||
|
||||
for r in rows:
|
||||
r["x1"] = aligned_x1
|
||||
|
||||
if len(cols) > 4:
|
||||
# Similar alignment for column Y coordinates
|
||||
# ...
|
||||
```
|
||||
|
||||
**Tại sao cần alignment?**
|
||||
|
||||
Detection model có thể cho ra boundaries không perfectly aligned:
|
||||
```
|
||||
Before alignment:
|
||||
Row 1: x0=10, x1=490
|
||||
Row 2: x0=12, x1=488
|
||||
Row 3: x0=8, x1=492
|
||||
|
||||
After alignment:
|
||||
Row 1: x0=10, x1=490
|
||||
Row 2: x0=10, x1=490
|
||||
Row 3: x0=10, x1=490
|
||||
```
|
||||
|
||||
### 2.4 Grid Construction
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 172-349
|
||||
|
||||
@staticmethod
|
||||
def construct_table(boxes, is_english=False, html=True, **kwargs):
|
||||
"""
|
||||
Construct 2D table from detected components.
|
||||
|
||||
Args:
|
||||
boxes: OCR boxes with R (row), C (column), SP (spanning) attributes
|
||||
is_english: Language hint
|
||||
html: Output format (HTML or descriptive text)
|
||||
|
||||
Returns:
|
||||
HTML table string or descriptive text
|
||||
"""
|
||||
# Step 1: Extract caption
|
||||
caption = ""
|
||||
for box in boxes[:]:
|
||||
if is_caption(box):
|
||||
caption = box["text"]
|
||||
boxes.remove(box)
|
||||
|
||||
# Step 2: Sort by row position (R attribute)
|
||||
rowh = np.median([b["bottom"] - b["top"] for b in boxes])
|
||||
boxes = Recognizer.sort_R_firstly(boxes, rowh / 2)
|
||||
|
||||
# Step 3: Group into rows
|
||||
rows = []
|
||||
current_row = [boxes[0]]
|
||||
|
||||
for box in boxes[1:]:
|
||||
# Same row if Y difference < row_height/2
|
||||
if abs(box["R"] - current_row[-1]["R"]) < rowh / 2:
|
||||
current_row.append(box)
|
||||
else:
|
||||
rows.append(current_row)
|
||||
current_row = [box]
|
||||
rows.append(current_row)
|
||||
|
||||
# Step 4: Sort each row by column position (C attribute)
|
||||
for row in rows:
|
||||
row.sort(key=lambda x: x["C"])
|
||||
|
||||
# Step 5: Build 2D table matrix
|
||||
n_rows = len(rows)
|
||||
n_cols = max(len(row) for row in rows)
|
||||
|
||||
table = [[None] * n_cols for _ in range(n_rows)]
|
||||
|
||||
for i, row in enumerate(rows):
|
||||
for j, cell in enumerate(row):
|
||||
table[i][j] = cell
|
||||
|
||||
# Step 6: Handle spanning cells
|
||||
table = handle_spanning_cells(table, boxes)
|
||||
|
||||
# Step 7: Generate output
|
||||
if html:
|
||||
return generate_html_table(table, caption)
|
||||
else:
|
||||
return generate_descriptive_text(table, caption)
|
||||
```
|
||||
|
||||
### 2.5 Spanning Cell Handling
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 496-575
|
||||
|
||||
def __cal_spans(self, boxes, rows, cols):
|
||||
"""
|
||||
Calculate colspan and rowspan for merged cells.
|
||||
|
||||
Spanning cell detection:
|
||||
- "SP" attribute indicates merged cell
|
||||
- Calculate which rows/cols it covers
|
||||
"""
|
||||
for box in boxes:
|
||||
if "SP" not in box:
|
||||
continue
|
||||
|
||||
# Find rows this cell spans
|
||||
box["rowspan"] = []
|
||||
for i, row in enumerate(rows):
|
||||
overlap = self.overlapped_area(box, row)
|
||||
if overlap > 0.3: # 30% overlap
|
||||
box["rowspan"].append(i)
|
||||
|
||||
# Find columns this cell spans
|
||||
box["colspan"] = []
|
||||
for j, col in enumerate(cols):
|
||||
overlap = self.overlapped_area(box, col)
|
||||
if overlap > 0.3:
|
||||
box["colspan"].append(j)
|
||||
|
||||
return boxes
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Spanning cell detection:
|
||||
|
||||
┌──────────┬──────────┬──────────┐
|
||||
│ Header 1 │ Header 2 │ Header 3 │
|
||||
├──────────┴──────────┼──────────┤
|
||||
│ Merged Cell │ Cell 3 │ ← SP cell spans columns 0-1
|
||||
│ (colspan=2) │ │
|
||||
├──────────┬──────────┼──────────┤
|
||||
│ Cell 4 │ Cell 5 │ Cell 6 │
|
||||
└──────────┴──────────┴──────────┘
|
||||
|
||||
Detection:
|
||||
- SP cell bbox: [0, 50, 300, 100]
|
||||
- Column 0: [0, 0, 150, 200] → overlap 0.5 ✓
|
||||
- Column 1: [150, 0, 300, 200] → overlap 0.5 ✓
|
||||
- Column 2: [300, 0, 450, 200] → overlap 0.0 ✗
|
||||
→ colspan = [0, 1]
|
||||
```
|
||||
|
||||
### 2.6 HTML Output Generation
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 352-393
|
||||
|
||||
def __html_table(table, header_rows, caption):
|
||||
"""
|
||||
Generate HTML table from 2D matrix.
|
||||
"""
|
||||
html_parts = ["<table>"]
|
||||
|
||||
# Add caption if exists
|
||||
if caption:
|
||||
html_parts.append(f"<caption>{caption}</caption>")
|
||||
|
||||
for i, row in enumerate(table):
|
||||
html_parts.append("<tr>")
|
||||
|
||||
for j, cell in enumerate(row):
|
||||
if cell is None:
|
||||
continue # Skip cells covered by spanning
|
||||
|
||||
# Determine tag (th for header, td for data)
|
||||
tag = "th" if i in header_rows else "td"
|
||||
|
||||
# Add colspan/rowspan attributes
|
||||
attrs = []
|
||||
if cell.get("colspan") and len(cell["colspan"]) > 1:
|
||||
attrs.append(f'colspan="{len(cell["colspan"])}"')
|
||||
if cell.get("rowspan") and len(cell["rowspan"]) > 1:
|
||||
attrs.append(f'rowspan="{len(cell["rowspan"])}"')
|
||||
|
||||
attr_str = " " + " ".join(attrs) if attrs else ""
|
||||
|
||||
# Add cell content
|
||||
html_parts.append(f"<{tag}{attr_str}>{cell['text']}</{tag}>")
|
||||
|
||||
html_parts.append("</tr>")
|
||||
|
||||
html_parts.append("</table>")
|
||||
|
||||
return "\n".join(html_parts)
|
||||
```
|
||||
|
||||
**Output Example**:
|
||||
```html
|
||||
<table>
|
||||
<caption>Table 1: Sales Data</caption>
|
||||
<tr>
|
||||
<th>Region</th>
|
||||
<th>Q1</th>
|
||||
<th>Q2</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="2">North America</td>
|
||||
<td>$150K</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Europe</td>
|
||||
<td>$100K</td>
|
||||
<td>$120K</td>
|
||||
</tr>
|
||||
</table>
|
||||
```
|
||||
|
||||
### 2.7 Descriptive Text Output
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 396-493
|
||||
|
||||
def __desc_table(table, header_rows, caption):
|
||||
"""
|
||||
Generate natural language description of table.
|
||||
|
||||
For RAG, sometimes descriptive text is better than HTML.
|
||||
"""
|
||||
descriptions = []
|
||||
|
||||
# Get headers
|
||||
headers = [cell["text"] for cell in table[0]] if header_rows else []
|
||||
|
||||
# Process each data row
|
||||
for i, row in enumerate(table):
|
||||
if i in header_rows:
|
||||
continue
|
||||
|
||||
row_desc = []
|
||||
for j, cell in enumerate(row):
|
||||
if cell is None:
|
||||
continue
|
||||
|
||||
if headers and j < len(headers):
|
||||
# "Column Name: Value" format
|
||||
row_desc.append(f"{headers[j]}: {cell['text']}")
|
||||
else:
|
||||
row_desc.append(cell['text'])
|
||||
|
||||
if row_desc:
|
||||
descriptions.append("; ".join(row_desc))
|
||||
|
||||
# Add source reference
|
||||
if caption:
|
||||
descriptions.append(f'(from "{caption}")')
|
||||
|
||||
return "\n".join(descriptions)
|
||||
```
|
||||
|
||||
**Output Example**:
|
||||
```
|
||||
Region: North America; Q1: $100K; Q2: $150K
|
||||
Region: Europe; Q1: $80K; Q2: $120K
|
||||
(from "Table 1: Sales Data")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Cell Content Classification
|
||||
|
||||
### 3.1 Block Type Detection
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 121-149
|
||||
|
||||
@staticmethod
|
||||
def blockType(text):
|
||||
"""
|
||||
Classify cell content type.
|
||||
|
||||
Used for:
|
||||
- Header detection (non-numeric cells likely headers)
|
||||
- Data validation
|
||||
- Smart formatting
|
||||
"""
|
||||
patterns = {
|
||||
"Dt": r"(^[0-9]{4}[-/][0-9]{1,2}|[0-9]{1,2}[-/][0-9]{1,2}[-/][0-9]{2,4}|"
|
||||
r"[0-9]{1,2}月|[Q][1-4]|[一二三四]季度)", # Date
|
||||
"Nu": r"^[-+]?[0-9.,%%¥$€£¥]+$", # Number
|
||||
"Ca": r"^[A-Z0-9]{4,}$", # Code
|
||||
"En": r"^[a-zA-Z\s]+$", # English
|
||||
}
|
||||
|
||||
for type_name, pattern in patterns.items():
|
||||
if re.search(pattern, text):
|
||||
return type_name
|
||||
|
||||
# Classify by length
|
||||
tokens = text.split()
|
||||
if len(tokens) == 1:
|
||||
return "Sg" # Single
|
||||
elif len(tokens) <= 3:
|
||||
return "Tx" # Short text
|
||||
elif len(tokens) <= 12:
|
||||
return "Lx" # Long text
|
||||
else:
|
||||
return "Ot" # Other
|
||||
|
||||
# Examples:
|
||||
# "2023-01-15" → "Dt" (Date)
|
||||
# "$1,234.56" → "Nu" (Number)
|
||||
# "ABC123" → "Ca" (Code)
|
||||
# "Total Revenue" → "En" (English)
|
||||
# "北京市" → "Tx" (Text)
|
||||
```
|
||||
|
||||
### 3.2 Header Detection
|
||||
|
||||
```python
|
||||
# deepdoc/vision/table_structure_recognizer.py, lines 332-344
|
||||
|
||||
def detect_headers(table):
|
||||
"""
|
||||
Detect which rows are headers based on content type.
|
||||
|
||||
Heuristic: If >50% of cells in a row are non-numeric,
|
||||
it's likely a header row.
|
||||
"""
|
||||
header_rows = set()
|
||||
|
||||
for i, row in enumerate(table):
|
||||
non_numeric = 0
|
||||
total = 0
|
||||
|
||||
for cell in row:
|
||||
if cell is None:
|
||||
continue
|
||||
total += 1
|
||||
if blockType(cell["text"]) != "Nu":
|
||||
non_numeric += 1
|
||||
|
||||
if total > 0 and non_numeric / total > 0.5:
|
||||
header_rows.add(i)
|
||||
|
||||
return header_rows
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Integration với PDF Parser
|
||||
|
||||
### 4.1 Table Detection in PDF Pipeline
|
||||
|
||||
```python
|
||||
# deepdoc/parser/pdf_parser.py, lines 196-281
|
||||
|
||||
def _table_transformer_job(self, zoomin=3):
|
||||
"""
|
||||
Detect and structure tables using TableStructureRecognizer.
|
||||
"""
|
||||
# Find table layouts
|
||||
table_layouts = [
|
||||
layout for layout in self.page_layout
|
||||
if layout["type"] == "Table"
|
||||
]
|
||||
|
||||
if not table_layouts:
|
||||
return
|
||||
|
||||
# Crop table images
|
||||
table_images = []
|
||||
for layout in table_layouts:
|
||||
x0, y0, x1, y1 = layout["bbox"]
|
||||
img = self.page_images[layout["page"]][
|
||||
int(y0*zoomin):int(y1*zoomin),
|
||||
int(x0*zoomin):int(x1*zoomin)
|
||||
]
|
||||
table_images.append(img)
|
||||
|
||||
# Run TSR
|
||||
table_structures = self.tsr(table_images)
|
||||
|
||||
# Match OCR boxes to table structure
|
||||
for layout, structure in zip(table_layouts, table_structures):
|
||||
# Get OCR boxes within table region
|
||||
table_boxes = [
|
||||
box for box in self.boxes
|
||||
if self._box_in_region(box, layout["bbox"])
|
||||
]
|
||||
|
||||
# Assign R, C, SP attributes
|
||||
for box in table_boxes:
|
||||
box["R"] = self._find_row(box, structure["rows"])
|
||||
box["C"] = self._find_column(box, structure["columns"])
|
||||
if self._is_spanning(box, structure["spanning_cells"]):
|
||||
box["SP"] = True
|
||||
|
||||
# Store for later extraction
|
||||
self.tb_cpns[layout["id"]] = {
|
||||
"boxes": table_boxes,
|
||||
"structure": structure
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 Table Extraction
|
||||
|
||||
```python
|
||||
# deepdoc/parser/pdf_parser.py, lines 757-930
|
||||
|
||||
def _extract_table_figure(self, need_image, ZM, return_html, need_position):
|
||||
"""
|
||||
Extract tables and figures from detected layouts.
|
||||
"""
|
||||
tables = []
|
||||
|
||||
for layout_id, table_data in self.tb_cpns.items():
|
||||
boxes = table_data["boxes"]
|
||||
|
||||
# Construct table (HTML or descriptive)
|
||||
if return_html:
|
||||
content = TableStructureRecognizer.construct_table(
|
||||
boxes, html=True
|
||||
)
|
||||
else:
|
||||
content = TableStructureRecognizer.construct_table(
|
||||
boxes, html=False
|
||||
)
|
||||
|
||||
table = {
|
||||
"content": content,
|
||||
"bbox": table_data["bbox"],
|
||||
}
|
||||
|
||||
if need_image:
|
||||
table["image"] = self._crop_region(table_data["bbox"])
|
||||
|
||||
tables.append(table)
|
||||
|
||||
return tables
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Performance Considerations
|
||||
|
||||
### 5.1 Batch Processing
|
||||
|
||||
```python
|
||||
# deepdoc/vision/recognizer.py, lines 415-437
|
||||
|
||||
def __call__(self, image_list, thr=0.7, batch_size=16):
|
||||
"""
|
||||
Batch inference for efficiency.
|
||||
|
||||
Why batch_size=16?
|
||||
- GPU memory optimization
|
||||
- Balance throughput vs latency
|
||||
- Typical document has 10-50 elements
|
||||
"""
|
||||
results = []
|
||||
|
||||
for i in range(0, len(image_list), batch_size):
|
||||
batch = image_list[i:i+batch_size]
|
||||
|
||||
# Preprocess
|
||||
inputs = self.preprocess(batch)
|
||||
|
||||
# Inference
|
||||
outputs = self.ort_sess.run(None, inputs)
|
||||
|
||||
# Postprocess
|
||||
batch_results = self.postprocess(outputs, inputs, thr)
|
||||
results.extend(batch_results)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### 5.2 Model Caching
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 36-73
|
||||
|
||||
# Global model cache
|
||||
loaded_models = {}
|
||||
|
||||
def load_model(model_dir, nm, device_id=None):
|
||||
"""
|
||||
Load ONNX model with caching.
|
||||
|
||||
Cache key: model_path + device_id
|
||||
"""
|
||||
model_path = os.path.join(model_dir, f"{nm}.onnx")
|
||||
cache_key = f"{model_path}_{device_id}"
|
||||
|
||||
if cache_key in loaded_models:
|
||||
return loaded_models[cache_key]
|
||||
|
||||
# Load model...
|
||||
session = ort.InferenceSession(model_path, ...)
|
||||
|
||||
loaded_models[cache_key] = (session, run_opts)
|
||||
return session, run_opts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Troubleshooting
|
||||
|
||||
### 6.1 Common Issues
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| Missing table | Low confidence | Lower threshold (0.1-0.2) |
|
||||
| Wrong colspan | Misaligned detection | Check row/column alignment |
|
||||
| Merged cells wrong | Overlap threshold | Adjust SP detection threshold |
|
||||
| Headers not detected | All numeric | Manual header specification |
|
||||
| Layout overlap | NMS threshold | Increase NMS IoU threshold |
|
||||
|
||||
### 6.2 Debugging
|
||||
|
||||
```python
|
||||
# Visualize layout detection
|
||||
from deepdoc.vision.seeit import draw_boxes
|
||||
|
||||
# Draw layout boxes on image
|
||||
layout_vis = draw_boxes(
|
||||
page_image,
|
||||
[(l["bbox"], l["type"]) for l in page_layouts],
|
||||
colors={
|
||||
"Text": (0, 255, 0),
|
||||
"Table": (255, 0, 0),
|
||||
"Figure": (0, 0, 255),
|
||||
}
|
||||
)
|
||||
cv2.imwrite("layout_debug.png", layout_vis)
|
||||
|
||||
# Check table structure
|
||||
for box in table_boxes:
|
||||
print(f"Text: {box['text']}")
|
||||
print(f" Row: {box.get('R', 'N/A')}")
|
||||
print(f" Col: {box.get('C', 'N/A')}")
|
||||
print(f" Spanning: {box.get('SP', False)}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
- YOLOv10 Paper: [YOLOv10: Real-Time End-to-End Object Detection](https://arxiv.org/abs/2405.14458)
|
||||
- Table Transformer: [PubTables-1M: Towards comprehensive table extraction](https://arxiv.org/abs/2110.00061)
|
||||
- Document Layout Analysis: [A Survey](https://arxiv.org/abs/2012.15005)
|
||||
678
personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md
Normal file
678
personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md
Normal file
|
|
@ -0,0 +1,678 @@
|
|||
# OCR Deep Dive
|
||||
|
||||
## Tổng Quan
|
||||
|
||||
Module OCR trong DeepDoc thực hiện 2 task chính:
|
||||
1. **Text Detection**: Phát hiện vùng chứa text trong image
|
||||
2. **Text Recognition**: Nhận dạng text trong các vùng đã phát hiện
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
deepdoc/vision/
|
||||
├── ocr.py # Main OCR class (752 lines)
|
||||
├── postprocess.py # CTC decoder, DBNet postprocess (371 lines)
|
||||
└── operators.py # Image preprocessing (726 lines)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1. Text Detection (DBNet)
|
||||
|
||||
### 1.1 Model Architecture
|
||||
|
||||
```
|
||||
DBNet (Differentiable Binarization Network):
|
||||
|
||||
Input Image (H, W, 3)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ ResNet-18 Backbone │
|
||||
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│
|
||||
│ │ C1 │→ │ C2 │→ │ C3 │→ │ C4 ││
|
||||
│ │64ch │ │128ch│ │256ch│ │512ch││
|
||||
│ └─────┘ └─────┘ └─────┘ └─────┘│
|
||||
└─────────────────────────────────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Feature Pyramid Network │
|
||||
│ Upsample + Concatenate all levels │
|
||||
│ Output: 256 channels │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
├─────────────────┐
|
||||
▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ Probability │ │ Threshold │
|
||||
│ Head │ │ Head │
|
||||
│ Conv → Sigmoid │ │ Conv → Sigmoid │
|
||||
└────────┬────────┘ └────────┬────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
Prob Map (H, W) Thresh Map (H, W)
|
||||
│ │
|
||||
└─────────┬─────────┘
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Differentiable Binarization │
|
||||
│ B = sigmoid((P - T) * k) │
|
||||
│ k = 50 (amplification factor) │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
Binary Map (H, W)
|
||||
```
|
||||
|
||||
### 1.2 DBNet Post-processing
|
||||
|
||||
```python
|
||||
# deepdoc/vision/postprocess.py, lines 41-259
|
||||
|
||||
class DBPostProcess:
|
||||
def __init__(self,
|
||||
thresh=0.3, # Binary threshold
|
||||
box_thresh=0.5, # Box confidence threshold
|
||||
max_candidates=1000, # Maximum text regions
|
||||
unclip_ratio=1.5, # Polygon expansion ratio
|
||||
use_dilation=False, # Morphological dilation
|
||||
score_mode="fast"): # fast or slow scoring
|
||||
|
||||
def __call__(self, outs_dict, shape_list):
|
||||
"""
|
||||
Post-process DBNet output.
|
||||
|
||||
Args:
|
||||
outs_dict: {"maps": probability_map}
|
||||
shape_list: Original image shapes
|
||||
|
||||
Returns:
|
||||
List of detected text boxes
|
||||
"""
|
||||
pred = outs_dict["maps"] # (N, 1, H, W)
|
||||
|
||||
# Step 1: Binary thresholding
|
||||
bitmap = pred > self.thresh # 0.3
|
||||
|
||||
# Step 2: Optional dilation
|
||||
if self.use_dilation:
|
||||
kernel = np.ones((2, 2))
|
||||
bitmap = cv2.dilate(bitmap, kernel)
|
||||
|
||||
# Step 3: Find contours
|
||||
contours = cv2.findContours(
|
||||
bitmap.astype(np.uint8),
|
||||
cv2.RETR_LIST,
|
||||
cv2.CHAIN_APPROX_SIMPLE
|
||||
)
|
||||
|
||||
# Step 4: Process each contour
|
||||
boxes = []
|
||||
for contour in contours[:self.max_candidates]:
|
||||
# Simplify polygon
|
||||
epsilon = 0.002 * cv2.arcLength(contour, True)
|
||||
approx = cv2.approxPolyDP(contour, epsilon, True)
|
||||
|
||||
if len(approx) < 4:
|
||||
continue
|
||||
|
||||
# Calculate confidence score
|
||||
score = self.box_score_fast(pred, approx)
|
||||
if score < self.box_thresh:
|
||||
continue
|
||||
|
||||
# Unclip (expand) polygon
|
||||
box = self.unclip(approx, self.unclip_ratio)
|
||||
boxes.append(box)
|
||||
|
||||
return boxes
|
||||
```
|
||||
|
||||
### 1.3 Unclipping Algorithm
|
||||
|
||||
**Vấn đề**: DBNet tends to predict tight boundaries → misses edge characters
|
||||
|
||||
**Giải pháp**: Expand detected polygon by unclip_ratio
|
||||
|
||||
```python
|
||||
# deepdoc/vision/postprocess.py, lines 163-169
|
||||
|
||||
def unclip(self, box, unclip_ratio):
|
||||
"""
|
||||
Expand polygon using Clipper library.
|
||||
|
||||
Công thức:
|
||||
distance = Area * unclip_ratio / Perimeter
|
||||
|
||||
Với unclip_ratio = 1.5:
|
||||
- Nhỏ polygon → expand nhiều hơn
|
||||
- Lớn polygon → expand ít hơn (proportional)
|
||||
"""
|
||||
poly = Polygon(box)
|
||||
distance = poly.area * unclip_ratio / poly.length
|
||||
|
||||
offset = pyclipper.PyclipperOffset()
|
||||
offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
|
||||
|
||||
expanded = offset.Execute(distance)
|
||||
return np.array(expanded[0])
|
||||
```
|
||||
|
||||
**Visualization**:
|
||||
```
|
||||
Original detection: After unclip (1.5x):
|
||||
┌──────────────┐ ┌────────────────────┐
|
||||
│ Hello │ → │ Hello │
|
||||
└──────────────┘ └────────────────────┘
|
||||
(expanded boundaries)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Text Recognition (CRNN)
|
||||
|
||||
### 2.1 Model Architecture
|
||||
|
||||
```
|
||||
CRNN (Convolutional Recurrent Neural Network):
|
||||
|
||||
Input: Cropped text image (3, 48, W)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ CNN Backbone │
|
||||
│ VGG-style convolutions │
|
||||
│ 7 conv layers + 4 max pooling │
|
||||
│ Output: (512, 1, W/4) │
|
||||
└────────────────┬────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Sequence Reshaping │
|
||||
│ Collapse height dimension │
|
||||
│ Output: (W/4, 512) │
|
||||
└────────────────┬────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Bidirectional LSTM │
|
||||
│ 2 layers, 256 hidden units │
|
||||
│ Output: (W/4, 512) │
|
||||
└────────────────┬────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Classification Head │
|
||||
│ Linear(512 → num_classes) │
|
||||
│ Output: (W/4, num_classes) │
|
||||
└────────────────┬────────────────────┘
|
||||
│
|
||||
▼
|
||||
Probability Matrix (T, C)
|
||||
T = time steps, C = characters
|
||||
```
|
||||
|
||||
### 2.2 CTC Decoding
|
||||
|
||||
```python
|
||||
# deepdoc/vision/postprocess.py, lines 347-370
|
||||
|
||||
class CTCLabelDecode(BaseRecLabelDecode):
|
||||
"""
|
||||
CTC (Connectionist Temporal Classification) Decoder.
|
||||
|
||||
CTC giải quyết vấn đề:
|
||||
- Model output có T time steps
|
||||
- Ground truth có N characters
|
||||
- T > N (nhiều frame cho 1 ký tự)
|
||||
- Không biết alignment chính xác
|
||||
|
||||
CTC thêm special "blank" token (ε):
|
||||
- Represents "no output"
|
||||
- Allows alignment without explicit segmentation
|
||||
"""
|
||||
|
||||
def __init__(self, character_dict_path, use_space_char=False):
|
||||
super().__init__(character_dict_path, use_space_char)
|
||||
# Prepend blank token at index 0
|
||||
self.character = ['blank'] + self.character
|
||||
|
||||
def __call__(self, preds, label=None):
|
||||
"""
|
||||
Decode CTC output.
|
||||
|
||||
Args:
|
||||
preds: (batch, time, num_classes) probability matrix
|
||||
|
||||
Returns:
|
||||
[(text, confidence), ...]
|
||||
"""
|
||||
# Get most probable character at each time step
|
||||
preds_idx = preds.argmax(axis=2) # (batch, time)
|
||||
preds_prob = preds.max(axis=2) # (batch, time)
|
||||
|
||||
# Decode with deduplication
|
||||
result = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
|
||||
|
||||
return result
|
||||
|
||||
def decode(self, text_index, text_prob, is_remove_duplicate=True):
|
||||
"""
|
||||
CTC decoding algorithm.
|
||||
|
||||
Example:
|
||||
Raw output: [a, a, ε, l, l, ε, p, h, a]
|
||||
After dedup: [a, ε, l, ε, p, h, a]
|
||||
Remove blank: [a, l, p, h, a]
|
||||
Final: "alpha"
|
||||
"""
|
||||
result = []
|
||||
|
||||
for batch_idx in range(len(text_index)):
|
||||
char_list = []
|
||||
conf_list = []
|
||||
|
||||
for idx in range(len(text_index[batch_idx])):
|
||||
char_idx = text_index[batch_idx][idx]
|
||||
|
||||
# Skip blank token (index 0)
|
||||
if char_idx == 0:
|
||||
continue
|
||||
|
||||
# Skip consecutive duplicates
|
||||
if is_remove_duplicate:
|
||||
if idx > 0 and char_idx == text_index[batch_idx][idx-1]:
|
||||
continue
|
||||
|
||||
char_list.append(self.character[char_idx])
|
||||
conf_list.append(text_prob[batch_idx][idx])
|
||||
|
||||
text = ''.join(char_list)
|
||||
conf = np.mean(conf_list) if conf_list else 0.0
|
||||
|
||||
result.append((text, conf))
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
### 2.3 Aspect Ratio Handling
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 146-170
|
||||
|
||||
def resize_norm_img(self, img, max_wh_ratio):
|
||||
"""
|
||||
Resize image maintaining aspect ratio.
|
||||
|
||||
Vấn đề: Text images có width khác nhau
|
||||
- "Hi" → narrow
|
||||
- "Hello World" → wide
|
||||
|
||||
Giải pháp: Resize theo aspect ratio, pad right side
|
||||
"""
|
||||
imgC, imgH, imgW = self.rec_image_shape # [3, 48, 320]
|
||||
|
||||
# Calculate target width from aspect ratio
|
||||
max_width = int(imgH * max_wh_ratio)
|
||||
max_width = min(max_width, imgW) # Cap at 320
|
||||
|
||||
h, w = img.shape[:2]
|
||||
ratio = w / float(h)
|
||||
|
||||
# Resize maintaining aspect ratio
|
||||
if ratio * imgH > max_width:
|
||||
resized_w = max_width
|
||||
else:
|
||||
resized_w = int(ratio * imgH)
|
||||
|
||||
resized_img = cv2.resize(img, (resized_w, imgH))
|
||||
|
||||
# Pad right side to max_width
|
||||
padded = np.zeros((imgH, max_width, 3), dtype=np.float32)
|
||||
padded[:, :resized_w, :] = resized_img
|
||||
|
||||
# Normalize: [0, 255] → [-1, 1]
|
||||
padded = (padded / 255.0 - 0.5) / 0.5
|
||||
|
||||
# Transpose: HWC → CHW
|
||||
padded = padded.transpose(2, 0, 1)
|
||||
|
||||
return padded
|
||||
```
|
||||
|
||||
**Visualization**:
|
||||
```
|
||||
Original images:
|
||||
┌──────┐ ┌────────────────┐ ┌──────────────────────┐
|
||||
│ Hi │ │ Hello │ │ Hello World │
|
||||
└──────┘ └────────────────┘ └──────────────────────┘
|
||||
narrow medium wide
|
||||
|
||||
After resize + pad (to width 320):
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ Hi │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ Hello │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ Hello World │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
|
||||
└──────────────────────────────────────────────────────┘
|
||||
(░ = zero padding)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Full OCR Pipeline
|
||||
|
||||
### 3.1 OCR Class
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 536-752
|
||||
|
||||
class OCR:
|
||||
"""
|
||||
End-to-end OCR pipeline.
|
||||
|
||||
Usage:
|
||||
ocr = OCR()
|
||||
results = ocr(image)
|
||||
# results: [(box_points, (text, confidence)), ...]
|
||||
"""
|
||||
|
||||
def __init__(self, model_dir=None):
|
||||
# Auto-download models if not found
|
||||
if model_dir is None:
|
||||
model_dir = self._get_model_dir()
|
||||
|
||||
# Initialize detector and recognizer
|
||||
self.text_detector = TextDetector(model_dir)
|
||||
self.text_recognizer = TextRecognizer(model_dir)
|
||||
|
||||
def __call__(self, img, device_id=0, cls=True):
|
||||
"""
|
||||
Full OCR pipeline.
|
||||
|
||||
Args:
|
||||
img: numpy array (H, W, 3) in BGR
|
||||
device_id: GPU device ID
|
||||
cls: Whether to check text orientation
|
||||
|
||||
Returns:
|
||||
[(box_4pts, (text, confidence)), ...]
|
||||
"""
|
||||
# Step 1: Detect text regions
|
||||
dt_boxes, det_time = self.text_detector(img)
|
||||
|
||||
if dt_boxes is None or len(dt_boxes) == 0:
|
||||
return []
|
||||
|
||||
# Step 2: Sort boxes by reading order
|
||||
dt_boxes = self.sorted_boxes(dt_boxes)
|
||||
|
||||
# Step 3: Crop and rotate each text region
|
||||
img_crop_list = []
|
||||
for box in dt_boxes:
|
||||
tmp_box = self.get_rotate_crop_image(img, box)
|
||||
img_crop_list.append(tmp_box)
|
||||
|
||||
# Step 4: Recognize text
|
||||
rec_res, rec_time = self.text_recognizer(img_crop_list)
|
||||
|
||||
# Step 5: Filter by confidence
|
||||
results = []
|
||||
for box, rec in zip(dt_boxes, rec_res):
|
||||
text, score = rec
|
||||
if score >= 0.5: # drop_score threshold
|
||||
results.append((box, (text, score)))
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### 3.2 Rotation Detection
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 584-638
|
||||
|
||||
def get_rotate_crop_image(self, img, points):
|
||||
"""
|
||||
Crop text region with automatic rotation detection.
|
||||
|
||||
Vấn đề: Text có thể xoay 90° hoặc 270°
|
||||
Giải pháp: Try multiple orientations, pick best CTC score
|
||||
"""
|
||||
# Order points: top-left → top-right → bottom-right → bottom-left
|
||||
rect = self.order_points_clockwise(points)
|
||||
|
||||
# Perspective transform to get rectangular crop
|
||||
width = int(max(
|
||||
np.linalg.norm(rect[0] - rect[1]),
|
||||
np.linalg.norm(rect[2] - rect[3])
|
||||
))
|
||||
height = int(max(
|
||||
np.linalg.norm(rect[0] - rect[3]),
|
||||
np.linalg.norm(rect[1] - rect[2])
|
||||
))
|
||||
|
||||
dst = np.array([
|
||||
[0, 0],
|
||||
[width, 0],
|
||||
[width, height],
|
||||
[0, height]
|
||||
], dtype=np.float32)
|
||||
|
||||
M = cv2.getPerspectiveTransform(rect, dst)
|
||||
warped = cv2.warpPerspective(img, M, (width, height))
|
||||
|
||||
# Check if text is vertical (need rotation)
|
||||
if warped.shape[0] / warped.shape[1] >= 1.5:
|
||||
# Try 3 orientations
|
||||
orientations = [
|
||||
(warped, 0), # Original
|
||||
(cv2.rotate(warped, cv2.ROTATE_90_CLOCKWISE), 90),
|
||||
(cv2.rotate(warped, cv2.ROTATE_90_COUNTERCLOCKWISE), -90)
|
||||
]
|
||||
|
||||
best_score = -1
|
||||
best_img = warped
|
||||
|
||||
for rot_img, angle in orientations:
|
||||
# Quick recognition to get confidence
|
||||
_, score = self.text_recognizer([rot_img])[0]
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_img = rot_img
|
||||
|
||||
warped = best_img
|
||||
|
||||
return warped
|
||||
```
|
||||
|
||||
### 3.3 Reading Order Sorting
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 640-661
|
||||
|
||||
def sorted_boxes(self, dt_boxes):
|
||||
"""
|
||||
Sort boxes by reading order (top-to-bottom, left-to-right).
|
||||
|
||||
Algorithm:
|
||||
1. Sort by Y coordinate (top of box)
|
||||
2. Within same "row" (Y within 10px), sort by X coordinate
|
||||
"""
|
||||
num_boxes = len(dt_boxes)
|
||||
sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
|
||||
|
||||
# Group into rows and sort each row
|
||||
_boxes = list(sorted_boxes)
|
||||
|
||||
for i in range(num_boxes - 1):
|
||||
for j in range(i, -1, -1):
|
||||
# If boxes are on same row (Y difference < 10)
|
||||
if abs(_boxes[j+1][0][1] - _boxes[j][0][1]) < 10:
|
||||
# Sort by X coordinate
|
||||
if _boxes[j+1][0][0] < _boxes[j][0][0]:
|
||||
_boxes[j], _boxes[j+1] = _boxes[j+1], _boxes[j]
|
||||
else:
|
||||
break
|
||||
|
||||
return _boxes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Performance Optimization
|
||||
|
||||
### 4.1 GPU Memory Management
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 96-127
|
||||
|
||||
def load_model(model_dir, nm, device_id=None):
|
||||
"""
|
||||
Load ONNX model with optimized settings.
|
||||
"""
|
||||
options = ort.SessionOptions()
|
||||
|
||||
# Reduce memory fragmentation
|
||||
options.enable_cpu_mem_arena = False
|
||||
|
||||
# Sequential execution (more predictable memory)
|
||||
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
|
||||
|
||||
# Limit thread usage
|
||||
options.intra_op_num_threads = 2
|
||||
options.inter_op_num_threads = 2
|
||||
|
||||
# GPU configuration
|
||||
if torch.cuda.is_available() and device_id is not None:
|
||||
providers = [
|
||||
('CUDAExecutionProvider', {
|
||||
'device_id': device_id,
|
||||
# Limit GPU memory to 2GB
|
||||
'gpu_mem_limit': int(os.getenv('OCR_GPU_MEM_LIMIT_MB', 2048)) * 1024 * 1024,
|
||||
# Memory allocation strategy
|
||||
'arena_extend_strategy': os.getenv('OCR_ARENA_EXTEND_STRATEGY', 'kNextPowerOfTwo'),
|
||||
})
|
||||
]
|
||||
else:
|
||||
providers = ['CPUExecutionProvider']
|
||||
|
||||
session = ort.InferenceSession(model_path, options, providers)
|
||||
|
||||
# Run options for memory cleanup after each run
|
||||
run_opts = ort.RunOptions()
|
||||
run_opts.add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")
|
||||
|
||||
return session, run_opts
|
||||
```
|
||||
|
||||
### 4.2 Batch Processing Optimization
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 363-408
|
||||
|
||||
def __call__(self, img_list):
|
||||
"""
|
||||
Optimized batch recognition.
|
||||
"""
|
||||
# Sort images by aspect ratio for efficient batching
|
||||
# Similar widths → less padding waste
|
||||
indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
|
||||
|
||||
results = [None] * len(img_list)
|
||||
|
||||
for batch_start in range(0, len(indices), self.batch_size):
|
||||
batch_indices = indices[batch_start:batch_start + self.batch_size]
|
||||
|
||||
# Get max width in batch for padding
|
||||
max_wh_ratio = max(img_list[i].shape[1]/img_list[i].shape[0]
|
||||
for i in batch_indices)
|
||||
|
||||
# Normalize all images to same width
|
||||
norm_imgs = []
|
||||
for i in batch_indices:
|
||||
norm_img = self.resize_norm_img(img_list[i], max_wh_ratio)
|
||||
norm_imgs.append(norm_img)
|
||||
|
||||
# Stack into batch
|
||||
batch = np.stack(norm_imgs)
|
||||
|
||||
# Run inference
|
||||
preds = self.ort_sess.run(None, {"input": batch})
|
||||
|
||||
# Decode results
|
||||
texts = self.postprocess_op(preds[0])
|
||||
|
||||
# Map back to original indices
|
||||
for j, idx in enumerate(batch_indices):
|
||||
results[idx] = texts[j]
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### 4.3 Multi-GPU Parallel Processing
|
||||
|
||||
```python
|
||||
# deepdoc/vision/ocr.py, lines 556-579
|
||||
|
||||
class OCR:
|
||||
def __init__(self, model_dir=None):
|
||||
if settings.PARALLEL_DEVICES > 0:
|
||||
# Create per-GPU instances
|
||||
self.text_detector = [
|
||||
TextDetector(model_dir, device_id)
|
||||
for device_id in range(settings.PARALLEL_DEVICES)
|
||||
]
|
||||
self.text_recognizer = [
|
||||
TextRecognizer(model_dir, device_id)
|
||||
for device_id in range(settings.PARALLEL_DEVICES)
|
||||
]
|
||||
else:
|
||||
# Single instance for CPU/single GPU
|
||||
self.text_detector = TextDetector(model_dir)
|
||||
self.text_recognizer = TextRecognizer(model_dir)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
### 5.1 Common Issues
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| Low accuracy | Low resolution input | Increase zoomin factor (3-5) |
|
||||
| Slow inference | Large images | Resize to max 960px |
|
||||
| Memory error | Too many candidates | Reduce max_candidates |
|
||||
| Missing text | Tight boundaries | Increase unclip_ratio |
|
||||
| Wrong orientation | Vertical text | Enable rotation detection |
|
||||
|
||||
### 5.2 Debugging Tips
|
||||
|
||||
```python
|
||||
# Enable verbose logging
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# Visualize detections
|
||||
from deepdoc.vision.seeit import draw_boxes
|
||||
|
||||
img_with_boxes = draw_boxes(img, dt_boxes)
|
||||
cv2.imwrite("debug_detection.png", img_with_boxes)
|
||||
|
||||
# Check confidence scores
|
||||
for box, (text, conf) in results:
|
||||
print(f"Text: {text}, Confidence: {conf:.2f}")
|
||||
if conf < 0.5:
|
||||
print(" ⚠️ Low confidence!")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. References
|
||||
|
||||
- DBNet Paper: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
|
||||
- CRNN Paper: [An End-to-End Trainable Neural Network for Image-based Sequence Recognition](https://arxiv.org/abs/1507.05717)
|
||||
- CTC Paper: [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
|
||||
- PaddleOCR: [GitHub](https://github.com/PaddlePaddle/PaddleOCR)
|
||||
Loading…
Add table
Reference in a new issue