docs: Add comprehensive DeepDoc deep guide documentation

Created in-depth documentation for understanding the deepdoc module:

- README.md: Complete deep guide with:
  - Big picture explanation (what problem deepdoc solves)
  - Data flow diagrams (Input → Processing → Output)
  - Detailed code analysis with line numbers
  - Technical explanations (ONNX, CTC, NMS, etc.)
  - Design reasoning (why certain technologies chosen)
  - Difficult terms glossary
  - Extension examples

- ocr_deep_dive.md: Deep dive into OCR subsystem
  - DBNet text detection architecture
  - CRNN text recognition
  - CTC decoding algorithm
  - Rotation handling
  - Performance optimization

- layout_table_deep_dive.md: Deep dive into layout/table recognition
  - YOLOv10 layout detection
  - Table structure recognition
  - Grid construction algorithm
  - Spanning cell handling
  - HTML/descriptive output generation
This commit is contained in:
Claude 2025-11-27 03:46:14 +00:00
parent 566bce428b
commit 6d4dbbfe2c
No known key found for this signature in database
3 changed files with 2890 additions and 0 deletions

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,926 @@
# Layout & Table Recognition Deep Dive
## Tổng Quan
Sau khi OCR extract được text boxes, DeepDoc cần:
1. **Layout Recognition**: Phân loại vùng (Text, Title, Table, Figure...)
2. **Table Structure Recognition**: Nhận dạng cấu trúc bảng (rows, columns, cells)
## File Structure
```
deepdoc/vision/
├── layout_recognizer.py # Layout detection (457 lines)
├── table_structure_recognizer.py # Table structure (613 lines)
└── recognizer.py # Base class (443 lines)
```
---
## 1. Layout Recognition (YOLOv10)
### 1.1 Layout Categories
```python
# deepdoc/vision/layout_recognizer.py, lines 34-46
labels = [
"_background_", # 0: Background (ignored)
"Text", # 1: Body text paragraphs
"Title", # 2: Section/document titles
"Figure", # 3: Images, diagrams, charts
"Figure caption", # 4: Text describing figures
"Table", # 5: Data tables
"Table caption", # 6: Text describing tables
"Header", # 7: Page headers
"Footer", # 8: Page footers
"Reference", # 9: Bibliography, citations
"Equation", # 10: Mathematical equations
]
```
### 1.2 YOLOv10 Architecture
```
YOLOv10 for Document Layout:
Input Image (640, 640, 3)
┌─────────────────────────────────────┐
│ CSPDarknet Backbone │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│
│ │ P1 │→ │ P2 │→ │ P3 │→ │ P4 ││
│ │/2 │ │/4 │ │/8 │ │/16 ││
│ └─────┘ └─────┘ └─────┘ └─────┘│
└─────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────┐
│ PANet Neck │
│ FPN (top-down) + PAN (bottom-up) │
│ Multi-scale feature fusion │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Detection Heads (3 scales) │
│ Small (80x80) → tiny objects │
│ Medium (40x40) → normal objects │
│ Large (20x20) → big objects │
└─────────────────────────────────────┘
Raw Predictions:
[x_center, y_center, width, height, confidence, class_probs...]
```
### 1.3 Preprocessing (LayoutRecognizer4YOLOv10)
```python
# deepdoc/vision/layout_recognizer.py, lines 186-209
def preprocess(self, image_list):
"""
Preprocess images for YOLOv10.
Key steps:
1. Resize maintaining aspect ratio
2. Pad to 640x640 (gray borders)
3. Normalize [0,255] → [0,1]
4. Transpose HWC → CHW
"""
processed = []
scale_factors = []
for img in image_list:
h, w = img.shape[:2]
# Calculate scale (preserve aspect ratio)
r = min(640/h, 640/w)
new_h, new_w = int(h*r), int(w*r)
# Resize
resized = cv2.resize(img, (new_w, new_h))
# Calculate padding
pad_top = (640 - new_h) // 2
pad_left = (640 - new_w) // 2
# Pad to 640x640 (gray: 114)
padded = np.full((640, 640, 3), 114, dtype=np.uint8)
padded[pad_top:pad_top+new_h, pad_left:pad_left+new_w] = resized
# Normalize and transpose
padded = padded.astype(np.float32) / 255.0
padded = padded.transpose(2, 0, 1) # HWC → CHW
processed.append(padded)
scale_factors.append([1/r, 1/r, pad_left, pad_top])
return np.stack(processed), scale_factors
```
**Visualization**:
```
Original image (1000x800):
┌────────────────────────────────────────┐
│ │
│ Document Content │
│ │
└────────────────────────────────────────┘
After resize (scale=0.64) to (640x512):
┌────────────────────────────────────────┐
│ │
│ Document Content │
│ │
└────────────────────────────────────────┘
After padding to (640x640):
┌────────────────────────────────────────┐
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
├────────────────────────────────────────┤
│ │
│ Document Content │
│ │
├────────────────────────────────────────┤
│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ ← 64px gray padding
└────────────────────────────────────────┘
```
### 1.4 NMS Postprocessing
```python
# deepdoc/vision/recognizer.py, lines 330-407
def postprocess(self, boxes, inputs, thr):
"""
YOLOv10 postprocessing with per-class NMS.
"""
results = []
for batch_idx, batch_boxes in enumerate(boxes):
scale_factor = inputs["scale_factor"][batch_idx]
# Filter by confidence threshold
mask = batch_boxes[:, 4] > thr # confidence > 0.2
filtered = batch_boxes[mask]
if len(filtered) == 0:
results.append([])
continue
# Convert xywh → xyxy
xyxy = self.xywh2xyxy(filtered[:, :4])
# Remove padding offset
xyxy[:, [0, 2]] -= scale_factor[2] # pad_left
xyxy[:, [1, 3]] -= scale_factor[3] # pad_top
# Scale back to original size
xyxy[:, [0, 2]] *= scale_factor[0] # scale_x
xyxy[:, [1, 3]] *= scale_factor[1] # scale_y
# Per-class NMS
class_ids = filtered[:, 5].astype(int)
scores = filtered[:, 4]
keep_indices = []
for cls in np.unique(class_ids):
cls_mask = class_ids == cls
cls_boxes = xyxy[cls_mask]
cls_scores = scores[cls_mask]
# NMS within class
keep = self.iou_filter(cls_boxes, cls_scores, iou_thresh=0.45)
keep_indices.extend(np.where(cls_mask)[0][keep])
# Build result
batch_results = []
for idx in keep_indices:
batch_results.append({
"type": self.labels[int(filtered[idx, 5])],
"bbox": xyxy[idx].tolist(),
"score": float(filtered[idx, 4])
})
results.append(batch_results)
return results
```
### 1.5 OCR-Layout Association
```python
# deepdoc/vision/layout_recognizer.py, lines 98-147
def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
"""
Detect layouts and associate with OCR results.
"""
# Step 1: Run layout detection
page_layouts = super().__call__(image_list, thr, batch_size)
# Step 2: Clean up overlapping layouts
for i, layouts in enumerate(page_layouts):
page_layouts[i] = self.layouts_cleanup(layouts, thr=0.7)
# Step 3: Associate OCR boxes with layouts
for page_idx, (ocr_boxes, layouts) in enumerate(zip(ocr_res, page_layouts)):
# Sort layouts by priority: Footer → Header → Reference → Caption → Others
layouts_by_priority = self._sort_by_priority(layouts)
for ocr_box in ocr_boxes:
# Find overlapping layout
matched_layout = self.find_overlapped_with_threshold(
ocr_box,
layouts_by_priority,
thr=0.4 # 40% overlap threshold
)
if matched_layout:
ocr_box["layout_type"] = matched_layout["type"]
ocr_box["layoutno"] = matched_layout.get("layoutno", 0)
else:
ocr_box["layout_type"] = "Text" # Default to Text
# Step 4: Filter garbage (headers, footers, page numbers)
if drop:
self._filter_garbage(ocr_res, page_layouts)
return ocr_res, page_layouts
```
### 1.6 Garbage Detection
```python
# deepdoc/vision/layout_recognizer.py, lines 64-66
# Patterns to filter out
garbage_patterns = [
r"^•+$", # Bullet points only
r"^[0-9]{1,2} / ?[0-9]{1,2}$", # Page numbers (3/10, 3 / 10)
r"^[0-9]{1,2} of [0-9]{1,2}$", # Page numbers (3 of 10)
r"^http://[^ ]{12,}", # Long URLs
r"\(cid *: *[0-9]+ *\)", # PDF character IDs
]
def is_garbage(text, layout_type, page_position):
"""
Determine if text should be filtered out.
Rules:
- Headers at top 10% of page → keep
- Footers at bottom 10% of page → keep
- Headers/footers elsewhere → garbage
- Page numbers → garbage
- URLs → garbage
"""
for pattern in garbage_patterns:
if re.match(pattern, text):
return True
# Position-based filtering
if layout_type == "Header" and page_position > 0.1:
return True # Header not at top
if layout_type == "Footer" and page_position < 0.9:
return True # Footer not at bottom
return False
```
---
## 2. Table Structure Recognition
### 2.1 Table Components
```python
# deepdoc/vision/table_structure_recognizer.py, lines 31-38
labels = [
"table", # 0: Whole table boundary
"table column", # 1: Column separators
"table row", # 2: Row separators
"table column header", # 3: Header rows
"table projected row header", # 4: Row labels
"table spanning cell", # 5: Merged cells
]
```
### 2.2 Detection to Grid Construction
```
Detection Output → Table Grid:
┌─────────────────────────────────────────────────────────────────┐
│ Raw Detections │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ table: [0, 0, 500, 300] │ │
│ │ table row: [0, 0, 500, 50], [0, 50, 500, 100], ... │ │
│ │ table column: [0, 0, 150, 300], [150, 0, 300, 300], ... │ │
│ │ table spanning cell: [0, 100, 300, 150] │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Alignment │ │
│ │ • Align row boundaries (left/right edges) │ │
│ │ • Align column boundaries (top/bottom edges) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Grid Construction │ │
│ │ │ │
│ │ ┌──────────┬──────────┬──────────┐ │ │
│ │ │ Header 1 │ Header 2 │ Header 3 │ ← Row 0 (header) │ │
│ │ ├──────────┴──────────┼──────────┤ │ │
│ │ │ Spanning Cell │ Cell 3 │ ← Row 1 │ │
│ │ ├──────────┬──────────┼──────────┤ │ │
│ │ │ Cell 4 │ Cell 5 │ Cell 6 │ ← Row 2 │ │
│ │ └──────────┴──────────┴──────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ HTML or Descriptive Output │
└─────────────────────────────────────────────────────────────────┘
```
### 2.3 Alignment Algorithm
```python
# deepdoc/vision/table_structure_recognizer.py, lines 67-111
def __call__(self, images, thr=0.2):
"""
Detect and align table structure.
"""
# Run detection
detections = super().__call__(images, thr)
for page_dets in detections:
rows = [d for d in page_dets if d["label"] == "table row"]
cols = [d for d in page_dets if d["label"] == "table column"]
if len(rows) > 4:
# Align row X coordinates (left edges)
x0_values = [r["x0"] for r in rows]
mean_x0 = np.mean(x0_values)
min_x0 = np.min(x0_values)
aligned_x0 = min(mean_x0, min_x0 + 0.05 * (max(x0_values) - min_x0))
for r in rows:
r["x0"] = aligned_x0
# Align row X coordinates (right edges)
x1_values = [r["x1"] for r in rows]
mean_x1 = np.mean(x1_values)
max_x1 = np.max(x1_values)
aligned_x1 = max(mean_x1, max_x1 - 0.05 * (max_x1 - min(x1_values)))
for r in rows:
r["x1"] = aligned_x1
if len(cols) > 4:
# Similar alignment for column Y coordinates
# ...
```
**Tại sao cần alignment?**
Detection model có thể cho ra boundaries không perfectly aligned:
```
Before alignment:
Row 1: x0=10, x1=490
Row 2: x0=12, x1=488
Row 3: x0=8, x1=492
After alignment:
Row 1: x0=10, x1=490
Row 2: x0=10, x1=490
Row 3: x0=10, x1=490
```
### 2.4 Grid Construction
```python
# deepdoc/vision/table_structure_recognizer.py, lines 172-349
@staticmethod
def construct_table(boxes, is_english=False, html=True, **kwargs):
"""
Construct 2D table from detected components.
Args:
boxes: OCR boxes with R (row), C (column), SP (spanning) attributes
is_english: Language hint
html: Output format (HTML or descriptive text)
Returns:
HTML table string or descriptive text
"""
# Step 1: Extract caption
caption = ""
for box in boxes[:]:
if is_caption(box):
caption = box["text"]
boxes.remove(box)
# Step 2: Sort by row position (R attribute)
rowh = np.median([b["bottom"] - b["top"] for b in boxes])
boxes = Recognizer.sort_R_firstly(boxes, rowh / 2)
# Step 3: Group into rows
rows = []
current_row = [boxes[0]]
for box in boxes[1:]:
# Same row if Y difference < row_height/2
if abs(box["R"] - current_row[-1]["R"]) < rowh / 2:
current_row.append(box)
else:
rows.append(current_row)
current_row = [box]
rows.append(current_row)
# Step 4: Sort each row by column position (C attribute)
for row in rows:
row.sort(key=lambda x: x["C"])
# Step 5: Build 2D table matrix
n_rows = len(rows)
n_cols = max(len(row) for row in rows)
table = [[None] * n_cols for _ in range(n_rows)]
for i, row in enumerate(rows):
for j, cell in enumerate(row):
table[i][j] = cell
# Step 6: Handle spanning cells
table = handle_spanning_cells(table, boxes)
# Step 7: Generate output
if html:
return generate_html_table(table, caption)
else:
return generate_descriptive_text(table, caption)
```
### 2.5 Spanning Cell Handling
```python
# deepdoc/vision/table_structure_recognizer.py, lines 496-575
def __cal_spans(self, boxes, rows, cols):
"""
Calculate colspan and rowspan for merged cells.
Spanning cell detection:
- "SP" attribute indicates merged cell
- Calculate which rows/cols it covers
"""
for box in boxes:
if "SP" not in box:
continue
# Find rows this cell spans
box["rowspan"] = []
for i, row in enumerate(rows):
overlap = self.overlapped_area(box, row)
if overlap > 0.3: # 30% overlap
box["rowspan"].append(i)
# Find columns this cell spans
box["colspan"] = []
for j, col in enumerate(cols):
overlap = self.overlapped_area(box, col)
if overlap > 0.3:
box["colspan"].append(j)
return boxes
```
**Example**:
```
Spanning cell detection:
┌──────────┬──────────┬──────────┐
│ Header 1 │ Header 2 │ Header 3 │
├──────────┴──────────┼──────────┤
│ Merged Cell │ Cell 3 │ ← SP cell spans columns 0-1
│ (colspan=2) │ │
├──────────┬──────────┼──────────┤
│ Cell 4 │ Cell 5 │ Cell 6 │
└──────────┴──────────┴──────────┘
Detection:
- SP cell bbox: [0, 50, 300, 100]
- Column 0: [0, 0, 150, 200] → overlap 0.5 ✓
- Column 1: [150, 0, 300, 200] → overlap 0.5 ✓
- Column 2: [300, 0, 450, 200] → overlap 0.0 ✗
→ colspan = [0, 1]
```
### 2.6 HTML Output Generation
```python
# deepdoc/vision/table_structure_recognizer.py, lines 352-393
def __html_table(table, header_rows, caption):
"""
Generate HTML table from 2D matrix.
"""
html_parts = ["<table>"]
# Add caption if exists
if caption:
html_parts.append(f"<caption>{caption}</caption>")
for i, row in enumerate(table):
html_parts.append("<tr>")
for j, cell in enumerate(row):
if cell is None:
continue # Skip cells covered by spanning
# Determine tag (th for header, td for data)
tag = "th" if i in header_rows else "td"
# Add colspan/rowspan attributes
attrs = []
if cell.get("colspan") and len(cell["colspan"]) > 1:
attrs.append(f'colspan="{len(cell["colspan"])}"')
if cell.get("rowspan") and len(cell["rowspan"]) > 1:
attrs.append(f'rowspan="{len(cell["rowspan"])}"')
attr_str = " " + " ".join(attrs) if attrs else ""
# Add cell content
html_parts.append(f"<{tag}{attr_str}>{cell['text']}</{tag}>")
html_parts.append("</tr>")
html_parts.append("</table>")
return "\n".join(html_parts)
```
**Output Example**:
```html
<table>
<caption>Table 1: Sales Data</caption>
<tr>
<th>Region</th>
<th>Q1</th>
<th>Q2</th>
</tr>
<tr>
<td colspan="2">North America</td>
<td>$150K</td>
</tr>
<tr>
<td>Europe</td>
<td>$100K</td>
<td>$120K</td>
</tr>
</table>
```
### 2.7 Descriptive Text Output
```python
# deepdoc/vision/table_structure_recognizer.py, lines 396-493
def __desc_table(table, header_rows, caption):
"""
Generate natural language description of table.
For RAG, sometimes descriptive text is better than HTML.
"""
descriptions = []
# Get headers
headers = [cell["text"] for cell in table[0]] if header_rows else []
# Process each data row
for i, row in enumerate(table):
if i in header_rows:
continue
row_desc = []
for j, cell in enumerate(row):
if cell is None:
continue
if headers and j < len(headers):
# "Column Name: Value" format
row_desc.append(f"{headers[j]}: {cell['text']}")
else:
row_desc.append(cell['text'])
if row_desc:
descriptions.append("; ".join(row_desc))
# Add source reference
if caption:
descriptions.append(f'(from "{caption}")')
return "\n".join(descriptions)
```
**Output Example**:
```
Region: North America; Q1: $100K; Q2: $150K
Region: Europe; Q1: $80K; Q2: $120K
(from "Table 1: Sales Data")
```
---
## 3. Cell Content Classification
### 3.1 Block Type Detection
```python
# deepdoc/vision/table_structure_recognizer.py, lines 121-149
@staticmethod
def blockType(text):
"""
Classify cell content type.
Used for:
- Header detection (non-numeric cells likely headers)
- Data validation
- Smart formatting
"""
patterns = {
"Dt": r"(^[0-9]{4}[-/][0-9]{1,2}|[0-9]{1,2}[-/][0-9]{1,2}[-/][0-9]{2,4}|"
r"[0-9]{1,2}月|[Q][1-4]|[一二三四]季度)", # Date
"Nu": r"^[-+]?[0-9.,%%¥$€£¥]+$", # Number
"Ca": r"^[A-Z0-9]{4,}$", # Code
"En": r"^[a-zA-Z\s]+$", # English
}
for type_name, pattern in patterns.items():
if re.search(pattern, text):
return type_name
# Classify by length
tokens = text.split()
if len(tokens) == 1:
return "Sg" # Single
elif len(tokens) <= 3:
return "Tx" # Short text
elif len(tokens) <= 12:
return "Lx" # Long text
else:
return "Ot" # Other
# Examples:
# "2023-01-15" → "Dt" (Date)
# "$1,234.56" → "Nu" (Number)
# "ABC123" → "Ca" (Code)
# "Total Revenue" → "En" (English)
# "北京市" → "Tx" (Text)
```
### 3.2 Header Detection
```python
# deepdoc/vision/table_structure_recognizer.py, lines 332-344
def detect_headers(table):
"""
Detect which rows are headers based on content type.
Heuristic: If >50% of cells in a row are non-numeric,
it's likely a header row.
"""
header_rows = set()
for i, row in enumerate(table):
non_numeric = 0
total = 0
for cell in row:
if cell is None:
continue
total += 1
if blockType(cell["text"]) != "Nu":
non_numeric += 1
if total > 0 and non_numeric / total > 0.5:
header_rows.add(i)
return header_rows
```
---
## 4. Integration với PDF Parser
### 4.1 Table Detection in PDF Pipeline
```python
# deepdoc/parser/pdf_parser.py, lines 196-281
def _table_transformer_job(self, zoomin=3):
"""
Detect and structure tables using TableStructureRecognizer.
"""
# Find table layouts
table_layouts = [
layout for layout in self.page_layout
if layout["type"] == "Table"
]
if not table_layouts:
return
# Crop table images
table_images = []
for layout in table_layouts:
x0, y0, x1, y1 = layout["bbox"]
img = self.page_images[layout["page"]][
int(y0*zoomin):int(y1*zoomin),
int(x0*zoomin):int(x1*zoomin)
]
table_images.append(img)
# Run TSR
table_structures = self.tsr(table_images)
# Match OCR boxes to table structure
for layout, structure in zip(table_layouts, table_structures):
# Get OCR boxes within table region
table_boxes = [
box for box in self.boxes
if self._box_in_region(box, layout["bbox"])
]
# Assign R, C, SP attributes
for box in table_boxes:
box["R"] = self._find_row(box, structure["rows"])
box["C"] = self._find_column(box, structure["columns"])
if self._is_spanning(box, structure["spanning_cells"]):
box["SP"] = True
# Store for later extraction
self.tb_cpns[layout["id"]] = {
"boxes": table_boxes,
"structure": structure
}
```
### 4.2 Table Extraction
```python
# deepdoc/parser/pdf_parser.py, lines 757-930
def _extract_table_figure(self, need_image, ZM, return_html, need_position):
"""
Extract tables and figures from detected layouts.
"""
tables = []
for layout_id, table_data in self.tb_cpns.items():
boxes = table_data["boxes"]
# Construct table (HTML or descriptive)
if return_html:
content = TableStructureRecognizer.construct_table(
boxes, html=True
)
else:
content = TableStructureRecognizer.construct_table(
boxes, html=False
)
table = {
"content": content,
"bbox": table_data["bbox"],
}
if need_image:
table["image"] = self._crop_region(table_data["bbox"])
tables.append(table)
return tables
```
---
## 5. Performance Considerations
### 5.1 Batch Processing
```python
# deepdoc/vision/recognizer.py, lines 415-437
def __call__(self, image_list, thr=0.7, batch_size=16):
"""
Batch inference for efficiency.
Why batch_size=16?
- GPU memory optimization
- Balance throughput vs latency
- Typical document has 10-50 elements
"""
results = []
for i in range(0, len(image_list), batch_size):
batch = image_list[i:i+batch_size]
# Preprocess
inputs = self.preprocess(batch)
# Inference
outputs = self.ort_sess.run(None, inputs)
# Postprocess
batch_results = self.postprocess(outputs, inputs, thr)
results.extend(batch_results)
return results
```
### 5.2 Model Caching
```python
# deepdoc/vision/ocr.py, lines 36-73
# Global model cache
loaded_models = {}
def load_model(model_dir, nm, device_id=None):
"""
Load ONNX model with caching.
Cache key: model_path + device_id
"""
model_path = os.path.join(model_dir, f"{nm}.onnx")
cache_key = f"{model_path}_{device_id}"
if cache_key in loaded_models:
return loaded_models[cache_key]
# Load model...
session = ort.InferenceSession(model_path, ...)
loaded_models[cache_key] = (session, run_opts)
return session, run_opts
```
---
## 6. Troubleshooting
### 6.1 Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Missing table | Low confidence | Lower threshold (0.1-0.2) |
| Wrong colspan | Misaligned detection | Check row/column alignment |
| Merged cells wrong | Overlap threshold | Adjust SP detection threshold |
| Headers not detected | All numeric | Manual header specification |
| Layout overlap | NMS threshold | Increase NMS IoU threshold |
### 6.2 Debugging
```python
# Visualize layout detection
from deepdoc.vision.seeit import draw_boxes
# Draw layout boxes on image
layout_vis = draw_boxes(
page_image,
[(l["bbox"], l["type"]) for l in page_layouts],
colors={
"Text": (0, 255, 0),
"Table": (255, 0, 0),
"Figure": (0, 0, 255),
}
)
cv2.imwrite("layout_debug.png", layout_vis)
# Check table structure
for box in table_boxes:
print(f"Text: {box['text']}")
print(f" Row: {box.get('R', 'N/A')}")
print(f" Col: {box.get('C', 'N/A')}")
print(f" Spanning: {box.get('SP', False)}")
```
---
## 7. References
- YOLOv10 Paper: [YOLOv10: Real-Time End-to-End Object Detection](https://arxiv.org/abs/2405.14458)
- Table Transformer: [PubTables-1M: Towards comprehensive table extraction](https://arxiv.org/abs/2110.00061)
- Document Layout Analysis: [A Survey](https://arxiv.org/abs/2012.15005)

View file

@ -0,0 +1,678 @@
# OCR Deep Dive
## Tổng Quan
Module OCR trong DeepDoc thực hiện 2 task chính:
1. **Text Detection**: Phát hiện vùng chứa text trong image
2. **Text Recognition**: Nhận dạng text trong các vùng đã phát hiện
## File Structure
```
deepdoc/vision/
├── ocr.py # Main OCR class (752 lines)
├── postprocess.py # CTC decoder, DBNet postprocess (371 lines)
└── operators.py # Image preprocessing (726 lines)
```
---
## 1. Text Detection (DBNet)
### 1.1 Model Architecture
```
DBNet (Differentiable Binarization Network):
Input Image (H, W, 3)
┌─────────────────────────────────────┐
│ ResNet-18 Backbone │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐│
│ │ C1 │→ │ C2 │→ │ C3 │→ │ C4 ││
│ │64ch │ │128ch│ │256ch│ │512ch││
│ └─────┘ └─────┘ └─────┘ └─────┘│
└─────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────┐
│ Feature Pyramid Network │
│ Upsample + Concatenate all levels │
│ Output: 256 channels │
└─────────────────────────────────────┘
├─────────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Probability │ │ Threshold │
│ Head │ │ Head │
│ Conv → Sigmoid │ │ Conv → Sigmoid │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
Prob Map (H, W) Thresh Map (H, W)
│ │
└─────────┬─────────┘
┌─────────────────────────────────────┐
│ Differentiable Binarization │
│ B = sigmoid((P - T) * k) │
│ k = 50 (amplification factor) │
└─────────────────────────────────────┘
Binary Map (H, W)
```
### 1.2 DBNet Post-processing
```python
# deepdoc/vision/postprocess.py, lines 41-259
class DBPostProcess:
def __init__(self,
thresh=0.3, # Binary threshold
box_thresh=0.5, # Box confidence threshold
max_candidates=1000, # Maximum text regions
unclip_ratio=1.5, # Polygon expansion ratio
use_dilation=False, # Morphological dilation
score_mode="fast"): # fast or slow scoring
def __call__(self, outs_dict, shape_list):
"""
Post-process DBNet output.
Args:
outs_dict: {"maps": probability_map}
shape_list: Original image shapes
Returns:
List of detected text boxes
"""
pred = outs_dict["maps"] # (N, 1, H, W)
# Step 1: Binary thresholding
bitmap = pred > self.thresh # 0.3
# Step 2: Optional dilation
if self.use_dilation:
kernel = np.ones((2, 2))
bitmap = cv2.dilate(bitmap, kernel)
# Step 3: Find contours
contours = cv2.findContours(
bitmap.astype(np.uint8),
cv2.RETR_LIST,
cv2.CHAIN_APPROX_SIMPLE
)
# Step 4: Process each contour
boxes = []
for contour in contours[:self.max_candidates]:
# Simplify polygon
epsilon = 0.002 * cv2.arcLength(contour, True)
approx = cv2.approxPolyDP(contour, epsilon, True)
if len(approx) < 4:
continue
# Calculate confidence score
score = self.box_score_fast(pred, approx)
if score < self.box_thresh:
continue
# Unclip (expand) polygon
box = self.unclip(approx, self.unclip_ratio)
boxes.append(box)
return boxes
```
### 1.3 Unclipping Algorithm
**Vấn đề**: DBNet tends to predict tight boundaries → misses edge characters
**Giải pháp**: Expand detected polygon by unclip_ratio
```python
# deepdoc/vision/postprocess.py, lines 163-169
def unclip(self, box, unclip_ratio):
"""
Expand polygon using Clipper library.
Công thức:
distance = Area * unclip_ratio / Perimeter
Với unclip_ratio = 1.5:
- Nhỏ polygon → expand nhiều hơn
- Lớn polygon → expand ít hơn (proportional)
"""
poly = Polygon(box)
distance = poly.area * unclip_ratio / poly.length
offset = pyclipper.PyclipperOffset()
offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
expanded = offset.Execute(distance)
return np.array(expanded[0])
```
**Visualization**:
```
Original detection: After unclip (1.5x):
┌──────────────┐ ┌────────────────────┐
│ Hello │ → │ Hello │
└──────────────┘ └────────────────────┘
(expanded boundaries)
```
---
## 2. Text Recognition (CRNN)
### 2.1 Model Architecture
```
CRNN (Convolutional Recurrent Neural Network):
Input: Cropped text image (3, 48, W)
┌─────────────────────────────────────┐
│ CNN Backbone │
│ VGG-style convolutions │
│ 7 conv layers + 4 max pooling │
│ Output: (512, 1, W/4) │
└────────────────┬────────────────────┘
┌─────────────────────────────────────┐
│ Sequence Reshaping │
│ Collapse height dimension │
│ Output: (W/4, 512) │
└────────────────┬────────────────────┘
┌─────────────────────────────────────┐
│ Bidirectional LSTM │
│ 2 layers, 256 hidden units │
│ Output: (W/4, 512) │
└────────────────┬────────────────────┘
┌─────────────────────────────────────┐
│ Classification Head │
│ Linear(512 → num_classes) │
│ Output: (W/4, num_classes) │
└────────────────┬────────────────────┘
Probability Matrix (T, C)
T = time steps, C = characters
```
### 2.2 CTC Decoding
```python
# deepdoc/vision/postprocess.py, lines 347-370
class CTCLabelDecode(BaseRecLabelDecode):
"""
CTC (Connectionist Temporal Classification) Decoder.
CTC giải quyết vấn đề:
- Model output có T time steps
- Ground truth có N characters
- T > N (nhiều frame cho 1 ký tự)
- Không biết alignment chính xác
CTC thêm special "blank" token (ε):
- Represents "no output"
- Allows alignment without explicit segmentation
"""
def __init__(self, character_dict_path, use_space_char=False):
super().__init__(character_dict_path, use_space_char)
# Prepend blank token at index 0
self.character = ['blank'] + self.character
def __call__(self, preds, label=None):
"""
Decode CTC output.
Args:
preds: (batch, time, num_classes) probability matrix
Returns:
[(text, confidence), ...]
"""
# Get most probable character at each time step
preds_idx = preds.argmax(axis=2) # (batch, time)
preds_prob = preds.max(axis=2) # (batch, time)
# Decode with deduplication
result = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
return result
def decode(self, text_index, text_prob, is_remove_duplicate=True):
"""
CTC decoding algorithm.
Example:
Raw output: [a, a, ε, l, l, ε, p, h, a]
After dedup: [a, ε, l, ε, p, h, a]
Remove blank: [a, l, p, h, a]
Final: "alpha"
"""
result = []
for batch_idx in range(len(text_index)):
char_list = []
conf_list = []
for idx in range(len(text_index[batch_idx])):
char_idx = text_index[batch_idx][idx]
# Skip blank token (index 0)
if char_idx == 0:
continue
# Skip consecutive duplicates
if is_remove_duplicate:
if idx > 0 and char_idx == text_index[batch_idx][idx-1]:
continue
char_list.append(self.character[char_idx])
conf_list.append(text_prob[batch_idx][idx])
text = ''.join(char_list)
conf = np.mean(conf_list) if conf_list else 0.0
result.append((text, conf))
return result
```
### 2.3 Aspect Ratio Handling
```python
# deepdoc/vision/ocr.py, lines 146-170
def resize_norm_img(self, img, max_wh_ratio):
"""
Resize image maintaining aspect ratio.
Vấn đề: Text images có width khác nhau
- "Hi" → narrow
- "Hello World" → wide
Giải pháp: Resize theo aspect ratio, pad right side
"""
imgC, imgH, imgW = self.rec_image_shape # [3, 48, 320]
# Calculate target width from aspect ratio
max_width = int(imgH * max_wh_ratio)
max_width = min(max_width, imgW) # Cap at 320
h, w = img.shape[:2]
ratio = w / float(h)
# Resize maintaining aspect ratio
if ratio * imgH > max_width:
resized_w = max_width
else:
resized_w = int(ratio * imgH)
resized_img = cv2.resize(img, (resized_w, imgH))
# Pad right side to max_width
padded = np.zeros((imgH, max_width, 3), dtype=np.float32)
padded[:, :resized_w, :] = resized_img
# Normalize: [0, 255] → [-1, 1]
padded = (padded / 255.0 - 0.5) / 0.5
# Transpose: HWC → CHW
padded = padded.transpose(2, 0, 1)
return padded
```
**Visualization**:
```
Original images:
┌──────┐ ┌────────────────┐ ┌──────────────────────┐
│ Hi │ │ Hello │ │ Hello World │
└──────┘ └────────────────┘ └──────────────────────┘
narrow medium wide
After resize + pad (to width 320):
┌──────────────────────────────────────────────────────┐
│ Hi │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
├──────────────────────────────────────────────────────┤
│ Hello │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
├──────────────────────────────────────────────────────┤
│ Hello World │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
└──────────────────────────────────────────────────────┘
(░ = zero padding)
```
---
## 3. Full OCR Pipeline
### 3.1 OCR Class
```python
# deepdoc/vision/ocr.py, lines 536-752
class OCR:
"""
End-to-end OCR pipeline.
Usage:
ocr = OCR()
results = ocr(image)
# results: [(box_points, (text, confidence)), ...]
"""
def __init__(self, model_dir=None):
# Auto-download models if not found
if model_dir is None:
model_dir = self._get_model_dir()
# Initialize detector and recognizer
self.text_detector = TextDetector(model_dir)
self.text_recognizer = TextRecognizer(model_dir)
def __call__(self, img, device_id=0, cls=True):
"""
Full OCR pipeline.
Args:
img: numpy array (H, W, 3) in BGR
device_id: GPU device ID
cls: Whether to check text orientation
Returns:
[(box_4pts, (text, confidence)), ...]
"""
# Step 1: Detect text regions
dt_boxes, det_time = self.text_detector(img)
if dt_boxes is None or len(dt_boxes) == 0:
return []
# Step 2: Sort boxes by reading order
dt_boxes = self.sorted_boxes(dt_boxes)
# Step 3: Crop and rotate each text region
img_crop_list = []
for box in dt_boxes:
tmp_box = self.get_rotate_crop_image(img, box)
img_crop_list.append(tmp_box)
# Step 4: Recognize text
rec_res, rec_time = self.text_recognizer(img_crop_list)
# Step 5: Filter by confidence
results = []
for box, rec in zip(dt_boxes, rec_res):
text, score = rec
if score >= 0.5: # drop_score threshold
results.append((box, (text, score)))
return results
```
### 3.2 Rotation Detection
```python
# deepdoc/vision/ocr.py, lines 584-638
def get_rotate_crop_image(self, img, points):
"""
Crop text region with automatic rotation detection.
Vấn đề: Text có thể xoay 90° hoặc 270°
Giải pháp: Try multiple orientations, pick best CTC score
"""
# Order points: top-left → top-right → bottom-right → bottom-left
rect = self.order_points_clockwise(points)
# Perspective transform to get rectangular crop
width = int(max(
np.linalg.norm(rect[0] - rect[1]),
np.linalg.norm(rect[2] - rect[3])
))
height = int(max(
np.linalg.norm(rect[0] - rect[3]),
np.linalg.norm(rect[1] - rect[2])
))
dst = np.array([
[0, 0],
[width, 0],
[width, height],
[0, height]
], dtype=np.float32)
M = cv2.getPerspectiveTransform(rect, dst)
warped = cv2.warpPerspective(img, M, (width, height))
# Check if text is vertical (need rotation)
if warped.shape[0] / warped.shape[1] >= 1.5:
# Try 3 orientations
orientations = [
(warped, 0), # Original
(cv2.rotate(warped, cv2.ROTATE_90_CLOCKWISE), 90),
(cv2.rotate(warped, cv2.ROTATE_90_COUNTERCLOCKWISE), -90)
]
best_score = -1
best_img = warped
for rot_img, angle in orientations:
# Quick recognition to get confidence
_, score = self.text_recognizer([rot_img])[0]
if score > best_score:
best_score = score
best_img = rot_img
warped = best_img
return warped
```
### 3.3 Reading Order Sorting
```python
# deepdoc/vision/ocr.py, lines 640-661
def sorted_boxes(self, dt_boxes):
"""
Sort boxes by reading order (top-to-bottom, left-to-right).
Algorithm:
1. Sort by Y coordinate (top of box)
2. Within same "row" (Y within 10px), sort by X coordinate
"""
num_boxes = len(dt_boxes)
sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
# Group into rows and sort each row
_boxes = list(sorted_boxes)
for i in range(num_boxes - 1):
for j in range(i, -1, -1):
# If boxes are on same row (Y difference < 10)
if abs(_boxes[j+1][0][1] - _boxes[j][0][1]) < 10:
# Sort by X coordinate
if _boxes[j+1][0][0] < _boxes[j][0][0]:
_boxes[j], _boxes[j+1] = _boxes[j+1], _boxes[j]
else:
break
return _boxes
```
---
## 4. Performance Optimization
### 4.1 GPU Memory Management
```python
# deepdoc/vision/ocr.py, lines 96-127
def load_model(model_dir, nm, device_id=None):
"""
Load ONNX model with optimized settings.
"""
options = ort.SessionOptions()
# Reduce memory fragmentation
options.enable_cpu_mem_arena = False
# Sequential execution (more predictable memory)
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
# Limit thread usage
options.intra_op_num_threads = 2
options.inter_op_num_threads = 2
# GPU configuration
if torch.cuda.is_available() and device_id is not None:
providers = [
('CUDAExecutionProvider', {
'device_id': device_id,
# Limit GPU memory to 2GB
'gpu_mem_limit': int(os.getenv('OCR_GPU_MEM_LIMIT_MB', 2048)) * 1024 * 1024,
# Memory allocation strategy
'arena_extend_strategy': os.getenv('OCR_ARENA_EXTEND_STRATEGY', 'kNextPowerOfTwo'),
})
]
else:
providers = ['CPUExecutionProvider']
session = ort.InferenceSession(model_path, options, providers)
# Run options for memory cleanup after each run
run_opts = ort.RunOptions()
run_opts.add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")
return session, run_opts
```
### 4.2 Batch Processing Optimization
```python
# deepdoc/vision/ocr.py, lines 363-408
def __call__(self, img_list):
"""
Optimized batch recognition.
"""
# Sort images by aspect ratio for efficient batching
# Similar widths → less padding waste
indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])
results = [None] * len(img_list)
for batch_start in range(0, len(indices), self.batch_size):
batch_indices = indices[batch_start:batch_start + self.batch_size]
# Get max width in batch for padding
max_wh_ratio = max(img_list[i].shape[1]/img_list[i].shape[0]
for i in batch_indices)
# Normalize all images to same width
norm_imgs = []
for i in batch_indices:
norm_img = self.resize_norm_img(img_list[i], max_wh_ratio)
norm_imgs.append(norm_img)
# Stack into batch
batch = np.stack(norm_imgs)
# Run inference
preds = self.ort_sess.run(None, {"input": batch})
# Decode results
texts = self.postprocess_op(preds[0])
# Map back to original indices
for j, idx in enumerate(batch_indices):
results[idx] = texts[j]
return results
```
### 4.3 Multi-GPU Parallel Processing
```python
# deepdoc/vision/ocr.py, lines 556-579
class OCR:
def __init__(self, model_dir=None):
if settings.PARALLEL_DEVICES > 0:
# Create per-GPU instances
self.text_detector = [
TextDetector(model_dir, device_id)
for device_id in range(settings.PARALLEL_DEVICES)
]
self.text_recognizer = [
TextRecognizer(model_dir, device_id)
for device_id in range(settings.PARALLEL_DEVICES)
]
else:
# Single instance for CPU/single GPU
self.text_detector = TextDetector(model_dir)
self.text_recognizer = TextRecognizer(model_dir)
```
---
## 5. Troubleshooting
### 5.1 Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Low accuracy | Low resolution input | Increase zoomin factor (3-5) |
| Slow inference | Large images | Resize to max 960px |
| Memory error | Too many candidates | Reduce max_candidates |
| Missing text | Tight boundaries | Increase unclip_ratio |
| Wrong orientation | Vertical text | Enable rotation detection |
### 5.2 Debugging Tips
```python
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Visualize detections
from deepdoc.vision.seeit import draw_boxes
img_with_boxes = draw_boxes(img, dt_boxes)
cv2.imwrite("debug_detection.png", img_with_boxes)
# Check confidence scores
for box, (text, conf) in results:
print(f"Text: {text}, Confidence: {conf:.2f}")
if conf < 0.5:
print(" ⚠️ Low confidence!")
```
---
## 6. References
- DBNet Paper: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
- CRNN Paper: [An End-to-End Trainable Neural Network for Image-based Sequence Recognition](https://arxiv.org/abs/1507.05717)
- CTC Paper: [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- PaddleOCR: [GitHub](https://github.com/PaddlePaddle/PaddleOCR)