ragflow/personal_analyze/07-DEEPDOC-DEEP-GUIDE/ocr_deep_dive.md

# OCR Deep Dive

## Tổng Quan

Module OCR trong DeepDoc thực hiện 2 task chính:
1. **Text Detection**: Phát hiện vùng chứa text trong image
2. **Text Recognition**: Nhận dạng text trong các vùng đã phát hiện

## File Structure

```
deepdoc/vision/
├── ocr.py                 # Main OCR class (752 lines)
├── postprocess.py         # CTC decoder, DBNet postprocess (371 lines)
└── operators.py           # Image preprocessing (726 lines)
```

---

## 1. Text Detection (DBNet)

### 1.1 Model Architecture

```
DBNet (Differentiable Binarization Network):

Input Image (H, W, 3)
         │
         ▼
┌─────────────────────────────────────┐
│        ResNet-18 Backbone           │
│  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐│
│  │ C1  │→ │ C2  │→ │ C3  │→ │ C4  ││
│  │64ch │  │128ch│  │256ch│  │512ch││
│  └─────┘  └─────┘  └─────┘  └─────┘│
└─────────────────────────────────────┘
         │      │      │      │
         ▼      ▼      ▼      ▼
┌─────────────────────────────────────┐
│        Feature Pyramid Network       │
│  Upsample + Concatenate all levels  │
│  Output: 256 channels               │
└─────────────────────────────────────┘
         │
         ├─────────────────┐
         ▼                 ▼
┌─────────────────┐ ┌─────────────────┐
│  Probability    │ │   Threshold     │
│     Head        │ │     Head        │
│  Conv → Sigmoid │ │  Conv → Sigmoid │
└────────┬────────┘ └────────┬────────┘
         │                   │
         ▼                   ▼
    Prob Map (H, W)    Thresh Map (H, W)
         │                   │
         └─────────┬─────────┘
                   ▼
┌─────────────────────────────────────┐
│    Differentiable Binarization      │
│    B = sigmoid((P - T) * k)         │
│    k = 50 (amplification factor)    │
└─────────────────────────────────────┘
                   │
                   ▼
            Binary Map (H, W)
```

### 1.2 DBNet Post-processing

```python
# deepdoc/vision/postprocess.py, lines 41-259

class DBPostProcess:
    def __init__(self,
                 thresh=0.3,           # Binary threshold
                 box_thresh=0.5,       # Box confidence threshold
                 max_candidates=1000,  # Maximum text regions
                 unclip_ratio=1.5,     # Polygon expansion ratio
                 use_dilation=False,   # Morphological dilation
                 score_mode="fast"):   # fast or slow scoring

    def __call__(self, outs_dict, shape_list):
        """
        Post-process DBNet output.

        Args:
            outs_dict: {"maps": probability_map}
            shape_list: Original image shapes

        Returns:
            List of detected text boxes
        """
        pred = outs_dict["maps"]  # (N, 1, H, W)

        # Step 1: Binary thresholding
        bitmap = pred > self.thresh  # 0.3

        # Step 2: Optional dilation
        if self.use_dilation:
            kernel = np.ones((2, 2))
            bitmap = cv2.dilate(bitmap, kernel)

        # Step 3: Find contours
        contours = cv2.findContours(
            bitmap.astype(np.uint8),
            cv2.RETR_LIST,
            cv2.CHAIN_APPROX_SIMPLE
        )

        # Step 4: Process each contour
        boxes = []
        for contour in contours[:self.max_candidates]:
            # Simplify polygon
            epsilon = 0.002 * cv2.arcLength(contour, True)
            approx = cv2.approxPolyDP(contour, epsilon, True)

            if len(approx) < 4:
                continue

            # Calculate confidence score
            score = self.box_score_fast(pred, approx)
            if score < self.box_thresh:
                continue

            # Unclip (expand) polygon
            box = self.unclip(approx, self.unclip_ratio)
            boxes.append(box)

        return boxes
```

### 1.3 Unclipping Algorithm

**Vấn đề**: DBNet tends to predict tight boundaries → misses edge characters

**Giải pháp**: Expand detected polygon by unclip_ratio

```python
# deepdoc/vision/postprocess.py, lines 163-169

def unclip(self, box, unclip_ratio):
    """
    Expand polygon using Clipper library.

    Công thức:
    distance = Area * unclip_ratio / Perimeter

    Với unclip_ratio = 1.5:
    - Nhỏ polygon → expand nhiều hơn
    - Lớn polygon → expand ít hơn (proportional)
    """
    poly = Polygon(box)
    distance = poly.area * unclip_ratio / poly.length

    offset = pyclipper.PyclipperOffset()
    offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)

    expanded = offset.Execute(distance)
    return np.array(expanded[0])
```

**Visualization**:
```
Original detection:     After unclip (1.5x):
┌──────────────┐        ┌────────────────────┐
│   Hello      │   →    │      Hello         │
└──────────────┘        └────────────────────┘
                        (expanded boundaries)
```

---

## 2. Text Recognition (CRNN)

### 2.1 Model Architecture

```
CRNN (Convolutional Recurrent Neural Network):

Input: Cropped text image (3, 48, W)
                │
                ▼
┌─────────────────────────────────────┐
│            CNN Backbone              │
│  VGG-style convolutions             │
│  7 conv layers + 4 max pooling      │
│  Output: (512, 1, W/4)              │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│         Sequence Reshaping          │
│  Collapse height dimension          │
│  Output: (W/4, 512)                 │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│         Bidirectional LSTM          │
│  2 layers, 256 hidden units         │
│  Output: (W/4, 512)                 │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│          Classification Head         │
│  Linear(512 → num_classes)          │
│  Output: (W/4, num_classes)         │
└────────────────┬────────────────────┘
                 │
                 ▼
        Probability Matrix (T, C)
        T = time steps, C = characters
```

### 2.2 CTC Decoding

```python
# deepdoc/vision/postprocess.py, lines 347-370

class CTCLabelDecode(BaseRecLabelDecode):
    """
    CTC (Connectionist Temporal Classification) Decoder.

    CTC giải quyết vấn đề:
    - Model output có T time steps
    - Ground truth có N characters
    - T > N (nhiều frame cho 1 ký tự)
    - Không biết alignment chính xác

    CTC thêm special "blank" token (ε):
    - Represents "no output"
    - Allows alignment without explicit segmentation
    """

    def __init__(self, character_dict_path, use_space_char=False):
        super().__init__(character_dict_path, use_space_char)
        # Prepend blank token at index 0
        self.character = ['blank'] + self.character

    def __call__(self, preds, label=None):
        """
        Decode CTC output.

        Args:
            preds: (batch, time, num_classes) probability matrix

        Returns:
            [(text, confidence), ...]
        """
        # Get most probable character at each time step
        preds_idx = preds.argmax(axis=2)   # (batch, time)
        preds_prob = preds.max(axis=2)      # (batch, time)

        # Decode with deduplication
        result = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)

        return result

    def decode(self, text_index, text_prob, is_remove_duplicate=True):
        """
        CTC decoding algorithm.

        Example:
        Raw output:  [a, a, ε, l, l, ε, p, h, a]
        After dedup: [a, ε, l, ε, p, h, a]
        Remove blank: [a, l, p, h, a]
        Final: "alpha"
        """
        result = []

        for batch_idx in range(len(text_index)):
            char_list = []
            conf_list = []

            for idx in range(len(text_index[batch_idx])):
                char_idx = text_index[batch_idx][idx]

                # Skip blank token (index 0)
                if char_idx == 0:
                    continue

                # Skip consecutive duplicates
                if is_remove_duplicate:
                    if idx > 0 and char_idx == text_index[batch_idx][idx-1]:
                        continue

                char_list.append(self.character[char_idx])
                conf_list.append(text_prob[batch_idx][idx])

            text = ''.join(char_list)
            conf = np.mean(conf_list) if conf_list else 0.0

            result.append((text, conf))

        return result
```

### 2.3 Aspect Ratio Handling

```python
# deepdoc/vision/ocr.py, lines 146-170

def resize_norm_img(self, img, max_wh_ratio):
    """
    Resize image maintaining aspect ratio.

    Vấn đề: Text images có width khác nhau
    - "Hi" → narrow
    - "Hello World" → wide

    Giải pháp: Resize theo aspect ratio, pad right side
    """
    imgC, imgH, imgW = self.rec_image_shape  # [3, 48, 320]

    # Calculate target width from aspect ratio
    max_width = int(imgH * max_wh_ratio)
    max_width = min(max_width, imgW)  # Cap at 320

    h, w = img.shape[:2]
    ratio = w / float(h)

    # Resize maintaining aspect ratio
    if ratio * imgH > max_width:
        resized_w = max_width
    else:
        resized_w = int(ratio * imgH)

    resized_img = cv2.resize(img, (resized_w, imgH))

    # Pad right side to max_width
    padded = np.zeros((imgH, max_width, 3), dtype=np.float32)
    padded[:, :resized_w, :] = resized_img

    # Normalize: [0, 255] → [-1, 1]
    padded = (padded / 255.0 - 0.5) / 0.5

    # Transpose: HWC → CHW
    padded = padded.transpose(2, 0, 1)

    return padded
```

**Visualization**:
```
Original images:
┌──────┐  ┌────────────────┐  ┌──────────────────────┐
│ Hi   │  │ Hello          │  │ Hello World          │
└──────┘  └────────────────┘  └──────────────────────┘
 narrow        medium               wide

After resize + pad (to width 320):
┌──────────────────────────────────────────────────────┐
│ Hi   │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
├──────────────────────────────────────────────────────┤
│ Hello          │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
├──────────────────────────────────────────────────────┤
│ Hello World          │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
└──────────────────────────────────────────────────────┘
(░ = zero padding)
```

---

## 3. Full OCR Pipeline

### 3.1 OCR Class

```python
# deepdoc/vision/ocr.py, lines 536-752

class OCR:
    """
    End-to-end OCR pipeline.

    Usage:
        ocr = OCR()
        results = ocr(image)
        # results: [(box_points, (text, confidence)), ...]
    """

    def __init__(self, model_dir=None):
        # Auto-download models if not found
        if model_dir is None:
            model_dir = self._get_model_dir()

        # Initialize detector and recognizer
        self.text_detector = TextDetector(model_dir)
        self.text_recognizer = TextRecognizer(model_dir)

    def __call__(self, img, device_id=0, cls=True):
        """
        Full OCR pipeline.

        Args:
            img: numpy array (H, W, 3) in BGR
            device_id: GPU device ID
            cls: Whether to check text orientation

        Returns:
            [(box_4pts, (text, confidence)), ...]
        """
        # Step 1: Detect text regions
        dt_boxes, det_time = self.text_detector(img)

        if dt_boxes is None or len(dt_boxes) == 0:
            return []

        # Step 2: Sort boxes by reading order
        dt_boxes = self.sorted_boxes(dt_boxes)

        # Step 3: Crop and rotate each text region
        img_crop_list = []
        for box in dt_boxes:
            tmp_box = self.get_rotate_crop_image(img, box)
            img_crop_list.append(tmp_box)

        # Step 4: Recognize text
        rec_res, rec_time = self.text_recognizer(img_crop_list)

        # Step 5: Filter by confidence
        results = []
        for box, rec in zip(dt_boxes, rec_res):
            text, score = rec
            if score >= 0.5:  # drop_score threshold
                results.append((box, (text, score)))

        return results
```

### 3.2 Rotation Detection

```python
# deepdoc/vision/ocr.py, lines 584-638

def get_rotate_crop_image(self, img, points):
    """
    Crop text region with automatic rotation detection.

    Vấn đề: Text có thể xoay 90° hoặc 270°
    Giải pháp: Try multiple orientations, pick best CTC score
    """
    # Order points: top-left → top-right → bottom-right → bottom-left
    rect = self.order_points_clockwise(points)

    # Perspective transform to get rectangular crop
    width = int(max(
        np.linalg.norm(rect[0] - rect[1]),
        np.linalg.norm(rect[2] - rect[3])
    ))
    height = int(max(
        np.linalg.norm(rect[0] - rect[3]),
        np.linalg.norm(rect[1] - rect[2])
    ))

    dst = np.array([
        [0, 0],
        [width, 0],
        [width, height],
        [0, height]
    ], dtype=np.float32)

    M = cv2.getPerspectiveTransform(rect, dst)
    warped = cv2.warpPerspective(img, M, (width, height))

    # Check if text is vertical (need rotation)
    if warped.shape[0] / warped.shape[1] >= 1.5:
        # Try 3 orientations
        orientations = [
            (warped, 0),                              # Original
            (cv2.rotate(warped, cv2.ROTATE_90_CLOCKWISE), 90),
            (cv2.rotate(warped, cv2.ROTATE_90_COUNTERCLOCKWISE), -90)
        ]

        best_score = -1
        best_img = warped

        for rot_img, angle in orientations:
            # Quick recognition to get confidence
            _, score = self.text_recognizer([rot_img])[0]
            if score > best_score:
                best_score = score
                best_img = rot_img

        warped = best_img

    return warped
```

### 3.3 Reading Order Sorting

```python
# deepdoc/vision/ocr.py, lines 640-661

def sorted_boxes(self, dt_boxes):
    """
    Sort boxes by reading order (top-to-bottom, left-to-right).

    Algorithm:
    1. Sort by Y coordinate (top of box)
    2. Within same "row" (Y within 10px), sort by X coordinate
    """
    num_boxes = len(dt_boxes)
    sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))

    # Group into rows and sort each row
    _boxes = list(sorted_boxes)

    for i in range(num_boxes - 1):
        for j in range(i, -1, -1):
            # If boxes are on same row (Y difference < 10)
            if abs(_boxes[j+1][0][1] - _boxes[j][0][1]) < 10:
                # Sort by X coordinate
                if _boxes[j+1][0][0] < _boxes[j][0][0]:
                    _boxes[j], _boxes[j+1] = _boxes[j+1], _boxes[j]
            else:
                break

    return _boxes
```

---

## 4. Performance Optimization

### 4.1 GPU Memory Management

```python
# deepdoc/vision/ocr.py, lines 96-127

def load_model(model_dir, nm, device_id=None):
    """
    Load ONNX model with optimized settings.
    """
    options = ort.SessionOptions()

    # Reduce memory fragmentation
    options.enable_cpu_mem_arena = False

    # Sequential execution (more predictable memory)
    options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

    # Limit thread usage
    options.intra_op_num_threads = 2
    options.inter_op_num_threads = 2

    # GPU configuration
    if torch.cuda.is_available() and device_id is not None:
        providers = [
            ('CUDAExecutionProvider', {
                'device_id': device_id,
                # Limit GPU memory to 2GB
                'gpu_mem_limit': int(os.getenv('OCR_GPU_MEM_LIMIT_MB', 2048)) * 1024 * 1024,
                # Memory allocation strategy
                'arena_extend_strategy': os.getenv('OCR_ARENA_EXTEND_STRATEGY', 'kNextPowerOfTwo'),
            })
        ]
    else:
        providers = ['CPUExecutionProvider']

    session = ort.InferenceSession(model_path, options, providers)

    # Run options for memory cleanup after each run
    run_opts = ort.RunOptions()
    run_opts.add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")

    return session, run_opts
```

### 4.2 Batch Processing Optimization

```python
# deepdoc/vision/ocr.py, lines 363-408

def __call__(self, img_list):
    """
    Optimized batch recognition.
    """
    # Sort images by aspect ratio for efficient batching
    # Similar widths → less padding waste
    indices = np.argsort([img.shape[1]/img.shape[0] for img in img_list])

    results = [None] * len(img_list)

    for batch_start in range(0, len(indices), self.batch_size):
        batch_indices = indices[batch_start:batch_start + self.batch_size]

        # Get max width in batch for padding
        max_wh_ratio = max(img_list[i].shape[1]/img_list[i].shape[0]
                          for i in batch_indices)

        # Normalize all images to same width
        norm_imgs = []
        for i in batch_indices:
            norm_img = self.resize_norm_img(img_list[i], max_wh_ratio)
            norm_imgs.append(norm_img)

        # Stack into batch
        batch = np.stack(norm_imgs)

        # Run inference
        preds = self.ort_sess.run(None, {"input": batch})

        # Decode results
        texts = self.postprocess_op(preds[0])

        # Map back to original indices
        for j, idx in enumerate(batch_indices):
            results[idx] = texts[j]

    return results
```

### 4.3 Multi-GPU Parallel Processing

```python
# deepdoc/vision/ocr.py, lines 556-579

class OCR:
    def __init__(self, model_dir=None):
        if settings.PARALLEL_DEVICES > 0:
            # Create per-GPU instances
            self.text_detector = [
                TextDetector(model_dir, device_id)
                for device_id in range(settings.PARALLEL_DEVICES)
            ]
            self.text_recognizer = [
                TextRecognizer(model_dir, device_id)
                for device_id in range(settings.PARALLEL_DEVICES)
            ]
        else:
            # Single instance for CPU/single GPU
            self.text_detector = TextDetector(model_dir)
            self.text_recognizer = TextRecognizer(model_dir)
```

---

## 5. Troubleshooting

### 5.1 Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Low accuracy | Low resolution input | Increase zoomin factor (3-5) |
| Slow inference | Large images | Resize to max 960px |
| Memory error | Too many candidates | Reduce max_candidates |
| Missing text | Tight boundaries | Increase unclip_ratio |
| Wrong orientation | Vertical text | Enable rotation detection |

### 5.2 Debugging Tips

```python
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Visualize detections
from deepdoc.vision.seeit import draw_boxes

img_with_boxes = draw_boxes(img, dt_boxes)
cv2.imwrite("debug_detection.png", img_with_boxes)

# Check confidence scores
for box, (text, conf) in results:
    print(f"Text: {text}, Confidence: {conf:.2f}")
    if conf < 0.5:
        print("  ⚠️ Low confidence!")
```

---

## 6. References

- DBNet Paper: [Real-time Scene Text Detection with Differentiable Binarization](https://arxiv.org/abs/1911.08947)
- CRNN Paper: [An End-to-End Trainable Neural Network for Image-based Sequence Recognition](https://arxiv.org/abs/1507.05717)
- CTC Paper: [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- PaddleOCR: [GitHub](https://github.com/PaddlePaddle/PaddleOCR)