ragflow/personal_analyze/06-ALGORITHMS/vision_algorithms.md

# Vision Algorithms

## Tong Quan

RAGFlow sử dụng computer vision algorithms cho document understanding, OCR, và layout analysis.

## 1. OCR (Optical Character Recognition)

### File Location
```
/deepdoc/vision/ocr.py (lines 30-120)
```

### Purpose
Text detection và recognition từ document images.

### Implementation

```python
import onnxruntime as ort

class OCR:
    def __init__(self):
        # Load ONNX models
        self.det_model = ort.InferenceSession("ocr_det.onnx")
        self.rec_model = ort.InferenceSession("ocr_rec.onnx")

    def detect(self, image, device_id=0):
        """
        Detect text regions in image.

        Returns:
            List of bounding boxes with confidence scores
        """
        # Preprocess
        img = self._preprocess_det(image)

        # Run detection
        outputs = self.det_model.run(None, {"input": img})

        # Post-process to get boxes
        boxes = self._postprocess_det(outputs[0])

        return boxes

    def recognize(self, image, boxes):
        """
        Recognize text in detected regions.

        Returns:
            List of (text, confidence) tuples
        """
        results = []

        for box in boxes:
            # Crop region
            crop = self._crop_region(image, box)

            # Preprocess
            img = self._preprocess_rec(crop)

            # Run recognition
            outputs = self.rec_model.run(None, {"input": img})

            # Decode to text
            text, conf = self._decode_ctc(outputs[0])
            results.append((text, conf))

        return results
```

### OCR Pipeline

```
OCR Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│  Input Image                                                     │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Detection Model (ONNX)                                          │
│  - DB (Differentiable Binarization) based                       │
│  - Output: Text region polygons                                 │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Post-processing                                                 │
│  - Polygon to bounding box                                      │
│  - Filter by confidence                                         │
│  - NMS for overlapping boxes                                    │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Recognition Model (ONNX)                                        │
│  - CRNN (CNN + RNN) based                                       │
│  - CTC decoding                                                 │
│  - Output: Character sequence                                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Output: [(text, confidence, box), ...]                         │
└─────────────────────────────────────────────────────────────────┘
```

### CTC Decoding

```
CTC (Connectionist Temporal Classification):

Input: Probability matrix P (T × C)
       T = time steps, C = character classes

Algorithm:
1. For each time step, get most probable character
2. Merge consecutive duplicates
3. Remove blank tokens

Example:
Raw output: [a, a, -, b, b, b, -, c]
After merge: [a, -, b, -, c]
After blank removal: [a, b, c]
Final: "abc"
```

---

## 2. Layout Recognition (YOLOv10)

### File Location
```
/deepdoc/vision/layout_recognizer.py (lines 33-100)
```

### Purpose
Detect document layout elements (text, title, table, figure, etc.).

### Implementation

```python
class LayoutRecognizer:
    LABELS = [
        "text", "title", "figure", "figure caption",
        "table", "table caption", "header", "footer",
        "reference", "equation"
    ]

    def __init__(self):
        self.model = ort.InferenceSession("layout_yolov10.onnx")

    def detect(self, image):
        """
        Detect layout elements in document image.
        """
        # Preprocess (resize, normalize)
        img = self._preprocess(image)

        # Run inference
        outputs = self.model.run(None, {"images": img})

        # Post-process
        boxes, labels, scores = self._postprocess(outputs[0])

        # Filter by confidence
        results = []
        for box, label, score in zip(boxes, labels, scores):
            if score > 0.4:  # Confidence threshold
                results.append({
                    "box": box,
                    "type": self.LABELS[label],
                    "confidence": score
                })

        return results
```

### Layout Types

```
Document Layout Categories:
┌──────────────────┬────────────────────────────────────┐
│ Type             │ Description                        │
├──────────────────┼────────────────────────────────────┤
│ text             │ Body text paragraphs               │
│ title            │ Section/document titles            │
│ figure           │ Images, diagrams, charts           │
│ figure caption   │ Text describing figures            │
│ table            │ Data tables                        │
│ table caption    │ Text describing tables             │
│ header           │ Page headers                       │
│ footer           │ Page footers                       │
│ reference        │ Bibliography, citations            │
│ equation         │ Mathematical equations             │
└──────────────────┴────────────────────────────────────┘
```

### YOLO Detection

```
YOLOv10 Detection:

1. Backbone: Feature extraction (CSPDarknet)
2. Neck: Feature pyramid (PANet)
3. Head: Prediction heads for different scales

Output format:
[x_center, y_center, width, height, confidence, class_probs...]

Post-processing:
1. Apply sigmoid to confidence
2. Multiply conf × class_prob for class scores
3. Filter by score threshold
4. Apply NMS
```

---

## 3. Table Structure Recognition (TSR)

### File Location
```
/deepdoc/vision/table_structure_recognizer.py (lines 30-100)
```

### Purpose
Detect table structure (rows, columns, cells, headers).

### Implementation

```python
class TableStructureRecognizer:
    LABELS = [
        "table", "table column", "table row",
        "table column header", "projected row header",
        "spanning cell"
    ]

    def __init__(self):
        self.model = ort.InferenceSession("table_structure.onnx")

    def recognize(self, table_image):
        """
        Recognize structure of a table image.
        """
        # Preprocess
        img = self._preprocess(table_image)

        # Run inference
        outputs = self.model.run(None, {"input": img})

        # Parse structure
        structure = self._parse_structure(outputs)

        return structure

    def _parse_structure(self, outputs):
        """
        Parse model output into table structure.
        """
        rows = []
        columns = []
        cells = []

        for detection in outputs:
            label = self.LABELS[detection["class"]]

            if label == "table row":
                rows.append(detection["box"])
            elif label == "table column":
                columns.append(detection["box"])
            elif label == "spanning cell":
                cells.append({
                    "box": detection["box"],
                    "colspan": self._estimate_colspan(detection, columns),
                    "rowspan": self._estimate_rowspan(detection, rows)
                })

        return {
            "rows": sorted(rows, key=lambda x: x[1]),  # Sort by Y
            "columns": sorted(columns, key=lambda x: x[0]),  # Sort by X
            "cells": cells
        }
```

### TSR Output

```
Table Structure Output:

{
    "rows": [
        {"y": 10, "height": 30},   # Row 1
        {"y": 40, "height": 30},   # Row 2
        ...
    ],
    "columns": [
        {"x": 0, "width": 100},    # Col 1
        {"x": 100, "width": 150},  # Col 2
        ...
    ],
    "cells": [
        {"row": 0, "col": 0, "text": "Header 1"},
        {"row": 0, "col": 1, "text": "Header 2"},
        {"row": 1, "col": 0, "text": "Data 1", "colspan": 2},
        ...
    ]
}
```

---

## 4. Non-Maximum Suppression (NMS)

### File Location
```
/deepdoc/vision/operators.py (lines 702-725)
```

### Purpose
Filter overlapping bounding boxes trong object detection.

### Implementation

```python
def nms(boxes, scores, iou_threshold=0.5):
    """
    Non-Maximum Suppression algorithm.

    Args:
        boxes: List of [x1, y1, x2, y2]
        scores: Confidence scores
        iou_threshold: IoU threshold for suppression

    Returns:
        Indices of kept boxes
    """
    # Sort by score (descending)
    indices = np.argsort(scores)[::-1]

    keep = []
    while len(indices) > 0:
        # Keep highest scoring box
        current = indices[0]
        keep.append(current)

        if len(indices) == 1:
            break

        # Compute IoU with remaining boxes
        remaining = indices[1:]
        ious = compute_iou(boxes[current], boxes[remaining])

        # Keep boxes with IoU below threshold
        indices = remaining[ious < iou_threshold]

    return keep
```

### NMS Algorithm

```
NMS (Non-Maximum Suppression):

Input: Boxes B, Scores S, Threshold θ
Output: Filtered boxes

Algorithm:
1. Sort boxes by score (descending)
2. Select box with highest score → add to results
3. Remove boxes with IoU > θ with selected box
4. Repeat until no boxes remain

Example:
Boxes: [A(0.9), B(0.8), C(0.7)]
IoU(A,B) = 0.7 > 0.5 → Remove B
IoU(A,C) = 0.3 < 0.5 → Keep C
Result: [A, C]
```

---

## 5. Intersection over Union (IoU)

### File Location
```
/deepdoc/vision/operators.py (lines 702-725)
/deepdoc/vision/recognizer.py (lines 339-357)
```

### Purpose
Measure overlap between bounding boxes.

### Implementation

```python
def compute_iou(box1, box2):
    """
    Compute Intersection over Union.

    Args:
        box1, box2: [x1, y1, x2, y2] format

    Returns:
        IoU value in [0, 1]
    """
    # Intersection coordinates
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # Intersection area
    intersection = max(0, x2 - x1) * max(0, y2 - y1)

    # Union area
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    # IoU
    if union == 0:
        return 0

    return intersection / union
```

### IoU Formula

```
IoU (Intersection over Union):

IoU = Area(A ∩ B) / Area(A ∪ B)

     = Area(A ∩ B) / (Area(A) + Area(B) - Area(A ∩ B))

Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap

Threshold Usage:
- Detection: IoU > 0.5 → Same object
- NMS: IoU > 0.5 → Suppress duplicate
```

---

## 6. Image Preprocessing

### File Location
```
/deepdoc/vision/operators.py
```

### Purpose
Prepare images for neural network input.

### Implementation

```python
class StandardizeImage:
    """Normalize image to [0, 1] range."""

    def __call__(self, image):
        return image.astype(np.float32) / 255.0

class NormalizeImage:
    """Apply mean/std normalization."""

    def __init__(self, mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225]):
        self.mean = np.array(mean)
        self.std = np.array(std)

    def __call__(self, image):
        return (image - self.mean) / self.std

class ToCHWImage:
    """Convert HWC to CHW format."""

    def __call__(self, image):
        return image.transpose((2, 0, 1))

class LinearResize:
    """Resize image maintaining aspect ratio."""

    def __init__(self, target_size):
        self.target = target_size

    def __call__(self, image):
        h, w = image.shape[:2]
        scale = self.target / max(h, w)
        new_h, new_w = int(h * scale), int(w * scale)
        return cv2.resize(image, (new_w, new_h),
                         interpolation=cv2.INTER_CUBIC)
```

### Preprocessing Pipeline

```
Image Preprocessing Pipeline:

1. Resize (maintain aspect ratio)
   - Target: 640 or 1280 depending on model

2. Standardize (0-255 → 0-1)
   - image = image / 255.0

3. Normalize (ImageNet stats)
   - image = (image - mean) / std
   - mean = [0.485, 0.456, 0.406]
   - std = [0.229, 0.224, 0.225]

4. Transpose (HWC → CHW)
   - PyTorch format: (C, H, W)

5. Pad (to square)
   - Pad with zeros to square shape
```

---

## 7. XGBoost Text Concatenation

### File Location
```
/deepdoc/parser/pdf_parser.py (lines 88-101, 131-170)
```

### Purpose
Predict whether adjacent text boxes should be merged.

### Implementation

```python
import xgboost as xgb

class PDFParser:
    def __init__(self):
        # Load pre-trained XGBoost model
        self.concat_model = xgb.Booster()
        self.concat_model.load_model("updown_concat_xgb.model")

    def should_concat(self, box1, box2):
        """
        Predict if two text boxes should be concatenated.
        """
        # Extract features
        features = self._extract_concat_features(box1, box2)

        # Create DMatrix
        dmatrix = xgb.DMatrix([features])

        # Predict probability
        prob = self.concat_model.predict(dmatrix)[0]

        return prob > 0.5

    def _extract_concat_features(self, box1, box2):
        """
        Extract 20+ features for concatenation decision.
        """
        features = []

        # Distance features
        y_dist = box2["top"] - box1["bottom"]
        char_height = box1["bottom"] - box1["top"]
        features.append(y_dist / max(char_height, 1))

        # Alignment features
        x_overlap = min(box1["x1"], box2["x1"]) - max(box1["x0"], box2["x0"])
        features.append(x_overlap / max(box1["x1"] - box1["x0"], 1))

        # Text pattern features
        text1, text2 = box1["text"], box2["text"]
        features.append(1 if text1.endswith((".", "。", "!", "?")) else 0)
        features.append(1 if text2[0].isupper() else 0)

        # Layout features
        features.append(1 if box1.get("layout_num") == box2.get("layout_num") else 0)

        # ... more features

        return features
```

### Feature List

```
XGBoost Concatenation Features:

1. Spatial Features:
   - Y-distance / char_height
   - X-alignment overlap ratio
   - Same page flag

2. Text Pattern Features:
   - Ends with sentence punctuation
   - Ends with continuation punctuation
   - Next starts with uppercase
   - Next starts with number
   - Chinese numbering pattern

3. Layout Features:
   - Same layout_type
   - Same layout_num
   - Same column

4. Tokenization Features:
   - Token count ratio
   - Last/first token match

Total: 20+ features
```

---

## Summary

| Algorithm | Purpose | Model Type |
|-----------|---------|------------|
| OCR | Text detection + recognition | ONNX (DB + CRNN) |
| Layout Recognition | Element detection | ONNX (YOLOv10) |
| TSR | Table structure | ONNX |
| NMS | Box filtering | Classical |
| IoU | Overlap measure | Classical |
| XGBoost | Text concatenation | Gradient Boosting |

## Related Files

- `/deepdoc/vision/ocr.py` - OCR models
- `/deepdoc/vision/layout_recognizer.py` - Layout detection
- `/deepdoc/vision/table_structure_recognizer.py` - TSR
- `/deepdoc/vision/operators.py` - Image processing
- `/deepdoc/parser/pdf_parser.py` - XGBoost integration