docs: Add comprehensive algorithm documentation (50+ algorithms)

- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)

2025-11-27 03:34:49 +00:00

17 KiB

Raw Blame History

Vision Algorithms

Tong Quan

RAGFlow sử dụng computer vision algorithms cho document understanding, OCR, và layout analysis.

1. OCR (Optical Character Recognition)

File Location

/deepdoc/vision/ocr.py (lines 30-120)

Purpose

Text detection và recognition từ document images.

Implementation

import onnxruntime as ort

class OCR:
    def __init__(self):
        # Load ONNX models
        self.det_model = ort.InferenceSession("ocr_det.onnx")
        self.rec_model = ort.InferenceSession("ocr_rec.onnx")

    def detect(self, image, device_id=0):
        """
        Detect text regions in image.

        Returns:
            List of bounding boxes with confidence scores
        """
        # Preprocess
        img = self._preprocess_det(image)

        # Run detection
        outputs = self.det_model.run(None, {"input": img})

        # Post-process to get boxes
        boxes = self._postprocess_det(outputs[0])

        return boxes

    def recognize(self, image, boxes):
        """
        Recognize text in detected regions.

        Returns:
            List of (text, confidence) tuples
        """
        results = []

        for box in boxes:
            # Crop region
            crop = self._crop_region(image, box)

            # Preprocess
            img = self._preprocess_rec(crop)

            # Run recognition
            outputs = self.rec_model.run(None, {"input": img})

            # Decode to text
            text, conf = self._decode_ctc(outputs[0])
            results.append((text, conf))

        return results

OCR Pipeline

OCR Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│  Input Image                                                     │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Detection Model (ONNX)                                          │
│  - DB (Differentiable Binarization) based                       │
│  - Output: Text region polygons                                 │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Post-processing                                                 │
│  - Polygon to bounding box                                      │
│  - Filter by confidence                                         │
│  - NMS for overlapping boxes                                    │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Recognition Model (ONNX)                                        │
│  - CRNN (CNN + RNN) based                                       │
│  - CTC decoding                                                 │
│  - Output: Character sequence                                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  Output: [(text, confidence, box), ...]                         │
└─────────────────────────────────────────────────────────────────┘

CTC Decoding

CTC (Connectionist Temporal Classification):

Input: Probability matrix P (T × C)
       T = time steps, C = character classes

Algorithm:
1. For each time step, get most probable character
2. Merge consecutive duplicates
3. Remove blank tokens

Example:
Raw output: [a, a, -, b, b, b, -, c]
After merge: [a, -, b, -, c]
After blank removal: [a, b, c]
Final: "abc"

2. Layout Recognition (YOLOv10)

File Location

/deepdoc/vision/layout_recognizer.py (lines 33-100)

Purpose

Detect document layout elements (text, title, table, figure, etc.).

Implementation

class LayoutRecognizer:
    LABELS = [
        "text", "title", "figure", "figure caption",
        "table", "table caption", "header", "footer",
        "reference", "equation"
    ]

    def __init__(self):
        self.model = ort.InferenceSession("layout_yolov10.onnx")

    def detect(self, image):
        """
        Detect layout elements in document image.
        """
        # Preprocess (resize, normalize)
        img = self._preprocess(image)

        # Run inference
        outputs = self.model.run(None, {"images": img})

        # Post-process
        boxes, labels, scores = self._postprocess(outputs[0])

        # Filter by confidence
        results = []
        for box, label, score in zip(boxes, labels, scores):
            if score > 0.4:  # Confidence threshold
                results.append({
                    "box": box,
                    "type": self.LABELS[label],
                    "confidence": score
                })

        return results

Layout Types

Document Layout Categories:
┌──────────────────┬────────────────────────────────────┐
│ Type             │ Description                        │
├──────────────────┼────────────────────────────────────┤
│ text             │ Body text paragraphs               │
│ title            │ Section/document titles            │
│ figure           │ Images, diagrams, charts           │
│ figure caption   │ Text describing figures            │
│ table            │ Data tables                        │
│ table caption    │ Text describing tables             │
│ header           │ Page headers                       │
│ footer           │ Page footers                       │
│ reference        │ Bibliography, citations            │
│ equation         │ Mathematical equations             │
└──────────────────┴────────────────────────────────────┘

YOLO Detection

YOLOv10 Detection:

1. Backbone: Feature extraction (CSPDarknet)
2. Neck: Feature pyramid (PANet)
3. Head: Prediction heads for different scales

Output format:
[x_center, y_center, width, height, confidence, class_probs...]

Post-processing:
1. Apply sigmoid to confidence
2. Multiply conf × class_prob for class scores
3. Filter by score threshold
4. Apply NMS

3. Table Structure Recognition (TSR)

File Location

/deepdoc/vision/table_structure_recognizer.py (lines 30-100)

Purpose

Detect table structure (rows, columns, cells, headers).

Implementation

class TableStructureRecognizer:
    LABELS = [
        "table", "table column", "table row",
        "table column header", "projected row header",
        "spanning cell"
    ]

    def __init__(self):
        self.model = ort.InferenceSession("table_structure.onnx")

    def recognize(self, table_image):
        """
        Recognize structure of a table image.
        """
        # Preprocess
        img = self._preprocess(table_image)

        # Run inference
        outputs = self.model.run(None, {"input": img})

        # Parse structure
        structure = self._parse_structure(outputs)

        return structure

    def _parse_structure(self, outputs):
        """
        Parse model output into table structure.
        """
        rows = []
        columns = []
        cells = []

        for detection in outputs:
            label = self.LABELS[detection["class"]]

            if label == "table row":
                rows.append(detection["box"])
            elif label == "table column":
                columns.append(detection["box"])
            elif label == "spanning cell":
                cells.append({
                    "box": detection["box"],
                    "colspan": self._estimate_colspan(detection, columns),
                    "rowspan": self._estimate_rowspan(detection, rows)
                })

        return {
            "rows": sorted(rows, key=lambda x: x[1]),  # Sort by Y
            "columns": sorted(columns, key=lambda x: x[0]),  # Sort by X
            "cells": cells
        }

TSR Output

Table Structure Output:

{
    "rows": [
        {"y": 10, "height": 30},   # Row 1
        {"y": 40, "height": 30},   # Row 2
        ...
    ],
    "columns": [
        {"x": 0, "width": 100},    # Col 1
        {"x": 100, "width": 150},  # Col 2
        ...
    ],
    "cells": [
        {"row": 0, "col": 0, "text": "Header 1"},
        {"row": 0, "col": 1, "text": "Header 2"},
        {"row": 1, "col": 0, "text": "Data 1", "colspan": 2},
        ...
    ]
}

4. Non-Maximum Suppression (NMS)

File Location

/deepdoc/vision/operators.py (lines 702-725)

Purpose

Filter overlapping bounding boxes trong object detection.

Implementation

def nms(boxes, scores, iou_threshold=0.5):
    """
    Non-Maximum Suppression algorithm.

    Args:
        boxes: List of [x1, y1, x2, y2]
        scores: Confidence scores
        iou_threshold: IoU threshold for suppression

    Returns:
        Indices of kept boxes
    """
    # Sort by score (descending)
    indices = np.argsort(scores)[::-1]

    keep = []
    while len(indices) > 0:
        # Keep highest scoring box
        current = indices[0]
        keep.append(current)

        if len(indices) == 1:
            break

        # Compute IoU with remaining boxes
        remaining = indices[1:]
        ious = compute_iou(boxes[current], boxes[remaining])

        # Keep boxes with IoU below threshold
        indices = remaining[ious < iou_threshold]

    return keep

NMS Algorithm

NMS (Non-Maximum Suppression):

Input: Boxes B, Scores S, Threshold θ
Output: Filtered boxes

Algorithm:
1. Sort boxes by score (descending)
2. Select box with highest score → add to results
3. Remove boxes with IoU > θ with selected box
4. Repeat until no boxes remain

Example:
Boxes: [A(0.9), B(0.8), C(0.7)]
IoU(A,B) = 0.7 > 0.5 → Remove B
IoU(A,C) = 0.3 < 0.5 → Keep C
Result: [A, C]

5. Intersection over Union (IoU)

File Location

/deepdoc/vision/operators.py (lines 702-725)
/deepdoc/vision/recognizer.py (lines 339-357)

Purpose

Measure overlap between bounding boxes.

Implementation

def compute_iou(box1, box2):
    """
    Compute Intersection over Union.

    Args:
        box1, box2: [x1, y1, x2, y2] format

    Returns:
        IoU value in [0, 1]
    """
    # Intersection coordinates
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # Intersection area
    intersection = max(0, x2 - x1) * max(0, y2 - y1)

    # Union area
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    # IoU
    if union == 0:
        return 0

    return intersection / union

IoU Formula

IoU (Intersection over Union):

IoU = Area(A ∩ B) / Area(A ∪ B)

     = Area(A ∩ B) / (Area(A) + Area(B) - Area(A ∩ B))

Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap

Threshold Usage:
- Detection: IoU > 0.5 → Same object
- NMS: IoU > 0.5 → Suppress duplicate

6. Image Preprocessing

File Location

/deepdoc/vision/operators.py

Purpose

Prepare images for neural network input.

Implementation

class StandardizeImage:
    """Normalize image to [0, 1] range."""

    def __call__(self, image):
        return image.astype(np.float32) / 255.0

class NormalizeImage:
    """Apply mean/std normalization."""

    def __init__(self, mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225]):
        self.mean = np.array(mean)
        self.std = np.array(std)

    def __call__(self, image):
        return (image - self.mean) / self.std

class ToCHWImage:
    """Convert HWC to CHW format."""

    def __call__(self, image):
        return image.transpose((2, 0, 1))

class LinearResize:
    """Resize image maintaining aspect ratio."""

    def __init__(self, target_size):
        self.target = target_size

    def __call__(self, image):
        h, w = image.shape[:2]
        scale = self.target / max(h, w)
        new_h, new_w = int(h * scale), int(w * scale)
        return cv2.resize(image, (new_w, new_h),
                         interpolation=cv2.INTER_CUBIC)

Preprocessing Pipeline

Image Preprocessing Pipeline:

1. Resize (maintain aspect ratio)
   - Target: 640 or 1280 depending on model

2. Standardize (0-255 → 0-1)
   - image = image / 255.0

3. Normalize (ImageNet stats)
   - image = (image - mean) / std
   - mean = [0.485, 0.456, 0.406]
   - std = [0.229, 0.224, 0.225]

4. Transpose (HWC → CHW)
   - PyTorch format: (C, H, W)

5. Pad (to square)
   - Pad with zeros to square shape

7. XGBoost Text Concatenation

File Location

/deepdoc/parser/pdf_parser.py (lines 88-101, 131-170)

Purpose

Predict whether adjacent text boxes should be merged.

Implementation

import xgboost as xgb

class PDFParser:
    def __init__(self):
        # Load pre-trained XGBoost model
        self.concat_model = xgb.Booster()
        self.concat_model.load_model("updown_concat_xgb.model")

    def should_concat(self, box1, box2):
        """
        Predict if two text boxes should be concatenated.
        """
        # Extract features
        features = self._extract_concat_features(box1, box2)

        # Create DMatrix
        dmatrix = xgb.DMatrix([features])

        # Predict probability
        prob = self.concat_model.predict(dmatrix)[0]

        return prob > 0.5

    def _extract_concat_features(self, box1, box2):
        """
        Extract 20+ features for concatenation decision.
        """
        features = []

        # Distance features
        y_dist = box2["top"] - box1["bottom"]
        char_height = box1["bottom"] - box1["top"]
        features.append(y_dist / max(char_height, 1))

        # Alignment features
        x_overlap = min(box1["x1"], box2["x1"]) - max(box1["x0"], box2["x0"])
        features.append(x_overlap / max(box1["x1"] - box1["x0"], 1))

        # Text pattern features
        text1, text2 = box1["text"], box2["text"]
        features.append(1 if text1.endswith((".", "。", "!", "?")) else 0)
        features.append(1 if text2[0].isupper() else 0)

        # Layout features
        features.append(1 if box1.get("layout_num") == box2.get("layout_num") else 0)

        # ... more features

        return features

Feature List

XGBoost Concatenation Features:

1. Spatial Features:
   - Y-distance / char_height
   - X-alignment overlap ratio
   - Same page flag

2. Text Pattern Features:
   - Ends with sentence punctuation
   - Ends with continuation punctuation
   - Next starts with uppercase
   - Next starts with number
   - Chinese numbering pattern

3. Layout Features:
   - Same layout_type
   - Same layout_num
   - Same column

4. Tokenization Features:
   - Token count ratio
   - Last/first token match

Total: 20+ features

Summary

Algorithm	Purpose	Model Type
OCR	Text detection + recognition	ONNX (DB + CRNN)
Layout Recognition	Element detection	ONNX (YOLOv10)
TSR	Table structure	ONNX
NMS	Box filtering	Classical
IoU	Overlap measure	Classical
XGBoost	Text concatenation	Gradient Boosting

/deepdoc/vision/ocr.py - OCR models
/deepdoc/vision/layout_recognizer.py - Layout detection
/deepdoc/vision/table_structure_recognizer.py - TSR
/deepdoc/vision/operators.py - Image processing
/deepdoc/parser/pdf_parser.py - XGBoost integration

17 KiB Raw Blame History Unescape Escape

Vision Algorithms

Tong Quan

1. OCR (Optical Character Recognition)

File Location

Purpose

Implementation

OCR Pipeline

CTC Decoding

2. Layout Recognition (YOLOv10)

File Location

Purpose

Implementation

Layout Types

YOLO Detection

3. Table Structure Recognition (TSR)

File Location

Purpose

Implementation

TSR Output

4. Non-Maximum Suppression (NMS)

File Location

Purpose

Implementation

NMS Algorithm

5. Intersection over Union (IoU)

File Location

Purpose

Implementation

IoU Formula

6. Image Preprocessing

File Location

Purpose

Implementation

Preprocessing Pipeline

7. XGBoost Text Concatenation

File Location

Purpose

Implementation

Feature List

Summary

Related Files

17 KiB

Raw Blame History