- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
17 KiB
17 KiB
Vision Algorithms
Tong Quan
RAGFlow sử dụng computer vision algorithms cho document understanding, OCR, và layout analysis.
1. OCR (Optical Character Recognition)
File Location
/deepdoc/vision/ocr.py (lines 30-120)
Purpose
Text detection và recognition từ document images.
Implementation
import onnxruntime as ort
class OCR:
def __init__(self):
# Load ONNX models
self.det_model = ort.InferenceSession("ocr_det.onnx")
self.rec_model = ort.InferenceSession("ocr_rec.onnx")
def detect(self, image, device_id=0):
"""
Detect text regions in image.
Returns:
List of bounding boxes with confidence scores
"""
# Preprocess
img = self._preprocess_det(image)
# Run detection
outputs = self.det_model.run(None, {"input": img})
# Post-process to get boxes
boxes = self._postprocess_det(outputs[0])
return boxes
def recognize(self, image, boxes):
"""
Recognize text in detected regions.
Returns:
List of (text, confidence) tuples
"""
results = []
for box in boxes:
# Crop region
crop = self._crop_region(image, box)
# Preprocess
img = self._preprocess_rec(crop)
# Run recognition
outputs = self.rec_model.run(None, {"input": img})
# Decode to text
text, conf = self._decode_ctc(outputs[0])
results.append((text, conf))
return results
OCR Pipeline
OCR Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ Input Image │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Detection Model (ONNX) │
│ - DB (Differentiable Binarization) based │
│ - Output: Text region polygons │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Post-processing │
│ - Polygon to bounding box │
│ - Filter by confidence │
│ - NMS for overlapping boxes │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Recognition Model (ONNX) │
│ - CRNN (CNN + RNN) based │
│ - CTC decoding │
│ - Output: Character sequence │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Output: [(text, confidence, box), ...] │
└─────────────────────────────────────────────────────────────────┘
CTC Decoding
CTC (Connectionist Temporal Classification):
Input: Probability matrix P (T × C)
T = time steps, C = character classes
Algorithm:
1. For each time step, get most probable character
2. Merge consecutive duplicates
3. Remove blank tokens
Example:
Raw output: [a, a, -, b, b, b, -, c]
After merge: [a, -, b, -, c]
After blank removal: [a, b, c]
Final: "abc"
2. Layout Recognition (YOLOv10)
File Location
/deepdoc/vision/layout_recognizer.py (lines 33-100)
Purpose
Detect document layout elements (text, title, table, figure, etc.).
Implementation
class LayoutRecognizer:
LABELS = [
"text", "title", "figure", "figure caption",
"table", "table caption", "header", "footer",
"reference", "equation"
]
def __init__(self):
self.model = ort.InferenceSession("layout_yolov10.onnx")
def detect(self, image):
"""
Detect layout elements in document image.
"""
# Preprocess (resize, normalize)
img = self._preprocess(image)
# Run inference
outputs = self.model.run(None, {"images": img})
# Post-process
boxes, labels, scores = self._postprocess(outputs[0])
# Filter by confidence
results = []
for box, label, score in zip(boxes, labels, scores):
if score > 0.4: # Confidence threshold
results.append({
"box": box,
"type": self.LABELS[label],
"confidence": score
})
return results
Layout Types
Document Layout Categories:
┌──────────────────┬────────────────────────────────────┐
│ Type │ Description │
├──────────────────┼────────────────────────────────────┤
│ text │ Body text paragraphs │
│ title │ Section/document titles │
│ figure │ Images, diagrams, charts │
│ figure caption │ Text describing figures │
│ table │ Data tables │
│ table caption │ Text describing tables │
│ header │ Page headers │
│ footer │ Page footers │
│ reference │ Bibliography, citations │
│ equation │ Mathematical equations │
└──────────────────┴────────────────────────────────────┘
YOLO Detection
YOLOv10 Detection:
1. Backbone: Feature extraction (CSPDarknet)
2. Neck: Feature pyramid (PANet)
3. Head: Prediction heads for different scales
Output format:
[x_center, y_center, width, height, confidence, class_probs...]
Post-processing:
1. Apply sigmoid to confidence
2. Multiply conf × class_prob for class scores
3. Filter by score threshold
4. Apply NMS
3. Table Structure Recognition (TSR)
File Location
/deepdoc/vision/table_structure_recognizer.py (lines 30-100)
Purpose
Detect table structure (rows, columns, cells, headers).
Implementation
class TableStructureRecognizer:
LABELS = [
"table", "table column", "table row",
"table column header", "projected row header",
"spanning cell"
]
def __init__(self):
self.model = ort.InferenceSession("table_structure.onnx")
def recognize(self, table_image):
"""
Recognize structure of a table image.
"""
# Preprocess
img = self._preprocess(table_image)
# Run inference
outputs = self.model.run(None, {"input": img})
# Parse structure
structure = self._parse_structure(outputs)
return structure
def _parse_structure(self, outputs):
"""
Parse model output into table structure.
"""
rows = []
columns = []
cells = []
for detection in outputs:
label = self.LABELS[detection["class"]]
if label == "table row":
rows.append(detection["box"])
elif label == "table column":
columns.append(detection["box"])
elif label == "spanning cell":
cells.append({
"box": detection["box"],
"colspan": self._estimate_colspan(detection, columns),
"rowspan": self._estimate_rowspan(detection, rows)
})
return {
"rows": sorted(rows, key=lambda x: x[1]), # Sort by Y
"columns": sorted(columns, key=lambda x: x[0]), # Sort by X
"cells": cells
}
TSR Output
Table Structure Output:
{
"rows": [
{"y": 10, "height": 30}, # Row 1
{"y": 40, "height": 30}, # Row 2
...
],
"columns": [
{"x": 0, "width": 100}, # Col 1
{"x": 100, "width": 150}, # Col 2
...
],
"cells": [
{"row": 0, "col": 0, "text": "Header 1"},
{"row": 0, "col": 1, "text": "Header 2"},
{"row": 1, "col": 0, "text": "Data 1", "colspan": 2},
...
]
}
4. Non-Maximum Suppression (NMS)
File Location
/deepdoc/vision/operators.py (lines 702-725)
Purpose
Filter overlapping bounding boxes trong object detection.
Implementation
def nms(boxes, scores, iou_threshold=0.5):
"""
Non-Maximum Suppression algorithm.
Args:
boxes: List of [x1, y1, x2, y2]
scores: Confidence scores
iou_threshold: IoU threshold for suppression
Returns:
Indices of kept boxes
"""
# Sort by score (descending)
indices = np.argsort(scores)[::-1]
keep = []
while len(indices) > 0:
# Keep highest scoring box
current = indices[0]
keep.append(current)
if len(indices) == 1:
break
# Compute IoU with remaining boxes
remaining = indices[1:]
ious = compute_iou(boxes[current], boxes[remaining])
# Keep boxes with IoU below threshold
indices = remaining[ious < iou_threshold]
return keep
NMS Algorithm
NMS (Non-Maximum Suppression):
Input: Boxes B, Scores S, Threshold θ
Output: Filtered boxes
Algorithm:
1. Sort boxes by score (descending)
2. Select box with highest score → add to results
3. Remove boxes with IoU > θ with selected box
4. Repeat until no boxes remain
Example:
Boxes: [A(0.9), B(0.8), C(0.7)]
IoU(A,B) = 0.7 > 0.5 → Remove B
IoU(A,C) = 0.3 < 0.5 → Keep C
Result: [A, C]
5. Intersection over Union (IoU)
File Location
/deepdoc/vision/operators.py (lines 702-725)
/deepdoc/vision/recognizer.py (lines 339-357)
Purpose
Measure overlap between bounding boxes.
Implementation
def compute_iou(box1, box2):
"""
Compute Intersection over Union.
Args:
box1, box2: [x1, y1, x2, y2] format
Returns:
IoU value in [0, 1]
"""
# Intersection coordinates
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
# Intersection area
intersection = max(0, x2 - x1) * max(0, y2 - y1)
# Union area
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
# IoU
if union == 0:
return 0
return intersection / union
IoU Formula
IoU (Intersection over Union):
IoU = Area(A ∩ B) / Area(A ∪ B)
= Area(A ∩ B) / (Area(A) + Area(B) - Area(A ∩ B))
Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap
Threshold Usage:
- Detection: IoU > 0.5 → Same object
- NMS: IoU > 0.5 → Suppress duplicate
6. Image Preprocessing
File Location
/deepdoc/vision/operators.py
Purpose
Prepare images for neural network input.
Implementation
class StandardizeImage:
"""Normalize image to [0, 1] range."""
def __call__(self, image):
return image.astype(np.float32) / 255.0
class NormalizeImage:
"""Apply mean/std normalization."""
def __init__(self, mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]):
self.mean = np.array(mean)
self.std = np.array(std)
def __call__(self, image):
return (image - self.mean) / self.std
class ToCHWImage:
"""Convert HWC to CHW format."""
def __call__(self, image):
return image.transpose((2, 0, 1))
class LinearResize:
"""Resize image maintaining aspect ratio."""
def __init__(self, target_size):
self.target = target_size
def __call__(self, image):
h, w = image.shape[:2]
scale = self.target / max(h, w)
new_h, new_w = int(h * scale), int(w * scale)
return cv2.resize(image, (new_w, new_h),
interpolation=cv2.INTER_CUBIC)
Preprocessing Pipeline
Image Preprocessing Pipeline:
1. Resize (maintain aspect ratio)
- Target: 640 or 1280 depending on model
2. Standardize (0-255 → 0-1)
- image = image / 255.0
3. Normalize (ImageNet stats)
- image = (image - mean) / std
- mean = [0.485, 0.456, 0.406]
- std = [0.229, 0.224, 0.225]
4. Transpose (HWC → CHW)
- PyTorch format: (C, H, W)
5. Pad (to square)
- Pad with zeros to square shape
7. XGBoost Text Concatenation
File Location
/deepdoc/parser/pdf_parser.py (lines 88-101, 131-170)
Purpose
Predict whether adjacent text boxes should be merged.
Implementation
import xgboost as xgb
class PDFParser:
def __init__(self):
# Load pre-trained XGBoost model
self.concat_model = xgb.Booster()
self.concat_model.load_model("updown_concat_xgb.model")
def should_concat(self, box1, box2):
"""
Predict if two text boxes should be concatenated.
"""
# Extract features
features = self._extract_concat_features(box1, box2)
# Create DMatrix
dmatrix = xgb.DMatrix([features])
# Predict probability
prob = self.concat_model.predict(dmatrix)[0]
return prob > 0.5
def _extract_concat_features(self, box1, box2):
"""
Extract 20+ features for concatenation decision.
"""
features = []
# Distance features
y_dist = box2["top"] - box1["bottom"]
char_height = box1["bottom"] - box1["top"]
features.append(y_dist / max(char_height, 1))
# Alignment features
x_overlap = min(box1["x1"], box2["x1"]) - max(box1["x0"], box2["x0"])
features.append(x_overlap / max(box1["x1"] - box1["x0"], 1))
# Text pattern features
text1, text2 = box1["text"], box2["text"]
features.append(1 if text1.endswith((".", "。", "!", "?")) else 0)
features.append(1 if text2[0].isupper() else 0)
# Layout features
features.append(1 if box1.get("layout_num") == box2.get("layout_num") else 0)
# ... more features
return features
Feature List
XGBoost Concatenation Features:
1. Spatial Features:
- Y-distance / char_height
- X-alignment overlap ratio
- Same page flag
2. Text Pattern Features:
- Ends with sentence punctuation
- Ends with continuation punctuation
- Next starts with uppercase
- Next starts with number
- Chinese numbering pattern
3. Layout Features:
- Same layout_type
- Same layout_num
- Same column
4. Tokenization Features:
- Token count ratio
- Last/first token match
Total: 20+ features
Summary
| Algorithm | Purpose | Model Type |
|---|---|---|
| OCR | Text detection + recognition | ONNX (DB + CRNN) |
| Layout Recognition | Element detection | ONNX (YOLOv10) |
| TSR | Table structure | ONNX |
| NMS | Box filtering | Classical |
| IoU | Overlap measure | Classical |
| XGBoost | Text concatenation | Gradient Boosting |
Related Files
/deepdoc/vision/ocr.py- OCR models/deepdoc/vision/layout_recognizer.py- Layout detection/deepdoc/vision/table_structure_recognizer.py- TSR/deepdoc/vision/operators.py- Image processing/deepdoc/parser/pdf_parser.py- XGBoost integration