- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
637 lines
17 KiB
Markdown
637 lines
17 KiB
Markdown
# Vision Algorithms
|
||
|
||
## Tong Quan
|
||
|
||
RAGFlow sử dụng computer vision algorithms cho document understanding, OCR, và layout analysis.
|
||
|
||
## 1. OCR (Optical Character Recognition)
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/vision/ocr.py (lines 30-120)
|
||
```
|
||
|
||
### Purpose
|
||
Text detection và recognition từ document images.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
import onnxruntime as ort
|
||
|
||
class OCR:
|
||
def __init__(self):
|
||
# Load ONNX models
|
||
self.det_model = ort.InferenceSession("ocr_det.onnx")
|
||
self.rec_model = ort.InferenceSession("ocr_rec.onnx")
|
||
|
||
def detect(self, image, device_id=0):
|
||
"""
|
||
Detect text regions in image.
|
||
|
||
Returns:
|
||
List of bounding boxes with confidence scores
|
||
"""
|
||
# Preprocess
|
||
img = self._preprocess_det(image)
|
||
|
||
# Run detection
|
||
outputs = self.det_model.run(None, {"input": img})
|
||
|
||
# Post-process to get boxes
|
||
boxes = self._postprocess_det(outputs[0])
|
||
|
||
return boxes
|
||
|
||
def recognize(self, image, boxes):
|
||
"""
|
||
Recognize text in detected regions.
|
||
|
||
Returns:
|
||
List of (text, confidence) tuples
|
||
"""
|
||
results = []
|
||
|
||
for box in boxes:
|
||
# Crop region
|
||
crop = self._crop_region(image, box)
|
||
|
||
# Preprocess
|
||
img = self._preprocess_rec(crop)
|
||
|
||
# Run recognition
|
||
outputs = self.rec_model.run(None, {"input": img})
|
||
|
||
# Decode to text
|
||
text, conf = self._decode_ctc(outputs[0])
|
||
results.append((text, conf))
|
||
|
||
return results
|
||
```
|
||
|
||
### OCR Pipeline
|
||
|
||
```
|
||
OCR Pipeline:
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Input Image │
|
||
└──────────────────────────┬──────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Detection Model (ONNX) │
|
||
│ - DB (Differentiable Binarization) based │
|
||
│ - Output: Text region polygons │
|
||
└──────────────────────────┬──────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Post-processing │
|
||
│ - Polygon to bounding box │
|
||
│ - Filter by confidence │
|
||
│ - NMS for overlapping boxes │
|
||
└──────────────────────────┬──────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Recognition Model (ONNX) │
|
||
│ - CRNN (CNN + RNN) based │
|
||
│ - CTC decoding │
|
||
│ - Output: Character sequence │
|
||
└──────────────────────────┬──────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Output: [(text, confidence, box), ...] │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### CTC Decoding
|
||
|
||
```
|
||
CTC (Connectionist Temporal Classification):
|
||
|
||
Input: Probability matrix P (T × C)
|
||
T = time steps, C = character classes
|
||
|
||
Algorithm:
|
||
1. For each time step, get most probable character
|
||
2. Merge consecutive duplicates
|
||
3. Remove blank tokens
|
||
|
||
Example:
|
||
Raw output: [a, a, -, b, b, b, -, c]
|
||
After merge: [a, -, b, -, c]
|
||
After blank removal: [a, b, c]
|
||
Final: "abc"
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Layout Recognition (YOLOv10)
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/vision/layout_recognizer.py (lines 33-100)
|
||
```
|
||
|
||
### Purpose
|
||
Detect document layout elements (text, title, table, figure, etc.).
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
class LayoutRecognizer:
|
||
LABELS = [
|
||
"text", "title", "figure", "figure caption",
|
||
"table", "table caption", "header", "footer",
|
||
"reference", "equation"
|
||
]
|
||
|
||
def __init__(self):
|
||
self.model = ort.InferenceSession("layout_yolov10.onnx")
|
||
|
||
def detect(self, image):
|
||
"""
|
||
Detect layout elements in document image.
|
||
"""
|
||
# Preprocess (resize, normalize)
|
||
img = self._preprocess(image)
|
||
|
||
# Run inference
|
||
outputs = self.model.run(None, {"images": img})
|
||
|
||
# Post-process
|
||
boxes, labels, scores = self._postprocess(outputs[0])
|
||
|
||
# Filter by confidence
|
||
results = []
|
||
for box, label, score in zip(boxes, labels, scores):
|
||
if score > 0.4: # Confidence threshold
|
||
results.append({
|
||
"box": box,
|
||
"type": self.LABELS[label],
|
||
"confidence": score
|
||
})
|
||
|
||
return results
|
||
```
|
||
|
||
### Layout Types
|
||
|
||
```
|
||
Document Layout Categories:
|
||
┌──────────────────┬────────────────────────────────────┐
|
||
│ Type │ Description │
|
||
├──────────────────┼────────────────────────────────────┤
|
||
│ text │ Body text paragraphs │
|
||
│ title │ Section/document titles │
|
||
│ figure │ Images, diagrams, charts │
|
||
│ figure caption │ Text describing figures │
|
||
│ table │ Data tables │
|
||
│ table caption │ Text describing tables │
|
||
│ header │ Page headers │
|
||
│ footer │ Page footers │
|
||
│ reference │ Bibliography, citations │
|
||
│ equation │ Mathematical equations │
|
||
└──────────────────┴────────────────────────────────────┘
|
||
```
|
||
|
||
### YOLO Detection
|
||
|
||
```
|
||
YOLOv10 Detection:
|
||
|
||
1. Backbone: Feature extraction (CSPDarknet)
|
||
2. Neck: Feature pyramid (PANet)
|
||
3. Head: Prediction heads for different scales
|
||
|
||
Output format:
|
||
[x_center, y_center, width, height, confidence, class_probs...]
|
||
|
||
Post-processing:
|
||
1. Apply sigmoid to confidence
|
||
2. Multiply conf × class_prob for class scores
|
||
3. Filter by score threshold
|
||
4. Apply NMS
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Table Structure Recognition (TSR)
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/vision/table_structure_recognizer.py (lines 30-100)
|
||
```
|
||
|
||
### Purpose
|
||
Detect table structure (rows, columns, cells, headers).
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
class TableStructureRecognizer:
|
||
LABELS = [
|
||
"table", "table column", "table row",
|
||
"table column header", "projected row header",
|
||
"spanning cell"
|
||
]
|
||
|
||
def __init__(self):
|
||
self.model = ort.InferenceSession("table_structure.onnx")
|
||
|
||
def recognize(self, table_image):
|
||
"""
|
||
Recognize structure of a table image.
|
||
"""
|
||
# Preprocess
|
||
img = self._preprocess(table_image)
|
||
|
||
# Run inference
|
||
outputs = self.model.run(None, {"input": img})
|
||
|
||
# Parse structure
|
||
structure = self._parse_structure(outputs)
|
||
|
||
return structure
|
||
|
||
def _parse_structure(self, outputs):
|
||
"""
|
||
Parse model output into table structure.
|
||
"""
|
||
rows = []
|
||
columns = []
|
||
cells = []
|
||
|
||
for detection in outputs:
|
||
label = self.LABELS[detection["class"]]
|
||
|
||
if label == "table row":
|
||
rows.append(detection["box"])
|
||
elif label == "table column":
|
||
columns.append(detection["box"])
|
||
elif label == "spanning cell":
|
||
cells.append({
|
||
"box": detection["box"],
|
||
"colspan": self._estimate_colspan(detection, columns),
|
||
"rowspan": self._estimate_rowspan(detection, rows)
|
||
})
|
||
|
||
return {
|
||
"rows": sorted(rows, key=lambda x: x[1]), # Sort by Y
|
||
"columns": sorted(columns, key=lambda x: x[0]), # Sort by X
|
||
"cells": cells
|
||
}
|
||
```
|
||
|
||
### TSR Output
|
||
|
||
```
|
||
Table Structure Output:
|
||
|
||
{
|
||
"rows": [
|
||
{"y": 10, "height": 30}, # Row 1
|
||
{"y": 40, "height": 30}, # Row 2
|
||
...
|
||
],
|
||
"columns": [
|
||
{"x": 0, "width": 100}, # Col 1
|
||
{"x": 100, "width": 150}, # Col 2
|
||
...
|
||
],
|
||
"cells": [
|
||
{"row": 0, "col": 0, "text": "Header 1"},
|
||
{"row": 0, "col": 1, "text": "Header 2"},
|
||
{"row": 1, "col": 0, "text": "Data 1", "colspan": 2},
|
||
...
|
||
]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Non-Maximum Suppression (NMS)
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/vision/operators.py (lines 702-725)
|
||
```
|
||
|
||
### Purpose
|
||
Filter overlapping bounding boxes trong object detection.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
def nms(boxes, scores, iou_threshold=0.5):
|
||
"""
|
||
Non-Maximum Suppression algorithm.
|
||
|
||
Args:
|
||
boxes: List of [x1, y1, x2, y2]
|
||
scores: Confidence scores
|
||
iou_threshold: IoU threshold for suppression
|
||
|
||
Returns:
|
||
Indices of kept boxes
|
||
"""
|
||
# Sort by score (descending)
|
||
indices = np.argsort(scores)[::-1]
|
||
|
||
keep = []
|
||
while len(indices) > 0:
|
||
# Keep highest scoring box
|
||
current = indices[0]
|
||
keep.append(current)
|
||
|
||
if len(indices) == 1:
|
||
break
|
||
|
||
# Compute IoU with remaining boxes
|
||
remaining = indices[1:]
|
||
ious = compute_iou(boxes[current], boxes[remaining])
|
||
|
||
# Keep boxes with IoU below threshold
|
||
indices = remaining[ious < iou_threshold]
|
||
|
||
return keep
|
||
```
|
||
|
||
### NMS Algorithm
|
||
|
||
```
|
||
NMS (Non-Maximum Suppression):
|
||
|
||
Input: Boxes B, Scores S, Threshold θ
|
||
Output: Filtered boxes
|
||
|
||
Algorithm:
|
||
1. Sort boxes by score (descending)
|
||
2. Select box with highest score → add to results
|
||
3. Remove boxes with IoU > θ with selected box
|
||
4. Repeat until no boxes remain
|
||
|
||
Example:
|
||
Boxes: [A(0.9), B(0.8), C(0.7)]
|
||
IoU(A,B) = 0.7 > 0.5 → Remove B
|
||
IoU(A,C) = 0.3 < 0.5 → Keep C
|
||
Result: [A, C]
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Intersection over Union (IoU)
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/vision/operators.py (lines 702-725)
|
||
/deepdoc/vision/recognizer.py (lines 339-357)
|
||
```
|
||
|
||
### Purpose
|
||
Measure overlap between bounding boxes.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
def compute_iou(box1, box2):
|
||
"""
|
||
Compute Intersection over Union.
|
||
|
||
Args:
|
||
box1, box2: [x1, y1, x2, y2] format
|
||
|
||
Returns:
|
||
IoU value in [0, 1]
|
||
"""
|
||
# Intersection coordinates
|
||
x1 = max(box1[0], box2[0])
|
||
y1 = max(box1[1], box2[1])
|
||
x2 = min(box1[2], box2[2])
|
||
y2 = min(box1[3], box2[3])
|
||
|
||
# Intersection area
|
||
intersection = max(0, x2 - x1) * max(0, y2 - y1)
|
||
|
||
# Union area
|
||
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
|
||
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
|
||
union = area1 + area2 - intersection
|
||
|
||
# IoU
|
||
if union == 0:
|
||
return 0
|
||
|
||
return intersection / union
|
||
```
|
||
|
||
### IoU Formula
|
||
|
||
```
|
||
IoU (Intersection over Union):
|
||
|
||
IoU = Area(A ∩ B) / Area(A ∪ B)
|
||
|
||
= Area(A ∩ B) / (Area(A) + Area(B) - Area(A ∩ B))
|
||
|
||
Range: [0, 1]
|
||
- IoU = 0: No overlap
|
||
- IoU = 1: Perfect overlap
|
||
|
||
Threshold Usage:
|
||
- Detection: IoU > 0.5 → Same object
|
||
- NMS: IoU > 0.5 → Suppress duplicate
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Image Preprocessing
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/vision/operators.py
|
||
```
|
||
|
||
### Purpose
|
||
Prepare images for neural network input.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
class StandardizeImage:
|
||
"""Normalize image to [0, 1] range."""
|
||
|
||
def __call__(self, image):
|
||
return image.astype(np.float32) / 255.0
|
||
|
||
class NormalizeImage:
|
||
"""Apply mean/std normalization."""
|
||
|
||
def __init__(self, mean=[0.485, 0.456, 0.406],
|
||
std=[0.229, 0.224, 0.225]):
|
||
self.mean = np.array(mean)
|
||
self.std = np.array(std)
|
||
|
||
def __call__(self, image):
|
||
return (image - self.mean) / self.std
|
||
|
||
class ToCHWImage:
|
||
"""Convert HWC to CHW format."""
|
||
|
||
def __call__(self, image):
|
||
return image.transpose((2, 0, 1))
|
||
|
||
class LinearResize:
|
||
"""Resize image maintaining aspect ratio."""
|
||
|
||
def __init__(self, target_size):
|
||
self.target = target_size
|
||
|
||
def __call__(self, image):
|
||
h, w = image.shape[:2]
|
||
scale = self.target / max(h, w)
|
||
new_h, new_w = int(h * scale), int(w * scale)
|
||
return cv2.resize(image, (new_w, new_h),
|
||
interpolation=cv2.INTER_CUBIC)
|
||
```
|
||
|
||
### Preprocessing Pipeline
|
||
|
||
```
|
||
Image Preprocessing Pipeline:
|
||
|
||
1. Resize (maintain aspect ratio)
|
||
- Target: 640 or 1280 depending on model
|
||
|
||
2. Standardize (0-255 → 0-1)
|
||
- image = image / 255.0
|
||
|
||
3. Normalize (ImageNet stats)
|
||
- image = (image - mean) / std
|
||
- mean = [0.485, 0.456, 0.406]
|
||
- std = [0.229, 0.224, 0.225]
|
||
|
||
4. Transpose (HWC → CHW)
|
||
- PyTorch format: (C, H, W)
|
||
|
||
5. Pad (to square)
|
||
- Pad with zeros to square shape
|
||
```
|
||
|
||
---
|
||
|
||
## 7. XGBoost Text Concatenation
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/parser/pdf_parser.py (lines 88-101, 131-170)
|
||
```
|
||
|
||
### Purpose
|
||
Predict whether adjacent text boxes should be merged.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
import xgboost as xgb
|
||
|
||
class PDFParser:
|
||
def __init__(self):
|
||
# Load pre-trained XGBoost model
|
||
self.concat_model = xgb.Booster()
|
||
self.concat_model.load_model("updown_concat_xgb.model")
|
||
|
||
def should_concat(self, box1, box2):
|
||
"""
|
||
Predict if two text boxes should be concatenated.
|
||
"""
|
||
# Extract features
|
||
features = self._extract_concat_features(box1, box2)
|
||
|
||
# Create DMatrix
|
||
dmatrix = xgb.DMatrix([features])
|
||
|
||
# Predict probability
|
||
prob = self.concat_model.predict(dmatrix)[0]
|
||
|
||
return prob > 0.5
|
||
|
||
def _extract_concat_features(self, box1, box2):
|
||
"""
|
||
Extract 20+ features for concatenation decision.
|
||
"""
|
||
features = []
|
||
|
||
# Distance features
|
||
y_dist = box2["top"] - box1["bottom"]
|
||
char_height = box1["bottom"] - box1["top"]
|
||
features.append(y_dist / max(char_height, 1))
|
||
|
||
# Alignment features
|
||
x_overlap = min(box1["x1"], box2["x1"]) - max(box1["x0"], box2["x0"])
|
||
features.append(x_overlap / max(box1["x1"] - box1["x0"], 1))
|
||
|
||
# Text pattern features
|
||
text1, text2 = box1["text"], box2["text"]
|
||
features.append(1 if text1.endswith((".", "。", "!", "?")) else 0)
|
||
features.append(1 if text2[0].isupper() else 0)
|
||
|
||
# Layout features
|
||
features.append(1 if box1.get("layout_num") == box2.get("layout_num") else 0)
|
||
|
||
# ... more features
|
||
|
||
return features
|
||
```
|
||
|
||
### Feature List
|
||
|
||
```
|
||
XGBoost Concatenation Features:
|
||
|
||
1. Spatial Features:
|
||
- Y-distance / char_height
|
||
- X-alignment overlap ratio
|
||
- Same page flag
|
||
|
||
2. Text Pattern Features:
|
||
- Ends with sentence punctuation
|
||
- Ends with continuation punctuation
|
||
- Next starts with uppercase
|
||
- Next starts with number
|
||
- Chinese numbering pattern
|
||
|
||
3. Layout Features:
|
||
- Same layout_type
|
||
- Same layout_num
|
||
- Same column
|
||
|
||
4. Tokenization Features:
|
||
- Token count ratio
|
||
- Last/first token match
|
||
|
||
Total: 20+ features
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Algorithm | Purpose | Model Type |
|
||
|-----------|---------|------------|
|
||
| OCR | Text detection + recognition | ONNX (DB + CRNN) |
|
||
| Layout Recognition | Element detection | ONNX (YOLOv10) |
|
||
| TSR | Table structure | ONNX |
|
||
| NMS | Box filtering | Classical |
|
||
| IoU | Overlap measure | Classical |
|
||
| XGBoost | Text concatenation | Gradient Boosting |
|
||
|
||
## Related Files
|
||
|
||
- `/deepdoc/vision/ocr.py` - OCR models
|
||
- `/deepdoc/vision/layout_recognizer.py` - Layout detection
|
||
- `/deepdoc/vision/table_structure_recognizer.py` - TSR
|
||
- `/deepdoc/vision/operators.py` - Image processing
|
||
- `/deepdoc/parser/pdf_parser.py` - XGBoost integration
|