docs: Add comprehensive algorithm documentation (50+ algorithms)

- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
This commit is contained in:
Claude 2025-11-27 03:34:49 +00:00
parent a6ee18476d
commit 566bce428b
No known key found for this signature in database
6 changed files with 2654 additions and 94 deletions

View file

@ -2,36 +2,65 @@
## Tong Quan
Module này chứa các thuật toán core của RAGFlow bao gồm scoring, similarity, chunking, và advanced RAG techniques.
Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories.
## Kien Truc Algorithms
## Algorithm Categories
```
┌─────────────────────────────────────────────────────────────────┐
CORE ALGORITHMS
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────
RAGFLOW ALGORITHM MAP
└─────────────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐
RETRIEVAL ALGORITHMS
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ BM25 Scoring │ │ Vector Cosine │ │ Hybrid Fusion │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────
1. CLUSTERING │ 2. DIMENSIONALITY REDUCTION
├── K-Means │ ├── UMAP
├── Gaussian Mixture Model (GMM) │ └── Node2Vec Embedding
│ └── Silhouette Score │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ ADVANCED RAG │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ RAPTOR │ │ GraphRAG │ │ Cross-Encoder │ │
│ │ (Hierarchical) │ │ (Knowledge G) │ │ (Reranking) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 3. GRAPH ALGORITHMS │ 4. NLP/TEXT PROCESSING │
│ ├── PageRank │ ├── Trie-based Tokenization │
│ ├── Leiden Community Detection │ ├── Max-Forward/Backward Algorithm │
│ ├── Entity Extraction (LLM) │ ├── DFS with Memoization │
│ ├── Relation Extraction (LLM) │ ├── TF-IDF Term Weighting │
│ ├── Entity Resolution │ ├── Named Entity Recognition (NER) │
│ └── Largest Connected Component │ ├── Part-of-Speech Tagging (POS) │
│ │ ├── Synonym Detection (WordNet) │
│ │ └── Query Expansion │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ TEXT PROCESSING │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ TF-IDF Weight │ │ Tokenization │ │ Query Expand │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 5. SIMILARITY/DISTANCE │ 6. INFORMATION RETRIEVAL │
│ ├── Cosine Similarity │ ├── BM25 Scoring │
│ ├── Edit Distance (Levenshtein) │ ├── Hybrid Score Fusion │
│ ├── IoU (Intersection over Union)│ ├── Cross-Encoder Reranking │
│ ├── Token Similarity │ └── Weighted Sum Fusion │
│ └── Hybrid Similarity │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 7. CHUNKING/MERGING │ 8. MACHINE LEARNING MODELS │
│ ├── Naive Merge (Token-based) │ ├── XGBoost (Text Concatenation) │
│ ├── Hierarchical Merge │ ├── ONNX Models (Vision) │
│ ├── Tree-based Merge │ └── Reranking Models │
│ └── Binary Search Merge │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 9. VISION/IMAGE PROCESSING │ 10. ADVANCED RAG │
│ ├── OCR (ONNX) │ ├── RAPTOR (Hierarchical Summarization)│
│ ├── Layout Recognition (YOLOv10) │ ├── GraphRAG │
│ ├── Table Structure Recognition │ └── Community Reports │
│ └── Non-Maximum Suppression (NMS)│ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 11. OPTIMIZATION │ 12. DATA STRUCTURES │
│ ├── BIC (Bayesian Info Criterion)│ ├── Trie Tree │
│ └── Silhouette Score │ ├── Hierarchical Tree │
│ │ └── NetworkX Graph │
└───────────────────────────────────┴─────────────────────────────────────────┘
```
## Files Trong Module Nay
@ -39,90 +68,122 @@ Module này chứa các thuật toán core của RAGFlow bao gồm scoring, simi
| File | Mo Ta |
|------|-------|
| [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm |
| [vector_similarity.md](./vector_similarity.md) | Cosine similarity calculations |
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination strategies |
| [tfidf_weighting.md](./tfidf_weighting.md) | TF-IDF term weighting |
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination |
| [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization |
| [graphrag_implementation.md](./graphrag_implementation.md) | Knowledge graph RAG |
| [clustering_algorithms.md](./clustering_algorithms.md) | KMeans, GMM, UMAP |
| [graph_algorithms.md](./graph_algorithms.md) | PageRank, Leiden, Entity Resolution |
| [nlp_algorithms.md](./nlp_algorithms.md) | Tokenization, TF-IDF, NER, POS |
| [vision_algorithms.md](./vision_algorithms.md) | OCR, Layout, NMS |
| [similarity_metrics.md](./similarity_metrics.md) | Cosine, Edit Distance, IoU |
## Algorithm Formulas
## Complete Algorithm Reference
### BM25 Scoring
```
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
### 1. CLUSTERING ALGORITHMS
where:
f(qi, D) = term frequency of qi in document D
|D| = document length
avgdl = average document length
k1 = 1.2 (term frequency saturation)
b = 0.75 (length normalization)
```
| Algorithm | File | Description |
|-----------|------|-------------|
| K-Means | `/deepdoc/parser/pdf_parser.py:36` | Column detection in PDF layout |
| GMM | `/rag/raptor.py:22` | RAPTOR cluster selection |
| Silhouette Score | `/deepdoc/parser/pdf_parser.py:37` | Cluster validation |
### Cosine Similarity
```
cos(θ) = (A · B) / (||A|| × ||B||)
### 2. DIMENSIONALITY REDUCTION
where:
A, B = embedding vectors
A · B = dot product
||A|| = L2 norm
```
| Algorithm | File | Description |
|-----------|------|-------------|
| UMAP | `/rag/raptor.py:21` | Pre-clustering dimension reduction |
| Node2Vec | `/graphrag/general/entity_embedding.py:24` | Graph node embedding |
### Hybrid Score Fusion
```
Hybrid_Score = α × BM25_Score + (1-α) × Vector_Score
### 3. GRAPH ALGORITHMS
Default: α = 0.05 (5% BM25, 95% Vector)
```
| Algorithm | File | Description |
|-----------|------|-------------|
| PageRank | `/graphrag/entity_resolution.py:150` | Entity importance scoring |
| Leiden | `/graphrag/general/leiden.py:72` | Hierarchical community detection |
| Entity Extraction | `/graphrag/general/extractor.py` | LLM-based entity extraction |
| Relation Extraction | `/graphrag/general/extractor.py` | LLM-based relation extraction |
| Entity Resolution | `/graphrag/entity_resolution.py` | Entity deduplication |
| LCC | `/graphrag/general/leiden.py:67` | Largest connected component |
### TF-IDF Weighting
```
IDF(term) = log10(10 + (N - df(term) + 0.5) / (df(term) + 0.5))
Weight = (0.3 × IDF1 + 0.7 × IDF2) × NER × PoS
```
### 4. NLP/TEXT PROCESSING
### Cross-Encoder Reranking
```
Final_Rank = α × Token_Sim + β × Vector_Sim + γ × Rank_Features
| Algorithm | File | Description |
|-----------|------|-------------|
| Trie Tokenization | `/rag/nlp/rag_tokenizer.py:72` | Chinese word segmentation |
| Max-Forward | `/rag/nlp/rag_tokenizer.py:250` | Forward tokenization |
| Max-Backward | `/rag/nlp/rag_tokenizer.py:273` | Backward tokenization |
| DFS + Memo | `/rag/nlp/rag_tokenizer.py:120` | Disambiguation |
| TF-IDF | `/rag/nlp/term_weight.py:223` | Term weighting |
| NER | `/rag/nlp/term_weight.py:84` | Named entity recognition |
| POS Tagging | `/rag/nlp/term_weight.py:179` | Part-of-speech tagging |
| Synonym | `/rag/nlp/synonym.py:71` | Synonym lookup |
| Query Expansion | `/rag/nlp/query.py:85` | Query rewriting |
| Porter Stemmer | `/rag/nlp/rag_tokenizer.py:27` | English stemming |
| WordNet Lemmatizer | `/rag/nlp/rag_tokenizer.py:27` | Lemmatization |
where:
α = 0.3 (token weight)
β = 0.7 (vector weight)
γ = variable (PageRank, tag boost)
```
### 5. SIMILARITY/DISTANCE METRICS
## Algorithm Parameters
| Algorithm | File | Formula |
|-----------|------|---------|
| Cosine Similarity | `/rag/nlp/query.py:221` | `cos(θ) = A·B / (‖A‖×‖B‖)` |
| Edit Distance | `/graphrag/entity_resolution.py:28` | Levenshtein distance |
| IoU | `/deepdoc/vision/operators.py:702` | `intersection / union` |
| Token Similarity | `/rag/nlp/query.py:230` | Weighted token overlap |
| Hybrid Similarity | `/rag/nlp/query.py:220` | `α×token + β×vector` |
| Algorithm | Parameter | Default | Range |
|-----------|-----------|---------|-------|
| **BM25** | k1 | 1.2 | 0-2.0 |
| | b | 0.75 | 0-1.0 |
| **Hybrid** | vector_weight | 0.95 | 0-1.0 |
| | text_weight | 0.05 | 0-1.0 |
| **TF-IDF** | IDF1 weight | 0.3 | - |
| | IDF2 weight | 0.7 | - |
| **Chunking** | chunk_size | 512 | 128-2048 |
| | overlap | 0-10% | 0-100% |
| **RAPTOR** | max_clusters | 10-50 | - |
| | GMM threshold | 0.1 | - |
| **GraphRAG** | entity_topN | 6 | 1-100 |
| | similarity_threshold | 0.3 | 0-1.0 |
### 6. INFORMATION RETRIEVAL
## Key Implementation Files
| Algorithm | File | Formula |
|-----------|------|---------|
| BM25 | `/rag/nlp/search.py` | ES native BM25 |
| Hybrid Fusion | `/rag/nlp/search.py:126` | `0.05×BM25 + 0.95×Vector` |
| Reranking | `/rag/nlp/search.py:330` | Cross-encoder scoring |
| Argsort Ranking | `/rag/nlp/search.py:429` | Score-based sorting |
- `/rag/nlp/search.py` - Search algorithms
- `/rag/nlp/term_weight.py` - TF-IDF implementation
- `/rag/nlp/query.py` - Query processing
- `/rag/raptor.py` - RAPTOR algorithm
- `/graphrag/search.py` - GraphRAG search
- `/rag/nlp/__init__.py` - Chunking algorithms
### 7. CHUNKING/MERGING
## Performance Metrics
| Algorithm | File | Description |
|-----------|------|-------------|
| Naive Merge | `/rag/nlp/__init__.py:582` | Token-based chunking |
| Naive Merge + Images | `/rag/nlp/__init__.py:645` | With image tracking |
| Hierarchical Merge | `/rag/nlp/__init__.py:487` | Tree-based merging |
| Binary Search | `/rag/nlp/__init__.py:512` | Efficient section lookup |
| DFS Tree Traversal | `/rag/flow/hierarchical_merger/` | Document hierarchy |
| Metric | Typical Value |
|--------|---------------|
| Vector Search Latency | < 100ms |
| BM25 Search Latency | < 50ms |
| Reranking Latency | 200-500ms |
| Total Retrieval | < 1s |
### 8. MACHINE LEARNING MODELS
| Model | File | Purpose |
|-------|------|---------|
| XGBoost | `/deepdoc/parser/pdf_parser.py:88` | Text concatenation |
| ONNX OCR | `/deepdoc/vision/ocr.py:32` | Text recognition |
| ONNX Layout | `/deepdoc/vision/layout_recognizer.py` | Layout detection |
| ONNX TSR | `/deepdoc/vision/table_structure_recognizer.py` | Table structure |
| YOLOv10 | `/deepdoc/vision/layout_recognizer.py` | Object detection |
### 9. VISION/IMAGE PROCESSING
| Algorithm | File | Description |
|-----------|------|-------------|
| NMS | `/deepdoc/vision/operators.py:702` | Box filtering |
| IoU Filtering | `/deepdoc/vision/recognizer.py:359` | Overlap detection |
| Bounding Box Overlap | `/deepdoc/vision/layout_recognizer.py:94` | Spatial analysis |
### 10. ADVANCED RAG
| Algorithm | File | Description |
|-----------|------|-------------|
| RAPTOR | `/rag/raptor.py:37` | Hierarchical summarization |
| GraphRAG | `/graphrag/` | Knowledge graph RAG |
| Community Reports | `/graphrag/general/community_reports_extractor.py` | Graph summaries |
### 11. OPTIMIZATION CRITERIA
| Algorithm | File | Formula |
|-----------|------|---------|
| BIC | `/rag/raptor.py:92` | `k×log(n) - 2×log(L)` |
| Silhouette | `/deepdoc/parser/pdf_parser.py:400` | `(b-a) / max(a,b)` |
## Statistics
- **Total Algorithms**: 50+
- **Categories**: 12
- **Key Libraries**: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX

View file

@ -0,0 +1,365 @@
# Clustering Algorithms
## Tong Quan
RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.
## 1. K-Means Clustering
### File Location
```
/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)
```
### Purpose
Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.
### Implementation
```python
from sklearn.cluster import KMeans
def _assign_column(self):
"""
Detect columns using KMeans clustering on X coordinates.
"""
# Get X coordinates of text boxes
x_coords = np.array([[b["x0"]] for b in self.bxs])
best_k = 1
best_score = -1
# Find optimal number of columns (1-5)
for k in range(1, min(5, len(self.bxs))):
if k >= len(self.bxs):
break
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
if k > 1:
score = silhouette_score(x_coords, labels)
if score > best_score:
best_score = score
best_k = k
# Assign columns with optimal k
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
for i, bx in enumerate(self.bxs):
bx["col_id"] = labels[i]
```
### Algorithm
```
K-Means Algorithm:
1. Initialize k centroids randomly
2. Repeat until convergence:
a. Assign each point to nearest centroid
b. Update centroids as mean of assigned points
3. Return cluster assignments
Objective: minimize Σ ||xi - μci||²
where μci is centroid of cluster containing xi
```
### Parameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| n_clusters | 1-5 | Number of columns to detect |
| n_init | "auto" | Initialization runs |
| random_state | 42 | Reproducibility |
---
## 2. Gaussian Mixture Model (GMM)
### File Location
```
/rag/raptor.py (lines 22, 102-106, 195-199)
```
### Purpose
RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.
### Implementation
```python
from sklearn.mixture import GaussianMixture
def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
"""
Find optimal number of clusters using BIC criterion.
"""
max_clusters = min(self._max_cluster, len(embeddings))
n_clusters = np.arange(1, max_clusters)
bics = []
for n in n_clusters:
gm = GaussianMixture(
n_components=n,
random_state=random_state,
covariance_type='full'
)
gm.fit(embeddings)
bics.append(gm.bic(embeddings))
# Select cluster count with minimum BIC
optimal_clusters = n_clusters[np.argmin(bics)]
return optimal_clusters
def _cluster_chunks(self, chunks, embeddings):
"""
Cluster chunks using GMM with soft assignments.
"""
# Reduce dimensions first
reduced = self._reduce_dimensions(embeddings)
# Find optimal k
n_clusters = self._get_optimal_clusters(reduced, random_state=42)
# Fit GMM
gm = GaussianMixture(n_components=n_clusters, random_state=42)
gm.fit(reduced)
# Get soft assignments (probabilities)
probs = gm.predict_proba(reduced)
# Assign chunks to clusters with probability > threshold
clusters = [[] for _ in range(n_clusters)]
for i, prob in enumerate(probs):
for j, p in enumerate(prob):
if p > 0.1: # Threshold
clusters[j].append(i)
return clusters
```
### GMM Formula
```
GMM Probability Density:
p(x) = Σ πk × N(x | μk, Σk)
where:
- πk = mixture weight for component k
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk
BIC (Bayesian Information Criterion):
BIC = k × log(n) - 2 × log(L̂)
where:
- k = number of parameters
- n = number of samples
- L̂ = maximum likelihood
```
### Soft Assignment
GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):
```
Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)
```
---
## 3. UMAP (Dimensionality Reduction)
### File Location
```
/rag/raptor.py (lines 21, 186-190)
```
### Purpose
Giảm số chiều của embeddings trước khi clustering để improve cluster quality.
### Implementation
```python
import umap
def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
"""
Reduce embedding dimensions using UMAP.
"""
n_samples = len(embeddings)
# Calculate neighbors based on sample size
n_neighbors = int((n_samples - 1) ** 0.8)
# Target dimensions
n_components = min(12, n_samples - 2)
reducer = umap.UMAP(
n_neighbors=max(2, n_neighbors),
n_components=n_components,
metric="cosine",
random_state=42
)
return reducer.fit_transform(embeddings)
```
### UMAP Algorithm
```
UMAP (Uniform Manifold Approximation and Projection):
1. Build high-dimensional graph:
- Compute k-nearest neighbors
- Create weighted edges based on distance
2. Build low-dimensional representation:
- Initialize randomly
- Optimize layout using cross-entropy loss
- Preserve local structure (neighbors stay neighbors)
Key idea: Preserve topological structure, not absolute distances
```
### Parameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| n_neighbors | (n-1)^0.8 | Local neighborhood size |
| n_components | min(12, n-2) | Output dimensions |
| metric | cosine | Distance metric |
---
## 4. Silhouette Score
### File Location
```
/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)
```
### Purpose
Đánh giá cluster quality để chọn optimal k cho K-Means.
### Formula
```
Silhouette Score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where:
- a(i) = average distance to points in same cluster
- b(i) = average distance to points in nearest other cluster
Range: [-1, 1]
- s ≈ 1: Point well-clustered
- s ≈ 0: Point on boundary
- s < 0: Point may be misclassified
Overall score = mean(s(i)) for all points
```
### Usage
```python
from sklearn.metrics import silhouette_score
# Find optimal k
best_k = 1
best_score = -1
for k in range(2, max_clusters):
km = KMeans(n_clusters=k)
labels = km.fit_predict(data)
score = silhouette_score(data, labels)
if score > best_score:
best_score = score
best_k = k
```
---
## 5. Node2Vec (Graph Embedding)
### File Location
```
/graphrag/general/entity_embedding.py (lines 24-44)
```
### Purpose
Generate embeddings cho graph nodes trong knowledge graph.
### Implementation
```python
from graspologic.embed import node2vec_embed
def embed_node2vec(graph, dimensions=1536, num_walks=10,
walk_length=40, window_size=2, iterations=3):
"""
Generate node embeddings using Node2Vec algorithm.
"""
lcc_tensors, embedding = node2vec_embed(
graph=graph,
dimensions=dimensions,
num_walks=num_walks,
walk_length=walk_length,
window_size=window_size,
iterations=iterations,
random_seed=86
)
return embedding
```
### Node2Vec Algorithm
```
Node2Vec Algorithm:
1. Random Walk Generation:
- For each node, perform biased random walks
- Walk strategy controlled by p (return) and q (in-out)
2. Skip-gram Training:
- Treat walks as sentences
- Train Word2Vec Skip-gram model
- Node → Embedding vector
Walk probabilities:
- p: Return parameter (go back to previous node)
- q: In-out parameter (explore vs exploit)
Low p, high q → BFS-like (local structure)
High p, low q → DFS-like (global structure)
```
### Parameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| dimensions | 1536 | Embedding size |
| num_walks | 10 | Walks per node |
| walk_length | 40 | Steps per walk |
| window_size | 2 | Skip-gram window |
| iterations | 3 | Training iterations |
---
## Summary
| Algorithm | Purpose | Library |
|-----------|---------|---------|
| K-Means | PDF column detection | sklearn |
| GMM | RAPTOR clustering | sklearn |
| UMAP | Dimension reduction | umap-learn |
| Silhouette | Cluster validation | sklearn |
| Node2Vec | Graph embedding | graspologic |
## Related Files
- `/deepdoc/parser/pdf_parser.py` - K-Means, Silhouette
- `/rag/raptor.py` - GMM, UMAP
- `/graphrag/general/entity_embedding.py` - Node2Vec

View file

@ -0,0 +1,471 @@
# Graph Algorithms
## Tong Quan
RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.
## 1. PageRank Algorithm
### File Location
```
/graphrag/entity_resolution.py (line 150)
/graphrag/general/index.py (line 460)
/graphrag/search.py (line 83)
```
### Purpose
Tính importance score cho entities trong knowledge graph.
### Implementation
```python
import networkx as nx
def compute_pagerank(graph):
"""
Compute PageRank scores for all nodes.
"""
pagerank = nx.pagerank(graph)
return pagerank
# Usage in search ranking
def rank_entities(entities, pagerank_scores):
"""
Rank entities by similarity * pagerank.
"""
ranked = sorted(
entities.items(),
key=lambda x: x[1]["sim"] * x[1]["pagerank"],
reverse=True
)
return ranked
```
### PageRank Formula
```
PageRank Algorithm:
PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
for all v linking to u
where:
- d = damping factor (typically 0.85)
- N = total number of nodes
- L(v) = number of outbound links from v
Iterative computation until convergence:
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)
```
### Usage in RAGFlow
```python
# In GraphRAG search
def get_relevant_entities(query, graph):
# 1. Get candidate entities by similarity
candidates = vector_search(query)
# 2. Compute PageRank
pagerank = nx.pagerank(graph)
# 3. Combine scores
for entity in candidates:
entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]
return sorted(candidates, key=lambda x: x["final_score"], reverse=True)
```
---
## 2. Leiden Community Detection
### File Location
```
/graphrag/general/leiden.py (lines 72-141)
```
### Purpose
Phát hiện communities trong knowledge graph để tổ chức entities thành groups.
### Implementation
```python
from graspologic.partition import hierarchical_leiden
from graspologic.utils import largest_connected_component
def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
"""
Compute hierarchical communities using Leiden algorithm.
"""
# Extract largest connected component
lcc = largest_connected_component(graph)
# Run hierarchical Leiden
community_mapping = hierarchical_leiden(
lcc,
max_cluster_size=max_cluster_size,
random_seed=seed
)
# Process results by level
results = {}
for level, communities in community_mapping.items():
for community_id, nodes in communities.items():
# Calculate community weight
weight = sum(
graph.nodes[n].get("rank", 1) *
graph.nodes[n].get("weight", 1)
for n in nodes
)
results[(level, community_id)] = {
"nodes": nodes,
"weight": weight
}
return results
```
### Leiden Algorithm
```
Leiden Algorithm (improvement over Louvain):
1. Local Moving Phase:
- Move nodes between communities to improve modularity
- Refined node movement to avoid poorly connected communities
2. Refinement Phase:
- Partition communities into smaller subcommunities
- Ensures well-connected communities
3. Aggregation Phase:
- Create aggregate graph with communities as nodes
- Repeat from step 1 until no improvement
Modularity:
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)
where:
- Aij = edge weight between i and j
- ki = degree of node i
- m = total edge weight
- δ(ci, cj) = 1 if same community, 0 otherwise
```
### Hierarchical Leiden
```
Hierarchical Leiden:
- Recursively applies Leiden to each community
- Creates multi-level community structure
- Controlled by max_cluster_size parameter
Level 0: Root community (all nodes)
Level 1: First-level subcommunities
Level 2: Second-level subcommunities
...
```
---
## 3. Entity Extraction (LLM-based)
### File Location
```
/graphrag/general/extractor.py
/graphrag/light/graph_extractor.py
```
### Purpose
Extract entities và relationships từ text sử dụng LLM.
### Implementation
```python
class GraphExtractor:
DEFAULT_ENTITY_TYPES = [
"organization", "person", "geo", "event", "category"
]
async def _process_single_content(self, content, entity_types):
"""
Extract entities from text using LLM with iterative gleaning.
"""
# Initial extraction
prompt = self._build_extraction_prompt(content, entity_types)
result = await self._llm_chat(prompt)
entities, relations = self._parse_result(result)
# Iterative gleaning (up to 2 times)
for _ in range(2): # ENTITY_EXTRACTION_MAX_GLEANINGS
glean_prompt = self._build_glean_prompt(result)
glean_result = await self._llm_chat(glean_prompt)
# Check if more entities found
if "NO" in glean_result.upper():
break
new_entities, new_relations = self._parse_result(glean_result)
entities.extend(new_entities)
relations.extend(new_relations)
return entities, relations
def _parse_result(self, result):
"""
Parse LLM output into structured entities/relations.
Format: (entity_type, entity_name, description)
Format: (source, target, relation_type, description)
"""
entities = []
relations = []
for line in result.split("\n"):
if line.startswith("(") and line.endswith(")"):
parts = line[1:-1].split(TUPLE_DELIMITER)
if len(parts) == 3: # Entity
entities.append({
"type": parts[0],
"name": parts[1],
"description": parts[2]
})
elif len(parts) == 4: # Relation
relations.append({
"source": parts[0],
"target": parts[1],
"type": parts[2],
"description": parts[3]
})
return entities, relations
```
### Extraction Pipeline
```
Entity Extraction Pipeline:
1. Initial Extraction
└── LLM extracts entities using structured prompt
2. Iterative Gleaning (max 2 iterations)
├── Ask LLM if more entities exist
├── If YES: extract additional entities
└── If NO: stop gleaning
3. Relation Extraction
└── Extract relationships between entities
4. Graph Construction
└── Build NetworkX graph from entities/relations
```
---
## 4. Entity Resolution
### File Location
```
/graphrag/entity_resolution.py
```
### Purpose
Merge duplicate entities trong knowledge graph.
### Implementation
```python
import editdistance
import networkx as nx
class EntityResolution:
def is_similarity(self, a: str, b: str) -> bool:
"""
Check if two entity names are similar.
"""
a, b = a.lower(), b.lower()
# Chinese: character set intersection
if self._is_chinese(a):
a_set, b_set = set(a), set(b)
max_len = max(len(a_set), len(b_set))
overlap = len(a_set & b_set)
return overlap / max_len >= 0.8
# English: Edit distance
else:
threshold = min(len(a), len(b)) // 2
distance = editdistance.eval(a, b)
return distance <= threshold
async def resolve(self, graph):
"""
Resolve duplicate entities in graph.
"""
# 1. Find candidate pairs
nodes = list(graph.nodes())
candidates = []
for i, a in enumerate(nodes):
for b in nodes[i+1:]:
if self.is_similarity(a, b):
candidates.append((a, b))
# 2. LLM verification (batch)
confirmed_pairs = []
for batch in self._batch(candidates, size=100):
results = await self._llm_verify_batch(batch)
confirmed_pairs.extend([
pair for pair, is_same in zip(batch, results)
if is_same
])
# 3. Merge confirmed pairs
for a, b in confirmed_pairs:
self._merge_nodes(graph, a, b)
# 4. Update PageRank
pagerank = nx.pagerank(graph)
for node in graph.nodes():
graph.nodes[node]["pagerank"] = pagerank[node]
return graph
```
### Similarity Metrics
```
English Similarity (Edit Distance):
distance(a, b) ≤ min(len(a), len(b)) // 2
Example:
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
- "google" vs "apple" → distance=5 > 2 → Not similar
Chinese Similarity (Character Set):
|a ∩ b| / max(|a|, |b|) ≥ 0.8
Example:
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
```
---
## 5. Largest Connected Component (LCC)
### File Location
```
/graphrag/general/leiden.py (line 67)
```
### Purpose
Extract largest connected subgraph trước khi community detection.
### Implementation
```python
from graspologic.utils import largest_connected_component
def _stabilize_graph(graph):
"""
Extract and stabilize the largest connected component.
"""
# Get LCC
lcc = largest_connected_component(graph)
# Sort nodes for reproducibility
sorted_nodes = sorted(lcc.nodes())
sorted_graph = lcc.subgraph(sorted_nodes).copy()
return sorted_graph
```
### LCC Algorithm
```
LCC Algorithm:
1. Find all connected components using BFS/DFS
2. Select component with maximum number of nodes
3. Return subgraph of that component
Complexity: O(V + E)
where V = vertices, E = edges
```
---
## 6. N-hop Path Scoring
### File Location
```
/graphrag/search.py (lines 181-187)
```
### Purpose
Score entities dựa trên path distance trong graph.
### Implementation
```python
def compute_nhop_scores(entity, neighbors, n_hops=2):
"""
Score entities based on graph distance.
"""
nhop_scores = {}
for neighbor in neighbors:
path = neighbor["path"]
weights = neighbor["weights"]
for i in range(len(path) - 1):
source, target = path[i], path[i + 1]
# Decay by distance
score = entity["sim"] / (2 + i)
if (source, target) in nhop_scores:
nhop_scores[(source, target)]["sim"] += score
else:
nhop_scores[(source, target)] = {"sim": score}
return nhop_scores
```
### Scoring Formula
```
N-hop Score with Decay:
score(e, path_i) = similarity(e) / (2 + distance_i)
where:
- distance_i = number of hops from source entity
- 2 = base constant to prevent division issues
Total score = Σ score(e, path_i) for all paths
```
---
## Summary
| Algorithm | Purpose | Library |
|-----------|---------|---------|
| PageRank | Entity importance | NetworkX |
| Leiden | Community detection | graspologic |
| Entity Extraction | KG construction | LLM |
| Entity Resolution | Deduplication | editdistance + LLM |
| LCC | Graph preprocessing | graspologic |
| N-hop Scoring | Path-based ranking | Custom |
## Related Files
- `/graphrag/entity_resolution.py` - Entity resolution
- `/graphrag/general/leiden.py` - Community detection
- `/graphrag/general/extractor.py` - Entity extraction
- `/graphrag/search.py` - Graph search

View file

@ -0,0 +1,571 @@
# NLP Algorithms
## Tong Quan
RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing.
## 1. Trie-based Tokenization
### File Location
```
/rag/nlp/rag_tokenizer.py (lines 72-90, 120-240)
```
### Purpose
Chinese word segmentation sử dụng Trie data structure.
### Implementation
```python
import datrie
class RagTokenizer:
def __init__(self):
# Load dictionary into Trie
self.trie = datrie.Trie(string.printable + "".join(
chr(i) for i in range(0x4E00, 0x9FFF) # CJK characters
))
# Load from huqie.txt dictionary
for line in open("rag/res/huqie.txt"):
word, freq, pos = line.strip().split("\t")
self.trie[word] = (int(freq), pos)
def _max_forward(self, text, start):
"""
Max-forward matching algorithm.
"""
end = len(text)
while end > start:
substr = text[start:end]
if substr in self.trie:
return substr, end
end -= 1
return text[start], start + 1
def _max_backward(self, text, end):
"""
Max-backward matching algorithm.
"""
start = 0
while start < end:
substr = text[start:end]
if substr in self.trie:
return substr, start
start += 1
return text[end-1], end - 1
```
### Trie Structure
```
Trie Data Structure:
root
/ \
机 学
/ \
器 习
/ \
学 人
Words: 机器, 机器学习, 机器人, 学习
Lookup: O(m) where m = word length
Insert: O(m)
Space: O(n × m) where n = number of words
```
### Max-Forward/Backward Algorithm
```
Max-Forward Matching:
Input: "机器学习是人工智能"
Step 1: Try "机器学习是人工智能" → Not found
Step 2: Try "机器学习是人工" → Not found
...
Step n: Try "机器学习" → Found!
Output: ["机器学习", ...]
Max-Backward Matching:
Input: "机器学习"
Step 1: Try "机器学习" from end → Found!
Output: ["机器学习"]
```
---
## 2. DFS with Memoization (Disambiguation)
### File Location
```
/rag/nlp/rag_tokenizer.py (lines 120-210)
```
### Purpose
Giải quyết ambiguity khi có nhiều cách tokenize.
### Implementation
```python
def dfs_(self, text, memo={}):
"""
DFS with memoization for tokenization disambiguation.
"""
if text in memo:
return memo[text]
if not text:
return [[]]
results = []
for end in range(1, len(text) + 1):
prefix = text[:end]
if prefix in self.trie or len(prefix) == 1:
suffix_results = self.dfs_(text[end:], memo)
for suffix in suffix_results:
results.append([prefix] + suffix)
# Score and select best tokenization
best = max(results, key=lambda x: self.score_(x))
memo[text] = [best]
return [best]
def score_(self, tokens):
"""
Score tokenization quality.
Formula: score = B/len(tokens) + L + F
where:
B = 30 (bonus for fewer tokens)
L = sum of token lengths / total length
F = sum of frequencies
"""
B = 30
L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens))
F = sum(self.trie.get(t, (1, ''))[0] for t in tokens)
return B / len(tokens) + L + F
```
### Scoring Function
```
Tokenization Scoring:
score(tokens) = B/n + L + F
where:
- B = 30 (base bonus)
- n = number of tokens (fewer is better)
- L = coverage ratio
- F = sum of word frequencies (common words preferred)
Example:
"北京大学" →
Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516
Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5
Winner: Option 1
```
---
## 3. TF-IDF Term Weighting
### File Location
```
/rag/nlp/term_weight.py (lines 162-244)
```
### Purpose
Tính importance weight cho mỗi term trong query.
### Implementation
```python
import math
import numpy as np
class Dealer:
def weights(self, tokens, preprocess=True):
"""
Calculate TF-IDF based weights for tokens.
"""
def idf(s, N):
return math.log10(10 + ((N - s + 0.5) / (s + 0.5)))
# IDF based on term frequency in corpus
idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens])
# IDF based on document frequency
idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens])
# NER and POS weights
ner_weights = np.array([self.ner(t) for t in tokens])
pos_weights = np.array([self.postag(t) for t in tokens])
# Combined weight
weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights
# Normalize
total = np.sum(weights)
return [(t, w / total) for t, w in zip(tokens, weights)]
```
### Formula
```
TF-IDF Variant:
IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5))
where:
- N = total documents (10⁹ for df, 10⁷ for freq)
- df = document frequency of term
Combined Weight:
weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS
Normalization:
normalized_weight(term) = weight(term) / Σ weight(all_terms)
```
---
## 4. Named Entity Recognition (NER)
### File Location
```
/rag/nlp/term_weight.py (lines 84-86, 144-149)
```
### Purpose
Dictionary-based entity type detection với weight assignment.
### Implementation
```python
class Dealer:
def __init__(self):
# Load NER dictionary
self.ner_dict = json.load(open("rag/res/ner.json"))
def ner(self, token):
"""
Get NER weight for token.
"""
NER_WEIGHTS = {
"toxic": 2.0, # Toxic/sensitive words
"func": 1.0, # Functional words
"corp": 3.0, # Corporation names
"loca": 3.0, # Location names
"sch": 3.0, # School names
"stock": 3.0, # Stock symbols
"firstnm": 1.0, # First names
}
for entity_type, weight in NER_WEIGHTS.items():
if token in self.ner_dict.get(entity_type, set()):
return weight
return 1.0 # Default
```
### Entity Types
```
NER Categories:
┌──────────┬────────┬─────────────────────────────┐
│ Type │ Weight │ Examples │
├──────────┼────────┼─────────────────────────────┤
│ corp │ 3.0 │ Microsoft, Google, Apple │
│ loca │ 3.0 │ Beijing, New York │
│ sch │ 3.0 │ MIT, Stanford │
│ stock │ 3.0 │ AAPL, GOOG │
│ toxic │ 2.0 │ (sensitive words) │
│ func │ 1.0 │ the, is, are │
│ firstnm │ 1.0 │ John, Mary │
└──────────┴────────┴─────────────────────────────┘
```
---
## 5. Part-of-Speech (POS) Tagging
### File Location
```
/rag/nlp/term_weight.py (lines 179-189)
```
### Purpose
Assign weights based on grammatical category.
### Implementation
```python
def postag(self, token):
"""
Get POS weight for token.
"""
POS_WEIGHTS = {
"r": 0.3, # Pronoun
"c": 0.3, # Conjunction
"d": 0.3, # Adverb
"ns": 3.0, # Place noun
"nt": 3.0, # Organization noun
"n": 2.0, # Common noun
}
# Get POS tag from tokenizer
pos = self.tokenizer.tag(token)
# Check for numeric patterns
if re.match(r"^[\d.]+$", token):
return 2.0
return POS_WEIGHTS.get(pos, 1.0)
```
### POS Weight Table
```
POS Weight Assignments:
┌───────┬────────┬─────────────────────┐
│ Tag │ Weight │ Description │
├───────┼────────┼─────────────────────┤
│ n │ 2.0 │ Common noun │
│ ns │ 3.0 │ Place noun │
│ nt │ 3.0 │ Organization noun │
│ v │ 1.0 │ Verb │
│ a │ 1.0 │ Adjective │
│ r │ 0.3 │ Pronoun │
│ c │ 0.3 │ Conjunction │
│ d │ 0.3 │ Adverb │
│ num │ 2.0 │ Number │
└───────┴────────┴─────────────────────┘
```
---
## 6. Synonym Detection
### File Location
```
/rag/nlp/synonym.py (lines 71-93)
```
### Purpose
Query expansion qua synonym lookup.
### Implementation
```python
from nltk.corpus import wordnet
class SynonymLookup:
def __init__(self):
# Load custom dictionary
self.custom_dict = json.load(open("rag/res/synonym.json"))
def lookup(self, token, top_n=8):
"""
Find synonyms for token.
Strategy:
1. Check custom dictionary first
2. Fall back to WordNet for English
"""
# Custom dictionary
if token in self.custom_dict:
return self.custom_dict[token][:top_n]
# WordNet for English words
if re.match(r"^[a-z]+$", token.lower()):
synonyms = set()
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
name = lemma.name().replace("_", " ")
if name.lower() != token.lower():
synonyms.add(name)
return list(synonyms)[:top_n]
return []
```
### Synonym Sources
```
Synonym Lookup Strategy:
1. Custom Dictionary (highest priority)
- Domain-specific synonyms
- Chinese synonyms
- Technical terms
2. WordNet (English only)
- General synonyms
- Lemma extraction from synsets
Example:
"computer" → WordNet → ["machine", "calculator", "computing device"]
"机器学习" → Custom → ["ML", "machine learning", "深度学习"]
```
---
## 7. Query Expansion
### File Location
```
/rag/nlp/query.py (lines 85-218)
```
### Purpose
Build expanded query với weighted terms và synonyms.
### Implementation
```python
class FulltextQueryer:
QUERY_FIELDS = [
"title_tks^10", # Title: 10x boost
"title_sm_tks^5", # Title sub-tokens: 5x
"important_kwd^30", # Keywords: 30x
"important_tks^20", # Keyword tokens: 20x
"question_tks^20", # Question tokens: 20x
"content_ltks^2", # Content: 2x
"content_sm_ltks^1", # Content sub-tokens: 1x
]
def question(self, text, min_match=0.6):
"""
Build expanded query.
"""
# 1. Tokenize
tokens = self.tokenizer.tokenize(text)
# 2. Get weights
weighted_tokens = self.term_weight.weights(tokens)
# 3. Get synonyms
synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens]
# 4. Build query string
query_parts = []
for (token, weight), syns in zip(weighted_tokens, synonyms):
if syns:
# Token with synonyms
syn_str = " ".join(syns)
query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)")
else:
query_parts.append(f"{token}^{weight:.4f}")
# 5. Add phrase queries (bigrams)
for i in range(1, len(weighted_tokens)):
left, _ = weighted_tokens[i-1]
right, w = weighted_tokens[i]
query_parts.append(f'"{left} {right}"^{w*2:.4f}')
return MatchTextExpr(
query=" ".join(query_parts),
fields=self.QUERY_FIELDS,
min_match=f"{int(min_match * 100)}%"
)
```
### Query Expansion Example
```
Input: "machine learning tutorial"
After expansion:
(machine^0.35 OR (computer device)^0.2)
(learning^0.40 OR (study education)^0.2)
(tutorial^0.25 OR (guide lesson)^0.2)
"machine learning"^0.80
"learning tutorial"^0.50
With field boosting:
{
"query_string": {
"query": "(machine^0.35 learning^0.40 tutorial^0.25)",
"fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"],
"minimum_should_match": "60%"
}
}
```
---
## 8. Fine-Grained Tokenization
### File Location
```
/rag/nlp/rag_tokenizer.py (lines 395-420)
```
### Purpose
Secondary tokenization cho compound words.
### Implementation
```python
def fine_grained_tokenize(self, text):
"""
Break compound words into sub-tokens.
"""
# First pass: standard tokenization
tokens = self.tokenize(text)
fine_tokens = []
for token in tokens:
# Skip short tokens
if len(token) < 3:
fine_tokens.append(token)
continue
# Try to break into sub-tokens
sub_tokens = self.dfs_(token)
if len(sub_tokens[0]) > 1:
fine_tokens.extend(sub_tokens[0])
else:
fine_tokens.append(token)
return fine_tokens
```
### Example
```
Standard: "机器学习" → ["机器学习"]
Fine-grained: "机器学习" → ["机器", "学习"]
Standard: "人工智能" → ["人工智能"]
Fine-grained: "人工智能" → ["人工", "智能"]
```
---
## Summary
| Algorithm | Purpose | File |
|-----------|---------|------|
| Trie Tokenization | Word segmentation | rag_tokenizer.py |
| Max-Forward/Backward | Matching strategy | rag_tokenizer.py |
| DFS + Memo | Disambiguation | rag_tokenizer.py |
| TF-IDF | Term weighting | term_weight.py |
| NER | Entity detection | term_weight.py |
| POS Tagging | Grammatical analysis | term_weight.py |
| Synonym | Query expansion | synonym.py |
| Query Expansion | Search enhancement | query.py |
| Fine-grained | Sub-tokenization | rag_tokenizer.py |
## Related Files
- `/rag/nlp/rag_tokenizer.py` - Tokenization
- `/rag/nlp/term_weight.py` - TF-IDF, NER, POS
- `/rag/nlp/synonym.py` - Synonym lookup
- `/rag/nlp/query.py` - Query processing

View file

@ -0,0 +1,455 @@
# Similarity & Distance Metrics
## Tong Quan
RAGFlow sử dụng multiple similarity metrics cho search, ranking, và entity resolution.
## 1. Cosine Similarity
### File Location
```
/rag/nlp/query.py (line 221)
/rag/raptor.py (line 189)
/rag/nlp/search.py (line 60)
```
### Purpose
Đo độ tương đồng giữa hai vectors (embeddings).
### Formula
```
Cosine Similarity:
cos(θ) = (A · B) / (||A|| × ||B||)
= Σ(Ai × Bi) / (√Σ(Ai²) × √Σ(Bi²))
Range: [-1, 1]
- cos = 1: Identical direction
- cos = 0: Orthogonal
- cos = -1: Opposite direction
For normalized vectors:
cos(θ) = A · B (dot product only)
```
### Implementation
```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def compute_cosine_similarity(vec1, vec2):
"""
Compute cosine similarity between two vectors.
"""
# Using sklearn
sim = cosine_similarity([vec1], [vec2])[0][0]
return sim
def compute_batch_similarity(query_vec, doc_vecs):
"""
Compute similarity between query and multiple documents.
"""
# Returns array of similarities
sims = cosine_similarity([query_vec], doc_vecs)[0]
return sims
# Manual implementation
def cosine_sim_manual(a, b):
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
```
### Usage in RAGFlow
```python
# Vector search scoring
def hybrid_similarity(self, query_vec, doc_vecs, tkweight=0.3, vtweight=0.7):
# Cosine similarity for vectors
vsim = cosine_similarity([query_vec], doc_vecs)[0]
# Token similarity
tksim = self.token_similarity(query_tokens, doc_tokens)
# Weighted combination
combined = vsim * vtweight + tksim * tkweight
return combined
```
---
## 2. Edit Distance (Levenshtein)
### File Location
```
/graphrag/entity_resolution.py (line 28, 246)
```
### Purpose
Measure string similarity cho entity resolution.
### Formula
```
Edit Distance (Levenshtein):
d(a, b) = minimum number of single-character edits
(insertions, deletions, substitutions)
Dynamic Programming:
d[i][j] = min(
d[i-1][j] + 1, # deletion
d[i][j-1] + 1, # insertion
d[i-1][j-1] + c # substitution (c=0 if same, 1 if different)
)
Base cases:
d[i][0] = i
d[0][j] = j
```
### Implementation
```python
import editdistance
def is_similar_by_edit_distance(a: str, b: str) -> bool:
"""
Check if two strings are similar using edit distance.
Threshold: distance ≤ min(len(a), len(b)) // 2
"""
a, b = a.lower(), b.lower()
threshold = min(len(a), len(b)) // 2
distance = editdistance.eval(a, b)
return distance <= threshold
# Examples:
# "microsoft" vs "microsft" → distance=1, threshold=4 → Similar
# "google" vs "apple" → distance=5, threshold=2 → Not similar
```
### Similarity Threshold
```
Edit Distance Threshold Strategy:
threshold = min(len(a), len(b)) // 2
Rationale:
- Allows ~50% character differences
- Handles typos and minor variations
- Stricter for short strings
Examples:
| String A | String B | Distance | Threshold | Similar? |
|-------------|-------------|----------|-----------|----------|
| microsoft | microsft | 1 | 4 | Yes |
| google | googl | 1 | 2 | Yes |
| amazon | apple | 5 | 2 | No |
| ibm | ibm | 0 | 1 | Yes |
```
---
## 3. Chinese Character Similarity
### File Location
```
/graphrag/entity_resolution.py (lines 250-255)
```
### Purpose
Similarity measure cho Chinese entity names.
### Formula
```
Chinese Character Similarity:
sim(a, b) = |set(a) ∩ set(b)| / max(|set(a)|, |set(b)|)
Threshold: sim ≥ 0.8
Example:
a = "北京大学" → set = {北, 京, 大, 学}
b = "北京大" → set = {北, 京, 大}
intersection = {北, 京, 大}
sim = 3 / max(4, 3) = 3/4 = 0.75 < 0.8 Not similar
```
### Implementation
```python
def is_similar_chinese(a: str, b: str) -> bool:
"""
Check if two Chinese strings are similar.
Uses character set intersection.
"""
a_set = set(a)
b_set = set(b)
max_len = max(len(a_set), len(b_set))
intersection = len(a_set & b_set)
similarity = intersection / max_len
return similarity >= 0.8
# Examples:
# "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
# "人工智能" vs "人工智慧" → 3/4 = 0.75 → Not similar
# "机器学习" vs "机器学习研究" → 4/6 = 0.67 → Not similar
```
---
## 4. Token Similarity (Weighted)
### File Location
```
/rag/nlp/query.py (lines 230-242)
```
### Purpose
Measure similarity based on token overlap với weights.
### Formula
```
Token Similarity:
sim(query, doc) = Σ weight(t) for t ∈ (query ∩ doc)
────────────────────────────────────
Σ weight(t) for t ∈ query
where weight(t) = TF-IDF weight of token t
Range: [0, 1]
- 0: No token overlap
- 1: All query tokens in document
```
### Implementation
```python
def token_similarity(self, query_tokens_weighted, doc_tokens):
"""
Compute weighted token similarity.
Args:
query_tokens_weighted: [(token, weight), ...]
doc_tokens: set of document tokens
Returns:
Similarity score in [0, 1]
"""
doc_set = set(doc_tokens)
matched_weight = 0
total_weight = 0
for token, weight in query_tokens_weighted:
total_weight += weight
if token in doc_set:
matched_weight += weight
if total_weight == 0:
return 0
return matched_weight / total_weight
# Example:
# query = [("machine", 0.4), ("learning", 0.35), ("tutorial", 0.25)]
# doc = {"machine", "learning", "introduction"}
# matched = 0.4 + 0.35 = 0.75
# total = 1.0
# similarity = 0.75
```
---
## 5. Hybrid Similarity
### File Location
```
/rag/nlp/query.py (lines 220-228)
```
### Purpose
Combine token và vector similarity.
### Formula
```
Hybrid Similarity:
hybrid = α × token_sim + β × vector_sim
where:
- α = text weight (default: 0.3)
- β = vector weight (default: 0.7)
- α + β = 1.0
Alternative with rank features:
hybrid = (α × token_sim + β × vector_sim) × (1 + γ × pagerank)
```
### Implementation
```python
def hybrid_similarity(self, query_vec, doc_vecs,
query_tokens, doc_tokens_list,
tkweight=0.3, vtweight=0.7):
"""
Compute hybrid similarity combining token and vector similarity.
"""
from sklearn.metrics.pairwise import cosine_similarity
# Vector similarity (cosine)
vsim = cosine_similarity([query_vec], doc_vecs)[0]
# Token similarity
tksim = []
for doc_tokens in doc_tokens_list:
sim = self.token_similarity(query_tokens, doc_tokens)
tksim.append(sim)
tksim = np.array(tksim)
# Handle edge case
if np.sum(vsim) == 0:
return tksim, tksim, vsim
# Weighted combination
combined = vsim * vtweight + tksim * tkweight
return combined, tksim, vsim
```
### Weight Recommendations
```
Hybrid Weights by Use Case:
┌─────────────────────────┬────────┬────────┐
│ Use Case │ Token │ Vector │
├─────────────────────────┼────────┼────────┤
│ Conversational/Semantic │ 0.05 │ 0.95 │
│ Technical Documentation │ 0.30 │ 0.70 │
│ Legal/Exact Match │ 0.40 │ 0.60 │
│ Code Search │ 0.50 │ 0.50 │
│ Default │ 0.30 │ 0.70 │
└─────────────────────────┴────────┴────────┘
```
---
## 6. IoU (Intersection over Union)
### File Location
```
/deepdoc/vision/operators.py (lines 702-725)
```
### Purpose
Measure bounding box overlap.
### Formula
```
IoU = Area(A ∩ B) / Area(A B)
= Area(intersection) / (Area(A) + Area(B) - Area(intersection))
Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap
```
### Implementation
```python
def compute_iou(box1, box2):
"""
Compute IoU between two boxes [x1, y1, x2, y2].
"""
# Intersection
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
# Union
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / union if union > 0 else 0
```
---
## 7. N-gram Similarity
### File Location
```
/graphrag/entity_resolution.py (2-gram analysis)
```
### Purpose
Check digit differences trong entity names.
### Implementation
```python
def check_2gram_digit_difference(a: str, b: str) -> bool:
"""
Check if strings differ only in digit 2-grams.
"""
def get_2grams(s):
return [s[i:i+2] for i in range(len(s)-1)]
a_grams = get_2grams(a)
b_grams = get_2grams(b)
# Find different 2-grams
diff_grams = set(a_grams) ^ set(b_grams)
# Check if all differences are digit-only
for gram in diff_grams:
if not gram.isdigit():
return False
return True
# Example:
# "product2023" vs "product2024" → True (only digit diff)
# "productA" vs "productB" → False (letter diff)
```
---
## Summary Table
| Metric | Formula | Range | Use Case |
|--------|---------|-------|----------|
| Cosine | A·B / (‖A‖×‖B‖) | [-1, 1] | Vector search |
| Edit Distance | min edits | [0, ∞) | String matching |
| Chinese Char | \|A∩B\| / max(\|A\|,\|B\|) | [0, 1] | Chinese entities |
| Token | Σw(matched) / Σw(all) | [0, 1] | Keyword matching |
| Hybrid | α×token + β×vector | [0, 1] | Combined search |
| IoU | intersection / union | [0, 1] | Box overlap |
## Related Files
- `/rag/nlp/query.py` - Similarity calculations
- `/rag/nlp/search.py` - Search ranking
- `/graphrag/entity_resolution.py` - Entity matching
- `/deepdoc/vision/operators.py` - Box metrics

View file

@ -0,0 +1,637 @@
# Vision Algorithms
## Tong Quan
RAGFlow sử dụng computer vision algorithms cho document understanding, OCR, và layout analysis.
## 1. OCR (Optical Character Recognition)
### File Location
```
/deepdoc/vision/ocr.py (lines 30-120)
```
### Purpose
Text detection và recognition từ document images.
### Implementation
```python
import onnxruntime as ort
class OCR:
def __init__(self):
# Load ONNX models
self.det_model = ort.InferenceSession("ocr_det.onnx")
self.rec_model = ort.InferenceSession("ocr_rec.onnx")
def detect(self, image, device_id=0):
"""
Detect text regions in image.
Returns:
List of bounding boxes with confidence scores
"""
# Preprocess
img = self._preprocess_det(image)
# Run detection
outputs = self.det_model.run(None, {"input": img})
# Post-process to get boxes
boxes = self._postprocess_det(outputs[0])
return boxes
def recognize(self, image, boxes):
"""
Recognize text in detected regions.
Returns:
List of (text, confidence) tuples
"""
results = []
for box in boxes:
# Crop region
crop = self._crop_region(image, box)
# Preprocess
img = self._preprocess_rec(crop)
# Run recognition
outputs = self.rec_model.run(None, {"input": img})
# Decode to text
text, conf = self._decode_ctc(outputs[0])
results.append((text, conf))
return results
```
### OCR Pipeline
```
OCR Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ Input Image │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Detection Model (ONNX) │
│ - DB (Differentiable Binarization) based │
│ - Output: Text region polygons │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Post-processing │
│ - Polygon to bounding box │
│ - Filter by confidence │
│ - NMS for overlapping boxes │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Recognition Model (ONNX) │
│ - CRNN (CNN + RNN) based │
│ - CTC decoding │
│ - Output: Character sequence │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Output: [(text, confidence, box), ...] │
└─────────────────────────────────────────────────────────────────┘
```
### CTC Decoding
```
CTC (Connectionist Temporal Classification):
Input: Probability matrix P (T × C)
T = time steps, C = character classes
Algorithm:
1. For each time step, get most probable character
2. Merge consecutive duplicates
3. Remove blank tokens
Example:
Raw output: [a, a, -, b, b, b, -, c]
After merge: [a, -, b, -, c]
After blank removal: [a, b, c]
Final: "abc"
```
---
## 2. Layout Recognition (YOLOv10)
### File Location
```
/deepdoc/vision/layout_recognizer.py (lines 33-100)
```
### Purpose
Detect document layout elements (text, title, table, figure, etc.).
### Implementation
```python
class LayoutRecognizer:
LABELS = [
"text", "title", "figure", "figure caption",
"table", "table caption", "header", "footer",
"reference", "equation"
]
def __init__(self):
self.model = ort.InferenceSession("layout_yolov10.onnx")
def detect(self, image):
"""
Detect layout elements in document image.
"""
# Preprocess (resize, normalize)
img = self._preprocess(image)
# Run inference
outputs = self.model.run(None, {"images": img})
# Post-process
boxes, labels, scores = self._postprocess(outputs[0])
# Filter by confidence
results = []
for box, label, score in zip(boxes, labels, scores):
if score > 0.4: # Confidence threshold
results.append({
"box": box,
"type": self.LABELS[label],
"confidence": score
})
return results
```
### Layout Types
```
Document Layout Categories:
┌──────────────────┬────────────────────────────────────┐
│ Type │ Description │
├──────────────────┼────────────────────────────────────┤
│ text │ Body text paragraphs │
│ title │ Section/document titles │
│ figure │ Images, diagrams, charts │
│ figure caption │ Text describing figures │
│ table │ Data tables │
│ table caption │ Text describing tables │
│ header │ Page headers │
│ footer │ Page footers │
│ reference │ Bibliography, citations │
│ equation │ Mathematical equations │
└──────────────────┴────────────────────────────────────┘
```
### YOLO Detection
```
YOLOv10 Detection:
1. Backbone: Feature extraction (CSPDarknet)
2. Neck: Feature pyramid (PANet)
3. Head: Prediction heads for different scales
Output format:
[x_center, y_center, width, height, confidence, class_probs...]
Post-processing:
1. Apply sigmoid to confidence
2. Multiply conf × class_prob for class scores
3. Filter by score threshold
4. Apply NMS
```
---
## 3. Table Structure Recognition (TSR)
### File Location
```
/deepdoc/vision/table_structure_recognizer.py (lines 30-100)
```
### Purpose
Detect table structure (rows, columns, cells, headers).
### Implementation
```python
class TableStructureRecognizer:
LABELS = [
"table", "table column", "table row",
"table column header", "projected row header",
"spanning cell"
]
def __init__(self):
self.model = ort.InferenceSession("table_structure.onnx")
def recognize(self, table_image):
"""
Recognize structure of a table image.
"""
# Preprocess
img = self._preprocess(table_image)
# Run inference
outputs = self.model.run(None, {"input": img})
# Parse structure
structure = self._parse_structure(outputs)
return structure
def _parse_structure(self, outputs):
"""
Parse model output into table structure.
"""
rows = []
columns = []
cells = []
for detection in outputs:
label = self.LABELS[detection["class"]]
if label == "table row":
rows.append(detection["box"])
elif label == "table column":
columns.append(detection["box"])
elif label == "spanning cell":
cells.append({
"box": detection["box"],
"colspan": self._estimate_colspan(detection, columns),
"rowspan": self._estimate_rowspan(detection, rows)
})
return {
"rows": sorted(rows, key=lambda x: x[1]), # Sort by Y
"columns": sorted(columns, key=lambda x: x[0]), # Sort by X
"cells": cells
}
```
### TSR Output
```
Table Structure Output:
{
"rows": [
{"y": 10, "height": 30}, # Row 1
{"y": 40, "height": 30}, # Row 2
...
],
"columns": [
{"x": 0, "width": 100}, # Col 1
{"x": 100, "width": 150}, # Col 2
...
],
"cells": [
{"row": 0, "col": 0, "text": "Header 1"},
{"row": 0, "col": 1, "text": "Header 2"},
{"row": 1, "col": 0, "text": "Data 1", "colspan": 2},
...
]
}
```
---
## 4. Non-Maximum Suppression (NMS)
### File Location
```
/deepdoc/vision/operators.py (lines 702-725)
```
### Purpose
Filter overlapping bounding boxes trong object detection.
### Implementation
```python
def nms(boxes, scores, iou_threshold=0.5):
"""
Non-Maximum Suppression algorithm.
Args:
boxes: List of [x1, y1, x2, y2]
scores: Confidence scores
iou_threshold: IoU threshold for suppression
Returns:
Indices of kept boxes
"""
# Sort by score (descending)
indices = np.argsort(scores)[::-1]
keep = []
while len(indices) > 0:
# Keep highest scoring box
current = indices[0]
keep.append(current)
if len(indices) == 1:
break
# Compute IoU with remaining boxes
remaining = indices[1:]
ious = compute_iou(boxes[current], boxes[remaining])
# Keep boxes with IoU below threshold
indices = remaining[ious < iou_threshold]
return keep
```
### NMS Algorithm
```
NMS (Non-Maximum Suppression):
Input: Boxes B, Scores S, Threshold θ
Output: Filtered boxes
Algorithm:
1. Sort boxes by score (descending)
2. Select box with highest score → add to results
3. Remove boxes with IoU > θ with selected box
4. Repeat until no boxes remain
Example:
Boxes: [A(0.9), B(0.8), C(0.7)]
IoU(A,B) = 0.7 > 0.5 → Remove B
IoU(A,C) = 0.3 < 0.5 Keep C
Result: [A, C]
```
---
## 5. Intersection over Union (IoU)
### File Location
```
/deepdoc/vision/operators.py (lines 702-725)
/deepdoc/vision/recognizer.py (lines 339-357)
```
### Purpose
Measure overlap between bounding boxes.
### Implementation
```python
def compute_iou(box1, box2):
"""
Compute Intersection over Union.
Args:
box1, box2: [x1, y1, x2, y2] format
Returns:
IoU value in [0, 1]
"""
# Intersection coordinates
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
# Intersection area
intersection = max(0, x2 - x1) * max(0, y2 - y1)
# Union area
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
# IoU
if union == 0:
return 0
return intersection / union
```
### IoU Formula
```
IoU (Intersection over Union):
IoU = Area(A ∩ B) / Area(A B)
= Area(A ∩ B) / (Area(A) + Area(B) - Area(A ∩ B))
Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap
Threshold Usage:
- Detection: IoU > 0.5 → Same object
- NMS: IoU > 0.5 → Suppress duplicate
```
---
## 6. Image Preprocessing
### File Location
```
/deepdoc/vision/operators.py
```
### Purpose
Prepare images for neural network input.
### Implementation
```python
class StandardizeImage:
"""Normalize image to [0, 1] range."""
def __call__(self, image):
return image.astype(np.float32) / 255.0
class NormalizeImage:
"""Apply mean/std normalization."""
def __init__(self, mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]):
self.mean = np.array(mean)
self.std = np.array(std)
def __call__(self, image):
return (image - self.mean) / self.std
class ToCHWImage:
"""Convert HWC to CHW format."""
def __call__(self, image):
return image.transpose((2, 0, 1))
class LinearResize:
"""Resize image maintaining aspect ratio."""
def __init__(self, target_size):
self.target = target_size
def __call__(self, image):
h, w = image.shape[:2]
scale = self.target / max(h, w)
new_h, new_w = int(h * scale), int(w * scale)
return cv2.resize(image, (new_w, new_h),
interpolation=cv2.INTER_CUBIC)
```
### Preprocessing Pipeline
```
Image Preprocessing Pipeline:
1. Resize (maintain aspect ratio)
- Target: 640 or 1280 depending on model
2. Standardize (0-255 → 0-1)
- image = image / 255.0
3. Normalize (ImageNet stats)
- image = (image - mean) / std
- mean = [0.485, 0.456, 0.406]
- std = [0.229, 0.224, 0.225]
4. Transpose (HWC → CHW)
- PyTorch format: (C, H, W)
5. Pad (to square)
- Pad with zeros to square shape
```
---
## 7. XGBoost Text Concatenation
### File Location
```
/deepdoc/parser/pdf_parser.py (lines 88-101, 131-170)
```
### Purpose
Predict whether adjacent text boxes should be merged.
### Implementation
```python
import xgboost as xgb
class PDFParser:
def __init__(self):
# Load pre-trained XGBoost model
self.concat_model = xgb.Booster()
self.concat_model.load_model("updown_concat_xgb.model")
def should_concat(self, box1, box2):
"""
Predict if two text boxes should be concatenated.
"""
# Extract features
features = self._extract_concat_features(box1, box2)
# Create DMatrix
dmatrix = xgb.DMatrix([features])
# Predict probability
prob = self.concat_model.predict(dmatrix)[0]
return prob > 0.5
def _extract_concat_features(self, box1, box2):
"""
Extract 20+ features for concatenation decision.
"""
features = []
# Distance features
y_dist = box2["top"] - box1["bottom"]
char_height = box1["bottom"] - box1["top"]
features.append(y_dist / max(char_height, 1))
# Alignment features
x_overlap = min(box1["x1"], box2["x1"]) - max(box1["x0"], box2["x0"])
features.append(x_overlap / max(box1["x1"] - box1["x0"], 1))
# Text pattern features
text1, text2 = box1["text"], box2["text"]
features.append(1 if text1.endswith((".", "。", "!", "?")) else 0)
features.append(1 if text2[0].isupper() else 0)
# Layout features
features.append(1 if box1.get("layout_num") == box2.get("layout_num") else 0)
# ... more features
return features
```
### Feature List
```
XGBoost Concatenation Features:
1. Spatial Features:
- Y-distance / char_height
- X-alignment overlap ratio
- Same page flag
2. Text Pattern Features:
- Ends with sentence punctuation
- Ends with continuation punctuation
- Next starts with uppercase
- Next starts with number
- Chinese numbering pattern
3. Layout Features:
- Same layout_type
- Same layout_num
- Same column
4. Tokenization Features:
- Token count ratio
- Last/first token match
Total: 20+ features
```
---
## Summary
| Algorithm | Purpose | Model Type |
|-----------|---------|------------|
| OCR | Text detection + recognition | ONNX (DB + CRNN) |
| Layout Recognition | Element detection | ONNX (YOLOv10) |
| TSR | Table structure | ONNX |
| NMS | Box filtering | Classical |
| IoU | Overlap measure | Classical |
| XGBoost | Text concatenation | Gradient Boosting |
## Related Files
- `/deepdoc/vision/ocr.py` - OCR models
- `/deepdoc/vision/layout_recognizer.py` - Layout detection
- `/deepdoc/vision/table_structure_recognizer.py` - TSR
- `/deepdoc/vision/operators.py` - Image processing
- `/deepdoc/parser/pdf_parser.py` - XGBoost integration