- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
10 KiB
10 KiB
Similarity & Distance Metrics
Tong Quan
RAGFlow sử dụng multiple similarity metrics cho search, ranking, và entity resolution.
1. Cosine Similarity
File Location
/rag/nlp/query.py (line 221)
/rag/raptor.py (line 189)
/rag/nlp/search.py (line 60)
Purpose
Đo độ tương đồng giữa hai vectors (embeddings).
Formula
Cosine Similarity:
cos(θ) = (A · B) / (||A|| × ||B||)
= Σ(Ai × Bi) / (√Σ(Ai²) × √Σ(Bi²))
Range: [-1, 1]
- cos = 1: Identical direction
- cos = 0: Orthogonal
- cos = -1: Opposite direction
For normalized vectors:
cos(θ) = A · B (dot product only)
Implementation
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def compute_cosine_similarity(vec1, vec2):
"""
Compute cosine similarity between two vectors.
"""
# Using sklearn
sim = cosine_similarity([vec1], [vec2])[0][0]
return sim
def compute_batch_similarity(query_vec, doc_vecs):
"""
Compute similarity between query and multiple documents.
"""
# Returns array of similarities
sims = cosine_similarity([query_vec], doc_vecs)[0]
return sims
# Manual implementation
def cosine_sim_manual(a, b):
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
Usage in RAGFlow
# Vector search scoring
def hybrid_similarity(self, query_vec, doc_vecs, tkweight=0.3, vtweight=0.7):
# Cosine similarity for vectors
vsim = cosine_similarity([query_vec], doc_vecs)[0]
# Token similarity
tksim = self.token_similarity(query_tokens, doc_tokens)
# Weighted combination
combined = vsim * vtweight + tksim * tkweight
return combined
2. Edit Distance (Levenshtein)
File Location
/graphrag/entity_resolution.py (line 28, 246)
Purpose
Measure string similarity cho entity resolution.
Formula
Edit Distance (Levenshtein):
d(a, b) = minimum number of single-character edits
(insertions, deletions, substitutions)
Dynamic Programming:
d[i][j] = min(
d[i-1][j] + 1, # deletion
d[i][j-1] + 1, # insertion
d[i-1][j-1] + c # substitution (c=0 if same, 1 if different)
)
Base cases:
d[i][0] = i
d[0][j] = j
Implementation
import editdistance
def is_similar_by_edit_distance(a: str, b: str) -> bool:
"""
Check if two strings are similar using edit distance.
Threshold: distance ≤ min(len(a), len(b)) // 2
"""
a, b = a.lower(), b.lower()
threshold = min(len(a), len(b)) // 2
distance = editdistance.eval(a, b)
return distance <= threshold
# Examples:
# "microsoft" vs "microsft" → distance=1, threshold=4 → Similar
# "google" vs "apple" → distance=5, threshold=2 → Not similar
Similarity Threshold
Edit Distance Threshold Strategy:
threshold = min(len(a), len(b)) // 2
Rationale:
- Allows ~50% character differences
- Handles typos and minor variations
- Stricter for short strings
Examples:
| String A | String B | Distance | Threshold | Similar? |
|-------------|-------------|----------|-----------|----------|
| microsoft | microsft | 1 | 4 | Yes |
| google | googl | 1 | 2 | Yes |
| amazon | apple | 5 | 2 | No |
| ibm | ibm | 0 | 1 | Yes |
3. Chinese Character Similarity
File Location
/graphrag/entity_resolution.py (lines 250-255)
Purpose
Similarity measure cho Chinese entity names.
Formula
Chinese Character Similarity:
sim(a, b) = |set(a) ∩ set(b)| / max(|set(a)|, |set(b)|)
Threshold: sim ≥ 0.8
Example:
a = "北京大学" → set = {北, 京, 大, 学}
b = "北京大" → set = {北, 京, 大}
intersection = {北, 京, 大}
sim = 3 / max(4, 3) = 3/4 = 0.75 < 0.8 → Not similar
Implementation
def is_similar_chinese(a: str, b: str) -> bool:
"""
Check if two Chinese strings are similar.
Uses character set intersection.
"""
a_set = set(a)
b_set = set(b)
max_len = max(len(a_set), len(b_set))
intersection = len(a_set & b_set)
similarity = intersection / max_len
return similarity >= 0.8
# Examples:
# "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
# "人工智能" vs "人工智慧" → 3/4 = 0.75 → Not similar
# "机器学习" vs "机器学习研究" → 4/6 = 0.67 → Not similar
4. Token Similarity (Weighted)
File Location
/rag/nlp/query.py (lines 230-242)
Purpose
Measure similarity based on token overlap với weights.
Formula
Token Similarity:
sim(query, doc) = Σ weight(t) for t ∈ (query ∩ doc)
────────────────────────────────────
Σ weight(t) for t ∈ query
where weight(t) = TF-IDF weight of token t
Range: [0, 1]
- 0: No token overlap
- 1: All query tokens in document
Implementation
def token_similarity(self, query_tokens_weighted, doc_tokens):
"""
Compute weighted token similarity.
Args:
query_tokens_weighted: [(token, weight), ...]
doc_tokens: set of document tokens
Returns:
Similarity score in [0, 1]
"""
doc_set = set(doc_tokens)
matched_weight = 0
total_weight = 0
for token, weight in query_tokens_weighted:
total_weight += weight
if token in doc_set:
matched_weight += weight
if total_weight == 0:
return 0
return matched_weight / total_weight
# Example:
# query = [("machine", 0.4), ("learning", 0.35), ("tutorial", 0.25)]
# doc = {"machine", "learning", "introduction"}
# matched = 0.4 + 0.35 = 0.75
# total = 1.0
# similarity = 0.75
5. Hybrid Similarity
File Location
/rag/nlp/query.py (lines 220-228)
Purpose
Combine token và vector similarity.
Formula
Hybrid Similarity:
hybrid = α × token_sim + β × vector_sim
where:
- α = text weight (default: 0.3)
- β = vector weight (default: 0.7)
- α + β = 1.0
Alternative with rank features:
hybrid = (α × token_sim + β × vector_sim) × (1 + γ × pagerank)
Implementation
def hybrid_similarity(self, query_vec, doc_vecs,
query_tokens, doc_tokens_list,
tkweight=0.3, vtweight=0.7):
"""
Compute hybrid similarity combining token and vector similarity.
"""
from sklearn.metrics.pairwise import cosine_similarity
# Vector similarity (cosine)
vsim = cosine_similarity([query_vec], doc_vecs)[0]
# Token similarity
tksim = []
for doc_tokens in doc_tokens_list:
sim = self.token_similarity(query_tokens, doc_tokens)
tksim.append(sim)
tksim = np.array(tksim)
# Handle edge case
if np.sum(vsim) == 0:
return tksim, tksim, vsim
# Weighted combination
combined = vsim * vtweight + tksim * tkweight
return combined, tksim, vsim
Weight Recommendations
Hybrid Weights by Use Case:
┌─────────────────────────┬────────┬────────┐
│ Use Case │ Token │ Vector │
├─────────────────────────┼────────┼────────┤
│ Conversational/Semantic │ 0.05 │ 0.95 │
│ Technical Documentation │ 0.30 │ 0.70 │
│ Legal/Exact Match │ 0.40 │ 0.60 │
│ Code Search │ 0.50 │ 0.50 │
│ Default │ 0.30 │ 0.70 │
└─────────────────────────┴────────┴────────┘
6. IoU (Intersection over Union)
File Location
/deepdoc/vision/operators.py (lines 702-725)
Purpose
Measure bounding box overlap.
Formula
IoU = Area(A ∩ B) / Area(A ∪ B)
= Area(intersection) / (Area(A) + Area(B) - Area(intersection))
Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap
Implementation
def compute_iou(box1, box2):
"""
Compute IoU between two boxes [x1, y1, x2, y2].
"""
# Intersection
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
# Union
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / union if union > 0 else 0
7. N-gram Similarity
File Location
/graphrag/entity_resolution.py (2-gram analysis)
Purpose
Check digit differences trong entity names.
Implementation
def check_2gram_digit_difference(a: str, b: str) -> bool:
"""
Check if strings differ only in digit 2-grams.
"""
def get_2grams(s):
return [s[i:i+2] for i in range(len(s)-1)]
a_grams = get_2grams(a)
b_grams = get_2grams(b)
# Find different 2-grams
diff_grams = set(a_grams) ^ set(b_grams)
# Check if all differences are digit-only
for gram in diff_grams:
if not gram.isdigit():
return False
return True
# Example:
# "product2023" vs "product2024" → True (only digit diff)
# "productA" vs "productB" → False (letter diff)
Summary Table
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| Cosine | A·B / (‖A‖×‖B‖) | [-1, 1] | Vector search |
| Edit Distance | min edits | [0, ∞) | String matching |
| Chinese Char | |A∩B| / max(|A|,|B|) | [0, 1] | Chinese entities |
| Token | Σw(matched) / Σw(all) | [0, 1] | Keyword matching |
| Hybrid | α×token + β×vector | [0, 1] | Combined search |
| IoU | intersection / union | [0, 1] | Box overlap |
Related Files
/rag/nlp/query.py- Similarity calculations/rag/nlp/search.py- Search ranking/graphrag/entity_resolution.py- Entity matching/deepdoc/vision/operators.py- Box metrics