ragflow/personal_analyze/06-ALGORITHMS/similarity_metrics.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

455 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Similarity & Distance Metrics
## Tong Quan
RAGFlow sử dụng multiple similarity metrics cho search, ranking, và entity resolution.
## 1. Cosine Similarity
### File Location
```
/rag/nlp/query.py (line 221)
/rag/raptor.py (line 189)
/rag/nlp/search.py (line 60)
```
### Purpose
Đo độ tương đồng giữa hai vectors (embeddings).
### Formula
```
Cosine Similarity:
cos(θ) = (A · B) / (||A|| × ||B||)
= Σ(Ai × Bi) / (√Σ(Ai²) × √Σ(Bi²))
Range: [-1, 1]
- cos = 1: Identical direction
- cos = 0: Orthogonal
- cos = -1: Opposite direction
For normalized vectors:
cos(θ) = A · B (dot product only)
```
### Implementation
```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def compute_cosine_similarity(vec1, vec2):
"""
Compute cosine similarity between two vectors.
"""
# Using sklearn
sim = cosine_similarity([vec1], [vec2])[0][0]
return sim
def compute_batch_similarity(query_vec, doc_vecs):
"""
Compute similarity between query and multiple documents.
"""
# Returns array of similarities
sims = cosine_similarity([query_vec], doc_vecs)[0]
return sims
# Manual implementation
def cosine_sim_manual(a, b):
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
```
### Usage in RAGFlow
```python
# Vector search scoring
def hybrid_similarity(self, query_vec, doc_vecs, tkweight=0.3, vtweight=0.7):
# Cosine similarity for vectors
vsim = cosine_similarity([query_vec], doc_vecs)[0]
# Token similarity
tksim = self.token_similarity(query_tokens, doc_tokens)
# Weighted combination
combined = vsim * vtweight + tksim * tkweight
return combined
```
---
## 2. Edit Distance (Levenshtein)
### File Location
```
/graphrag/entity_resolution.py (line 28, 246)
```
### Purpose
Measure string similarity cho entity resolution.
### Formula
```
Edit Distance (Levenshtein):
d(a, b) = minimum number of single-character edits
(insertions, deletions, substitutions)
Dynamic Programming:
d[i][j] = min(
d[i-1][j] + 1, # deletion
d[i][j-1] + 1, # insertion
d[i-1][j-1] + c # substitution (c=0 if same, 1 if different)
)
Base cases:
d[i][0] = i
d[0][j] = j
```
### Implementation
```python
import editdistance
def is_similar_by_edit_distance(a: str, b: str) -> bool:
"""
Check if two strings are similar using edit distance.
Threshold: distance ≤ min(len(a), len(b)) // 2
"""
a, b = a.lower(), b.lower()
threshold = min(len(a), len(b)) // 2
distance = editdistance.eval(a, b)
return distance <= threshold
# Examples:
# "microsoft" vs "microsft" → distance=1, threshold=4 → Similar
# "google" vs "apple" → distance=5, threshold=2 → Not similar
```
### Similarity Threshold
```
Edit Distance Threshold Strategy:
threshold = min(len(a), len(b)) // 2
Rationale:
- Allows ~50% character differences
- Handles typos and minor variations
- Stricter for short strings
Examples:
| String A | String B | Distance | Threshold | Similar? |
|-------------|-------------|----------|-----------|----------|
| microsoft | microsft | 1 | 4 | Yes |
| google | googl | 1 | 2 | Yes |
| amazon | apple | 5 | 2 | No |
| ibm | ibm | 0 | 1 | Yes |
```
---
## 3. Chinese Character Similarity
### File Location
```
/graphrag/entity_resolution.py (lines 250-255)
```
### Purpose
Similarity measure cho Chinese entity names.
### Formula
```
Chinese Character Similarity:
sim(a, b) = |set(a) ∩ set(b)| / max(|set(a)|, |set(b)|)
Threshold: sim ≥ 0.8
Example:
a = "北京大学" → set = {北, 京, 大, 学}
b = "北京大" → set = {北, 京, 大}
intersection = {北, 京, 大}
sim = 3 / max(4, 3) = 3/4 = 0.75 < 0.8 → Not similar
```
### Implementation
```python
def is_similar_chinese(a: str, b: str) -> bool:
"""
Check if two Chinese strings are similar.
Uses character set intersection.
"""
a_set = set(a)
b_set = set(b)
max_len = max(len(a_set), len(b_set))
intersection = len(a_set & b_set)
similarity = intersection / max_len
return similarity >= 0.8
# Examples:
# "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
# "人工智能" vs "人工智慧" → 3/4 = 0.75 → Not similar
# "机器学习" vs "机器学习研究" → 4/6 = 0.67 → Not similar
```
---
## 4. Token Similarity (Weighted)
### File Location
```
/rag/nlp/query.py (lines 230-242)
```
### Purpose
Measure similarity based on token overlap với weights.
### Formula
```
Token Similarity:
sim(query, doc) = Σ weight(t) for t ∈ (query ∩ doc)
────────────────────────────────────
Σ weight(t) for t ∈ query
where weight(t) = TF-IDF weight of token t
Range: [0, 1]
- 0: No token overlap
- 1: All query tokens in document
```
### Implementation
```python
def token_similarity(self, query_tokens_weighted, doc_tokens):
"""
Compute weighted token similarity.
Args:
query_tokens_weighted: [(token, weight), ...]
doc_tokens: set of document tokens
Returns:
Similarity score in [0, 1]
"""
doc_set = set(doc_tokens)
matched_weight = 0
total_weight = 0
for token, weight in query_tokens_weighted:
total_weight += weight
if token in doc_set:
matched_weight += weight
if total_weight == 0:
return 0
return matched_weight / total_weight
# Example:
# query = [("machine", 0.4), ("learning", 0.35), ("tutorial", 0.25)]
# doc = {"machine", "learning", "introduction"}
# matched = 0.4 + 0.35 = 0.75
# total = 1.0
# similarity = 0.75
```
---
## 5. Hybrid Similarity
### File Location
```
/rag/nlp/query.py (lines 220-228)
```
### Purpose
Combine token và vector similarity.
### Formula
```
Hybrid Similarity:
hybrid = α × token_sim + β × vector_sim
where:
- α = text weight (default: 0.3)
- β = vector weight (default: 0.7)
- α + β = 1.0
Alternative with rank features:
hybrid = (α × token_sim + β × vector_sim) × (1 + γ × pagerank)
```
### Implementation
```python
def hybrid_similarity(self, query_vec, doc_vecs,
query_tokens, doc_tokens_list,
tkweight=0.3, vtweight=0.7):
"""
Compute hybrid similarity combining token and vector similarity.
"""
from sklearn.metrics.pairwise import cosine_similarity
# Vector similarity (cosine)
vsim = cosine_similarity([query_vec], doc_vecs)[0]
# Token similarity
tksim = []
for doc_tokens in doc_tokens_list:
sim = self.token_similarity(query_tokens, doc_tokens)
tksim.append(sim)
tksim = np.array(tksim)
# Handle edge case
if np.sum(vsim) == 0:
return tksim, tksim, vsim
# Weighted combination
combined = vsim * vtweight + tksim * tkweight
return combined, tksim, vsim
```
### Weight Recommendations
```
Hybrid Weights by Use Case:
┌─────────────────────────┬────────┬────────┐
│ Use Case │ Token │ Vector │
├─────────────────────────┼────────┼────────┤
│ Conversational/Semantic │ 0.05 │ 0.95 │
│ Technical Documentation │ 0.30 │ 0.70 │
│ Legal/Exact Match │ 0.40 │ 0.60 │
│ Code Search │ 0.50 │ 0.50 │
│ Default │ 0.30 │ 0.70 │
└─────────────────────────┴────────┴────────┘
```
---
## 6. IoU (Intersection over Union)
### File Location
```
/deepdoc/vision/operators.py (lines 702-725)
```
### Purpose
Measure bounding box overlap.
### Formula
```
IoU = Area(A ∩ B) / Area(A B)
= Area(intersection) / (Area(A) + Area(B) - Area(intersection))
Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap
```
### Implementation
```python
def compute_iou(box1, box2):
"""
Compute IoU between two boxes [x1, y1, x2, y2].
"""
# Intersection
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
# Union
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / union if union > 0 else 0
```
---
## 7. N-gram Similarity
### File Location
```
/graphrag/entity_resolution.py (2-gram analysis)
```
### Purpose
Check digit differences trong entity names.
### Implementation
```python
def check_2gram_digit_difference(a: str, b: str) -> bool:
"""
Check if strings differ only in digit 2-grams.
"""
def get_2grams(s):
return [s[i:i+2] for i in range(len(s)-1)]
a_grams = get_2grams(a)
b_grams = get_2grams(b)
# Find different 2-grams
diff_grams = set(a_grams) ^ set(b_grams)
# Check if all differences are digit-only
for gram in diff_grams:
if not gram.isdigit():
return False
return True
# Example:
# "product2023" vs "product2024" → True (only digit diff)
# "productA" vs "productB" → False (letter diff)
```
---
## Summary Table
| Metric | Formula | Range | Use Case |
|--------|---------|-------|----------|
| Cosine | A·B / (‖A‖×‖B‖) | [-1, 1] | Vector search |
| Edit Distance | min edits | [0, ∞) | String matching |
| Chinese Char | \|A∩B\| / max(\|A\|,\|B\|) | [0, 1] | Chinese entities |
| Token | Σw(matched) / Σw(all) | [0, 1] | Keyword matching |
| Hybrid | α×token + β×vector | [0, 1] | Combined search |
| IoU | intersection / union | [0, 1] | Box overlap |
## Related Files
- `/rag/nlp/query.py` - Similarity calculations
- `/rag/nlp/search.py` - Search ranking
- `/graphrag/entity_resolution.py` - Entity matching
- `/deepdoc/vision/operators.py` - Box metrics