ragflow/personal_analyze/06-ALGORITHMS/similarity_metrics.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

10 KiB
Raw Blame History

Similarity & Distance Metrics

Tong Quan

RAGFlow sử dụng multiple similarity metrics cho search, ranking, và entity resolution.

1. Cosine Similarity

File Location

/rag/nlp/query.py (line 221)
/rag/raptor.py (line 189)
/rag/nlp/search.py (line 60)

Purpose

Đo độ tương đồng giữa hai vectors (embeddings).

Formula

Cosine Similarity:

cos(θ) = (A · B) / (||A|| × ||B||)

       = Σ(Ai × Bi) / (√Σ(Ai²) × √Σ(Bi²))

Range: [-1, 1]
- cos = 1: Identical direction
- cos = 0: Orthogonal
- cos = -1: Opposite direction

For normalized vectors:
cos(θ) = A · B  (dot product only)

Implementation

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def compute_cosine_similarity(vec1, vec2):
    """
    Compute cosine similarity between two vectors.
    """
    # Using sklearn
    sim = cosine_similarity([vec1], [vec2])[0][0]
    return sim

def compute_batch_similarity(query_vec, doc_vecs):
    """
    Compute similarity between query and multiple documents.
    """
    # Returns array of similarities
    sims = cosine_similarity([query_vec], doc_vecs)[0]
    return sims

# Manual implementation
def cosine_sim_manual(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

Usage in RAGFlow

# Vector search scoring
def hybrid_similarity(self, query_vec, doc_vecs, tkweight=0.3, vtweight=0.7):
    # Cosine similarity for vectors
    vsim = cosine_similarity([query_vec], doc_vecs)[0]

    # Token similarity
    tksim = self.token_similarity(query_tokens, doc_tokens)

    # Weighted combination
    combined = vsim * vtweight + tksim * tkweight

    return combined

2. Edit Distance (Levenshtein)

File Location

/graphrag/entity_resolution.py (line 28, 246)

Purpose

Measure string similarity cho entity resolution.

Formula

Edit Distance (Levenshtein):

d(a, b) = minimum number of single-character edits
          (insertions, deletions, substitutions)

Dynamic Programming:
d[i][j] = min(
    d[i-1][j] + 1,      # deletion
    d[i][j-1] + 1,      # insertion
    d[i-1][j-1] + c     # substitution (c=0 if same, 1 if different)
)

Base cases:
d[i][0] = i
d[0][j] = j

Implementation

import editdistance

def is_similar_by_edit_distance(a: str, b: str) -> bool:
    """
    Check if two strings are similar using edit distance.

    Threshold: distance ≤ min(len(a), len(b)) // 2
    """
    a, b = a.lower(), b.lower()
    threshold = min(len(a), len(b)) // 2
    distance = editdistance.eval(a, b)
    return distance <= threshold

# Examples:
# "microsoft" vs "microsft" → distance=1, threshold=4 → Similar
# "google" vs "apple" → distance=5, threshold=2 → Not similar

Similarity Threshold

Edit Distance Threshold Strategy:

threshold = min(len(a), len(b)) // 2

Rationale:
- Allows ~50% character differences
- Handles typos and minor variations
- Stricter for short strings

Examples:
| String A    | String B    | Distance | Threshold | Similar? |
|-------------|-------------|----------|-----------|----------|
| microsoft   | microsft    | 1        | 4         | Yes      |
| google      | googl       | 1        | 2         | Yes      |
| amazon      | apple       | 5        | 2         | No       |
| ibm         | ibm         | 0        | 1         | Yes      |

3. Chinese Character Similarity

File Location

/graphrag/entity_resolution.py (lines 250-255)

Purpose

Similarity measure cho Chinese entity names.

Formula

Chinese Character Similarity:

sim(a, b) = |set(a) ∩ set(b)| / max(|set(a)|, |set(b)|)

Threshold: sim ≥ 0.8

Example:
a = "北京大学" → set = {北, 京, 大, 学}
b = "北京大" → set = {北, 京, 大}
intersection = {北, 京, 大}
sim = 3 / max(4, 3) = 3/4 = 0.75 < 0.8 → Not similar

Implementation

def is_similar_chinese(a: str, b: str) -> bool:
    """
    Check if two Chinese strings are similar.
    Uses character set intersection.
    """
    a_set = set(a)
    b_set = set(b)

    max_len = max(len(a_set), len(b_set))
    intersection = len(a_set & b_set)

    similarity = intersection / max_len

    return similarity >= 0.8

# Examples:
# "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
# "人工智能" vs "人工智慧" → 3/4 = 0.75 → Not similar
# "机器学习" vs "机器学习研究" → 4/6 = 0.67 → Not similar

4. Token Similarity (Weighted)

File Location

/rag/nlp/query.py (lines 230-242)

Purpose

Measure similarity based on token overlap với weights.

Formula

Token Similarity:

sim(query, doc) = Σ weight(t) for t ∈ (query ∩ doc)
                  ────────────────────────────────────
                  Σ weight(t) for t ∈ query

where weight(t) = TF-IDF weight of token t

Range: [0, 1]
- 0: No token overlap
- 1: All query tokens in document

Implementation

def token_similarity(self, query_tokens_weighted, doc_tokens):
    """
    Compute weighted token similarity.

    Args:
        query_tokens_weighted: [(token, weight), ...]
        doc_tokens: set of document tokens

    Returns:
        Similarity score in [0, 1]
    """
    doc_set = set(doc_tokens)

    matched_weight = 0
    total_weight = 0

    for token, weight in query_tokens_weighted:
        total_weight += weight
        if token in doc_set:
            matched_weight += weight

    if total_weight == 0:
        return 0

    return matched_weight / total_weight

# Example:
# query = [("machine", 0.4), ("learning", 0.35), ("tutorial", 0.25)]
# doc = {"machine", "learning", "introduction"}
# matched = 0.4 + 0.35 = 0.75
# total = 1.0
# similarity = 0.75

5. Hybrid Similarity

File Location

/rag/nlp/query.py (lines 220-228)

Purpose

Combine token và vector similarity.

Formula

Hybrid Similarity:

hybrid = α × token_sim + β × vector_sim

where:
- α = text weight (default: 0.3)
- β = vector weight (default: 0.7)
- α + β = 1.0

Alternative with rank features:
hybrid = (α × token_sim + β × vector_sim) × (1 + γ × pagerank)

Implementation

def hybrid_similarity(self, query_vec, doc_vecs,
                      query_tokens, doc_tokens_list,
                      tkweight=0.3, vtweight=0.7):
    """
    Compute hybrid similarity combining token and vector similarity.
    """
    from sklearn.metrics.pairwise import cosine_similarity

    # Vector similarity (cosine)
    vsim = cosine_similarity([query_vec], doc_vecs)[0]

    # Token similarity
    tksim = []
    for doc_tokens in doc_tokens_list:
        sim = self.token_similarity(query_tokens, doc_tokens)
        tksim.append(sim)

    tksim = np.array(tksim)

    # Handle edge case
    if np.sum(vsim) == 0:
        return tksim, tksim, vsim

    # Weighted combination
    combined = vsim * vtweight + tksim * tkweight

    return combined, tksim, vsim

Weight Recommendations

Hybrid Weights by Use Case:
┌─────────────────────────┬────────┬────────┐
│ Use Case                │ Token  │ Vector │
├─────────────────────────┼────────┼────────┤
│ Conversational/Semantic │ 0.05   │ 0.95   │
│ Technical Documentation │ 0.30   │ 0.70   │
│ Legal/Exact Match       │ 0.40   │ 0.60   │
│ Code Search             │ 0.50   │ 0.50   │
│ Default                 │ 0.30   │ 0.70   │
└─────────────────────────┴────────┴────────┘

6. IoU (Intersection over Union)

File Location

/deepdoc/vision/operators.py (lines 702-725)

Purpose

Measure bounding box overlap.

Formula

IoU = Area(A ∩ B) / Area(A  B)

    = Area(intersection) / (Area(A) + Area(B) - Area(intersection))

Range: [0, 1]
- IoU = 0: No overlap
- IoU = 1: Perfect overlap

Implementation

def compute_iou(box1, box2):
    """
    Compute IoU between two boxes [x1, y1, x2, y2].
    """
    # Intersection
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    intersection = max(0, x2 - x1) * max(0, y2 - y1)

    # Union
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / union if union > 0 else 0

7. N-gram Similarity

File Location

/graphrag/entity_resolution.py (2-gram analysis)

Purpose

Check digit differences trong entity names.

Implementation

def check_2gram_digit_difference(a: str, b: str) -> bool:
    """
    Check if strings differ only in digit 2-grams.
    """
    def get_2grams(s):
        return [s[i:i+2] for i in range(len(s)-1)]

    a_grams = get_2grams(a)
    b_grams = get_2grams(b)

    # Find different 2-grams
    diff_grams = set(a_grams) ^ set(b_grams)

    # Check if all differences are digit-only
    for gram in diff_grams:
        if not gram.isdigit():
            return False

    return True

# Example:
# "product2023" vs "product2024" → True (only digit diff)
# "productA" vs "productB" → False (letter diff)

Summary Table

Metric Formula Range Use Case
Cosine A·B / (‖A‖×‖B‖) [-1, 1] Vector search
Edit Distance min edits [0, ∞) String matching
Chinese Char |A∩B| / max(|A|,|B|) [0, 1] Chinese entities
Token Σw(matched) / Σw(all) [0, 1] Keyword matching
Hybrid α×token + β×vector [0, 1] Combined search
IoU intersection / union [0, 1] Box overlap
  • /rag/nlp/query.py - Similarity calculations
  • /rag/nlp/search.py - Search ranking
  • /graphrag/entity_resolution.py - Entity matching
  • /deepdoc/vision/operators.py - Box metrics