ragflow/personal_analyze/06-ALGORITHMS/clustering_algorithms.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

8.2 KiB
Raw Blame History

Clustering Algorithms

Tong Quan

RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.

1. K-Means Clustering

File Location

/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)

Purpose

Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.

Implementation

from sklearn.cluster import KMeans

def _assign_column(self):
    """
    Detect columns using KMeans clustering on X coordinates.
    """
    # Get X coordinates of text boxes
    x_coords = np.array([[b["x0"]] for b in self.bxs])

    best_k = 1
    best_score = -1

    # Find optimal number of columns (1-5)
    for k in range(1, min(5, len(self.bxs))):
        if k >= len(self.bxs):
            break

        km = KMeans(n_clusters=k, random_state=42, n_init="auto")
        labels = km.fit_predict(x_coords)

        if k > 1:
            score = silhouette_score(x_coords, labels)
            if score > best_score:
                best_score = score
                best_k = k

    # Assign columns with optimal k
    km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
    labels = km.fit_predict(x_coords)

    for i, bx in enumerate(self.bxs):
        bx["col_id"] = labels[i]

Algorithm

K-Means Algorithm:
1. Initialize k centroids randomly
2. Repeat until convergence:
   a. Assign each point to nearest centroid
   b. Update centroids as mean of assigned points
3. Return cluster assignments

Objective: minimize Σ ||xi - μci||²
where μci is centroid of cluster containing xi

Parameters

Parameter Value Description
n_clusters 1-5 Number of columns to detect
n_init "auto" Initialization runs
random_state 42 Reproducibility

2. Gaussian Mixture Model (GMM)

File Location

/rag/raptor.py (lines 22, 102-106, 195-199)

Purpose

RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.

Implementation

from sklearn.mixture import GaussianMixture

def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
    """
    Find optimal number of clusters using BIC criterion.
    """
    max_clusters = min(self._max_cluster, len(embeddings))
    n_clusters = np.arange(1, max_clusters)

    bics = []
    for n in n_clusters:
        gm = GaussianMixture(
            n_components=n,
            random_state=random_state,
            covariance_type='full'
        )
        gm.fit(embeddings)
        bics.append(gm.bic(embeddings))

    # Select cluster count with minimum BIC
    optimal_clusters = n_clusters[np.argmin(bics)]
    return optimal_clusters

def _cluster_chunks(self, chunks, embeddings):
    """
    Cluster chunks using GMM with soft assignments.
    """
    # Reduce dimensions first
    reduced = self._reduce_dimensions(embeddings)

    # Find optimal k
    n_clusters = self._get_optimal_clusters(reduced, random_state=42)

    # Fit GMM
    gm = GaussianMixture(n_components=n_clusters, random_state=42)
    gm.fit(reduced)

    # Get soft assignments (probabilities)
    probs = gm.predict_proba(reduced)

    # Assign chunks to clusters with probability > threshold
    clusters = [[] for _ in range(n_clusters)]
    for i, prob in enumerate(probs):
        for j, p in enumerate(prob):
            if p > 0.1:  # Threshold
                clusters[j].append(i)

    return clusters

GMM Formula

GMM Probability Density:
p(x) = Σ πk × N(x | μk, Σk)

where:
- πk = mixture weight for component k
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk

BIC (Bayesian Information Criterion):
BIC = k × log(n) - 2 × log(L̂)

where:
- k = number of parameters
- n = number of samples
- L̂ = maximum likelihood

Soft Assignment

GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):

Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)

3. UMAP (Dimensionality Reduction)

File Location

/rag/raptor.py (lines 21, 186-190)

Purpose

Giảm số chiều của embeddings trước khi clustering để improve cluster quality.

Implementation

import umap

def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
    """
    Reduce embedding dimensions using UMAP.
    """
    n_samples = len(embeddings)

    # Calculate neighbors based on sample size
    n_neighbors = int((n_samples - 1) ** 0.8)

    # Target dimensions
    n_components = min(12, n_samples - 2)

    reducer = umap.UMAP(
        n_neighbors=max(2, n_neighbors),
        n_components=n_components,
        metric="cosine",
        random_state=42
    )

    return reducer.fit_transform(embeddings)

UMAP Algorithm

UMAP (Uniform Manifold Approximation and Projection):

1. Build high-dimensional graph:
   - Compute k-nearest neighbors
   - Create weighted edges based on distance

2. Build low-dimensional representation:
   - Initialize randomly
   - Optimize layout using cross-entropy loss
   - Preserve local structure (neighbors stay neighbors)

Key idea: Preserve topological structure, not absolute distances

Parameters

Parameter Value Description
n_neighbors (n-1)^0.8 Local neighborhood size
n_components min(12, n-2) Output dimensions
metric cosine Distance metric

4. Silhouette Score

File Location

/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)

Purpose

Đánh giá cluster quality để chọn optimal k cho K-Means.

Formula

Silhouette Score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))

where:
- a(i) = average distance to points in same cluster
- b(i) = average distance to points in nearest other cluster

Range: [-1, 1]
- s ≈ 1: Point well-clustered
- s ≈ 0: Point on boundary
- s < 0: Point may be misclassified

Overall score = mean(s(i)) for all points

Usage

from sklearn.metrics import silhouette_score

# Find optimal k
best_k = 1
best_score = -1

for k in range(2, max_clusters):
    km = KMeans(n_clusters=k)
    labels = km.fit_predict(data)

    score = silhouette_score(data, labels)

    if score > best_score:
        best_score = score
        best_k = k

5. Node2Vec (Graph Embedding)

File Location

/graphrag/general/entity_embedding.py (lines 24-44)

Purpose

Generate embeddings cho graph nodes trong knowledge graph.

Implementation

from graspologic.embed import node2vec_embed

def embed_node2vec(graph, dimensions=1536, num_walks=10,
                   walk_length=40, window_size=2, iterations=3):
    """
    Generate node embeddings using Node2Vec algorithm.
    """
    lcc_tensors, embedding = node2vec_embed(
        graph=graph,
        dimensions=dimensions,
        num_walks=num_walks,
        walk_length=walk_length,
        window_size=window_size,
        iterations=iterations,
        random_seed=86
    )

    return embedding

Node2Vec Algorithm

Node2Vec Algorithm:

1. Random Walk Generation:
   - For each node, perform biased random walks
   - Walk strategy controlled by p (return) and q (in-out)

2. Skip-gram Training:
   - Treat walks as sentences
   - Train Word2Vec Skip-gram model
   - Node → Embedding vector

Walk probabilities:
- p: Return parameter (go back to previous node)
- q: In-out parameter (explore vs exploit)

Low p, high q → BFS-like (local structure)
High p, low q → DFS-like (global structure)

Parameters

Parameter Value Description
dimensions 1536 Embedding size
num_walks 10 Walks per node
walk_length 40 Steps per walk
window_size 2 Skip-gram window
iterations 3 Training iterations

Summary

Algorithm Purpose Library
K-Means PDF column detection sklearn
GMM RAPTOR clustering sklearn
UMAP Dimension reduction umap-learn
Silhouette Cluster validation sklearn
Node2Vec Graph embedding graspologic
  • /deepdoc/parser/pdf_parser.py - K-Means, Silhouette
  • /rag/raptor.py - GMM, UMAP
  • /graphrag/general/entity_embedding.py - Node2Vec