docs: Add comprehensive algorithm documentation (50+ algorithms)

- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)

2025-11-27 03:34:49 +00:00

8.2 KiB

Raw Blame History

Clustering Algorithms

Tong Quan

RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.

1. K-Means Clustering

File Location

/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)

Purpose

Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.

Implementation

from sklearn.cluster import KMeans

def _assign_column(self):
    """
    Detect columns using KMeans clustering on X coordinates.
    """
    # Get X coordinates of text boxes
    x_coords = np.array([[b["x0"]] for b in self.bxs])

    best_k = 1
    best_score = -1

    # Find optimal number of columns (1-5)
    for k in range(1, min(5, len(self.bxs))):
        if k >= len(self.bxs):
            break

        km = KMeans(n_clusters=k, random_state=42, n_init="auto")
        labels = km.fit_predict(x_coords)

        if k > 1:
            score = silhouette_score(x_coords, labels)
            if score > best_score:
                best_score = score
                best_k = k

    # Assign columns with optimal k
    km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
    labels = km.fit_predict(x_coords)

    for i, bx in enumerate(self.bxs):
        bx["col_id"] = labels[i]

Algorithm

K-Means Algorithm:
1. Initialize k centroids randomly
2. Repeat until convergence:
   a. Assign each point to nearest centroid
   b. Update centroids as mean of assigned points
3. Return cluster assignments

Objective: minimize Σ ||xi - μci||²
where μci is centroid of cluster containing xi

Parameters

Parameter	Value	Description
n_clusters	1-5	Number of columns to detect
n_init	"auto"	Initialization runs
random_state	42	Reproducibility

2. Gaussian Mixture Model (GMM)

File Location

/rag/raptor.py (lines 22, 102-106, 195-199)

Purpose

RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.

Implementation

from sklearn.mixture import GaussianMixture

def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
    """
    Find optimal number of clusters using BIC criterion.
    """
    max_clusters = min(self._max_cluster, len(embeddings))
    n_clusters = np.arange(1, max_clusters)

    bics = []
    for n in n_clusters:
        gm = GaussianMixture(
            n_components=n,
            random_state=random_state,
            covariance_type='full'
        )
        gm.fit(embeddings)
        bics.append(gm.bic(embeddings))

    # Select cluster count with minimum BIC
    optimal_clusters = n_clusters[np.argmin(bics)]
    return optimal_clusters

def _cluster_chunks(self, chunks, embeddings):
    """
    Cluster chunks using GMM with soft assignments.
    """
    # Reduce dimensions first
    reduced = self._reduce_dimensions(embeddings)

    # Find optimal k
    n_clusters = self._get_optimal_clusters(reduced, random_state=42)

    # Fit GMM
    gm = GaussianMixture(n_components=n_clusters, random_state=42)
    gm.fit(reduced)

    # Get soft assignments (probabilities)
    probs = gm.predict_proba(reduced)

    # Assign chunks to clusters with probability > threshold
    clusters = [[] for _ in range(n_clusters)]
    for i, prob in enumerate(probs):
        for j, p in enumerate(prob):
            if p > 0.1:  # Threshold
                clusters[j].append(i)

    return clusters

GMM Formula

GMM Probability Density:
p(x) = Σ πk × N(x | μk, Σk)

where:
- πk = mixture weight for component k
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk

BIC (Bayesian Information Criterion):
BIC = k × log(n) - 2 × log(L̂)

where:
- k = number of parameters
- n = number of samples
- L̂ = maximum likelihood

Soft Assignment

GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):

Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)

3. UMAP (Dimensionality Reduction)

File Location

/rag/raptor.py (lines 21, 186-190)

Purpose

Giảm số chiều của embeddings trước khi clustering để improve cluster quality.

Implementation

import umap

def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
    """
    Reduce embedding dimensions using UMAP.
    """
    n_samples = len(embeddings)

    # Calculate neighbors based on sample size
    n_neighbors = int((n_samples - 1) ** 0.8)

    # Target dimensions
    n_components = min(12, n_samples - 2)

    reducer = umap.UMAP(
        n_neighbors=max(2, n_neighbors),
        n_components=n_components,
        metric="cosine",
        random_state=42
    )

    return reducer.fit_transform(embeddings)

UMAP Algorithm

UMAP (Uniform Manifold Approximation and Projection):

1. Build high-dimensional graph:
   - Compute k-nearest neighbors
   - Create weighted edges based on distance

2. Build low-dimensional representation:
   - Initialize randomly
   - Optimize layout using cross-entropy loss
   - Preserve local structure (neighbors stay neighbors)

Key idea: Preserve topological structure, not absolute distances

Parameters

Parameter	Value	Description
n_neighbors	(n-1)^0.8	Local neighborhood size
n_components	min(12, n-2)	Output dimensions
metric	cosine	Distance metric

4. Silhouette Score

File Location

/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)

Purpose

Đánh giá cluster quality để chọn optimal k cho K-Means.

Formula

Silhouette Score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))

where:
- a(i) = average distance to points in same cluster
- b(i) = average distance to points in nearest other cluster

Range: [-1, 1]
- s ≈ 1: Point well-clustered
- s ≈ 0: Point on boundary
- s < 0: Point may be misclassified

Overall score = mean(s(i)) for all points

Usage

from sklearn.metrics import silhouette_score

# Find optimal k
best_k = 1
best_score = -1

for k in range(2, max_clusters):
    km = KMeans(n_clusters=k)
    labels = km.fit_predict(data)

    score = silhouette_score(data, labels)

    if score > best_score:
        best_score = score
        best_k = k

5. Node2Vec (Graph Embedding)

File Location

/graphrag/general/entity_embedding.py (lines 24-44)

Purpose

Generate embeddings cho graph nodes trong knowledge graph.

Implementation

from graspologic.embed import node2vec_embed

def embed_node2vec(graph, dimensions=1536, num_walks=10,
                   walk_length=40, window_size=2, iterations=3):
    """
    Generate node embeddings using Node2Vec algorithm.
    """
    lcc_tensors, embedding = node2vec_embed(
        graph=graph,
        dimensions=dimensions,
        num_walks=num_walks,
        walk_length=walk_length,
        window_size=window_size,
        iterations=iterations,
        random_seed=86
    )

    return embedding

Node2Vec Algorithm

Node2Vec Algorithm:

1. Random Walk Generation:
   - For each node, perform biased random walks
   - Walk strategy controlled by p (return) and q (in-out)

2. Skip-gram Training:
   - Treat walks as sentences
   - Train Word2Vec Skip-gram model
   - Node → Embedding vector

Walk probabilities:
- p: Return parameter (go back to previous node)
- q: In-out parameter (explore vs exploit)

Low p, high q → BFS-like (local structure)
High p, low q → DFS-like (global structure)

Parameters

Parameter	Value	Description
dimensions	1536	Embedding size
num_walks	10	Walks per node
walk_length	40	Steps per walk
window_size	2	Skip-gram window
iterations	3	Training iterations

Summary

Algorithm	Purpose	Library
K-Means	PDF column detection	sklearn
GMM	RAPTOR clustering	sklearn
UMAP	Dimension reduction	umap-learn
Silhouette	Cluster validation	sklearn
Node2Vec	Graph embedding	graspologic

/deepdoc/parser/pdf_parser.py - K-Means, Silhouette
/rag/raptor.py - GMM, UMAP
/graphrag/general/entity_embedding.py - Node2Vec

8.2 KiB Raw Blame History Unescape Escape

Clustering Algorithms

Tong Quan

1. K-Means Clustering

File Location

Purpose

Implementation

Algorithm

Parameters

2. Gaussian Mixture Model (GMM)

File Location

Purpose

Implementation

GMM Formula

Soft Assignment

3. UMAP (Dimensionality Reduction)

File Location

Purpose

Implementation

UMAP Algorithm

Parameters

4. Silhouette Score

File Location

Purpose

Formula

Usage

5. Node2Vec (Graph Embedding)

File Location

Purpose

Implementation

Node2Vec Algorithm

Parameters

Summary

Related Files

8.2 KiB

Raw Blame History