- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
8.2 KiB
8.2 KiB
Clustering Algorithms
Tong Quan
RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.
1. K-Means Clustering
File Location
/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)
Purpose
Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.
Implementation
from sklearn.cluster import KMeans
def _assign_column(self):
"""
Detect columns using KMeans clustering on X coordinates.
"""
# Get X coordinates of text boxes
x_coords = np.array([[b["x0"]] for b in self.bxs])
best_k = 1
best_score = -1
# Find optimal number of columns (1-5)
for k in range(1, min(5, len(self.bxs))):
if k >= len(self.bxs):
break
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
if k > 1:
score = silhouette_score(x_coords, labels)
if score > best_score:
best_score = score
best_k = k
# Assign columns with optimal k
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
for i, bx in enumerate(self.bxs):
bx["col_id"] = labels[i]
Algorithm
K-Means Algorithm:
1. Initialize k centroids randomly
2. Repeat until convergence:
a. Assign each point to nearest centroid
b. Update centroids as mean of assigned points
3. Return cluster assignments
Objective: minimize Σ ||xi - μci||²
where μci is centroid of cluster containing xi
Parameters
| Parameter | Value | Description |
|---|---|---|
| n_clusters | 1-5 | Number of columns to detect |
| n_init | "auto" | Initialization runs |
| random_state | 42 | Reproducibility |
2. Gaussian Mixture Model (GMM)
File Location
/rag/raptor.py (lines 22, 102-106, 195-199)
Purpose
RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.
Implementation
from sklearn.mixture import GaussianMixture
def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
"""
Find optimal number of clusters using BIC criterion.
"""
max_clusters = min(self._max_cluster, len(embeddings))
n_clusters = np.arange(1, max_clusters)
bics = []
for n in n_clusters:
gm = GaussianMixture(
n_components=n,
random_state=random_state,
covariance_type='full'
)
gm.fit(embeddings)
bics.append(gm.bic(embeddings))
# Select cluster count with minimum BIC
optimal_clusters = n_clusters[np.argmin(bics)]
return optimal_clusters
def _cluster_chunks(self, chunks, embeddings):
"""
Cluster chunks using GMM with soft assignments.
"""
# Reduce dimensions first
reduced = self._reduce_dimensions(embeddings)
# Find optimal k
n_clusters = self._get_optimal_clusters(reduced, random_state=42)
# Fit GMM
gm = GaussianMixture(n_components=n_clusters, random_state=42)
gm.fit(reduced)
# Get soft assignments (probabilities)
probs = gm.predict_proba(reduced)
# Assign chunks to clusters with probability > threshold
clusters = [[] for _ in range(n_clusters)]
for i, prob in enumerate(probs):
for j, p in enumerate(prob):
if p > 0.1: # Threshold
clusters[j].append(i)
return clusters
GMM Formula
GMM Probability Density:
p(x) = Σ πk × N(x | μk, Σk)
where:
- πk = mixture weight for component k
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk
BIC (Bayesian Information Criterion):
BIC = k × log(n) - 2 × log(L̂)
where:
- k = number of parameters
- n = number of samples
- L̂ = maximum likelihood
Soft Assignment
GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):
Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)
3. UMAP (Dimensionality Reduction)
File Location
/rag/raptor.py (lines 21, 186-190)
Purpose
Giảm số chiều của embeddings trước khi clustering để improve cluster quality.
Implementation
import umap
def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
"""
Reduce embedding dimensions using UMAP.
"""
n_samples = len(embeddings)
# Calculate neighbors based on sample size
n_neighbors = int((n_samples - 1) ** 0.8)
# Target dimensions
n_components = min(12, n_samples - 2)
reducer = umap.UMAP(
n_neighbors=max(2, n_neighbors),
n_components=n_components,
metric="cosine",
random_state=42
)
return reducer.fit_transform(embeddings)
UMAP Algorithm
UMAP (Uniform Manifold Approximation and Projection):
1. Build high-dimensional graph:
- Compute k-nearest neighbors
- Create weighted edges based on distance
2. Build low-dimensional representation:
- Initialize randomly
- Optimize layout using cross-entropy loss
- Preserve local structure (neighbors stay neighbors)
Key idea: Preserve topological structure, not absolute distances
Parameters
| Parameter | Value | Description |
|---|---|---|
| n_neighbors | (n-1)^0.8 | Local neighborhood size |
| n_components | min(12, n-2) | Output dimensions |
| metric | cosine | Distance metric |
4. Silhouette Score
File Location
/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)
Purpose
Đánh giá cluster quality để chọn optimal k cho K-Means.
Formula
Silhouette Score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where:
- a(i) = average distance to points in same cluster
- b(i) = average distance to points in nearest other cluster
Range: [-1, 1]
- s ≈ 1: Point well-clustered
- s ≈ 0: Point on boundary
- s < 0: Point may be misclassified
Overall score = mean(s(i)) for all points
Usage
from sklearn.metrics import silhouette_score
# Find optimal k
best_k = 1
best_score = -1
for k in range(2, max_clusters):
km = KMeans(n_clusters=k)
labels = km.fit_predict(data)
score = silhouette_score(data, labels)
if score > best_score:
best_score = score
best_k = k
5. Node2Vec (Graph Embedding)
File Location
/graphrag/general/entity_embedding.py (lines 24-44)
Purpose
Generate embeddings cho graph nodes trong knowledge graph.
Implementation
from graspologic.embed import node2vec_embed
def embed_node2vec(graph, dimensions=1536, num_walks=10,
walk_length=40, window_size=2, iterations=3):
"""
Generate node embeddings using Node2Vec algorithm.
"""
lcc_tensors, embedding = node2vec_embed(
graph=graph,
dimensions=dimensions,
num_walks=num_walks,
walk_length=walk_length,
window_size=window_size,
iterations=iterations,
random_seed=86
)
return embedding
Node2Vec Algorithm
Node2Vec Algorithm:
1. Random Walk Generation:
- For each node, perform biased random walks
- Walk strategy controlled by p (return) and q (in-out)
2. Skip-gram Training:
- Treat walks as sentences
- Train Word2Vec Skip-gram model
- Node → Embedding vector
Walk probabilities:
- p: Return parameter (go back to previous node)
- q: In-out parameter (explore vs exploit)
Low p, high q → BFS-like (local structure)
High p, low q → DFS-like (global structure)
Parameters
| Parameter | Value | Description |
|---|---|---|
| dimensions | 1536 | Embedding size |
| num_walks | 10 | Walks per node |
| walk_length | 40 | Steps per walk |
| window_size | 2 | Skip-gram window |
| iterations | 3 | Training iterations |
Summary
| Algorithm | Purpose | Library |
|---|---|---|
| K-Means | PDF column detection | sklearn |
| GMM | RAPTOR clustering | sklearn |
| UMAP | Dimension reduction | umap-learn |
| Silhouette | Cluster validation | sklearn |
| Node2Vec | Graph embedding | graspologic |
Related Files
/deepdoc/parser/pdf_parser.py- K-Means, Silhouette/rag/raptor.py- GMM, UMAP/graphrag/general/entity_embedding.py- Node2Vec