- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
365 lines
8.2 KiB
Markdown
365 lines
8.2 KiB
Markdown
# Clustering Algorithms
|
||
|
||
## Tong Quan
|
||
|
||
RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.
|
||
|
||
## 1. K-Means Clustering
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)
|
||
```
|
||
|
||
### Purpose
|
||
Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
from sklearn.cluster import KMeans
|
||
|
||
def _assign_column(self):
|
||
"""
|
||
Detect columns using KMeans clustering on X coordinates.
|
||
"""
|
||
# Get X coordinates of text boxes
|
||
x_coords = np.array([[b["x0"]] for b in self.bxs])
|
||
|
||
best_k = 1
|
||
best_score = -1
|
||
|
||
# Find optimal number of columns (1-5)
|
||
for k in range(1, min(5, len(self.bxs))):
|
||
if k >= len(self.bxs):
|
||
break
|
||
|
||
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
|
||
labels = km.fit_predict(x_coords)
|
||
|
||
if k > 1:
|
||
score = silhouette_score(x_coords, labels)
|
||
if score > best_score:
|
||
best_score = score
|
||
best_k = k
|
||
|
||
# Assign columns with optimal k
|
||
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
|
||
labels = km.fit_predict(x_coords)
|
||
|
||
for i, bx in enumerate(self.bxs):
|
||
bx["col_id"] = labels[i]
|
||
```
|
||
|
||
### Algorithm
|
||
|
||
```
|
||
K-Means Algorithm:
|
||
1. Initialize k centroids randomly
|
||
2. Repeat until convergence:
|
||
a. Assign each point to nearest centroid
|
||
b. Update centroids as mean of assigned points
|
||
3. Return cluster assignments
|
||
|
||
Objective: minimize Σ ||xi - μci||²
|
||
where μci is centroid of cluster containing xi
|
||
```
|
||
|
||
### Parameters
|
||
|
||
| Parameter | Value | Description |
|
||
|-----------|-------|-------------|
|
||
| n_clusters | 1-5 | Number of columns to detect |
|
||
| n_init | "auto" | Initialization runs |
|
||
| random_state | 42 | Reproducibility |
|
||
|
||
---
|
||
|
||
## 2. Gaussian Mixture Model (GMM)
|
||
|
||
### File Location
|
||
```
|
||
/rag/raptor.py (lines 22, 102-106, 195-199)
|
||
```
|
||
|
||
### Purpose
|
||
RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
from sklearn.mixture import GaussianMixture
|
||
|
||
def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
|
||
"""
|
||
Find optimal number of clusters using BIC criterion.
|
||
"""
|
||
max_clusters = min(self._max_cluster, len(embeddings))
|
||
n_clusters = np.arange(1, max_clusters)
|
||
|
||
bics = []
|
||
for n in n_clusters:
|
||
gm = GaussianMixture(
|
||
n_components=n,
|
||
random_state=random_state,
|
||
covariance_type='full'
|
||
)
|
||
gm.fit(embeddings)
|
||
bics.append(gm.bic(embeddings))
|
||
|
||
# Select cluster count with minimum BIC
|
||
optimal_clusters = n_clusters[np.argmin(bics)]
|
||
return optimal_clusters
|
||
|
||
def _cluster_chunks(self, chunks, embeddings):
|
||
"""
|
||
Cluster chunks using GMM with soft assignments.
|
||
"""
|
||
# Reduce dimensions first
|
||
reduced = self._reduce_dimensions(embeddings)
|
||
|
||
# Find optimal k
|
||
n_clusters = self._get_optimal_clusters(reduced, random_state=42)
|
||
|
||
# Fit GMM
|
||
gm = GaussianMixture(n_components=n_clusters, random_state=42)
|
||
gm.fit(reduced)
|
||
|
||
# Get soft assignments (probabilities)
|
||
probs = gm.predict_proba(reduced)
|
||
|
||
# Assign chunks to clusters with probability > threshold
|
||
clusters = [[] for _ in range(n_clusters)]
|
||
for i, prob in enumerate(probs):
|
||
for j, p in enumerate(prob):
|
||
if p > 0.1: # Threshold
|
||
clusters[j].append(i)
|
||
|
||
return clusters
|
||
```
|
||
|
||
### GMM Formula
|
||
|
||
```
|
||
GMM Probability Density:
|
||
p(x) = Σ πk × N(x | μk, Σk)
|
||
|
||
where:
|
||
- πk = mixture weight for component k
|
||
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk
|
||
|
||
BIC (Bayesian Information Criterion):
|
||
BIC = k × log(n) - 2 × log(L̂)
|
||
|
||
where:
|
||
- k = number of parameters
|
||
- n = number of samples
|
||
- L̂ = maximum likelihood
|
||
```
|
||
|
||
### Soft Assignment
|
||
|
||
GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):
|
||
|
||
```
|
||
Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)
|
||
```
|
||
|
||
---
|
||
|
||
## 3. UMAP (Dimensionality Reduction)
|
||
|
||
### File Location
|
||
```
|
||
/rag/raptor.py (lines 21, 186-190)
|
||
```
|
||
|
||
### Purpose
|
||
Giảm số chiều của embeddings trước khi clustering để improve cluster quality.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
import umap
|
||
|
||
def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Reduce embedding dimensions using UMAP.
|
||
"""
|
||
n_samples = len(embeddings)
|
||
|
||
# Calculate neighbors based on sample size
|
||
n_neighbors = int((n_samples - 1) ** 0.8)
|
||
|
||
# Target dimensions
|
||
n_components = min(12, n_samples - 2)
|
||
|
||
reducer = umap.UMAP(
|
||
n_neighbors=max(2, n_neighbors),
|
||
n_components=n_components,
|
||
metric="cosine",
|
||
random_state=42
|
||
)
|
||
|
||
return reducer.fit_transform(embeddings)
|
||
```
|
||
|
||
### UMAP Algorithm
|
||
|
||
```
|
||
UMAP (Uniform Manifold Approximation and Projection):
|
||
|
||
1. Build high-dimensional graph:
|
||
- Compute k-nearest neighbors
|
||
- Create weighted edges based on distance
|
||
|
||
2. Build low-dimensional representation:
|
||
- Initialize randomly
|
||
- Optimize layout using cross-entropy loss
|
||
- Preserve local structure (neighbors stay neighbors)
|
||
|
||
Key idea: Preserve topological structure, not absolute distances
|
||
```
|
||
|
||
### Parameters
|
||
|
||
| Parameter | Value | Description |
|
||
|-----------|-------|-------------|
|
||
| n_neighbors | (n-1)^0.8 | Local neighborhood size |
|
||
| n_components | min(12, n-2) | Output dimensions |
|
||
| metric | cosine | Distance metric |
|
||
|
||
---
|
||
|
||
## 4. Silhouette Score
|
||
|
||
### File Location
|
||
```
|
||
/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)
|
||
```
|
||
|
||
### Purpose
|
||
Đánh giá cluster quality để chọn optimal k cho K-Means.
|
||
|
||
### Formula
|
||
|
||
```
|
||
Silhouette Score:
|
||
s(i) = (b(i) - a(i)) / max(a(i), b(i))
|
||
|
||
where:
|
||
- a(i) = average distance to points in same cluster
|
||
- b(i) = average distance to points in nearest other cluster
|
||
|
||
Range: [-1, 1]
|
||
- s ≈ 1: Point well-clustered
|
||
- s ≈ 0: Point on boundary
|
||
- s < 0: Point may be misclassified
|
||
|
||
Overall score = mean(s(i)) for all points
|
||
```
|
||
|
||
### Usage
|
||
|
||
```python
|
||
from sklearn.metrics import silhouette_score
|
||
|
||
# Find optimal k
|
||
best_k = 1
|
||
best_score = -1
|
||
|
||
for k in range(2, max_clusters):
|
||
km = KMeans(n_clusters=k)
|
||
labels = km.fit_predict(data)
|
||
|
||
score = silhouette_score(data, labels)
|
||
|
||
if score > best_score:
|
||
best_score = score
|
||
best_k = k
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Node2Vec (Graph Embedding)
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/general/entity_embedding.py (lines 24-44)
|
||
```
|
||
|
||
### Purpose
|
||
Generate embeddings cho graph nodes trong knowledge graph.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
from graspologic.embed import node2vec_embed
|
||
|
||
def embed_node2vec(graph, dimensions=1536, num_walks=10,
|
||
walk_length=40, window_size=2, iterations=3):
|
||
"""
|
||
Generate node embeddings using Node2Vec algorithm.
|
||
"""
|
||
lcc_tensors, embedding = node2vec_embed(
|
||
graph=graph,
|
||
dimensions=dimensions,
|
||
num_walks=num_walks,
|
||
walk_length=walk_length,
|
||
window_size=window_size,
|
||
iterations=iterations,
|
||
random_seed=86
|
||
)
|
||
|
||
return embedding
|
||
```
|
||
|
||
### Node2Vec Algorithm
|
||
|
||
```
|
||
Node2Vec Algorithm:
|
||
|
||
1. Random Walk Generation:
|
||
- For each node, perform biased random walks
|
||
- Walk strategy controlled by p (return) and q (in-out)
|
||
|
||
2. Skip-gram Training:
|
||
- Treat walks as sentences
|
||
- Train Word2Vec Skip-gram model
|
||
- Node → Embedding vector
|
||
|
||
Walk probabilities:
|
||
- p: Return parameter (go back to previous node)
|
||
- q: In-out parameter (explore vs exploit)
|
||
|
||
Low p, high q → BFS-like (local structure)
|
||
High p, low q → DFS-like (global structure)
|
||
```
|
||
|
||
### Parameters
|
||
|
||
| Parameter | Value | Description |
|
||
|-----------|-------|-------------|
|
||
| dimensions | 1536 | Embedding size |
|
||
| num_walks | 10 | Walks per node |
|
||
| walk_length | 40 | Steps per walk |
|
||
| window_size | 2 | Skip-gram window |
|
||
| iterations | 3 | Training iterations |
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Algorithm | Purpose | Library |
|
||
|-----------|---------|---------|
|
||
| K-Means | PDF column detection | sklearn |
|
||
| GMM | RAPTOR clustering | sklearn |
|
||
| UMAP | Dimension reduction | umap-learn |
|
||
| Silhouette | Cluster validation | sklearn |
|
||
| Node2Vec | Graph embedding | graspologic |
|
||
|
||
## Related Files
|
||
|
||
- `/deepdoc/parser/pdf_parser.py` - K-Means, Silhouette
|
||
- `/rag/raptor.py` - GMM, UMAP
|
||
- `/graphrag/general/entity_embedding.py` - Node2Vec
|