ragflow/personal_analyze/06-ALGORITHMS/clustering_algorithms.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

365 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Clustering Algorithms
## Tong Quan
RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.
## 1. K-Means Clustering
### File Location
```
/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)
```
### Purpose
Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.
### Implementation
```python
from sklearn.cluster import KMeans
def _assign_column(self):
"""
Detect columns using KMeans clustering on X coordinates.
"""
# Get X coordinates of text boxes
x_coords = np.array([[b["x0"]] for b in self.bxs])
best_k = 1
best_score = -1
# Find optimal number of columns (1-5)
for k in range(1, min(5, len(self.bxs))):
if k >= len(self.bxs):
break
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
if k > 1:
score = silhouette_score(x_coords, labels)
if score > best_score:
best_score = score
best_k = k
# Assign columns with optimal k
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
labels = km.fit_predict(x_coords)
for i, bx in enumerate(self.bxs):
bx["col_id"] = labels[i]
```
### Algorithm
```
K-Means Algorithm:
1. Initialize k centroids randomly
2. Repeat until convergence:
a. Assign each point to nearest centroid
b. Update centroids as mean of assigned points
3. Return cluster assignments
Objective: minimize Σ ||xi - μci||²
where μci is centroid of cluster containing xi
```
### Parameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| n_clusters | 1-5 | Number of columns to detect |
| n_init | "auto" | Initialization runs |
| random_state | 42 | Reproducibility |
---
## 2. Gaussian Mixture Model (GMM)
### File Location
```
/rag/raptor.py (lines 22, 102-106, 195-199)
```
### Purpose
RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.
### Implementation
```python
from sklearn.mixture import GaussianMixture
def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
"""
Find optimal number of clusters using BIC criterion.
"""
max_clusters = min(self._max_cluster, len(embeddings))
n_clusters = np.arange(1, max_clusters)
bics = []
for n in n_clusters:
gm = GaussianMixture(
n_components=n,
random_state=random_state,
covariance_type='full'
)
gm.fit(embeddings)
bics.append(gm.bic(embeddings))
# Select cluster count with minimum BIC
optimal_clusters = n_clusters[np.argmin(bics)]
return optimal_clusters
def _cluster_chunks(self, chunks, embeddings):
"""
Cluster chunks using GMM with soft assignments.
"""
# Reduce dimensions first
reduced = self._reduce_dimensions(embeddings)
# Find optimal k
n_clusters = self._get_optimal_clusters(reduced, random_state=42)
# Fit GMM
gm = GaussianMixture(n_components=n_clusters, random_state=42)
gm.fit(reduced)
# Get soft assignments (probabilities)
probs = gm.predict_proba(reduced)
# Assign chunks to clusters with probability > threshold
clusters = [[] for _ in range(n_clusters)]
for i, prob in enumerate(probs):
for j, p in enumerate(prob):
if p > 0.1: # Threshold
clusters[j].append(i)
return clusters
```
### GMM Formula
```
GMM Probability Density:
p(x) = Σ πk × N(x | μk, Σk)
where:
- πk = mixture weight for component k
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk
BIC (Bayesian Information Criterion):
BIC = k × log(n) - 2 × log(L̂)
where:
- k = number of parameters
- n = number of samples
- L̂ = maximum likelihood
```
### Soft Assignment
GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):
```
Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)
```
---
## 3. UMAP (Dimensionality Reduction)
### File Location
```
/rag/raptor.py (lines 21, 186-190)
```
### Purpose
Giảm số chiều của embeddings trước khi clustering để improve cluster quality.
### Implementation
```python
import umap
def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
"""
Reduce embedding dimensions using UMAP.
"""
n_samples = len(embeddings)
# Calculate neighbors based on sample size
n_neighbors = int((n_samples - 1) ** 0.8)
# Target dimensions
n_components = min(12, n_samples - 2)
reducer = umap.UMAP(
n_neighbors=max(2, n_neighbors),
n_components=n_components,
metric="cosine",
random_state=42
)
return reducer.fit_transform(embeddings)
```
### UMAP Algorithm
```
UMAP (Uniform Manifold Approximation and Projection):
1. Build high-dimensional graph:
- Compute k-nearest neighbors
- Create weighted edges based on distance
2. Build low-dimensional representation:
- Initialize randomly
- Optimize layout using cross-entropy loss
- Preserve local structure (neighbors stay neighbors)
Key idea: Preserve topological structure, not absolute distances
```
### Parameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| n_neighbors | (n-1)^0.8 | Local neighborhood size |
| n_components | min(12, n-2) | Output dimensions |
| metric | cosine | Distance metric |
---
## 4. Silhouette Score
### File Location
```
/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)
```
### Purpose
Đánh giá cluster quality để chọn optimal k cho K-Means.
### Formula
```
Silhouette Score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where:
- a(i) = average distance to points in same cluster
- b(i) = average distance to points in nearest other cluster
Range: [-1, 1]
- s ≈ 1: Point well-clustered
- s ≈ 0: Point on boundary
- s < 0: Point may be misclassified
Overall score = mean(s(i)) for all points
```
### Usage
```python
from sklearn.metrics import silhouette_score
# Find optimal k
best_k = 1
best_score = -1
for k in range(2, max_clusters):
km = KMeans(n_clusters=k)
labels = km.fit_predict(data)
score = silhouette_score(data, labels)
if score > best_score:
best_score = score
best_k = k
```
---
## 5. Node2Vec (Graph Embedding)
### File Location
```
/graphrag/general/entity_embedding.py (lines 24-44)
```
### Purpose
Generate embeddings cho graph nodes trong knowledge graph.
### Implementation
```python
from graspologic.embed import node2vec_embed
def embed_node2vec(graph, dimensions=1536, num_walks=10,
walk_length=40, window_size=2, iterations=3):
"""
Generate node embeddings using Node2Vec algorithm.
"""
lcc_tensors, embedding = node2vec_embed(
graph=graph,
dimensions=dimensions,
num_walks=num_walks,
walk_length=walk_length,
window_size=window_size,
iterations=iterations,
random_seed=86
)
return embedding
```
### Node2Vec Algorithm
```
Node2Vec Algorithm:
1. Random Walk Generation:
- For each node, perform biased random walks
- Walk strategy controlled by p (return) and q (in-out)
2. Skip-gram Training:
- Treat walks as sentences
- Train Word2Vec Skip-gram model
- Node → Embedding vector
Walk probabilities:
- p: Return parameter (go back to previous node)
- q: In-out parameter (explore vs exploit)
Low p, high q → BFS-like (local structure)
High p, low q → DFS-like (global structure)
```
### Parameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| dimensions | 1536 | Embedding size |
| num_walks | 10 | Walks per node |
| walk_length | 40 | Steps per walk |
| window_size | 2 | Skip-gram window |
| iterations | 3 | Training iterations |
---
## Summary
| Algorithm | Purpose | Library |
|-----------|---------|---------|
| K-Means | PDF column detection | sklearn |
| GMM | RAPTOR clustering | sklearn |
| UMAP | Dimension reduction | umap-learn |
| Silhouette | Cluster validation | sklearn |
| Node2Vec | Graph embedding | graspologic |
## Related Files
- `/deepdoc/parser/pdf_parser.py` - K-Means, Silhouette
- `/rag/raptor.py` - GMM, UMAP
- `/graphrag/general/entity_embedding.py` - Node2Vec