# Clustering Algorithms ## Tong Quan RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization. ## 1. K-Means Clustering ### File Location ``` /deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055) ``` ### Purpose Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate. ### Implementation ```python from sklearn.cluster import KMeans def _assign_column(self): """ Detect columns using KMeans clustering on X coordinates. """ # Get X coordinates of text boxes x_coords = np.array([[b["x0"]] for b in self.bxs]) best_k = 1 best_score = -1 # Find optimal number of columns (1-5) for k in range(1, min(5, len(self.bxs))): if k >= len(self.bxs): break km = KMeans(n_clusters=k, random_state=42, n_init="auto") labels = km.fit_predict(x_coords) if k > 1: score = silhouette_score(x_coords, labels) if score > best_score: best_score = score best_k = k # Assign columns with optimal k km = KMeans(n_clusters=best_k, random_state=42, n_init="auto") labels = km.fit_predict(x_coords) for i, bx in enumerate(self.bxs): bx["col_id"] = labels[i] ``` ### Algorithm ``` K-Means Algorithm: 1. Initialize k centroids randomly 2. Repeat until convergence: a. Assign each point to nearest centroid b. Update centroids as mean of assigned points 3. Return cluster assignments Objective: minimize Σ ||xi - μci||² where μci is centroid of cluster containing xi ``` ### Parameters | Parameter | Value | Description | |-----------|-------|-------------| | n_clusters | 1-5 | Number of columns to detect | | n_init | "auto" | Initialization runs | | random_state | 42 | Reproducibility | --- ## 2. Gaussian Mixture Model (GMM) ### File Location ``` /rag/raptor.py (lines 22, 102-106, 195-199) ``` ### Purpose RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization. ### Implementation ```python from sklearn.mixture import GaussianMixture def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int): """ Find optimal number of clusters using BIC criterion. """ max_clusters = min(self._max_cluster, len(embeddings)) n_clusters = np.arange(1, max_clusters) bics = [] for n in n_clusters: gm = GaussianMixture( n_components=n, random_state=random_state, covariance_type='full' ) gm.fit(embeddings) bics.append(gm.bic(embeddings)) # Select cluster count with minimum BIC optimal_clusters = n_clusters[np.argmin(bics)] return optimal_clusters def _cluster_chunks(self, chunks, embeddings): """ Cluster chunks using GMM with soft assignments. """ # Reduce dimensions first reduced = self._reduce_dimensions(embeddings) # Find optimal k n_clusters = self._get_optimal_clusters(reduced, random_state=42) # Fit GMM gm = GaussianMixture(n_components=n_clusters, random_state=42) gm.fit(reduced) # Get soft assignments (probabilities) probs = gm.predict_proba(reduced) # Assign chunks to clusters with probability > threshold clusters = [[] for _ in range(n_clusters)] for i, prob in enumerate(probs): for j, p in enumerate(prob): if p > 0.1: # Threshold clusters[j].append(i) return clusters ``` ### GMM Formula ``` GMM Probability Density: p(x) = Σ πk × N(x | μk, Σk) where: - πk = mixture weight for component k - N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk BIC (Bayesian Information Criterion): BIC = k × log(n) - 2 × log(L̂) where: - k = number of parameters - n = number of samples - L̂ = maximum likelihood ``` ### Soft Assignment GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters): ``` Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1) ``` --- ## 3. UMAP (Dimensionality Reduction) ### File Location ``` /rag/raptor.py (lines 21, 186-190) ``` ### Purpose Giảm số chiều của embeddings trước khi clustering để improve cluster quality. ### Implementation ```python import umap def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray: """ Reduce embedding dimensions using UMAP. """ n_samples = len(embeddings) # Calculate neighbors based on sample size n_neighbors = int((n_samples - 1) ** 0.8) # Target dimensions n_components = min(12, n_samples - 2) reducer = umap.UMAP( n_neighbors=max(2, n_neighbors), n_components=n_components, metric="cosine", random_state=42 ) return reducer.fit_transform(embeddings) ``` ### UMAP Algorithm ``` UMAP (Uniform Manifold Approximation and Projection): 1. Build high-dimensional graph: - Compute k-nearest neighbors - Create weighted edges based on distance 2. Build low-dimensional representation: - Initialize randomly - Optimize layout using cross-entropy loss - Preserve local structure (neighbors stay neighbors) Key idea: Preserve topological structure, not absolute distances ``` ### Parameters | Parameter | Value | Description | |-----------|-------|-------------| | n_neighbors | (n-1)^0.8 | Local neighborhood size | | n_components | min(12, n-2) | Output dimensions | | metric | cosine | Distance metric | --- ## 4. Silhouette Score ### File Location ``` /deepdoc/parser/pdf_parser.py (lines 37, 400, 1052) ``` ### Purpose Đánh giá cluster quality để chọn optimal k cho K-Means. ### Formula ``` Silhouette Score: s(i) = (b(i) - a(i)) / max(a(i), b(i)) where: - a(i) = average distance to points in same cluster - b(i) = average distance to points in nearest other cluster Range: [-1, 1] - s ≈ 1: Point well-clustered - s ≈ 0: Point on boundary - s < 0: Point may be misclassified Overall score = mean(s(i)) for all points ``` ### Usage ```python from sklearn.metrics import silhouette_score # Find optimal k best_k = 1 best_score = -1 for k in range(2, max_clusters): km = KMeans(n_clusters=k) labels = km.fit_predict(data) score = silhouette_score(data, labels) if score > best_score: best_score = score best_k = k ``` --- ## 5. Node2Vec (Graph Embedding) ### File Location ``` /graphrag/general/entity_embedding.py (lines 24-44) ``` ### Purpose Generate embeddings cho graph nodes trong knowledge graph. ### Implementation ```python from graspologic.embed import node2vec_embed def embed_node2vec(graph, dimensions=1536, num_walks=10, walk_length=40, window_size=2, iterations=3): """ Generate node embeddings using Node2Vec algorithm. """ lcc_tensors, embedding = node2vec_embed( graph=graph, dimensions=dimensions, num_walks=num_walks, walk_length=walk_length, window_size=window_size, iterations=iterations, random_seed=86 ) return embedding ``` ### Node2Vec Algorithm ``` Node2Vec Algorithm: 1. Random Walk Generation: - For each node, perform biased random walks - Walk strategy controlled by p (return) and q (in-out) 2. Skip-gram Training: - Treat walks as sentences - Train Word2Vec Skip-gram model - Node → Embedding vector Walk probabilities: - p: Return parameter (go back to previous node) - q: In-out parameter (explore vs exploit) Low p, high q → BFS-like (local structure) High p, low q → DFS-like (global structure) ``` ### Parameters | Parameter | Value | Description | |-----------|-------|-------------| | dimensions | 1536 | Embedding size | | num_walks | 10 | Walks per node | | walk_length | 40 | Steps per walk | | window_size | 2 | Skip-gram window | | iterations | 3 | Training iterations | --- ## Summary | Algorithm | Purpose | Library | |-----------|---------|---------| | K-Means | PDF column detection | sklearn | | GMM | RAPTOR clustering | sklearn | | UMAP | Dimension reduction | umap-learn | | Silhouette | Cluster validation | sklearn | | Node2Vec | Graph embedding | graspologic | ## Related Files - `/deepdoc/parser/pdf_parser.py` - K-Means, Silhouette - `/rag/raptor.py` - GMM, UMAP - `/graphrag/general/entity_embedding.py` - Node2Vec