- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
11 KiB
11 KiB
Graph Algorithms
Tong Quan
RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.
1. PageRank Algorithm
File Location
/graphrag/entity_resolution.py (line 150)
/graphrag/general/index.py (line 460)
/graphrag/search.py (line 83)
Purpose
Tính importance score cho entities trong knowledge graph.
Implementation
import networkx as nx
def compute_pagerank(graph):
"""
Compute PageRank scores for all nodes.
"""
pagerank = nx.pagerank(graph)
return pagerank
# Usage in search ranking
def rank_entities(entities, pagerank_scores):
"""
Rank entities by similarity * pagerank.
"""
ranked = sorted(
entities.items(),
key=lambda x: x[1]["sim"] * x[1]["pagerank"],
reverse=True
)
return ranked
PageRank Formula
PageRank Algorithm:
PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
for all v linking to u
where:
- d = damping factor (typically 0.85)
- N = total number of nodes
- L(v) = number of outbound links from v
Iterative computation until convergence:
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)
Usage in RAGFlow
# In GraphRAG search
def get_relevant_entities(query, graph):
# 1. Get candidate entities by similarity
candidates = vector_search(query)
# 2. Compute PageRank
pagerank = nx.pagerank(graph)
# 3. Combine scores
for entity in candidates:
entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]
return sorted(candidates, key=lambda x: x["final_score"], reverse=True)
2. Leiden Community Detection
File Location
/graphrag/general/leiden.py (lines 72-141)
Purpose
Phát hiện communities trong knowledge graph để tổ chức entities thành groups.
Implementation
from graspologic.partition import hierarchical_leiden
from graspologic.utils import largest_connected_component
def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
"""
Compute hierarchical communities using Leiden algorithm.
"""
# Extract largest connected component
lcc = largest_connected_component(graph)
# Run hierarchical Leiden
community_mapping = hierarchical_leiden(
lcc,
max_cluster_size=max_cluster_size,
random_seed=seed
)
# Process results by level
results = {}
for level, communities in community_mapping.items():
for community_id, nodes in communities.items():
# Calculate community weight
weight = sum(
graph.nodes[n].get("rank", 1) *
graph.nodes[n].get("weight", 1)
for n in nodes
)
results[(level, community_id)] = {
"nodes": nodes,
"weight": weight
}
return results
Leiden Algorithm
Leiden Algorithm (improvement over Louvain):
1. Local Moving Phase:
- Move nodes between communities to improve modularity
- Refined node movement to avoid poorly connected communities
2. Refinement Phase:
- Partition communities into smaller subcommunities
- Ensures well-connected communities
3. Aggregation Phase:
- Create aggregate graph with communities as nodes
- Repeat from step 1 until no improvement
Modularity:
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)
where:
- Aij = edge weight between i and j
- ki = degree of node i
- m = total edge weight
- δ(ci, cj) = 1 if same community, 0 otherwise
Hierarchical Leiden
Hierarchical Leiden:
- Recursively applies Leiden to each community
- Creates multi-level community structure
- Controlled by max_cluster_size parameter
Level 0: Root community (all nodes)
Level 1: First-level subcommunities
Level 2: Second-level subcommunities
...
3. Entity Extraction (LLM-based)
File Location
/graphrag/general/extractor.py
/graphrag/light/graph_extractor.py
Purpose
Extract entities và relationships từ text sử dụng LLM.
Implementation
class GraphExtractor:
DEFAULT_ENTITY_TYPES = [
"organization", "person", "geo", "event", "category"
]
async def _process_single_content(self, content, entity_types):
"""
Extract entities from text using LLM with iterative gleaning.
"""
# Initial extraction
prompt = self._build_extraction_prompt(content, entity_types)
result = await self._llm_chat(prompt)
entities, relations = self._parse_result(result)
# Iterative gleaning (up to 2 times)
for _ in range(2): # ENTITY_EXTRACTION_MAX_GLEANINGS
glean_prompt = self._build_glean_prompt(result)
glean_result = await self._llm_chat(glean_prompt)
# Check if more entities found
if "NO" in glean_result.upper():
break
new_entities, new_relations = self._parse_result(glean_result)
entities.extend(new_entities)
relations.extend(new_relations)
return entities, relations
def _parse_result(self, result):
"""
Parse LLM output into structured entities/relations.
Format: (entity_type, entity_name, description)
Format: (source, target, relation_type, description)
"""
entities = []
relations = []
for line in result.split("\n"):
if line.startswith("(") and line.endswith(")"):
parts = line[1:-1].split(TUPLE_DELIMITER)
if len(parts) == 3: # Entity
entities.append({
"type": parts[0],
"name": parts[1],
"description": parts[2]
})
elif len(parts) == 4: # Relation
relations.append({
"source": parts[0],
"target": parts[1],
"type": parts[2],
"description": parts[3]
})
return entities, relations
Extraction Pipeline
Entity Extraction Pipeline:
1. Initial Extraction
└── LLM extracts entities using structured prompt
2. Iterative Gleaning (max 2 iterations)
├── Ask LLM if more entities exist
├── If YES: extract additional entities
└── If NO: stop gleaning
3. Relation Extraction
└── Extract relationships between entities
4. Graph Construction
└── Build NetworkX graph from entities/relations
4. Entity Resolution
File Location
/graphrag/entity_resolution.py
Purpose
Merge duplicate entities trong knowledge graph.
Implementation
import editdistance
import networkx as nx
class EntityResolution:
def is_similarity(self, a: str, b: str) -> bool:
"""
Check if two entity names are similar.
"""
a, b = a.lower(), b.lower()
# Chinese: character set intersection
if self._is_chinese(a):
a_set, b_set = set(a), set(b)
max_len = max(len(a_set), len(b_set))
overlap = len(a_set & b_set)
return overlap / max_len >= 0.8
# English: Edit distance
else:
threshold = min(len(a), len(b)) // 2
distance = editdistance.eval(a, b)
return distance <= threshold
async def resolve(self, graph):
"""
Resolve duplicate entities in graph.
"""
# 1. Find candidate pairs
nodes = list(graph.nodes())
candidates = []
for i, a in enumerate(nodes):
for b in nodes[i+1:]:
if self.is_similarity(a, b):
candidates.append((a, b))
# 2. LLM verification (batch)
confirmed_pairs = []
for batch in self._batch(candidates, size=100):
results = await self._llm_verify_batch(batch)
confirmed_pairs.extend([
pair for pair, is_same in zip(batch, results)
if is_same
])
# 3. Merge confirmed pairs
for a, b in confirmed_pairs:
self._merge_nodes(graph, a, b)
# 4. Update PageRank
pagerank = nx.pagerank(graph)
for node in graph.nodes():
graph.nodes[node]["pagerank"] = pagerank[node]
return graph
Similarity Metrics
English Similarity (Edit Distance):
distance(a, b) ≤ min(len(a), len(b)) // 2
Example:
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
- "google" vs "apple" → distance=5 > 2 → Not similar
Chinese Similarity (Character Set):
|a ∩ b| / max(|a|, |b|) ≥ 0.8
Example:
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
5. Largest Connected Component (LCC)
File Location
/graphrag/general/leiden.py (line 67)
Purpose
Extract largest connected subgraph trước khi community detection.
Implementation
from graspologic.utils import largest_connected_component
def _stabilize_graph(graph):
"""
Extract and stabilize the largest connected component.
"""
# Get LCC
lcc = largest_connected_component(graph)
# Sort nodes for reproducibility
sorted_nodes = sorted(lcc.nodes())
sorted_graph = lcc.subgraph(sorted_nodes).copy()
return sorted_graph
LCC Algorithm
LCC Algorithm:
1. Find all connected components using BFS/DFS
2. Select component with maximum number of nodes
3. Return subgraph of that component
Complexity: O(V + E)
where V = vertices, E = edges
6. N-hop Path Scoring
File Location
/graphrag/search.py (lines 181-187)
Purpose
Score entities dựa trên path distance trong graph.
Implementation
def compute_nhop_scores(entity, neighbors, n_hops=2):
"""
Score entities based on graph distance.
"""
nhop_scores = {}
for neighbor in neighbors:
path = neighbor["path"]
weights = neighbor["weights"]
for i in range(len(path) - 1):
source, target = path[i], path[i + 1]
# Decay by distance
score = entity["sim"] / (2 + i)
if (source, target) in nhop_scores:
nhop_scores[(source, target)]["sim"] += score
else:
nhop_scores[(source, target)] = {"sim": score}
return nhop_scores
Scoring Formula
N-hop Score with Decay:
score(e, path_i) = similarity(e) / (2 + distance_i)
where:
- distance_i = number of hops from source entity
- 2 = base constant to prevent division issues
Total score = Σ score(e, path_i) for all paths
Summary
| Algorithm | Purpose | Library |
|---|---|---|
| PageRank | Entity importance | NetworkX |
| Leiden | Community detection | graspologic |
| Entity Extraction | KG construction | LLM |
| Entity Resolution | Deduplication | editdistance + LLM |
| LCC | Graph preprocessing | graspologic |
| N-hop Scoring | Path-based ranking | Custom |
Related Files
/graphrag/entity_resolution.py- Entity resolution/graphrag/general/leiden.py- Community detection/graphrag/general/extractor.py- Entity extraction/graphrag/search.py- Graph search