- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
471 lines
11 KiB
Markdown
471 lines
11 KiB
Markdown
# Graph Algorithms
|
||
|
||
## Tong Quan
|
||
|
||
RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.
|
||
|
||
## 1. PageRank Algorithm
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/entity_resolution.py (line 150)
|
||
/graphrag/general/index.py (line 460)
|
||
/graphrag/search.py (line 83)
|
||
```
|
||
|
||
### Purpose
|
||
Tính importance score cho entities trong knowledge graph.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
import networkx as nx
|
||
|
||
def compute_pagerank(graph):
|
||
"""
|
||
Compute PageRank scores for all nodes.
|
||
"""
|
||
pagerank = nx.pagerank(graph)
|
||
return pagerank
|
||
|
||
# Usage in search ranking
|
||
def rank_entities(entities, pagerank_scores):
|
||
"""
|
||
Rank entities by similarity * pagerank.
|
||
"""
|
||
ranked = sorted(
|
||
entities.items(),
|
||
key=lambda x: x[1]["sim"] * x[1]["pagerank"],
|
||
reverse=True
|
||
)
|
||
return ranked
|
||
```
|
||
|
||
### PageRank Formula
|
||
|
||
```
|
||
PageRank Algorithm:
|
||
|
||
PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
|
||
for all v linking to u
|
||
|
||
where:
|
||
- d = damping factor (typically 0.85)
|
||
- N = total number of nodes
|
||
- L(v) = number of outbound links from v
|
||
|
||
Iterative computation until convergence:
|
||
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)
|
||
```
|
||
|
||
### Usage in RAGFlow
|
||
|
||
```python
|
||
# In GraphRAG search
|
||
def get_relevant_entities(query, graph):
|
||
# 1. Get candidate entities by similarity
|
||
candidates = vector_search(query)
|
||
|
||
# 2. Compute PageRank
|
||
pagerank = nx.pagerank(graph)
|
||
|
||
# 3. Combine scores
|
||
for entity in candidates:
|
||
entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]
|
||
|
||
return sorted(candidates, key=lambda x: x["final_score"], reverse=True)
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Leiden Community Detection
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/general/leiden.py (lines 72-141)
|
||
```
|
||
|
||
### Purpose
|
||
Phát hiện communities trong knowledge graph để tổ chức entities thành groups.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
from graspologic.partition import hierarchical_leiden
|
||
from graspologic.utils import largest_connected_component
|
||
|
||
def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
|
||
"""
|
||
Compute hierarchical communities using Leiden algorithm.
|
||
"""
|
||
# Extract largest connected component
|
||
lcc = largest_connected_component(graph)
|
||
|
||
# Run hierarchical Leiden
|
||
community_mapping = hierarchical_leiden(
|
||
lcc,
|
||
max_cluster_size=max_cluster_size,
|
||
random_seed=seed
|
||
)
|
||
|
||
# Process results by level
|
||
results = {}
|
||
for level, communities in community_mapping.items():
|
||
for community_id, nodes in communities.items():
|
||
# Calculate community weight
|
||
weight = sum(
|
||
graph.nodes[n].get("rank", 1) *
|
||
graph.nodes[n].get("weight", 1)
|
||
for n in nodes
|
||
)
|
||
results[(level, community_id)] = {
|
||
"nodes": nodes,
|
||
"weight": weight
|
||
}
|
||
|
||
return results
|
||
```
|
||
|
||
### Leiden Algorithm
|
||
|
||
```
|
||
Leiden Algorithm (improvement over Louvain):
|
||
|
||
1. Local Moving Phase:
|
||
- Move nodes between communities to improve modularity
|
||
- Refined node movement to avoid poorly connected communities
|
||
|
||
2. Refinement Phase:
|
||
- Partition communities into smaller subcommunities
|
||
- Ensures well-connected communities
|
||
|
||
3. Aggregation Phase:
|
||
- Create aggregate graph with communities as nodes
|
||
- Repeat from step 1 until no improvement
|
||
|
||
Modularity:
|
||
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)
|
||
|
||
where:
|
||
- Aij = edge weight between i and j
|
||
- ki = degree of node i
|
||
- m = total edge weight
|
||
- δ(ci, cj) = 1 if same community, 0 otherwise
|
||
```
|
||
|
||
### Hierarchical Leiden
|
||
|
||
```
|
||
Hierarchical Leiden:
|
||
- Recursively applies Leiden to each community
|
||
- Creates multi-level community structure
|
||
- Controlled by max_cluster_size parameter
|
||
|
||
Level 0: Root community (all nodes)
|
||
Level 1: First-level subcommunities
|
||
Level 2: Second-level subcommunities
|
||
...
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Entity Extraction (LLM-based)
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/general/extractor.py
|
||
/graphrag/light/graph_extractor.py
|
||
```
|
||
|
||
### Purpose
|
||
Extract entities và relationships từ text sử dụng LLM.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
class GraphExtractor:
|
||
DEFAULT_ENTITY_TYPES = [
|
||
"organization", "person", "geo", "event", "category"
|
||
]
|
||
|
||
async def _process_single_content(self, content, entity_types):
|
||
"""
|
||
Extract entities from text using LLM with iterative gleaning.
|
||
"""
|
||
# Initial extraction
|
||
prompt = self._build_extraction_prompt(content, entity_types)
|
||
result = await self._llm_chat(prompt)
|
||
|
||
entities, relations = self._parse_result(result)
|
||
|
||
# Iterative gleaning (up to 2 times)
|
||
for _ in range(2): # ENTITY_EXTRACTION_MAX_GLEANINGS
|
||
glean_prompt = self._build_glean_prompt(result)
|
||
glean_result = await self._llm_chat(glean_prompt)
|
||
|
||
# Check if more entities found
|
||
if "NO" in glean_result.upper():
|
||
break
|
||
|
||
new_entities, new_relations = self._parse_result(glean_result)
|
||
entities.extend(new_entities)
|
||
relations.extend(new_relations)
|
||
|
||
return entities, relations
|
||
|
||
def _parse_result(self, result):
|
||
"""
|
||
Parse LLM output into structured entities/relations.
|
||
|
||
Format: (entity_type, entity_name, description)
|
||
Format: (source, target, relation_type, description)
|
||
"""
|
||
entities = []
|
||
relations = []
|
||
|
||
for line in result.split("\n"):
|
||
if line.startswith("(") and line.endswith(")"):
|
||
parts = line[1:-1].split(TUPLE_DELIMITER)
|
||
if len(parts) == 3: # Entity
|
||
entities.append({
|
||
"type": parts[0],
|
||
"name": parts[1],
|
||
"description": parts[2]
|
||
})
|
||
elif len(parts) == 4: # Relation
|
||
relations.append({
|
||
"source": parts[0],
|
||
"target": parts[1],
|
||
"type": parts[2],
|
||
"description": parts[3]
|
||
})
|
||
|
||
return entities, relations
|
||
```
|
||
|
||
### Extraction Pipeline
|
||
|
||
```
|
||
Entity Extraction Pipeline:
|
||
|
||
1. Initial Extraction
|
||
└── LLM extracts entities using structured prompt
|
||
|
||
2. Iterative Gleaning (max 2 iterations)
|
||
├── Ask LLM if more entities exist
|
||
├── If YES: extract additional entities
|
||
└── If NO: stop gleaning
|
||
|
||
3. Relation Extraction
|
||
└── Extract relationships between entities
|
||
|
||
4. Graph Construction
|
||
└── Build NetworkX graph from entities/relations
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Entity Resolution
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/entity_resolution.py
|
||
```
|
||
|
||
### Purpose
|
||
Merge duplicate entities trong knowledge graph.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
import editdistance
|
||
import networkx as nx
|
||
|
||
class EntityResolution:
|
||
def is_similarity(self, a: str, b: str) -> bool:
|
||
"""
|
||
Check if two entity names are similar.
|
||
"""
|
||
a, b = a.lower(), b.lower()
|
||
|
||
# Chinese: character set intersection
|
||
if self._is_chinese(a):
|
||
a_set, b_set = set(a), set(b)
|
||
max_len = max(len(a_set), len(b_set))
|
||
overlap = len(a_set & b_set)
|
||
return overlap / max_len >= 0.8
|
||
|
||
# English: Edit distance
|
||
else:
|
||
threshold = min(len(a), len(b)) // 2
|
||
distance = editdistance.eval(a, b)
|
||
return distance <= threshold
|
||
|
||
async def resolve(self, graph):
|
||
"""
|
||
Resolve duplicate entities in graph.
|
||
"""
|
||
# 1. Find candidate pairs
|
||
nodes = list(graph.nodes())
|
||
candidates = []
|
||
|
||
for i, a in enumerate(nodes):
|
||
for b in nodes[i+1:]:
|
||
if self.is_similarity(a, b):
|
||
candidates.append((a, b))
|
||
|
||
# 2. LLM verification (batch)
|
||
confirmed_pairs = []
|
||
for batch in self._batch(candidates, size=100):
|
||
results = await self._llm_verify_batch(batch)
|
||
confirmed_pairs.extend([
|
||
pair for pair, is_same in zip(batch, results)
|
||
if is_same
|
||
])
|
||
|
||
# 3. Merge confirmed pairs
|
||
for a, b in confirmed_pairs:
|
||
self._merge_nodes(graph, a, b)
|
||
|
||
# 4. Update PageRank
|
||
pagerank = nx.pagerank(graph)
|
||
for node in graph.nodes():
|
||
graph.nodes[node]["pagerank"] = pagerank[node]
|
||
|
||
return graph
|
||
```
|
||
|
||
### Similarity Metrics
|
||
|
||
```
|
||
English Similarity (Edit Distance):
|
||
distance(a, b) ≤ min(len(a), len(b)) // 2
|
||
|
||
Example:
|
||
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
|
||
- "google" vs "apple" → distance=5 > 2 → Not similar
|
||
|
||
Chinese Similarity (Character Set):
|
||
|a ∩ b| / max(|a|, |b|) ≥ 0.8
|
||
|
||
Example:
|
||
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
|
||
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Largest Connected Component (LCC)
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/general/leiden.py (line 67)
|
||
```
|
||
|
||
### Purpose
|
||
Extract largest connected subgraph trước khi community detection.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
from graspologic.utils import largest_connected_component
|
||
|
||
def _stabilize_graph(graph):
|
||
"""
|
||
Extract and stabilize the largest connected component.
|
||
"""
|
||
# Get LCC
|
||
lcc = largest_connected_component(graph)
|
||
|
||
# Sort nodes for reproducibility
|
||
sorted_nodes = sorted(lcc.nodes())
|
||
sorted_graph = lcc.subgraph(sorted_nodes).copy()
|
||
|
||
return sorted_graph
|
||
```
|
||
|
||
### LCC Algorithm
|
||
|
||
```
|
||
LCC Algorithm:
|
||
|
||
1. Find all connected components using BFS/DFS
|
||
2. Select component with maximum number of nodes
|
||
3. Return subgraph of that component
|
||
|
||
Complexity: O(V + E)
|
||
where V = vertices, E = edges
|
||
```
|
||
|
||
---
|
||
|
||
## 6. N-hop Path Scoring
|
||
|
||
### File Location
|
||
```
|
||
/graphrag/search.py (lines 181-187)
|
||
```
|
||
|
||
### Purpose
|
||
Score entities dựa trên path distance trong graph.
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
def compute_nhop_scores(entity, neighbors, n_hops=2):
|
||
"""
|
||
Score entities based on graph distance.
|
||
"""
|
||
nhop_scores = {}
|
||
|
||
for neighbor in neighbors:
|
||
path = neighbor["path"]
|
||
weights = neighbor["weights"]
|
||
|
||
for i in range(len(path) - 1):
|
||
source, target = path[i], path[i + 1]
|
||
|
||
# Decay by distance
|
||
score = entity["sim"] / (2 + i)
|
||
|
||
if (source, target) in nhop_scores:
|
||
nhop_scores[(source, target)]["sim"] += score
|
||
else:
|
||
nhop_scores[(source, target)] = {"sim": score}
|
||
|
||
return nhop_scores
|
||
```
|
||
|
||
### Scoring Formula
|
||
|
||
```
|
||
N-hop Score with Decay:
|
||
|
||
score(e, path_i) = similarity(e) / (2 + distance_i)
|
||
|
||
where:
|
||
- distance_i = number of hops from source entity
|
||
- 2 = base constant to prevent division issues
|
||
|
||
Total score = Σ score(e, path_i) for all paths
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Algorithm | Purpose | Library |
|
||
|-----------|---------|---------|
|
||
| PageRank | Entity importance | NetworkX |
|
||
| Leiden | Community detection | graspologic |
|
||
| Entity Extraction | KG construction | LLM |
|
||
| Entity Resolution | Deduplication | editdistance + LLM |
|
||
| LCC | Graph preprocessing | graspologic |
|
||
| N-hop Scoring | Path-based ranking | Custom |
|
||
|
||
## Related Files
|
||
|
||
- `/graphrag/entity_resolution.py` - Entity resolution
|
||
- `/graphrag/general/leiden.py` - Community detection
|
||
- `/graphrag/general/extractor.py` - Entity extraction
|
||
- `/graphrag/search.py` - Graph search
|