ragflow/personal_analyze/06-ALGORITHMS/graph_algorithms.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

471 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Graph Algorithms
## Tong Quan
RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.
## 1. PageRank Algorithm
### File Location
```
/graphrag/entity_resolution.py (line 150)
/graphrag/general/index.py (line 460)
/graphrag/search.py (line 83)
```
### Purpose
Tính importance score cho entities trong knowledge graph.
### Implementation
```python
import networkx as nx
def compute_pagerank(graph):
"""
Compute PageRank scores for all nodes.
"""
pagerank = nx.pagerank(graph)
return pagerank
# Usage in search ranking
def rank_entities(entities, pagerank_scores):
"""
Rank entities by similarity * pagerank.
"""
ranked = sorted(
entities.items(),
key=lambda x: x[1]["sim"] * x[1]["pagerank"],
reverse=True
)
return ranked
```
### PageRank Formula
```
PageRank Algorithm:
PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
for all v linking to u
where:
- d = damping factor (typically 0.85)
- N = total number of nodes
- L(v) = number of outbound links from v
Iterative computation until convergence:
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)
```
### Usage in RAGFlow
```python
# In GraphRAG search
def get_relevant_entities(query, graph):
# 1. Get candidate entities by similarity
candidates = vector_search(query)
# 2. Compute PageRank
pagerank = nx.pagerank(graph)
# 3. Combine scores
for entity in candidates:
entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]
return sorted(candidates, key=lambda x: x["final_score"], reverse=True)
```
---
## 2. Leiden Community Detection
### File Location
```
/graphrag/general/leiden.py (lines 72-141)
```
### Purpose
Phát hiện communities trong knowledge graph để tổ chức entities thành groups.
### Implementation
```python
from graspologic.partition import hierarchical_leiden
from graspologic.utils import largest_connected_component
def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
"""
Compute hierarchical communities using Leiden algorithm.
"""
# Extract largest connected component
lcc = largest_connected_component(graph)
# Run hierarchical Leiden
community_mapping = hierarchical_leiden(
lcc,
max_cluster_size=max_cluster_size,
random_seed=seed
)
# Process results by level
results = {}
for level, communities in community_mapping.items():
for community_id, nodes in communities.items():
# Calculate community weight
weight = sum(
graph.nodes[n].get("rank", 1) *
graph.nodes[n].get("weight", 1)
for n in nodes
)
results[(level, community_id)] = {
"nodes": nodes,
"weight": weight
}
return results
```
### Leiden Algorithm
```
Leiden Algorithm (improvement over Louvain):
1. Local Moving Phase:
- Move nodes between communities to improve modularity
- Refined node movement to avoid poorly connected communities
2. Refinement Phase:
- Partition communities into smaller subcommunities
- Ensures well-connected communities
3. Aggregation Phase:
- Create aggregate graph with communities as nodes
- Repeat from step 1 until no improvement
Modularity:
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)
where:
- Aij = edge weight between i and j
- ki = degree of node i
- m = total edge weight
- δ(ci, cj) = 1 if same community, 0 otherwise
```
### Hierarchical Leiden
```
Hierarchical Leiden:
- Recursively applies Leiden to each community
- Creates multi-level community structure
- Controlled by max_cluster_size parameter
Level 0: Root community (all nodes)
Level 1: First-level subcommunities
Level 2: Second-level subcommunities
...
```
---
## 3. Entity Extraction (LLM-based)
### File Location
```
/graphrag/general/extractor.py
/graphrag/light/graph_extractor.py
```
### Purpose
Extract entities và relationships từ text sử dụng LLM.
### Implementation
```python
class GraphExtractor:
DEFAULT_ENTITY_TYPES = [
"organization", "person", "geo", "event", "category"
]
async def _process_single_content(self, content, entity_types):
"""
Extract entities from text using LLM with iterative gleaning.
"""
# Initial extraction
prompt = self._build_extraction_prompt(content, entity_types)
result = await self._llm_chat(prompt)
entities, relations = self._parse_result(result)
# Iterative gleaning (up to 2 times)
for _ in range(2): # ENTITY_EXTRACTION_MAX_GLEANINGS
glean_prompt = self._build_glean_prompt(result)
glean_result = await self._llm_chat(glean_prompt)
# Check if more entities found
if "NO" in glean_result.upper():
break
new_entities, new_relations = self._parse_result(glean_result)
entities.extend(new_entities)
relations.extend(new_relations)
return entities, relations
def _parse_result(self, result):
"""
Parse LLM output into structured entities/relations.
Format: (entity_type, entity_name, description)
Format: (source, target, relation_type, description)
"""
entities = []
relations = []
for line in result.split("\n"):
if line.startswith("(") and line.endswith(")"):
parts = line[1:-1].split(TUPLE_DELIMITER)
if len(parts) == 3: # Entity
entities.append({
"type": parts[0],
"name": parts[1],
"description": parts[2]
})
elif len(parts) == 4: # Relation
relations.append({
"source": parts[0],
"target": parts[1],
"type": parts[2],
"description": parts[3]
})
return entities, relations
```
### Extraction Pipeline
```
Entity Extraction Pipeline:
1. Initial Extraction
└── LLM extracts entities using structured prompt
2. Iterative Gleaning (max 2 iterations)
├── Ask LLM if more entities exist
├── If YES: extract additional entities
└── If NO: stop gleaning
3. Relation Extraction
└── Extract relationships between entities
4. Graph Construction
└── Build NetworkX graph from entities/relations
```
---
## 4. Entity Resolution
### File Location
```
/graphrag/entity_resolution.py
```
### Purpose
Merge duplicate entities trong knowledge graph.
### Implementation
```python
import editdistance
import networkx as nx
class EntityResolution:
def is_similarity(self, a: str, b: str) -> bool:
"""
Check if two entity names are similar.
"""
a, b = a.lower(), b.lower()
# Chinese: character set intersection
if self._is_chinese(a):
a_set, b_set = set(a), set(b)
max_len = max(len(a_set), len(b_set))
overlap = len(a_set & b_set)
return overlap / max_len >= 0.8
# English: Edit distance
else:
threshold = min(len(a), len(b)) // 2
distance = editdistance.eval(a, b)
return distance <= threshold
async def resolve(self, graph):
"""
Resolve duplicate entities in graph.
"""
# 1. Find candidate pairs
nodes = list(graph.nodes())
candidates = []
for i, a in enumerate(nodes):
for b in nodes[i+1:]:
if self.is_similarity(a, b):
candidates.append((a, b))
# 2. LLM verification (batch)
confirmed_pairs = []
for batch in self._batch(candidates, size=100):
results = await self._llm_verify_batch(batch)
confirmed_pairs.extend([
pair for pair, is_same in zip(batch, results)
if is_same
])
# 3. Merge confirmed pairs
for a, b in confirmed_pairs:
self._merge_nodes(graph, a, b)
# 4. Update PageRank
pagerank = nx.pagerank(graph)
for node in graph.nodes():
graph.nodes[node]["pagerank"] = pagerank[node]
return graph
```
### Similarity Metrics
```
English Similarity (Edit Distance):
distance(a, b) ≤ min(len(a), len(b)) // 2
Example:
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
- "google" vs "apple" → distance=5 > 2 → Not similar
Chinese Similarity (Character Set):
|a ∩ b| / max(|a|, |b|) ≥ 0.8
Example:
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
```
---
## 5. Largest Connected Component (LCC)
### File Location
```
/graphrag/general/leiden.py (line 67)
```
### Purpose
Extract largest connected subgraph trước khi community detection.
### Implementation
```python
from graspologic.utils import largest_connected_component
def _stabilize_graph(graph):
"""
Extract and stabilize the largest connected component.
"""
# Get LCC
lcc = largest_connected_component(graph)
# Sort nodes for reproducibility
sorted_nodes = sorted(lcc.nodes())
sorted_graph = lcc.subgraph(sorted_nodes).copy()
return sorted_graph
```
### LCC Algorithm
```
LCC Algorithm:
1. Find all connected components using BFS/DFS
2. Select component with maximum number of nodes
3. Return subgraph of that component
Complexity: O(V + E)
where V = vertices, E = edges
```
---
## 6. N-hop Path Scoring
### File Location
```
/graphrag/search.py (lines 181-187)
```
### Purpose
Score entities dựa trên path distance trong graph.
### Implementation
```python
def compute_nhop_scores(entity, neighbors, n_hops=2):
"""
Score entities based on graph distance.
"""
nhop_scores = {}
for neighbor in neighbors:
path = neighbor["path"]
weights = neighbor["weights"]
for i in range(len(path) - 1):
source, target = path[i], path[i + 1]
# Decay by distance
score = entity["sim"] / (2 + i)
if (source, target) in nhop_scores:
nhop_scores[(source, target)]["sim"] += score
else:
nhop_scores[(source, target)] = {"sim": score}
return nhop_scores
```
### Scoring Formula
```
N-hop Score with Decay:
score(e, path_i) = similarity(e) / (2 + distance_i)
where:
- distance_i = number of hops from source entity
- 2 = base constant to prevent division issues
Total score = Σ score(e, path_i) for all paths
```
---
## Summary
| Algorithm | Purpose | Library |
|-----------|---------|---------|
| PageRank | Entity importance | NetworkX |
| Leiden | Community detection | graspologic |
| Entity Extraction | KG construction | LLM |
| Entity Resolution | Deduplication | editdistance + LLM |
| LCC | Graph preprocessing | graspologic |
| N-hop Scoring | Path-based ranking | Custom |
## Related Files
- `/graphrag/entity_resolution.py` - Entity resolution
- `/graphrag/general/leiden.py` - Community detection
- `/graphrag/general/extractor.py` - Entity extraction
- `/graphrag/search.py` - Graph search