ragflow/personal_analyze/06-ALGORITHMS/graph_algorithms.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

11 KiB
Raw Blame History

Graph Algorithms

Tong Quan

RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.

1. PageRank Algorithm

File Location

/graphrag/entity_resolution.py (line 150)
/graphrag/general/index.py (line 460)
/graphrag/search.py (line 83)

Purpose

Tính importance score cho entities trong knowledge graph.

Implementation

import networkx as nx

def compute_pagerank(graph):
    """
    Compute PageRank scores for all nodes.
    """
    pagerank = nx.pagerank(graph)
    return pagerank

# Usage in search ranking
def rank_entities(entities, pagerank_scores):
    """
    Rank entities by similarity * pagerank.
    """
    ranked = sorted(
        entities.items(),
        key=lambda x: x[1]["sim"] * x[1]["pagerank"],
        reverse=True
    )
    return ranked

PageRank Formula

PageRank Algorithm:

PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
        for all v linking to u

where:
- d = damping factor (typically 0.85)
- N = total number of nodes
- L(v) = number of outbound links from v

Iterative computation until convergence:
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)

Usage in RAGFlow

# In GraphRAG search
def get_relevant_entities(query, graph):
    # 1. Get candidate entities by similarity
    candidates = vector_search(query)

    # 2. Compute PageRank
    pagerank = nx.pagerank(graph)

    # 3. Combine scores
    for entity in candidates:
        entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]

    return sorted(candidates, key=lambda x: x["final_score"], reverse=True)

2. Leiden Community Detection

File Location

/graphrag/general/leiden.py (lines 72-141)

Purpose

Phát hiện communities trong knowledge graph để tổ chức entities thành groups.

Implementation

from graspologic.partition import hierarchical_leiden
from graspologic.utils import largest_connected_component

def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
    """
    Compute hierarchical communities using Leiden algorithm.
    """
    # Extract largest connected component
    lcc = largest_connected_component(graph)

    # Run hierarchical Leiden
    community_mapping = hierarchical_leiden(
        lcc,
        max_cluster_size=max_cluster_size,
        random_seed=seed
    )

    # Process results by level
    results = {}
    for level, communities in community_mapping.items():
        for community_id, nodes in communities.items():
            # Calculate community weight
            weight = sum(
                graph.nodes[n].get("rank", 1) *
                graph.nodes[n].get("weight", 1)
                for n in nodes
            )
            results[(level, community_id)] = {
                "nodes": nodes,
                "weight": weight
            }

    return results

Leiden Algorithm

Leiden Algorithm (improvement over Louvain):

1. Local Moving Phase:
   - Move nodes between communities to improve modularity
   - Refined node movement to avoid poorly connected communities

2. Refinement Phase:
   - Partition communities into smaller subcommunities
   - Ensures well-connected communities

3. Aggregation Phase:
   - Create aggregate graph with communities as nodes
   - Repeat from step 1 until no improvement

Modularity:
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)

where:
- Aij = edge weight between i and j
- ki = degree of node i
- m = total edge weight
- δ(ci, cj) = 1 if same community, 0 otherwise

Hierarchical Leiden

Hierarchical Leiden:
- Recursively applies Leiden to each community
- Creates multi-level community structure
- Controlled by max_cluster_size parameter

Level 0: Root community (all nodes)
Level 1: First-level subcommunities
Level 2: Second-level subcommunities
...

3. Entity Extraction (LLM-based)

File Location

/graphrag/general/extractor.py
/graphrag/light/graph_extractor.py

Purpose

Extract entities và relationships từ text sử dụng LLM.

Implementation

class GraphExtractor:
    DEFAULT_ENTITY_TYPES = [
        "organization", "person", "geo", "event", "category"
    ]

    async def _process_single_content(self, content, entity_types):
        """
        Extract entities from text using LLM with iterative gleaning.
        """
        # Initial extraction
        prompt = self._build_extraction_prompt(content, entity_types)
        result = await self._llm_chat(prompt)

        entities, relations = self._parse_result(result)

        # Iterative gleaning (up to 2 times)
        for _ in range(2):  # ENTITY_EXTRACTION_MAX_GLEANINGS
            glean_prompt = self._build_glean_prompt(result)
            glean_result = await self._llm_chat(glean_prompt)

            # Check if more entities found
            if "NO" in glean_result.upper():
                break

            new_entities, new_relations = self._parse_result(glean_result)
            entities.extend(new_entities)
            relations.extend(new_relations)

        return entities, relations

    def _parse_result(self, result):
        """
        Parse LLM output into structured entities/relations.

        Format: (entity_type, entity_name, description)
        Format: (source, target, relation_type, description)
        """
        entities = []
        relations = []

        for line in result.split("\n"):
            if line.startswith("(") and line.endswith(")"):
                parts = line[1:-1].split(TUPLE_DELIMITER)
                if len(parts) == 3:  # Entity
                    entities.append({
                        "type": parts[0],
                        "name": parts[1],
                        "description": parts[2]
                    })
                elif len(parts) == 4:  # Relation
                    relations.append({
                        "source": parts[0],
                        "target": parts[1],
                        "type": parts[2],
                        "description": parts[3]
                    })

        return entities, relations

Extraction Pipeline

Entity Extraction Pipeline:

1. Initial Extraction
   └── LLM extracts entities using structured prompt

2. Iterative Gleaning (max 2 iterations)
   ├── Ask LLM if more entities exist
   ├── If YES: extract additional entities
   └── If NO: stop gleaning

3. Relation Extraction
   └── Extract relationships between entities

4. Graph Construction
   └── Build NetworkX graph from entities/relations

4. Entity Resolution

File Location

/graphrag/entity_resolution.py

Purpose

Merge duplicate entities trong knowledge graph.

Implementation

import editdistance
import networkx as nx

class EntityResolution:
    def is_similarity(self, a: str, b: str) -> bool:
        """
        Check if two entity names are similar.
        """
        a, b = a.lower(), b.lower()

        # Chinese: character set intersection
        if self._is_chinese(a):
            a_set, b_set = set(a), set(b)
            max_len = max(len(a_set), len(b_set))
            overlap = len(a_set & b_set)
            return overlap / max_len >= 0.8

        # English: Edit distance
        else:
            threshold = min(len(a), len(b)) // 2
            distance = editdistance.eval(a, b)
            return distance <= threshold

    async def resolve(self, graph):
        """
        Resolve duplicate entities in graph.
        """
        # 1. Find candidate pairs
        nodes = list(graph.nodes())
        candidates = []

        for i, a in enumerate(nodes):
            for b in nodes[i+1:]:
                if self.is_similarity(a, b):
                    candidates.append((a, b))

        # 2. LLM verification (batch)
        confirmed_pairs = []
        for batch in self._batch(candidates, size=100):
            results = await self._llm_verify_batch(batch)
            confirmed_pairs.extend([
                pair for pair, is_same in zip(batch, results)
                if is_same
            ])

        # 3. Merge confirmed pairs
        for a, b in confirmed_pairs:
            self._merge_nodes(graph, a, b)

        # 4. Update PageRank
        pagerank = nx.pagerank(graph)
        for node in graph.nodes():
            graph.nodes[node]["pagerank"] = pagerank[node]

        return graph

Similarity Metrics

English Similarity (Edit Distance):
distance(a, b) ≤ min(len(a), len(b)) // 2

Example:
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
- "google" vs "apple" → distance=5 > 2 → Not similar

Chinese Similarity (Character Set):
|a ∩ b| / max(|a|, |b|) ≥ 0.8

Example:
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar

5. Largest Connected Component (LCC)

File Location

/graphrag/general/leiden.py (line 67)

Purpose

Extract largest connected subgraph trước khi community detection.

Implementation

from graspologic.utils import largest_connected_component

def _stabilize_graph(graph):
    """
    Extract and stabilize the largest connected component.
    """
    # Get LCC
    lcc = largest_connected_component(graph)

    # Sort nodes for reproducibility
    sorted_nodes = sorted(lcc.nodes())
    sorted_graph = lcc.subgraph(sorted_nodes).copy()

    return sorted_graph

LCC Algorithm

LCC Algorithm:

1. Find all connected components using BFS/DFS
2. Select component with maximum number of nodes
3. Return subgraph of that component

Complexity: O(V + E)
where V = vertices, E = edges

6. N-hop Path Scoring

File Location

/graphrag/search.py (lines 181-187)

Purpose

Score entities dựa trên path distance trong graph.

Implementation

def compute_nhop_scores(entity, neighbors, n_hops=2):
    """
    Score entities based on graph distance.
    """
    nhop_scores = {}

    for neighbor in neighbors:
        path = neighbor["path"]
        weights = neighbor["weights"]

        for i in range(len(path) - 1):
            source, target = path[i], path[i + 1]

            # Decay by distance
            score = entity["sim"] / (2 + i)

            if (source, target) in nhop_scores:
                nhop_scores[(source, target)]["sim"] += score
            else:
                nhop_scores[(source, target)] = {"sim": score}

    return nhop_scores

Scoring Formula

N-hop Score with Decay:

score(e, path_i) = similarity(e) / (2 + distance_i)

where:
- distance_i = number of hops from source entity
- 2 = base constant to prevent division issues

Total score = Σ score(e, path_i) for all paths

Summary

Algorithm Purpose Library
PageRank Entity importance NetworkX
Leiden Community detection graspologic
Entity Extraction KG construction LLM
Entity Resolution Deduplication editdistance + LLM
LCC Graph preprocessing graspologic
N-hop Scoring Path-based ranking Custom
  • /graphrag/entity_resolution.py - Entity resolution
  • /graphrag/general/leiden.py - Community detection
  • /graphrag/general/extractor.py - Entity extraction
  • /graphrag/search.py - Graph search