docs: Add comprehensive algorithm documentation (50+ algorithms)

- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)

2025-11-27 03:34:49 +00:00

11 KiB

Raw Blame History

Graph Algorithms

Tong Quan

RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.

1. PageRank Algorithm

File Location

/graphrag/entity_resolution.py (line 150)
/graphrag/general/index.py (line 460)
/graphrag/search.py (line 83)

Purpose

Tính importance score cho entities trong knowledge graph.

Implementation

import networkx as nx

def compute_pagerank(graph):
    """
    Compute PageRank scores for all nodes.
    """
    pagerank = nx.pagerank(graph)
    return pagerank

# Usage in search ranking
def rank_entities(entities, pagerank_scores):
    """
    Rank entities by similarity * pagerank.
    """
    ranked = sorted(
        entities.items(),
        key=lambda x: x[1]["sim"] * x[1]["pagerank"],
        reverse=True
    )
    return ranked

PageRank Formula

PageRank Algorithm:

PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
        for all v linking to u

where:
- d = damping factor (typically 0.85)
- N = total number of nodes
- L(v) = number of outbound links from v

Iterative computation until convergence:
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)

Usage in RAGFlow

# In GraphRAG search
def get_relevant_entities(query, graph):
    # 1. Get candidate entities by similarity
    candidates = vector_search(query)

    # 2. Compute PageRank
    pagerank = nx.pagerank(graph)

    # 3. Combine scores
    for entity in candidates:
        entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]

    return sorted(candidates, key=lambda x: x["final_score"], reverse=True)

2. Leiden Community Detection

File Location

/graphrag/general/leiden.py (lines 72-141)

Purpose

Phát hiện communities trong knowledge graph để tổ chức entities thành groups.

Implementation

from graspologic.partition import hierarchical_leiden
from graspologic.utils import largest_connected_component

def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
    """
    Compute hierarchical communities using Leiden algorithm.
    """
    # Extract largest connected component
    lcc = largest_connected_component(graph)

    # Run hierarchical Leiden
    community_mapping = hierarchical_leiden(
        lcc,
        max_cluster_size=max_cluster_size,
        random_seed=seed
    )

    # Process results by level
    results = {}
    for level, communities in community_mapping.items():
        for community_id, nodes in communities.items():
            # Calculate community weight
            weight = sum(
                graph.nodes[n].get("rank", 1) *
                graph.nodes[n].get("weight", 1)
                for n in nodes
            )
            results[(level, community_id)] = {
                "nodes": nodes,
                "weight": weight
            }

    return results

Leiden Algorithm

Leiden Algorithm (improvement over Louvain):

1. Local Moving Phase:
   - Move nodes between communities to improve modularity
   - Refined node movement to avoid poorly connected communities

2. Refinement Phase:
   - Partition communities into smaller subcommunities
   - Ensures well-connected communities

3. Aggregation Phase:
   - Create aggregate graph with communities as nodes
   - Repeat from step 1 until no improvement

Modularity:
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)

where:
- Aij = edge weight between i and j
- ki = degree of node i
- m = total edge weight
- δ(ci, cj) = 1 if same community, 0 otherwise

Hierarchical Leiden

Hierarchical Leiden:
- Recursively applies Leiden to each community
- Creates multi-level community structure
- Controlled by max_cluster_size parameter

Level 0: Root community (all nodes)
Level 1: First-level subcommunities
Level 2: Second-level subcommunities
...

3. Entity Extraction (LLM-based)

File Location

/graphrag/general/extractor.py
/graphrag/light/graph_extractor.py

Purpose

Extract entities và relationships từ text sử dụng LLM.

Implementation

class GraphExtractor:
    DEFAULT_ENTITY_TYPES = [
        "organization", "person", "geo", "event", "category"
    ]

    async def _process_single_content(self, content, entity_types):
        """
        Extract entities from text using LLM with iterative gleaning.
        """
        # Initial extraction
        prompt = self._build_extraction_prompt(content, entity_types)
        result = await self._llm_chat(prompt)

        entities, relations = self._parse_result(result)

        # Iterative gleaning (up to 2 times)
        for _ in range(2):  # ENTITY_EXTRACTION_MAX_GLEANINGS
            glean_prompt = self._build_glean_prompt(result)
            glean_result = await self._llm_chat(glean_prompt)

            # Check if more entities found
            if "NO" in glean_result.upper():
                break

            new_entities, new_relations = self._parse_result(glean_result)
            entities.extend(new_entities)
            relations.extend(new_relations)

        return entities, relations

    def _parse_result(self, result):
        """
        Parse LLM output into structured entities/relations.

        Format: (entity_type, entity_name, description)
        Format: (source, target, relation_type, description)
        """
        entities = []
        relations = []

        for line in result.split("\n"):
            if line.startswith("(") and line.endswith(")"):
                parts = line[1:-1].split(TUPLE_DELIMITER)
                if len(parts) == 3:  # Entity
                    entities.append({
                        "type": parts[0],
                        "name": parts[1],
                        "description": parts[2]
                    })
                elif len(parts) == 4:  # Relation
                    relations.append({
                        "source": parts[0],
                        "target": parts[1],
                        "type": parts[2],
                        "description": parts[3]
                    })

        return entities, relations

Extraction Pipeline

Entity Extraction Pipeline:

1. Initial Extraction
   └── LLM extracts entities using structured prompt

2. Iterative Gleaning (max 2 iterations)
   ├── Ask LLM if more entities exist
   ├── If YES: extract additional entities
   └── If NO: stop gleaning

3. Relation Extraction
   └── Extract relationships between entities

4. Graph Construction
   └── Build NetworkX graph from entities/relations

4. Entity Resolution

File Location

/graphrag/entity_resolution.py

Purpose

Merge duplicate entities trong knowledge graph.

Implementation

import editdistance
import networkx as nx

class EntityResolution:
    def is_similarity(self, a: str, b: str) -> bool:
        """
        Check if two entity names are similar.
        """
        a, b = a.lower(), b.lower()

        # Chinese: character set intersection
        if self._is_chinese(a):
            a_set, b_set = set(a), set(b)
            max_len = max(len(a_set), len(b_set))
            overlap = len(a_set & b_set)
            return overlap / max_len >= 0.8

        # English: Edit distance
        else:
            threshold = min(len(a), len(b)) // 2
            distance = editdistance.eval(a, b)
            return distance <= threshold

    async def resolve(self, graph):
        """
        Resolve duplicate entities in graph.
        """
        # 1. Find candidate pairs
        nodes = list(graph.nodes())
        candidates = []

        for i, a in enumerate(nodes):
            for b in nodes[i+1:]:
                if self.is_similarity(a, b):
                    candidates.append((a, b))

        # 2. LLM verification (batch)
        confirmed_pairs = []
        for batch in self._batch(candidates, size=100):
            results = await self._llm_verify_batch(batch)
            confirmed_pairs.extend([
                pair for pair, is_same in zip(batch, results)
                if is_same
            ])

        # 3. Merge confirmed pairs
        for a, b in confirmed_pairs:
            self._merge_nodes(graph, a, b)

        # 4. Update PageRank
        pagerank = nx.pagerank(graph)
        for node in graph.nodes():
            graph.nodes[node]["pagerank"] = pagerank[node]

        return graph

Similarity Metrics

English Similarity (Edit Distance):
distance(a, b) ≤ min(len(a), len(b)) // 2

Example:
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
- "google" vs "apple" → distance=5 > 2 → Not similar

Chinese Similarity (Character Set):
|a ∩ b| / max(|a|, |b|) ≥ 0.8

Example:
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar

5. Largest Connected Component (LCC)

File Location

/graphrag/general/leiden.py (line 67)

Purpose

Extract largest connected subgraph trước khi community detection.

Implementation

from graspologic.utils import largest_connected_component

def _stabilize_graph(graph):
    """
    Extract and stabilize the largest connected component.
    """
    # Get LCC
    lcc = largest_connected_component(graph)

    # Sort nodes for reproducibility
    sorted_nodes = sorted(lcc.nodes())
    sorted_graph = lcc.subgraph(sorted_nodes).copy()

    return sorted_graph

LCC Algorithm

LCC Algorithm:

1. Find all connected components using BFS/DFS
2. Select component with maximum number of nodes
3. Return subgraph of that component

Complexity: O(V + E)
where V = vertices, E = edges

6. N-hop Path Scoring

File Location

/graphrag/search.py (lines 181-187)

Purpose

Score entities dựa trên path distance trong graph.

Implementation

def compute_nhop_scores(entity, neighbors, n_hops=2):
    """
    Score entities based on graph distance.
    """
    nhop_scores = {}

    for neighbor in neighbors:
        path = neighbor["path"]
        weights = neighbor["weights"]

        for i in range(len(path) - 1):
            source, target = path[i], path[i + 1]

            # Decay by distance
            score = entity["sim"] / (2 + i)

            if (source, target) in nhop_scores:
                nhop_scores[(source, target)]["sim"] += score
            else:
                nhop_scores[(source, target)] = {"sim": score}

    return nhop_scores

Scoring Formula

N-hop Score with Decay:

score(e, path_i) = similarity(e) / (2 + distance_i)

where:
- distance_i = number of hops from source entity
- 2 = base constant to prevent division issues

Total score = Σ score(e, path_i) for all paths

Summary

Algorithm	Purpose	Library
PageRank	Entity importance	NetworkX
Leiden	Community detection	graspologic
Entity Extraction	KG construction	LLM
Entity Resolution	Deduplication	editdistance + LLM
LCC	Graph preprocessing	graspologic
N-hop Scoring	Path-based ranking	Custom

/graphrag/entity_resolution.py - Entity resolution
/graphrag/general/leiden.py - Community detection
/graphrag/general/extractor.py - Entity extraction
/graphrag/search.py - Graph search

11 KiB Raw Blame History Unescape Escape

Graph Algorithms

Tong Quan

1. PageRank Algorithm

File Location

Purpose

Implementation

PageRank Formula

Usage in RAGFlow

2. Leiden Community Detection

File Location

Purpose

Implementation

Leiden Algorithm

Hierarchical Leiden

3. Entity Extraction (LLM-based)

File Location

Purpose

Implementation

Extraction Pipeline

4. Entity Resolution

File Location

Purpose

Implementation

Similarity Metrics

5. Largest Connected Component (LCC)

File Location

Purpose

Implementation

LCC Algorithm

6. N-hop Path Scoring

File Location

Purpose

Implementation

Scoring Formula

Summary

Related Files

11 KiB

Raw Blame History