docs: Add comprehensive algorithm documentation (50+ algorithms)

- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)

2025-11-27 03:34:49 +00:00

14 KiB

Raw Blame History

NLP Algorithms

Tong Quan

RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing.

1. Trie-based Tokenization

File Location

/rag/nlp/rag_tokenizer.py (lines 72-90, 120-240)

Purpose

Chinese word segmentation sử dụng Trie data structure.

Implementation

import datrie

class RagTokenizer:
    def __init__(self):
        # Load dictionary into Trie
        self.trie = datrie.Trie(string.printable + "".join(
            chr(i) for i in range(0x4E00, 0x9FFF)  # CJK characters
        ))

        # Load from huqie.txt dictionary
        for line in open("rag/res/huqie.txt"):
            word, freq, pos = line.strip().split("\t")
            self.trie[word] = (int(freq), pos)

    def _max_forward(self, text, start):
        """
        Max-forward matching algorithm.
        """
        end = len(text)
        while end > start:
            substr = text[start:end]
            if substr in self.trie:
                return substr, end
            end -= 1
        return text[start], start + 1

    def _max_backward(self, text, end):
        """
        Max-backward matching algorithm.
        """
        start = 0
        while start < end:
            substr = text[start:end]
            if substr in self.trie:
                return substr, start
            start += 1
        return text[end-1], end - 1

Trie Structure

Trie Data Structure:
         root
        /    \
       机     学
      /       \
     器        习
    / \
   学  人

Words: 机器, 机器学习, 机器人, 学习

Lookup: O(m) where m = word length
Insert: O(m)
Space: O(n × m) where n = number of words

Max-Forward/Backward Algorithm

Max-Forward Matching:
Input: "机器学习是人工智能"

Step 1: Try "机器学习是人工智能" → Not found
Step 2: Try "机器学习是人工" → Not found
...
Step n: Try "机器学习" → Found!
Output: ["机器学习", ...]

Max-Backward Matching:
Input: "机器学习"

Step 1: Try "机器学习" from end → Found!
Output: ["机器学习"]

2. DFS with Memoization (Disambiguation)

File Location

/rag/nlp/rag_tokenizer.py (lines 120-210)

Purpose

Giải quyết ambiguity khi có nhiều cách tokenize.

Implementation

def dfs_(self, text, memo={}):
    """
    DFS with memoization for tokenization disambiguation.
    """
    if text in memo:
        return memo[text]

    if not text:
        return [[]]

    results = []
    for end in range(1, len(text) + 1):
        prefix = text[:end]
        if prefix in self.trie or len(prefix) == 1:
            suffix_results = self.dfs_(text[end:], memo)
            for suffix in suffix_results:
                results.append([prefix] + suffix)

    # Score and select best tokenization
    best = max(results, key=lambda x: self.score_(x))
    memo[text] = [best]
    return [best]

def score_(self, tokens):
    """
    Score tokenization quality.

    Formula: score = B/len(tokens) + L + F
    where:
        B = 30 (bonus for fewer tokens)
        L = sum of token lengths / total length
        F = sum of frequencies
    """
    B = 30
    L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens))
    F = sum(self.trie.get(t, (1, ''))[0] for t in tokens)

    return B / len(tokens) + L + F

Scoring Function

Tokenization Scoring:

score(tokens) = B/n + L + F

where:
- B = 30 (base bonus)
- n = number of tokens (fewer is better)
- L = coverage ratio
- F = sum of word frequencies (common words preferred)

Example:
"北京大学" →
  Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516
  Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5

Winner: Option 1

3. TF-IDF Term Weighting

File Location

/rag/nlp/term_weight.py (lines 162-244)

Purpose

Tính importance weight cho mỗi term trong query.

Implementation

import math
import numpy as np

class Dealer:
    def weights(self, tokens, preprocess=True):
        """
        Calculate TF-IDF based weights for tokens.
        """
        def idf(s, N):
            return math.log10(10 + ((N - s + 0.5) / (s + 0.5)))

        # IDF based on term frequency in corpus
        idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens])

        # IDF based on document frequency
        idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens])

        # NER and POS weights
        ner_weights = np.array([self.ner(t) for t in tokens])
        pos_weights = np.array([self.postag(t) for t in tokens])

        # Combined weight
        weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights

        # Normalize
        total = np.sum(weights)
        return [(t, w / total) for t, w in zip(tokens, weights)]

Formula

TF-IDF Variant:

IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5))

where:
- N = total documents (10⁹ for df, 10⁷ for freq)
- df = document frequency of term

Combined Weight:
weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS

Normalization:
normalized_weight(term) = weight(term) / Σ weight(all_terms)

4. Named Entity Recognition (NER)

File Location

/rag/nlp/term_weight.py (lines 84-86, 144-149)

Purpose

Dictionary-based entity type detection với weight assignment.

Implementation

class Dealer:
    def __init__(self):
        # Load NER dictionary
        self.ner_dict = json.load(open("rag/res/ner.json"))

    def ner(self, token):
        """
        Get NER weight for token.
        """
        NER_WEIGHTS = {
            "toxic": 2.0,      # Toxic/sensitive words
            "func": 1.0,       # Functional words
            "corp": 3.0,       # Corporation names
            "loca": 3.0,       # Location names
            "sch": 3.0,        # School names
            "stock": 3.0,      # Stock symbols
            "firstnm": 1.0,    # First names
        }

        for entity_type, weight in NER_WEIGHTS.items():
            if token in self.ner_dict.get(entity_type, set()):
                return weight

        return 1.0  # Default

Entity Types

NER Categories:
┌──────────┬────────┬─────────────────────────────┐
│ Type     │ Weight │ Examples                    │
├──────────┼────────┼─────────────────────────────┤
│ corp     │ 3.0    │ Microsoft, Google, Apple    │
│ loca     │ 3.0    │ Beijing, New York           │
│ sch      │ 3.0    │ MIT, Stanford               │
│ stock    │ 3.0    │ AAPL, GOOG                  │
│ toxic    │ 2.0    │ (sensitive words)           │
│ func     │ 1.0    │ the, is, are                │
│ firstnm  │ 1.0    │ John, Mary                  │
└──────────┴────────┴─────────────────────────────┘

5. Part-of-Speech (POS) Tagging

File Location

/rag/nlp/term_weight.py (lines 179-189)

Purpose

Assign weights based on grammatical category.

Implementation

def postag(self, token):
    """
    Get POS weight for token.
    """
    POS_WEIGHTS = {
        "r": 0.3,   # Pronoun
        "c": 0.3,   # Conjunction
        "d": 0.3,   # Adverb
        "ns": 3.0,  # Place noun
        "nt": 3.0,  # Organization noun
        "n": 2.0,   # Common noun
    }

    # Get POS tag from tokenizer
    pos = self.tokenizer.tag(token)

    # Check for numeric patterns
    if re.match(r"^[\d.]+$", token):
        return 2.0

    return POS_WEIGHTS.get(pos, 1.0)

POS Weight Table

POS Weight Assignments:
┌───────┬────────┬─────────────────────┐
│ Tag   │ Weight │ Description         │
├───────┼────────┼─────────────────────┤
│ n     │ 2.0    │ Common noun         │
│ ns    │ 3.0    │ Place noun          │
│ nt    │ 3.0    │ Organization noun   │
│ v     │ 1.0    │ Verb                │
│ a     │ 1.0    │ Adjective           │
│ r     │ 0.3    │ Pronoun             │
│ c     │ 0.3    │ Conjunction         │
│ d     │ 0.3    │ Adverb              │
│ num   │ 2.0    │ Number              │
└───────┴────────┴─────────────────────┘

6. Synonym Detection

File Location

/rag/nlp/synonym.py (lines 71-93)

Purpose

Query expansion qua synonym lookup.

Implementation

from nltk.corpus import wordnet

class SynonymLookup:
    def __init__(self):
        # Load custom dictionary
        self.custom_dict = json.load(open("rag/res/synonym.json"))

    def lookup(self, token, top_n=8):
        """
        Find synonyms for token.

        Strategy:
        1. Check custom dictionary first
        2. Fall back to WordNet for English
        """
        # Custom dictionary
        if token in self.custom_dict:
            return self.custom_dict[token][:top_n]

        # WordNet for English words
        if re.match(r"^[a-z]+$", token.lower()):
            synonyms = set()
            for syn in wordnet.synsets(token):
                for lemma in syn.lemmas():
                    name = lemma.name().replace("_", " ")
                    if name.lower() != token.lower():
                        synonyms.add(name)

            return list(synonyms)[:top_n]

        return []

Synonym Sources

Synonym Lookup Strategy:

1. Custom Dictionary (highest priority)
   - Domain-specific synonyms
   - Chinese synonyms
   - Technical terms

2. WordNet (English only)
   - General synonyms
   - Lemma extraction from synsets

Example:
"computer" → WordNet → ["machine", "calculator", "computing device"]
"机器学习" → Custom → ["ML", "machine learning", "深度学习"]

7. Query Expansion

File Location

/rag/nlp/query.py (lines 85-218)

Purpose

Build expanded query với weighted terms và synonyms.

Implementation

class FulltextQueryer:
    QUERY_FIELDS = [
        "title_tks^10",        # Title: 10x boost
        "title_sm_tks^5",      # Title sub-tokens: 5x
        "important_kwd^30",    # Keywords: 30x
        "important_tks^20",    # Keyword tokens: 20x
        "question_tks^20",     # Question tokens: 20x
        "content_ltks^2",      # Content: 2x
        "content_sm_ltks^1",   # Content sub-tokens: 1x
    ]

    def question(self, text, min_match=0.6):
        """
        Build expanded query.
        """
        # 1. Tokenize
        tokens = self.tokenizer.tokenize(text)

        # 2. Get weights
        weighted_tokens = self.term_weight.weights(tokens)

        # 3. Get synonyms
        synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens]

        # 4. Build query string
        query_parts = []
        for (token, weight), syns in zip(weighted_tokens, synonyms):
            if syns:
                # Token with synonyms
                syn_str = " ".join(syns)
                query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)")
            else:
                query_parts.append(f"{token}^{weight:.4f}")

        # 5. Add phrase queries (bigrams)
        for i in range(1, len(weighted_tokens)):
            left, _ = weighted_tokens[i-1]
            right, w = weighted_tokens[i]
            query_parts.append(f'"{left} {right}"^{w*2:.4f}')

        return MatchTextExpr(
            query=" ".join(query_parts),
            fields=self.QUERY_FIELDS,
            min_match=f"{int(min_match * 100)}%"
        )

Query Expansion Example

Input: "machine learning tutorial"

After expansion:
(machine^0.35 OR (computer device)^0.2)
(learning^0.40 OR (study education)^0.2)
(tutorial^0.25 OR (guide lesson)^0.2)
"machine learning"^0.80
"learning tutorial"^0.50

With field boosting:
{
    "query_string": {
        "query": "(machine^0.35 learning^0.40 tutorial^0.25)",
        "fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"],
        "minimum_should_match": "60%"
    }
}

8. Fine-Grained Tokenization

File Location

/rag/nlp/rag_tokenizer.py (lines 395-420)

Purpose

Secondary tokenization cho compound words.

Implementation

def fine_grained_tokenize(self, text):
    """
    Break compound words into sub-tokens.
    """
    # First pass: standard tokenization
    tokens = self.tokenize(text)

    fine_tokens = []
    for token in tokens:
        # Skip short tokens
        if len(token) < 3:
            fine_tokens.append(token)
            continue

        # Try to break into sub-tokens
        sub_tokens = self.dfs_(token)
        if len(sub_tokens[0]) > 1:
            fine_tokens.extend(sub_tokens[0])
        else:
            fine_tokens.append(token)

    return fine_tokens

Example

Standard: "机器学习" → ["机器学习"]
Fine-grained: "机器学习" → ["机器", "学习"]

Standard: "人工智能" → ["人工智能"]
Fine-grained: "人工智能" → ["人工", "智能"]

Summary

Algorithm	Purpose	File
Trie Tokenization	Word segmentation	rag_tokenizer.py
Max-Forward/Backward	Matching strategy	rag_tokenizer.py
DFS + Memo	Disambiguation	rag_tokenizer.py
TF-IDF	Term weighting	term_weight.py
NER	Entity detection	term_weight.py
POS Tagging	Grammatical analysis	term_weight.py
Synonym	Query expansion	synonym.py
Query Expansion	Search enhancement	query.py
Fine-grained	Sub-tokenization	rag_tokenizer.py

/rag/nlp/rag_tokenizer.py - Tokenization
/rag/nlp/term_weight.py - TF-IDF, NER, POS
/rag/nlp/synonym.py - Synonym lookup
/rag/nlp/query.py - Query processing

14 KiB Raw Blame History Unescape Escape

NLP Algorithms

Tong Quan

1. Trie-based Tokenization

File Location

Purpose

Implementation

Trie Structure

Max-Forward/Backward Algorithm

2. DFS with Memoization (Disambiguation)

File Location

Purpose

Implementation

Scoring Function

3. TF-IDF Term Weighting

File Location

Purpose

Implementation

Formula

4. Named Entity Recognition (NER)

File Location

Purpose

Implementation

Entity Types

5. Part-of-Speech (POS) Tagging

File Location

Purpose

Implementation

POS Weight Table

6. Synonym Detection

File Location

Purpose

Implementation

Synonym Sources

7. Query Expansion

File Location

Purpose

Implementation

Query Expansion Example

8. Fine-Grained Tokenization

File Location

Purpose

Implementation

Example

Summary

Related Files

14 KiB

Raw Blame History