ragflow/personal_analyze/06-ALGORITHMS/nlp_algorithms.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

14 KiB
Raw Blame History

NLP Algorithms

Tong Quan

RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing.

1. Trie-based Tokenization

File Location

/rag/nlp/rag_tokenizer.py (lines 72-90, 120-240)

Purpose

Chinese word segmentation sử dụng Trie data structure.

Implementation

import datrie

class RagTokenizer:
    def __init__(self):
        # Load dictionary into Trie
        self.trie = datrie.Trie(string.printable + "".join(
            chr(i) for i in range(0x4E00, 0x9FFF)  # CJK characters
        ))

        # Load from huqie.txt dictionary
        for line in open("rag/res/huqie.txt"):
            word, freq, pos = line.strip().split("\t")
            self.trie[word] = (int(freq), pos)

    def _max_forward(self, text, start):
        """
        Max-forward matching algorithm.
        """
        end = len(text)
        while end > start:
            substr = text[start:end]
            if substr in self.trie:
                return substr, end
            end -= 1
        return text[start], start + 1

    def _max_backward(self, text, end):
        """
        Max-backward matching algorithm.
        """
        start = 0
        while start < end:
            substr = text[start:end]
            if substr in self.trie:
                return substr, start
            start += 1
        return text[end-1], end - 1

Trie Structure

Trie Data Structure:
         root
        /    \
       机     学
      /       \
     器        习
    / \
   学  人

Words: 机器, 机器学习, 机器人, 学习

Lookup: O(m) where m = word length
Insert: O(m)
Space: O(n × m) where n = number of words

Max-Forward/Backward Algorithm

Max-Forward Matching:
Input: "机器学习是人工智能"

Step 1: Try "机器学习是人工智能" → Not found
Step 2: Try "机器学习是人工" → Not found
...
Step n: Try "机器学习" → Found!
Output: ["机器学习", ...]

Max-Backward Matching:
Input: "机器学习"

Step 1: Try "机器学习" from end → Found!
Output: ["机器学习"]

2. DFS with Memoization (Disambiguation)

File Location

/rag/nlp/rag_tokenizer.py (lines 120-210)

Purpose

Giải quyết ambiguity khi có nhiều cách tokenize.

Implementation

def dfs_(self, text, memo={}):
    """
    DFS with memoization for tokenization disambiguation.
    """
    if text in memo:
        return memo[text]

    if not text:
        return [[]]

    results = []
    for end in range(1, len(text) + 1):
        prefix = text[:end]
        if prefix in self.trie or len(prefix) == 1:
            suffix_results = self.dfs_(text[end:], memo)
            for suffix in suffix_results:
                results.append([prefix] + suffix)

    # Score and select best tokenization
    best = max(results, key=lambda x: self.score_(x))
    memo[text] = [best]
    return [best]

def score_(self, tokens):
    """
    Score tokenization quality.

    Formula: score = B/len(tokens) + L + F
    where:
        B = 30 (bonus for fewer tokens)
        L = sum of token lengths / total length
        F = sum of frequencies
    """
    B = 30
    L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens))
    F = sum(self.trie.get(t, (1, ''))[0] for t in tokens)

    return B / len(tokens) + L + F

Scoring Function

Tokenization Scoring:

score(tokens) = B/n + L + F

where:
- B = 30 (base bonus)
- n = number of tokens (fewer is better)
- L = coverage ratio
- F = sum of word frequencies (common words preferred)

Example:
"北京大学" →
  Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516
  Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5

Winner: Option 1

3. TF-IDF Term Weighting

File Location

/rag/nlp/term_weight.py (lines 162-244)

Purpose

Tính importance weight cho mỗi term trong query.

Implementation

import math
import numpy as np

class Dealer:
    def weights(self, tokens, preprocess=True):
        """
        Calculate TF-IDF based weights for tokens.
        """
        def idf(s, N):
            return math.log10(10 + ((N - s + 0.5) / (s + 0.5)))

        # IDF based on term frequency in corpus
        idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens])

        # IDF based on document frequency
        idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens])

        # NER and POS weights
        ner_weights = np.array([self.ner(t) for t in tokens])
        pos_weights = np.array([self.postag(t) for t in tokens])

        # Combined weight
        weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights

        # Normalize
        total = np.sum(weights)
        return [(t, w / total) for t, w in zip(tokens, weights)]

Formula

TF-IDF Variant:

IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5))

where:
- N = total documents (10⁹ for df, 10⁷ for freq)
- df = document frequency of term

Combined Weight:
weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS

Normalization:
normalized_weight(term) = weight(term) / Σ weight(all_terms)

4. Named Entity Recognition (NER)

File Location

/rag/nlp/term_weight.py (lines 84-86, 144-149)

Purpose

Dictionary-based entity type detection với weight assignment.

Implementation

class Dealer:
    def __init__(self):
        # Load NER dictionary
        self.ner_dict = json.load(open("rag/res/ner.json"))

    def ner(self, token):
        """
        Get NER weight for token.
        """
        NER_WEIGHTS = {
            "toxic": 2.0,      # Toxic/sensitive words
            "func": 1.0,       # Functional words
            "corp": 3.0,       # Corporation names
            "loca": 3.0,       # Location names
            "sch": 3.0,        # School names
            "stock": 3.0,      # Stock symbols
            "firstnm": 1.0,    # First names
        }

        for entity_type, weight in NER_WEIGHTS.items():
            if token in self.ner_dict.get(entity_type, set()):
                return weight

        return 1.0  # Default

Entity Types

NER Categories:
┌──────────┬────────┬─────────────────────────────┐
│ Type     │ Weight │ Examples                    │
├──────────┼────────┼─────────────────────────────┤
│ corp     │ 3.0    │ Microsoft, Google, Apple    │
│ loca     │ 3.0    │ Beijing, New York           │
│ sch      │ 3.0    │ MIT, Stanford               │
│ stock    │ 3.0    │ AAPL, GOOG                  │
│ toxic    │ 2.0    │ (sensitive words)           │
│ func     │ 1.0    │ the, is, are                │
│ firstnm  │ 1.0    │ John, Mary                  │
└──────────┴────────┴─────────────────────────────┘

5. Part-of-Speech (POS) Tagging

File Location

/rag/nlp/term_weight.py (lines 179-189)

Purpose

Assign weights based on grammatical category.

Implementation

def postag(self, token):
    """
    Get POS weight for token.
    """
    POS_WEIGHTS = {
        "r": 0.3,   # Pronoun
        "c": 0.3,   # Conjunction
        "d": 0.3,   # Adverb
        "ns": 3.0,  # Place noun
        "nt": 3.0,  # Organization noun
        "n": 2.0,   # Common noun
    }

    # Get POS tag from tokenizer
    pos = self.tokenizer.tag(token)

    # Check for numeric patterns
    if re.match(r"^[\d.]+$", token):
        return 2.0

    return POS_WEIGHTS.get(pos, 1.0)

POS Weight Table

POS Weight Assignments:
┌───────┬────────┬─────────────────────┐
│ Tag   │ Weight │ Description         │
├───────┼────────┼─────────────────────┤
│ n     │ 2.0    │ Common noun         │
│ ns    │ 3.0    │ Place noun          │
│ nt    │ 3.0    │ Organization noun   │
│ v     │ 1.0    │ Verb                │
│ a     │ 1.0    │ Adjective           │
│ r     │ 0.3    │ Pronoun             │
│ c     │ 0.3    │ Conjunction         │
│ d     │ 0.3    │ Adverb              │
│ num   │ 2.0    │ Number              │
└───────┴────────┴─────────────────────┘

6. Synonym Detection

File Location

/rag/nlp/synonym.py (lines 71-93)

Purpose

Query expansion qua synonym lookup.

Implementation

from nltk.corpus import wordnet

class SynonymLookup:
    def __init__(self):
        # Load custom dictionary
        self.custom_dict = json.load(open("rag/res/synonym.json"))

    def lookup(self, token, top_n=8):
        """
        Find synonyms for token.

        Strategy:
        1. Check custom dictionary first
        2. Fall back to WordNet for English
        """
        # Custom dictionary
        if token in self.custom_dict:
            return self.custom_dict[token][:top_n]

        # WordNet for English words
        if re.match(r"^[a-z]+$", token.lower()):
            synonyms = set()
            for syn in wordnet.synsets(token):
                for lemma in syn.lemmas():
                    name = lemma.name().replace("_", " ")
                    if name.lower() != token.lower():
                        synonyms.add(name)

            return list(synonyms)[:top_n]

        return []

Synonym Sources

Synonym Lookup Strategy:

1. Custom Dictionary (highest priority)
   - Domain-specific synonyms
   - Chinese synonyms
   - Technical terms

2. WordNet (English only)
   - General synonyms
   - Lemma extraction from synsets

Example:
"computer" → WordNet → ["machine", "calculator", "computing device"]
"机器学习" → Custom → ["ML", "machine learning", "深度学习"]

7. Query Expansion

File Location

/rag/nlp/query.py (lines 85-218)

Purpose

Build expanded query với weighted terms và synonyms.

Implementation

class FulltextQueryer:
    QUERY_FIELDS = [
        "title_tks^10",        # Title: 10x boost
        "title_sm_tks^5",      # Title sub-tokens: 5x
        "important_kwd^30",    # Keywords: 30x
        "important_tks^20",    # Keyword tokens: 20x
        "question_tks^20",     # Question tokens: 20x
        "content_ltks^2",      # Content: 2x
        "content_sm_ltks^1",   # Content sub-tokens: 1x
    ]

    def question(self, text, min_match=0.6):
        """
        Build expanded query.
        """
        # 1. Tokenize
        tokens = self.tokenizer.tokenize(text)

        # 2. Get weights
        weighted_tokens = self.term_weight.weights(tokens)

        # 3. Get synonyms
        synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens]

        # 4. Build query string
        query_parts = []
        for (token, weight), syns in zip(weighted_tokens, synonyms):
            if syns:
                # Token with synonyms
                syn_str = " ".join(syns)
                query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)")
            else:
                query_parts.append(f"{token}^{weight:.4f}")

        # 5. Add phrase queries (bigrams)
        for i in range(1, len(weighted_tokens)):
            left, _ = weighted_tokens[i-1]
            right, w = weighted_tokens[i]
            query_parts.append(f'"{left} {right}"^{w*2:.4f}')

        return MatchTextExpr(
            query=" ".join(query_parts),
            fields=self.QUERY_FIELDS,
            min_match=f"{int(min_match * 100)}%"
        )

Query Expansion Example

Input: "machine learning tutorial"

After expansion:
(machine^0.35 OR (computer device)^0.2)
(learning^0.40 OR (study education)^0.2)
(tutorial^0.25 OR (guide lesson)^0.2)
"machine learning"^0.80
"learning tutorial"^0.50

With field boosting:
{
    "query_string": {
        "query": "(machine^0.35 learning^0.40 tutorial^0.25)",
        "fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"],
        "minimum_should_match": "60%"
    }
}

8. Fine-Grained Tokenization

File Location

/rag/nlp/rag_tokenizer.py (lines 395-420)

Purpose

Secondary tokenization cho compound words.

Implementation

def fine_grained_tokenize(self, text):
    """
    Break compound words into sub-tokens.
    """
    # First pass: standard tokenization
    tokens = self.tokenize(text)

    fine_tokens = []
    for token in tokens:
        # Skip short tokens
        if len(token) < 3:
            fine_tokens.append(token)
            continue

        # Try to break into sub-tokens
        sub_tokens = self.dfs_(token)
        if len(sub_tokens[0]) > 1:
            fine_tokens.extend(sub_tokens[0])
        else:
            fine_tokens.append(token)

    return fine_tokens

Example

Standard: "机器学习" → ["机器学习"]
Fine-grained: "机器学习" → ["机器", "学习"]

Standard: "人工智能" → ["人工智能"]
Fine-grained: "人工智能" → ["人工", "智能"]

Summary

Algorithm Purpose File
Trie Tokenization Word segmentation rag_tokenizer.py
Max-Forward/Backward Matching strategy rag_tokenizer.py
DFS + Memo Disambiguation rag_tokenizer.py
TF-IDF Term weighting term_weight.py
NER Entity detection term_weight.py
POS Tagging Grammatical analysis term_weight.py
Synonym Query expansion synonym.py
Query Expansion Search enhancement query.py
Fine-grained Sub-tokenization rag_tokenizer.py
  • /rag/nlp/rag_tokenizer.py - Tokenization
  • /rag/nlp/term_weight.py - TF-IDF, NER, POS
  • /rag/nlp/synonym.py - Synonym lookup
  • /rag/nlp/query.py - Query processing