- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
14 KiB
14 KiB
NLP Algorithms
Tong Quan
RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing.
1. Trie-based Tokenization
File Location
/rag/nlp/rag_tokenizer.py (lines 72-90, 120-240)
Purpose
Chinese word segmentation sử dụng Trie data structure.
Implementation
import datrie
class RagTokenizer:
def __init__(self):
# Load dictionary into Trie
self.trie = datrie.Trie(string.printable + "".join(
chr(i) for i in range(0x4E00, 0x9FFF) # CJK characters
))
# Load from huqie.txt dictionary
for line in open("rag/res/huqie.txt"):
word, freq, pos = line.strip().split("\t")
self.trie[word] = (int(freq), pos)
def _max_forward(self, text, start):
"""
Max-forward matching algorithm.
"""
end = len(text)
while end > start:
substr = text[start:end]
if substr in self.trie:
return substr, end
end -= 1
return text[start], start + 1
def _max_backward(self, text, end):
"""
Max-backward matching algorithm.
"""
start = 0
while start < end:
substr = text[start:end]
if substr in self.trie:
return substr, start
start += 1
return text[end-1], end - 1
Trie Structure
Trie Data Structure:
root
/ \
机 学
/ \
器 习
/ \
学 人
Words: 机器, 机器学习, 机器人, 学习
Lookup: O(m) where m = word length
Insert: O(m)
Space: O(n × m) where n = number of words
Max-Forward/Backward Algorithm
Max-Forward Matching:
Input: "机器学习是人工智能"
Step 1: Try "机器学习是人工智能" → Not found
Step 2: Try "机器学习是人工" → Not found
...
Step n: Try "机器学习" → Found!
Output: ["机器学习", ...]
Max-Backward Matching:
Input: "机器学习"
Step 1: Try "机器学习" from end → Found!
Output: ["机器学习"]
2. DFS with Memoization (Disambiguation)
File Location
/rag/nlp/rag_tokenizer.py (lines 120-210)
Purpose
Giải quyết ambiguity khi có nhiều cách tokenize.
Implementation
def dfs_(self, text, memo={}):
"""
DFS with memoization for tokenization disambiguation.
"""
if text in memo:
return memo[text]
if not text:
return [[]]
results = []
for end in range(1, len(text) + 1):
prefix = text[:end]
if prefix in self.trie or len(prefix) == 1:
suffix_results = self.dfs_(text[end:], memo)
for suffix in suffix_results:
results.append([prefix] + suffix)
# Score and select best tokenization
best = max(results, key=lambda x: self.score_(x))
memo[text] = [best]
return [best]
def score_(self, tokens):
"""
Score tokenization quality.
Formula: score = B/len(tokens) + L + F
where:
B = 30 (bonus for fewer tokens)
L = sum of token lengths / total length
F = sum of frequencies
"""
B = 30
L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens))
F = sum(self.trie.get(t, (1, ''))[0] for t in tokens)
return B / len(tokens) + L + F
Scoring Function
Tokenization Scoring:
score(tokens) = B/n + L + F
where:
- B = 30 (base bonus)
- n = number of tokens (fewer is better)
- L = coverage ratio
- F = sum of word frequencies (common words preferred)
Example:
"北京大学" →
Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516
Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5
Winner: Option 1
3. TF-IDF Term Weighting
File Location
/rag/nlp/term_weight.py (lines 162-244)
Purpose
Tính importance weight cho mỗi term trong query.
Implementation
import math
import numpy as np
class Dealer:
def weights(self, tokens, preprocess=True):
"""
Calculate TF-IDF based weights for tokens.
"""
def idf(s, N):
return math.log10(10 + ((N - s + 0.5) / (s + 0.5)))
# IDF based on term frequency in corpus
idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens])
# IDF based on document frequency
idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens])
# NER and POS weights
ner_weights = np.array([self.ner(t) for t in tokens])
pos_weights = np.array([self.postag(t) for t in tokens])
# Combined weight
weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights
# Normalize
total = np.sum(weights)
return [(t, w / total) for t, w in zip(tokens, weights)]
Formula
TF-IDF Variant:
IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5))
where:
- N = total documents (10⁹ for df, 10⁷ for freq)
- df = document frequency of term
Combined Weight:
weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS
Normalization:
normalized_weight(term) = weight(term) / Σ weight(all_terms)
4. Named Entity Recognition (NER)
File Location
/rag/nlp/term_weight.py (lines 84-86, 144-149)
Purpose
Dictionary-based entity type detection với weight assignment.
Implementation
class Dealer:
def __init__(self):
# Load NER dictionary
self.ner_dict = json.load(open("rag/res/ner.json"))
def ner(self, token):
"""
Get NER weight for token.
"""
NER_WEIGHTS = {
"toxic": 2.0, # Toxic/sensitive words
"func": 1.0, # Functional words
"corp": 3.0, # Corporation names
"loca": 3.0, # Location names
"sch": 3.0, # School names
"stock": 3.0, # Stock symbols
"firstnm": 1.0, # First names
}
for entity_type, weight in NER_WEIGHTS.items():
if token in self.ner_dict.get(entity_type, set()):
return weight
return 1.0 # Default
Entity Types
NER Categories:
┌──────────┬────────┬─────────────────────────────┐
│ Type │ Weight │ Examples │
├──────────┼────────┼─────────────────────────────┤
│ corp │ 3.0 │ Microsoft, Google, Apple │
│ loca │ 3.0 │ Beijing, New York │
│ sch │ 3.0 │ MIT, Stanford │
│ stock │ 3.0 │ AAPL, GOOG │
│ toxic │ 2.0 │ (sensitive words) │
│ func │ 1.0 │ the, is, are │
│ firstnm │ 1.0 │ John, Mary │
└──────────┴────────┴─────────────────────────────┘
5. Part-of-Speech (POS) Tagging
File Location
/rag/nlp/term_weight.py (lines 179-189)
Purpose
Assign weights based on grammatical category.
Implementation
def postag(self, token):
"""
Get POS weight for token.
"""
POS_WEIGHTS = {
"r": 0.3, # Pronoun
"c": 0.3, # Conjunction
"d": 0.3, # Adverb
"ns": 3.0, # Place noun
"nt": 3.0, # Organization noun
"n": 2.0, # Common noun
}
# Get POS tag from tokenizer
pos = self.tokenizer.tag(token)
# Check for numeric patterns
if re.match(r"^[\d.]+$", token):
return 2.0
return POS_WEIGHTS.get(pos, 1.0)
POS Weight Table
POS Weight Assignments:
┌───────┬────────┬─────────────────────┐
│ Tag │ Weight │ Description │
├───────┼────────┼─────────────────────┤
│ n │ 2.0 │ Common noun │
│ ns │ 3.0 │ Place noun │
│ nt │ 3.0 │ Organization noun │
│ v │ 1.0 │ Verb │
│ a │ 1.0 │ Adjective │
│ r │ 0.3 │ Pronoun │
│ c │ 0.3 │ Conjunction │
│ d │ 0.3 │ Adverb │
│ num │ 2.0 │ Number │
└───────┴────────┴─────────────────────┘
6. Synonym Detection
File Location
/rag/nlp/synonym.py (lines 71-93)
Purpose
Query expansion qua synonym lookup.
Implementation
from nltk.corpus import wordnet
class SynonymLookup:
def __init__(self):
# Load custom dictionary
self.custom_dict = json.load(open("rag/res/synonym.json"))
def lookup(self, token, top_n=8):
"""
Find synonyms for token.
Strategy:
1. Check custom dictionary first
2. Fall back to WordNet for English
"""
# Custom dictionary
if token in self.custom_dict:
return self.custom_dict[token][:top_n]
# WordNet for English words
if re.match(r"^[a-z]+$", token.lower()):
synonyms = set()
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
name = lemma.name().replace("_", " ")
if name.lower() != token.lower():
synonyms.add(name)
return list(synonyms)[:top_n]
return []
Synonym Sources
Synonym Lookup Strategy:
1. Custom Dictionary (highest priority)
- Domain-specific synonyms
- Chinese synonyms
- Technical terms
2. WordNet (English only)
- General synonyms
- Lemma extraction from synsets
Example:
"computer" → WordNet → ["machine", "calculator", "computing device"]
"机器学习" → Custom → ["ML", "machine learning", "深度学习"]
7. Query Expansion
File Location
/rag/nlp/query.py (lines 85-218)
Purpose
Build expanded query với weighted terms và synonyms.
Implementation
class FulltextQueryer:
QUERY_FIELDS = [
"title_tks^10", # Title: 10x boost
"title_sm_tks^5", # Title sub-tokens: 5x
"important_kwd^30", # Keywords: 30x
"important_tks^20", # Keyword tokens: 20x
"question_tks^20", # Question tokens: 20x
"content_ltks^2", # Content: 2x
"content_sm_ltks^1", # Content sub-tokens: 1x
]
def question(self, text, min_match=0.6):
"""
Build expanded query.
"""
# 1. Tokenize
tokens = self.tokenizer.tokenize(text)
# 2. Get weights
weighted_tokens = self.term_weight.weights(tokens)
# 3. Get synonyms
synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens]
# 4. Build query string
query_parts = []
for (token, weight), syns in zip(weighted_tokens, synonyms):
if syns:
# Token with synonyms
syn_str = " ".join(syns)
query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)")
else:
query_parts.append(f"{token}^{weight:.4f}")
# 5. Add phrase queries (bigrams)
for i in range(1, len(weighted_tokens)):
left, _ = weighted_tokens[i-1]
right, w = weighted_tokens[i]
query_parts.append(f'"{left} {right}"^{w*2:.4f}')
return MatchTextExpr(
query=" ".join(query_parts),
fields=self.QUERY_FIELDS,
min_match=f"{int(min_match * 100)}%"
)
Query Expansion Example
Input: "machine learning tutorial"
After expansion:
(machine^0.35 OR (computer device)^0.2)
(learning^0.40 OR (study education)^0.2)
(tutorial^0.25 OR (guide lesson)^0.2)
"machine learning"^0.80
"learning tutorial"^0.50
With field boosting:
{
"query_string": {
"query": "(machine^0.35 learning^0.40 tutorial^0.25)",
"fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"],
"minimum_should_match": "60%"
}
}
8. Fine-Grained Tokenization
File Location
/rag/nlp/rag_tokenizer.py (lines 395-420)
Purpose
Secondary tokenization cho compound words.
Implementation
def fine_grained_tokenize(self, text):
"""
Break compound words into sub-tokens.
"""
# First pass: standard tokenization
tokens = self.tokenize(text)
fine_tokens = []
for token in tokens:
# Skip short tokens
if len(token) < 3:
fine_tokens.append(token)
continue
# Try to break into sub-tokens
sub_tokens = self.dfs_(token)
if len(sub_tokens[0]) > 1:
fine_tokens.extend(sub_tokens[0])
else:
fine_tokens.append(token)
return fine_tokens
Example
Standard: "机器学习" → ["机器学习"]
Fine-grained: "机器学习" → ["机器", "学习"]
Standard: "人工智能" → ["人工智能"]
Fine-grained: "人工智能" → ["人工", "智能"]
Summary
| Algorithm | Purpose | File |
|---|---|---|
| Trie Tokenization | Word segmentation | rag_tokenizer.py |
| Max-Forward/Backward | Matching strategy | rag_tokenizer.py |
| DFS + Memo | Disambiguation | rag_tokenizer.py |
| TF-IDF | Term weighting | term_weight.py |
| NER | Entity detection | term_weight.py |
| POS Tagging | Grammatical analysis | term_weight.py |
| Synonym | Query expansion | synonym.py |
| Query Expansion | Search enhancement | query.py |
| Fine-grained | Sub-tokenization | rag_tokenizer.py |
Related Files
/rag/nlp/rag_tokenizer.py- Tokenization/rag/nlp/term_weight.py- TF-IDF, NER, POS/rag/nlp/synonym.py- Synonym lookup/rag/nlp/query.py- Query processing