ragflow/personal_analyze/06-ALGORITHMS/nlp_algorithms.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

571 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NLP Algorithms
## Tong Quan
RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing.
## 1. Trie-based Tokenization
### File Location
```
/rag/nlp/rag_tokenizer.py (lines 72-90, 120-240)
```
### Purpose
Chinese word segmentation sử dụng Trie data structure.
### Implementation
```python
import datrie
class RagTokenizer:
def __init__(self):
# Load dictionary into Trie
self.trie = datrie.Trie(string.printable + "".join(
chr(i) for i in range(0x4E00, 0x9FFF) # CJK characters
))
# Load from huqie.txt dictionary
for line in open("rag/res/huqie.txt"):
word, freq, pos = line.strip().split("\t")
self.trie[word] = (int(freq), pos)
def _max_forward(self, text, start):
"""
Max-forward matching algorithm.
"""
end = len(text)
while end > start:
substr = text[start:end]
if substr in self.trie:
return substr, end
end -= 1
return text[start], start + 1
def _max_backward(self, text, end):
"""
Max-backward matching algorithm.
"""
start = 0
while start < end:
substr = text[start:end]
if substr in self.trie:
return substr, start
start += 1
return text[end-1], end - 1
```
### Trie Structure
```
Trie Data Structure:
root
/ \
机 学
/ \
器 习
/ \
学 人
Words: 机器, 机器学习, 机器人, 学习
Lookup: O(m) where m = word length
Insert: O(m)
Space: O(n × m) where n = number of words
```
### Max-Forward/Backward Algorithm
```
Max-Forward Matching:
Input: "机器学习是人工智能"
Step 1: Try "机器学习是人工智能" → Not found
Step 2: Try "机器学习是人工" → Not found
...
Step n: Try "机器学习" → Found!
Output: ["机器学习", ...]
Max-Backward Matching:
Input: "机器学习"
Step 1: Try "机器学习" from end → Found!
Output: ["机器学习"]
```
---
## 2. DFS with Memoization (Disambiguation)
### File Location
```
/rag/nlp/rag_tokenizer.py (lines 120-210)
```
### Purpose
Giải quyết ambiguity khi có nhiều cách tokenize.
### Implementation
```python
def dfs_(self, text, memo={}):
"""
DFS with memoization for tokenization disambiguation.
"""
if text in memo:
return memo[text]
if not text:
return [[]]
results = []
for end in range(1, len(text) + 1):
prefix = text[:end]
if prefix in self.trie or len(prefix) == 1:
suffix_results = self.dfs_(text[end:], memo)
for suffix in suffix_results:
results.append([prefix] + suffix)
# Score and select best tokenization
best = max(results, key=lambda x: self.score_(x))
memo[text] = [best]
return [best]
def score_(self, tokens):
"""
Score tokenization quality.
Formula: score = B/len(tokens) + L + F
where:
B = 30 (bonus for fewer tokens)
L = sum of token lengths / total length
F = sum of frequencies
"""
B = 30
L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens))
F = sum(self.trie.get(t, (1, ''))[0] for t in tokens)
return B / len(tokens) + L + F
```
### Scoring Function
```
Tokenization Scoring:
score(tokens) = B/n + L + F
where:
- B = 30 (base bonus)
- n = number of tokens (fewer is better)
- L = coverage ratio
- F = sum of word frequencies (common words preferred)
Example:
"北京大学" →
Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516
Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5
Winner: Option 1
```
---
## 3. TF-IDF Term Weighting
### File Location
```
/rag/nlp/term_weight.py (lines 162-244)
```
### Purpose
Tính importance weight cho mỗi term trong query.
### Implementation
```python
import math
import numpy as np
class Dealer:
def weights(self, tokens, preprocess=True):
"""
Calculate TF-IDF based weights for tokens.
"""
def idf(s, N):
return math.log10(10 + ((N - s + 0.5) / (s + 0.5)))
# IDF based on term frequency in corpus
idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens])
# IDF based on document frequency
idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens])
# NER and POS weights
ner_weights = np.array([self.ner(t) for t in tokens])
pos_weights = np.array([self.postag(t) for t in tokens])
# Combined weight
weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights
# Normalize
total = np.sum(weights)
return [(t, w / total) for t, w in zip(tokens, weights)]
```
### Formula
```
TF-IDF Variant:
IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5))
where:
- N = total documents (10⁹ for df, 10⁷ for freq)
- df = document frequency of term
Combined Weight:
weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS
Normalization:
normalized_weight(term) = weight(term) / Σ weight(all_terms)
```
---
## 4. Named Entity Recognition (NER)
### File Location
```
/rag/nlp/term_weight.py (lines 84-86, 144-149)
```
### Purpose
Dictionary-based entity type detection với weight assignment.
### Implementation
```python
class Dealer:
def __init__(self):
# Load NER dictionary
self.ner_dict = json.load(open("rag/res/ner.json"))
def ner(self, token):
"""
Get NER weight for token.
"""
NER_WEIGHTS = {
"toxic": 2.0, # Toxic/sensitive words
"func": 1.0, # Functional words
"corp": 3.0, # Corporation names
"loca": 3.0, # Location names
"sch": 3.0, # School names
"stock": 3.0, # Stock symbols
"firstnm": 1.0, # First names
}
for entity_type, weight in NER_WEIGHTS.items():
if token in self.ner_dict.get(entity_type, set()):
return weight
return 1.0 # Default
```
### Entity Types
```
NER Categories:
┌──────────┬────────┬─────────────────────────────┐
│ Type │ Weight │ Examples │
├──────────┼────────┼─────────────────────────────┤
│ corp │ 3.0 │ Microsoft, Google, Apple │
│ loca │ 3.0 │ Beijing, New York │
│ sch │ 3.0 │ MIT, Stanford │
│ stock │ 3.0 │ AAPL, GOOG │
│ toxic │ 2.0 │ (sensitive words) │
│ func │ 1.0 │ the, is, are │
│ firstnm │ 1.0 │ John, Mary │
└──────────┴────────┴─────────────────────────────┘
```
---
## 5. Part-of-Speech (POS) Tagging
### File Location
```
/rag/nlp/term_weight.py (lines 179-189)
```
### Purpose
Assign weights based on grammatical category.
### Implementation
```python
def postag(self, token):
"""
Get POS weight for token.
"""
POS_WEIGHTS = {
"r": 0.3, # Pronoun
"c": 0.3, # Conjunction
"d": 0.3, # Adverb
"ns": 3.0, # Place noun
"nt": 3.0, # Organization noun
"n": 2.0, # Common noun
}
# Get POS tag from tokenizer
pos = self.tokenizer.tag(token)
# Check for numeric patterns
if re.match(r"^[\d.]+$", token):
return 2.0
return POS_WEIGHTS.get(pos, 1.0)
```
### POS Weight Table
```
POS Weight Assignments:
┌───────┬────────┬─────────────────────┐
│ Tag │ Weight │ Description │
├───────┼────────┼─────────────────────┤
│ n │ 2.0 │ Common noun │
│ ns │ 3.0 │ Place noun │
│ nt │ 3.0 │ Organization noun │
│ v │ 1.0 │ Verb │
│ a │ 1.0 │ Adjective │
│ r │ 0.3 │ Pronoun │
│ c │ 0.3 │ Conjunction │
│ d │ 0.3 │ Adverb │
│ num │ 2.0 │ Number │
└───────┴────────┴─────────────────────┘
```
---
## 6. Synonym Detection
### File Location
```
/rag/nlp/synonym.py (lines 71-93)
```
### Purpose
Query expansion qua synonym lookup.
### Implementation
```python
from nltk.corpus import wordnet
class SynonymLookup:
def __init__(self):
# Load custom dictionary
self.custom_dict = json.load(open("rag/res/synonym.json"))
def lookup(self, token, top_n=8):
"""
Find synonyms for token.
Strategy:
1. Check custom dictionary first
2. Fall back to WordNet for English
"""
# Custom dictionary
if token in self.custom_dict:
return self.custom_dict[token][:top_n]
# WordNet for English words
if re.match(r"^[a-z]+$", token.lower()):
synonyms = set()
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
name = lemma.name().replace("_", " ")
if name.lower() != token.lower():
synonyms.add(name)
return list(synonyms)[:top_n]
return []
```
### Synonym Sources
```
Synonym Lookup Strategy:
1. Custom Dictionary (highest priority)
- Domain-specific synonyms
- Chinese synonyms
- Technical terms
2. WordNet (English only)
- General synonyms
- Lemma extraction from synsets
Example:
"computer" → WordNet → ["machine", "calculator", "computing device"]
"机器学习" → Custom → ["ML", "machine learning", "深度学习"]
```
---
## 7. Query Expansion
### File Location
```
/rag/nlp/query.py (lines 85-218)
```
### Purpose
Build expanded query với weighted terms và synonyms.
### Implementation
```python
class FulltextQueryer:
QUERY_FIELDS = [
"title_tks^10", # Title: 10x boost
"title_sm_tks^5", # Title sub-tokens: 5x
"important_kwd^30", # Keywords: 30x
"important_tks^20", # Keyword tokens: 20x
"question_tks^20", # Question tokens: 20x
"content_ltks^2", # Content: 2x
"content_sm_ltks^1", # Content sub-tokens: 1x
]
def question(self, text, min_match=0.6):
"""
Build expanded query.
"""
# 1. Tokenize
tokens = self.tokenizer.tokenize(text)
# 2. Get weights
weighted_tokens = self.term_weight.weights(tokens)
# 3. Get synonyms
synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens]
# 4. Build query string
query_parts = []
for (token, weight), syns in zip(weighted_tokens, synonyms):
if syns:
# Token with synonyms
syn_str = " ".join(syns)
query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)")
else:
query_parts.append(f"{token}^{weight:.4f}")
# 5. Add phrase queries (bigrams)
for i in range(1, len(weighted_tokens)):
left, _ = weighted_tokens[i-1]
right, w = weighted_tokens[i]
query_parts.append(f'"{left} {right}"^{w*2:.4f}')
return MatchTextExpr(
query=" ".join(query_parts),
fields=self.QUERY_FIELDS,
min_match=f"{int(min_match * 100)}%"
)
```
### Query Expansion Example
```
Input: "machine learning tutorial"
After expansion:
(machine^0.35 OR (computer device)^0.2)
(learning^0.40 OR (study education)^0.2)
(tutorial^0.25 OR (guide lesson)^0.2)
"machine learning"^0.80
"learning tutorial"^0.50
With field boosting:
{
"query_string": {
"query": "(machine^0.35 learning^0.40 tutorial^0.25)",
"fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"],
"minimum_should_match": "60%"
}
}
```
---
## 8. Fine-Grained Tokenization
### File Location
```
/rag/nlp/rag_tokenizer.py (lines 395-420)
```
### Purpose
Secondary tokenization cho compound words.
### Implementation
```python
def fine_grained_tokenize(self, text):
"""
Break compound words into sub-tokens.
"""
# First pass: standard tokenization
tokens = self.tokenize(text)
fine_tokens = []
for token in tokens:
# Skip short tokens
if len(token) < 3:
fine_tokens.append(token)
continue
# Try to break into sub-tokens
sub_tokens = self.dfs_(token)
if len(sub_tokens[0]) > 1:
fine_tokens.extend(sub_tokens[0])
else:
fine_tokens.append(token)
return fine_tokens
```
### Example
```
Standard: "机器学习" → ["机器学习"]
Fine-grained: "机器学习" → ["机器", "学习"]
Standard: "人工智能" → ["人工智能"]
Fine-grained: "人工智能" → ["人工", "智能"]
```
---
## Summary
| Algorithm | Purpose | File |
|-----------|---------|------|
| Trie Tokenization | Word segmentation | rag_tokenizer.py |
| Max-Forward/Backward | Matching strategy | rag_tokenizer.py |
| DFS + Memo | Disambiguation | rag_tokenizer.py |
| TF-IDF | Term weighting | term_weight.py |
| NER | Entity detection | term_weight.py |
| POS Tagging | Grammatical analysis | term_weight.py |
| Synonym | Query expansion | synonym.py |
| Query Expansion | Search enhancement | query.py |
| Fine-grained | Sub-tokenization | rag_tokenizer.py |
## Related Files
- `/rag/nlp/rag_tokenizer.py` - Tokenization
- `/rag/nlp/term_weight.py` - TF-IDF, NER, POS
- `/rag/nlp/synonym.py` - Synonym lookup
- `/rag/nlp/query.py` - Query processing