# NLP Algorithms ## Tong Quan RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing. ## 1. Trie-based Tokenization ### File Location ``` /rag/nlp/rag_tokenizer.py (lines 72-90, 120-240) ``` ### Purpose Chinese word segmentation sử dụng Trie data structure. ### Implementation ```python import datrie class RagTokenizer: def __init__(self): # Load dictionary into Trie self.trie = datrie.Trie(string.printable + "".join( chr(i) for i in range(0x4E00, 0x9FFF) # CJK characters )) # Load from huqie.txt dictionary for line in open("rag/res/huqie.txt"): word, freq, pos = line.strip().split("\t") self.trie[word] = (int(freq), pos) def _max_forward(self, text, start): """ Max-forward matching algorithm. """ end = len(text) while end > start: substr = text[start:end] if substr in self.trie: return substr, end end -= 1 return text[start], start + 1 def _max_backward(self, text, end): """ Max-backward matching algorithm. """ start = 0 while start < end: substr = text[start:end] if substr in self.trie: return substr, start start += 1 return text[end-1], end - 1 ``` ### Trie Structure ``` Trie Data Structure: root / \ 机 学 / \ 器 习 / \ 学 人 Words: 机器, 机器学习, 机器人, 学习 Lookup: O(m) where m = word length Insert: O(m) Space: O(n × m) where n = number of words ``` ### Max-Forward/Backward Algorithm ``` Max-Forward Matching: Input: "机器学习是人工智能" Step 1: Try "机器学习是人工智能" → Not found Step 2: Try "机器学习是人工" → Not found ... Step n: Try "机器学习" → Found! Output: ["机器学习", ...] Max-Backward Matching: Input: "机器学习" Step 1: Try "机器学习" from end → Found! Output: ["机器学习"] ``` --- ## 2. DFS with Memoization (Disambiguation) ### File Location ``` /rag/nlp/rag_tokenizer.py (lines 120-210) ``` ### Purpose Giải quyết ambiguity khi có nhiều cách tokenize. ### Implementation ```python def dfs_(self, text, memo={}): """ DFS with memoization for tokenization disambiguation. """ if text in memo: return memo[text] if not text: return [[]] results = [] for end in range(1, len(text) + 1): prefix = text[:end] if prefix in self.trie or len(prefix) == 1: suffix_results = self.dfs_(text[end:], memo) for suffix in suffix_results: results.append([prefix] + suffix) # Score and select best tokenization best = max(results, key=lambda x: self.score_(x)) memo[text] = [best] return [best] def score_(self, tokens): """ Score tokenization quality. Formula: score = B/len(tokens) + L + F where: B = 30 (bonus for fewer tokens) L = sum of token lengths / total length F = sum of frequencies """ B = 30 L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens)) F = sum(self.trie.get(t, (1, ''))[0] for t in tokens) return B / len(tokens) + L + F ``` ### Scoring Function ``` Tokenization Scoring: score(tokens) = B/n + L + F where: - B = 30 (base bonus) - n = number of tokens (fewer is better) - L = coverage ratio - F = sum of word frequencies (common words preferred) Example: "北京大学" → Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516 Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5 Winner: Option 1 ``` --- ## 3. TF-IDF Term Weighting ### File Location ``` /rag/nlp/term_weight.py (lines 162-244) ``` ### Purpose Tính importance weight cho mỗi term trong query. ### Implementation ```python import math import numpy as np class Dealer: def weights(self, tokens, preprocess=True): """ Calculate TF-IDF based weights for tokens. """ def idf(s, N): return math.log10(10 + ((N - s + 0.5) / (s + 0.5))) # IDF based on term frequency in corpus idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens]) # IDF based on document frequency idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens]) # NER and POS weights ner_weights = np.array([self.ner(t) for t in tokens]) pos_weights = np.array([self.postag(t) for t in tokens]) # Combined weight weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights # Normalize total = np.sum(weights) return [(t, w / total) for t, w in zip(tokens, weights)] ``` ### Formula ``` TF-IDF Variant: IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5)) where: - N = total documents (10⁹ for df, 10⁷ for freq) - df = document frequency of term Combined Weight: weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS Normalization: normalized_weight(term) = weight(term) / Σ weight(all_terms) ``` --- ## 4. Named Entity Recognition (NER) ### File Location ``` /rag/nlp/term_weight.py (lines 84-86, 144-149) ``` ### Purpose Dictionary-based entity type detection với weight assignment. ### Implementation ```python class Dealer: def __init__(self): # Load NER dictionary self.ner_dict = json.load(open("rag/res/ner.json")) def ner(self, token): """ Get NER weight for token. """ NER_WEIGHTS = { "toxic": 2.0, # Toxic/sensitive words "func": 1.0, # Functional words "corp": 3.0, # Corporation names "loca": 3.0, # Location names "sch": 3.0, # School names "stock": 3.0, # Stock symbols "firstnm": 1.0, # First names } for entity_type, weight in NER_WEIGHTS.items(): if token in self.ner_dict.get(entity_type, set()): return weight return 1.0 # Default ``` ### Entity Types ``` NER Categories: ┌──────────┬────────┬─────────────────────────────┐ │ Type │ Weight │ Examples │ ├──────────┼────────┼─────────────────────────────┤ │ corp │ 3.0 │ Microsoft, Google, Apple │ │ loca │ 3.0 │ Beijing, New York │ │ sch │ 3.0 │ MIT, Stanford │ │ stock │ 3.0 │ AAPL, GOOG │ │ toxic │ 2.0 │ (sensitive words) │ │ func │ 1.0 │ the, is, are │ │ firstnm │ 1.0 │ John, Mary │ └──────────┴────────┴─────────────────────────────┘ ``` --- ## 5. Part-of-Speech (POS) Tagging ### File Location ``` /rag/nlp/term_weight.py (lines 179-189) ``` ### Purpose Assign weights based on grammatical category. ### Implementation ```python def postag(self, token): """ Get POS weight for token. """ POS_WEIGHTS = { "r": 0.3, # Pronoun "c": 0.3, # Conjunction "d": 0.3, # Adverb "ns": 3.0, # Place noun "nt": 3.0, # Organization noun "n": 2.0, # Common noun } # Get POS tag from tokenizer pos = self.tokenizer.tag(token) # Check for numeric patterns if re.match(r"^[\d.]+$", token): return 2.0 return POS_WEIGHTS.get(pos, 1.0) ``` ### POS Weight Table ``` POS Weight Assignments: ┌───────┬────────┬─────────────────────┐ │ Tag │ Weight │ Description │ ├───────┼────────┼─────────────────────┤ │ n │ 2.0 │ Common noun │ │ ns │ 3.0 │ Place noun │ │ nt │ 3.0 │ Organization noun │ │ v │ 1.0 │ Verb │ │ a │ 1.0 │ Adjective │ │ r │ 0.3 │ Pronoun │ │ c │ 0.3 │ Conjunction │ │ d │ 0.3 │ Adverb │ │ num │ 2.0 │ Number │ └───────┴────────┴─────────────────────┘ ``` --- ## 6. Synonym Detection ### File Location ``` /rag/nlp/synonym.py (lines 71-93) ``` ### Purpose Query expansion qua synonym lookup. ### Implementation ```python from nltk.corpus import wordnet class SynonymLookup: def __init__(self): # Load custom dictionary self.custom_dict = json.load(open("rag/res/synonym.json")) def lookup(self, token, top_n=8): """ Find synonyms for token. Strategy: 1. Check custom dictionary first 2. Fall back to WordNet for English """ # Custom dictionary if token in self.custom_dict: return self.custom_dict[token][:top_n] # WordNet for English words if re.match(r"^[a-z]+$", token.lower()): synonyms = set() for syn in wordnet.synsets(token): for lemma in syn.lemmas(): name = lemma.name().replace("_", " ") if name.lower() != token.lower(): synonyms.add(name) return list(synonyms)[:top_n] return [] ``` ### Synonym Sources ``` Synonym Lookup Strategy: 1. Custom Dictionary (highest priority) - Domain-specific synonyms - Chinese synonyms - Technical terms 2. WordNet (English only) - General synonyms - Lemma extraction from synsets Example: "computer" → WordNet → ["machine", "calculator", "computing device"] "机器学习" → Custom → ["ML", "machine learning", "深度学习"] ``` --- ## 7. Query Expansion ### File Location ``` /rag/nlp/query.py (lines 85-218) ``` ### Purpose Build expanded query với weighted terms và synonyms. ### Implementation ```python class FulltextQueryer: QUERY_FIELDS = [ "title_tks^10", # Title: 10x boost "title_sm_tks^5", # Title sub-tokens: 5x "important_kwd^30", # Keywords: 30x "important_tks^20", # Keyword tokens: 20x "question_tks^20", # Question tokens: 20x "content_ltks^2", # Content: 2x "content_sm_ltks^1", # Content sub-tokens: 1x ] def question(self, text, min_match=0.6): """ Build expanded query. """ # 1. Tokenize tokens = self.tokenizer.tokenize(text) # 2. Get weights weighted_tokens = self.term_weight.weights(tokens) # 3. Get synonyms synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens] # 4. Build query string query_parts = [] for (token, weight), syns in zip(weighted_tokens, synonyms): if syns: # Token with synonyms syn_str = " ".join(syns) query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)") else: query_parts.append(f"{token}^{weight:.4f}") # 5. Add phrase queries (bigrams) for i in range(1, len(weighted_tokens)): left, _ = weighted_tokens[i-1] right, w = weighted_tokens[i] query_parts.append(f'"{left} {right}"^{w*2:.4f}') return MatchTextExpr( query=" ".join(query_parts), fields=self.QUERY_FIELDS, min_match=f"{int(min_match * 100)}%" ) ``` ### Query Expansion Example ``` Input: "machine learning tutorial" After expansion: (machine^0.35 OR (computer device)^0.2) (learning^0.40 OR (study education)^0.2) (tutorial^0.25 OR (guide lesson)^0.2) "machine learning"^0.80 "learning tutorial"^0.50 With field boosting: { "query_string": { "query": "(machine^0.35 learning^0.40 tutorial^0.25)", "fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"], "minimum_should_match": "60%" } } ``` --- ## 8. Fine-Grained Tokenization ### File Location ``` /rag/nlp/rag_tokenizer.py (lines 395-420) ``` ### Purpose Secondary tokenization cho compound words. ### Implementation ```python def fine_grained_tokenize(self, text): """ Break compound words into sub-tokens. """ # First pass: standard tokenization tokens = self.tokenize(text) fine_tokens = [] for token in tokens: # Skip short tokens if len(token) < 3: fine_tokens.append(token) continue # Try to break into sub-tokens sub_tokens = self.dfs_(token) if len(sub_tokens[0]) > 1: fine_tokens.extend(sub_tokens[0]) else: fine_tokens.append(token) return fine_tokens ``` ### Example ``` Standard: "机器学习" → ["机器学习"] Fine-grained: "机器学习" → ["机器", "学习"] Standard: "人工智能" → ["人工智能"] Fine-grained: "人工智能" → ["人工", "智能"] ``` --- ## Summary | Algorithm | Purpose | File | |-----------|---------|------| | Trie Tokenization | Word segmentation | rag_tokenizer.py | | Max-Forward/Backward | Matching strategy | rag_tokenizer.py | | DFS + Memo | Disambiguation | rag_tokenizer.py | | TF-IDF | Term weighting | term_weight.py | | NER | Entity detection | term_weight.py | | POS Tagging | Grammatical analysis | term_weight.py | | Synonym | Query expansion | synonym.py | | Query Expansion | Search enhancement | query.py | | Fine-grained | Sub-tokenization | rag_tokenizer.py | ## Related Files - `/rag/nlp/rag_tokenizer.py` - Tokenization - `/rag/nlp/term_weight.py` - TF-IDF, NER, POS - `/rag/nlp/synonym.py` - Synonym lookup - `/rag/nlp/query.py` - Query processing