docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
This commit is contained in:
parent
a6ee18476d
commit
566bce428b
6 changed files with 2654 additions and 94 deletions
|
|
@ -2,36 +2,65 @@
|
|||
|
||||
## Tong Quan
|
||||
|
||||
Module này chứa các thuật toán core của RAGFlow bao gồm scoring, similarity, chunking, và advanced RAG techniques.
|
||||
Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories.
|
||||
|
||||
## Kien Truc Algorithms
|
||||
## Algorithm Categories
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ CORE ALGORITHMS │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ RAGFLOW ALGORITHM MAP │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ RETRIEVAL ALGORITHMS │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ BM25 Scoring │ │ Vector Cosine │ │ Hybrid Fusion │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 1. CLUSTERING │ 2. DIMENSIONALITY REDUCTION │
|
||||
│ ├── K-Means │ ├── UMAP │
|
||||
│ ├── Gaussian Mixture Model (GMM) │ └── Node2Vec Embedding │
|
||||
│ └── Silhouette Score │ │
|
||||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ ADVANCED RAG │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ RAPTOR │ │ GraphRAG │ │ Cross-Encoder │ │
|
||||
│ │ (Hierarchical) │ │ (Knowledge G) │ │ (Reranking) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 3. GRAPH ALGORITHMS │ 4. NLP/TEXT PROCESSING │
|
||||
│ ├── PageRank │ ├── Trie-based Tokenization │
|
||||
│ ├── Leiden Community Detection │ ├── Max-Forward/Backward Algorithm │
|
||||
│ ├── Entity Extraction (LLM) │ ├── DFS with Memoization │
|
||||
│ ├── Relation Extraction (LLM) │ ├── TF-IDF Term Weighting │
|
||||
│ ├── Entity Resolution │ ├── Named Entity Recognition (NER) │
|
||||
│ └── Largest Connected Component │ ├── Part-of-Speech Tagging (POS) │
|
||||
│ │ ├── Synonym Detection (WordNet) │
|
||||
│ │ └── Query Expansion │
|
||||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TEXT PROCESSING │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ TF-IDF Weight │ │ Tokenization │ │ Query Expand │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 5. SIMILARITY/DISTANCE │ 6. INFORMATION RETRIEVAL │
|
||||
│ ├── Cosine Similarity │ ├── BM25 Scoring │
|
||||
│ ├── Edit Distance (Levenshtein) │ ├── Hybrid Score Fusion │
|
||||
│ ├── IoU (Intersection over Union)│ ├── Cross-Encoder Reranking │
|
||||
│ ├── Token Similarity │ └── Weighted Sum Fusion │
|
||||
│ └── Hybrid Similarity │ │
|
||||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 7. CHUNKING/MERGING │ 8. MACHINE LEARNING MODELS │
|
||||
│ ├── Naive Merge (Token-based) │ ├── XGBoost (Text Concatenation) │
|
||||
│ ├── Hierarchical Merge │ ├── ONNX Models (Vision) │
|
||||
│ ├── Tree-based Merge │ └── Reranking Models │
|
||||
│ └── Binary Search Merge │ │
|
||||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 9. VISION/IMAGE PROCESSING │ 10. ADVANCED RAG │
|
||||
│ ├── OCR (ONNX) │ ├── RAPTOR (Hierarchical Summarization)│
|
||||
│ ├── Layout Recognition (YOLOv10) │ ├── GraphRAG │
|
||||
│ ├── Table Structure Recognition │ └── Community Reports │
|
||||
│ └── Non-Maximum Suppression (NMS)│ │
|
||||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 11. OPTIMIZATION │ 12. DATA STRUCTURES │
|
||||
│ ├── BIC (Bayesian Info Criterion)│ ├── Trie Tree │
|
||||
│ └── Silhouette Score │ ├── Hierarchical Tree │
|
||||
│ │ └── NetworkX Graph │
|
||||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Files Trong Module Nay
|
||||
|
|
@ -39,90 +68,122 @@ Module này chứa các thuật toán core của RAGFlow bao gồm scoring, simi
|
|||
| File | Mo Ta |
|
||||
|------|-------|
|
||||
| [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm |
|
||||
| [vector_similarity.md](./vector_similarity.md) | Cosine similarity calculations |
|
||||
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination strategies |
|
||||
| [tfidf_weighting.md](./tfidf_weighting.md) | TF-IDF term weighting |
|
||||
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination |
|
||||
| [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization |
|
||||
| [graphrag_implementation.md](./graphrag_implementation.md) | Knowledge graph RAG |
|
||||
| [clustering_algorithms.md](./clustering_algorithms.md) | KMeans, GMM, UMAP |
|
||||
| [graph_algorithms.md](./graph_algorithms.md) | PageRank, Leiden, Entity Resolution |
|
||||
| [nlp_algorithms.md](./nlp_algorithms.md) | Tokenization, TF-IDF, NER, POS |
|
||||
| [vision_algorithms.md](./vision_algorithms.md) | OCR, Layout, NMS |
|
||||
| [similarity_metrics.md](./similarity_metrics.md) | Cosine, Edit Distance, IoU |
|
||||
|
||||
## Algorithm Formulas
|
||||
## Complete Algorithm Reference
|
||||
|
||||
### BM25 Scoring
|
||||
```
|
||||
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
|
||||
### 1. CLUSTERING ALGORITHMS
|
||||
|
||||
where:
|
||||
f(qi, D) = term frequency of qi in document D
|
||||
|D| = document length
|
||||
avgdl = average document length
|
||||
k1 = 1.2 (term frequency saturation)
|
||||
b = 0.75 (length normalization)
|
||||
```
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| K-Means | `/deepdoc/parser/pdf_parser.py:36` | Column detection in PDF layout |
|
||||
| GMM | `/rag/raptor.py:22` | RAPTOR cluster selection |
|
||||
| Silhouette Score | `/deepdoc/parser/pdf_parser.py:37` | Cluster validation |
|
||||
|
||||
### Cosine Similarity
|
||||
```
|
||||
cos(θ) = (A · B) / (||A|| × ||B||)
|
||||
### 2. DIMENSIONALITY REDUCTION
|
||||
|
||||
where:
|
||||
A, B = embedding vectors
|
||||
A · B = dot product
|
||||
||A|| = L2 norm
|
||||
```
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| UMAP | `/rag/raptor.py:21` | Pre-clustering dimension reduction |
|
||||
| Node2Vec | `/graphrag/general/entity_embedding.py:24` | Graph node embedding |
|
||||
|
||||
### Hybrid Score Fusion
|
||||
```
|
||||
Hybrid_Score = α × BM25_Score + (1-α) × Vector_Score
|
||||
### 3. GRAPH ALGORITHMS
|
||||
|
||||
Default: α = 0.05 (5% BM25, 95% Vector)
|
||||
```
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| PageRank | `/graphrag/entity_resolution.py:150` | Entity importance scoring |
|
||||
| Leiden | `/graphrag/general/leiden.py:72` | Hierarchical community detection |
|
||||
| Entity Extraction | `/graphrag/general/extractor.py` | LLM-based entity extraction |
|
||||
| Relation Extraction | `/graphrag/general/extractor.py` | LLM-based relation extraction |
|
||||
| Entity Resolution | `/graphrag/entity_resolution.py` | Entity deduplication |
|
||||
| LCC | `/graphrag/general/leiden.py:67` | Largest connected component |
|
||||
|
||||
### TF-IDF Weighting
|
||||
```
|
||||
IDF(term) = log10(10 + (N - df(term) + 0.5) / (df(term) + 0.5))
|
||||
Weight = (0.3 × IDF1 + 0.7 × IDF2) × NER × PoS
|
||||
```
|
||||
### 4. NLP/TEXT PROCESSING
|
||||
|
||||
### Cross-Encoder Reranking
|
||||
```
|
||||
Final_Rank = α × Token_Sim + β × Vector_Sim + γ × Rank_Features
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| Trie Tokenization | `/rag/nlp/rag_tokenizer.py:72` | Chinese word segmentation |
|
||||
| Max-Forward | `/rag/nlp/rag_tokenizer.py:250` | Forward tokenization |
|
||||
| Max-Backward | `/rag/nlp/rag_tokenizer.py:273` | Backward tokenization |
|
||||
| DFS + Memo | `/rag/nlp/rag_tokenizer.py:120` | Disambiguation |
|
||||
| TF-IDF | `/rag/nlp/term_weight.py:223` | Term weighting |
|
||||
| NER | `/rag/nlp/term_weight.py:84` | Named entity recognition |
|
||||
| POS Tagging | `/rag/nlp/term_weight.py:179` | Part-of-speech tagging |
|
||||
| Synonym | `/rag/nlp/synonym.py:71` | Synonym lookup |
|
||||
| Query Expansion | `/rag/nlp/query.py:85` | Query rewriting |
|
||||
| Porter Stemmer | `/rag/nlp/rag_tokenizer.py:27` | English stemming |
|
||||
| WordNet Lemmatizer | `/rag/nlp/rag_tokenizer.py:27` | Lemmatization |
|
||||
|
||||
where:
|
||||
α = 0.3 (token weight)
|
||||
β = 0.7 (vector weight)
|
||||
γ = variable (PageRank, tag boost)
|
||||
```
|
||||
### 5. SIMILARITY/DISTANCE METRICS
|
||||
|
||||
## Algorithm Parameters
|
||||
| Algorithm | File | Formula |
|
||||
|-----------|------|---------|
|
||||
| Cosine Similarity | `/rag/nlp/query.py:221` | `cos(θ) = A·B / (‖A‖×‖B‖)` |
|
||||
| Edit Distance | `/graphrag/entity_resolution.py:28` | Levenshtein distance |
|
||||
| IoU | `/deepdoc/vision/operators.py:702` | `intersection / union` |
|
||||
| Token Similarity | `/rag/nlp/query.py:230` | Weighted token overlap |
|
||||
| Hybrid Similarity | `/rag/nlp/query.py:220` | `α×token + β×vector` |
|
||||
|
||||
| Algorithm | Parameter | Default | Range |
|
||||
|-----------|-----------|---------|-------|
|
||||
| **BM25** | k1 | 1.2 | 0-2.0 |
|
||||
| | b | 0.75 | 0-1.0 |
|
||||
| **Hybrid** | vector_weight | 0.95 | 0-1.0 |
|
||||
| | text_weight | 0.05 | 0-1.0 |
|
||||
| **TF-IDF** | IDF1 weight | 0.3 | - |
|
||||
| | IDF2 weight | 0.7 | - |
|
||||
| **Chunking** | chunk_size | 512 | 128-2048 |
|
||||
| | overlap | 0-10% | 0-100% |
|
||||
| **RAPTOR** | max_clusters | 10-50 | - |
|
||||
| | GMM threshold | 0.1 | - |
|
||||
| **GraphRAG** | entity_topN | 6 | 1-100 |
|
||||
| | similarity_threshold | 0.3 | 0-1.0 |
|
||||
### 6. INFORMATION RETRIEVAL
|
||||
|
||||
## Key Implementation Files
|
||||
| Algorithm | File | Formula |
|
||||
|-----------|------|---------|
|
||||
| BM25 | `/rag/nlp/search.py` | ES native BM25 |
|
||||
| Hybrid Fusion | `/rag/nlp/search.py:126` | `0.05×BM25 + 0.95×Vector` |
|
||||
| Reranking | `/rag/nlp/search.py:330` | Cross-encoder scoring |
|
||||
| Argsort Ranking | `/rag/nlp/search.py:429` | Score-based sorting |
|
||||
|
||||
- `/rag/nlp/search.py` - Search algorithms
|
||||
- `/rag/nlp/term_weight.py` - TF-IDF implementation
|
||||
- `/rag/nlp/query.py` - Query processing
|
||||
- `/rag/raptor.py` - RAPTOR algorithm
|
||||
- `/graphrag/search.py` - GraphRAG search
|
||||
- `/rag/nlp/__init__.py` - Chunking algorithms
|
||||
### 7. CHUNKING/MERGING
|
||||
|
||||
## Performance Metrics
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| Naive Merge | `/rag/nlp/__init__.py:582` | Token-based chunking |
|
||||
| Naive Merge + Images | `/rag/nlp/__init__.py:645` | With image tracking |
|
||||
| Hierarchical Merge | `/rag/nlp/__init__.py:487` | Tree-based merging |
|
||||
| Binary Search | `/rag/nlp/__init__.py:512` | Efficient section lookup |
|
||||
| DFS Tree Traversal | `/rag/flow/hierarchical_merger/` | Document hierarchy |
|
||||
|
||||
| Metric | Typical Value |
|
||||
|--------|---------------|
|
||||
| Vector Search Latency | < 100ms |
|
||||
| BM25 Search Latency | < 50ms |
|
||||
| Reranking Latency | 200-500ms |
|
||||
| Total Retrieval | < 1s |
|
||||
### 8. MACHINE LEARNING MODELS
|
||||
|
||||
| Model | File | Purpose |
|
||||
|-------|------|---------|
|
||||
| XGBoost | `/deepdoc/parser/pdf_parser.py:88` | Text concatenation |
|
||||
| ONNX OCR | `/deepdoc/vision/ocr.py:32` | Text recognition |
|
||||
| ONNX Layout | `/deepdoc/vision/layout_recognizer.py` | Layout detection |
|
||||
| ONNX TSR | `/deepdoc/vision/table_structure_recognizer.py` | Table structure |
|
||||
| YOLOv10 | `/deepdoc/vision/layout_recognizer.py` | Object detection |
|
||||
|
||||
### 9. VISION/IMAGE PROCESSING
|
||||
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| NMS | `/deepdoc/vision/operators.py:702` | Box filtering |
|
||||
| IoU Filtering | `/deepdoc/vision/recognizer.py:359` | Overlap detection |
|
||||
| Bounding Box Overlap | `/deepdoc/vision/layout_recognizer.py:94` | Spatial analysis |
|
||||
|
||||
### 10. ADVANCED RAG
|
||||
|
||||
| Algorithm | File | Description |
|
||||
|-----------|------|-------------|
|
||||
| RAPTOR | `/rag/raptor.py:37` | Hierarchical summarization |
|
||||
| GraphRAG | `/graphrag/` | Knowledge graph RAG |
|
||||
| Community Reports | `/graphrag/general/community_reports_extractor.py` | Graph summaries |
|
||||
|
||||
### 11. OPTIMIZATION CRITERIA
|
||||
|
||||
| Algorithm | File | Formula |
|
||||
|-----------|------|---------|
|
||||
| BIC | `/rag/raptor.py:92` | `k×log(n) - 2×log(L)` |
|
||||
| Silhouette | `/deepdoc/parser/pdf_parser.py:400` | `(b-a) / max(a,b)` |
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Algorithms**: 50+
|
||||
- **Categories**: 12
|
||||
- **Key Libraries**: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX
|
||||
|
|
|
|||
365
personal_analyze/06-ALGORITHMS/clustering_algorithms.md
Normal file
365
personal_analyze/06-ALGORITHMS/clustering_algorithms.md
Normal file
|
|
@ -0,0 +1,365 @@
|
|||
# Clustering Algorithms
|
||||
|
||||
## Tong Quan
|
||||
|
||||
RAGFlow sử dụng clustering algorithms cho PDF layout analysis và RAPTOR hierarchical summarization.
|
||||
|
||||
## 1. K-Means Clustering
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/parser/pdf_parser.py (lines 36, 394, 425, 1047-1055)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Phát hiện cột (columns) trong PDF layout bằng cách clustering text boxes theo X-coordinate.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
def _assign_column(self):
|
||||
"""
|
||||
Detect columns using KMeans clustering on X coordinates.
|
||||
"""
|
||||
# Get X coordinates of text boxes
|
||||
x_coords = np.array([[b["x0"]] for b in self.bxs])
|
||||
|
||||
best_k = 1
|
||||
best_score = -1
|
||||
|
||||
# Find optimal number of columns (1-5)
|
||||
for k in range(1, min(5, len(self.bxs))):
|
||||
if k >= len(self.bxs):
|
||||
break
|
||||
|
||||
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
|
||||
labels = km.fit_predict(x_coords)
|
||||
|
||||
if k > 1:
|
||||
score = silhouette_score(x_coords, labels)
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_k = k
|
||||
|
||||
# Assign columns with optimal k
|
||||
km = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
|
||||
labels = km.fit_predict(x_coords)
|
||||
|
||||
for i, bx in enumerate(self.bxs):
|
||||
bx["col_id"] = labels[i]
|
||||
```
|
||||
|
||||
### Algorithm
|
||||
|
||||
```
|
||||
K-Means Algorithm:
|
||||
1. Initialize k centroids randomly
|
||||
2. Repeat until convergence:
|
||||
a. Assign each point to nearest centroid
|
||||
b. Update centroids as mean of assigned points
|
||||
3. Return cluster assignments
|
||||
|
||||
Objective: minimize Σ ||xi - μci||²
|
||||
where μci is centroid of cluster containing xi
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Value | Description |
|
||||
|-----------|-------|-------------|
|
||||
| n_clusters | 1-5 | Number of columns to detect |
|
||||
| n_init | "auto" | Initialization runs |
|
||||
| random_state | 42 | Reproducibility |
|
||||
|
||||
---
|
||||
|
||||
## 2. Gaussian Mixture Model (GMM)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/raptor.py (lines 22, 102-106, 195-199)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
RAPTOR algorithm sử dụng GMM để cluster document chunks trước khi summarization.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from sklearn.mixture import GaussianMixture
|
||||
|
||||
def _get_optimal_clusters(self, embeddings: np.ndarray, random_state: int):
|
||||
"""
|
||||
Find optimal number of clusters using BIC criterion.
|
||||
"""
|
||||
max_clusters = min(self._max_cluster, len(embeddings))
|
||||
n_clusters = np.arange(1, max_clusters)
|
||||
|
||||
bics = []
|
||||
for n in n_clusters:
|
||||
gm = GaussianMixture(
|
||||
n_components=n,
|
||||
random_state=random_state,
|
||||
covariance_type='full'
|
||||
)
|
||||
gm.fit(embeddings)
|
||||
bics.append(gm.bic(embeddings))
|
||||
|
||||
# Select cluster count with minimum BIC
|
||||
optimal_clusters = n_clusters[np.argmin(bics)]
|
||||
return optimal_clusters
|
||||
|
||||
def _cluster_chunks(self, chunks, embeddings):
|
||||
"""
|
||||
Cluster chunks using GMM with soft assignments.
|
||||
"""
|
||||
# Reduce dimensions first
|
||||
reduced = self._reduce_dimensions(embeddings)
|
||||
|
||||
# Find optimal k
|
||||
n_clusters = self._get_optimal_clusters(reduced, random_state=42)
|
||||
|
||||
# Fit GMM
|
||||
gm = GaussianMixture(n_components=n_clusters, random_state=42)
|
||||
gm.fit(reduced)
|
||||
|
||||
# Get soft assignments (probabilities)
|
||||
probs = gm.predict_proba(reduced)
|
||||
|
||||
# Assign chunks to clusters with probability > threshold
|
||||
clusters = [[] for _ in range(n_clusters)]
|
||||
for i, prob in enumerate(probs):
|
||||
for j, p in enumerate(prob):
|
||||
if p > 0.1: # Threshold
|
||||
clusters[j].append(i)
|
||||
|
||||
return clusters
|
||||
```
|
||||
|
||||
### GMM Formula
|
||||
|
||||
```
|
||||
GMM Probability Density:
|
||||
p(x) = Σ πk × N(x | μk, Σk)
|
||||
|
||||
where:
|
||||
- πk = mixture weight for component k
|
||||
- N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk
|
||||
|
||||
BIC (Bayesian Information Criterion):
|
||||
BIC = k × log(n) - 2 × log(L̂)
|
||||
|
||||
where:
|
||||
- k = number of parameters
|
||||
- n = number of samples
|
||||
- L̂ = maximum likelihood
|
||||
```
|
||||
|
||||
### Soft Assignment
|
||||
|
||||
GMM cho phép soft assignment (một chunk có thể thuộc nhiều clusters):
|
||||
|
||||
```
|
||||
Chunk i belongs to Cluster j if P(j|xi) > threshold (0.1)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. UMAP (Dimensionality Reduction)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/raptor.py (lines 21, 186-190)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Giảm số chiều của embeddings trước khi clustering để improve cluster quality.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import umap
|
||||
|
||||
def _reduce_dimensions(self, embeddings: np.ndarray) -> np.ndarray:
|
||||
"""
|
||||
Reduce embedding dimensions using UMAP.
|
||||
"""
|
||||
n_samples = len(embeddings)
|
||||
|
||||
# Calculate neighbors based on sample size
|
||||
n_neighbors = int((n_samples - 1) ** 0.8)
|
||||
|
||||
# Target dimensions
|
||||
n_components = min(12, n_samples - 2)
|
||||
|
||||
reducer = umap.UMAP(
|
||||
n_neighbors=max(2, n_neighbors),
|
||||
n_components=n_components,
|
||||
metric="cosine",
|
||||
random_state=42
|
||||
)
|
||||
|
||||
return reducer.fit_transform(embeddings)
|
||||
```
|
||||
|
||||
### UMAP Algorithm
|
||||
|
||||
```
|
||||
UMAP (Uniform Manifold Approximation and Projection):
|
||||
|
||||
1. Build high-dimensional graph:
|
||||
- Compute k-nearest neighbors
|
||||
- Create weighted edges based on distance
|
||||
|
||||
2. Build low-dimensional representation:
|
||||
- Initialize randomly
|
||||
- Optimize layout using cross-entropy loss
|
||||
- Preserve local structure (neighbors stay neighbors)
|
||||
|
||||
Key idea: Preserve topological structure, not absolute distances
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Value | Description |
|
||||
|-----------|-------|-------------|
|
||||
| n_neighbors | (n-1)^0.8 | Local neighborhood size |
|
||||
| n_components | min(12, n-2) | Output dimensions |
|
||||
| metric | cosine | Distance metric |
|
||||
|
||||
---
|
||||
|
||||
## 4. Silhouette Score
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/parser/pdf_parser.py (lines 37, 400, 1052)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Đánh giá cluster quality để chọn optimal k cho K-Means.
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Silhouette Score:
|
||||
s(i) = (b(i) - a(i)) / max(a(i), b(i))
|
||||
|
||||
where:
|
||||
- a(i) = average distance to points in same cluster
|
||||
- b(i) = average distance to points in nearest other cluster
|
||||
|
||||
Range: [-1, 1]
|
||||
- s ≈ 1: Point well-clustered
|
||||
- s ≈ 0: Point on boundary
|
||||
- s < 0: Point may be misclassified
|
||||
|
||||
Overall score = mean(s(i)) for all points
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from sklearn.metrics import silhouette_score
|
||||
|
||||
# Find optimal k
|
||||
best_k = 1
|
||||
best_score = -1
|
||||
|
||||
for k in range(2, max_clusters):
|
||||
km = KMeans(n_clusters=k)
|
||||
labels = km.fit_predict(data)
|
||||
|
||||
score = silhouette_score(data, labels)
|
||||
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_k = k
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Node2Vec (Graph Embedding)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/general/entity_embedding.py (lines 24-44)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Generate embeddings cho graph nodes trong knowledge graph.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from graspologic.embed import node2vec_embed
|
||||
|
||||
def embed_node2vec(graph, dimensions=1536, num_walks=10,
|
||||
walk_length=40, window_size=2, iterations=3):
|
||||
"""
|
||||
Generate node embeddings using Node2Vec algorithm.
|
||||
"""
|
||||
lcc_tensors, embedding = node2vec_embed(
|
||||
graph=graph,
|
||||
dimensions=dimensions,
|
||||
num_walks=num_walks,
|
||||
walk_length=walk_length,
|
||||
window_size=window_size,
|
||||
iterations=iterations,
|
||||
random_seed=86
|
||||
)
|
||||
|
||||
return embedding
|
||||
```
|
||||
|
||||
### Node2Vec Algorithm
|
||||
|
||||
```
|
||||
Node2Vec Algorithm:
|
||||
|
||||
1. Random Walk Generation:
|
||||
- For each node, perform biased random walks
|
||||
- Walk strategy controlled by p (return) and q (in-out)
|
||||
|
||||
2. Skip-gram Training:
|
||||
- Treat walks as sentences
|
||||
- Train Word2Vec Skip-gram model
|
||||
- Node → Embedding vector
|
||||
|
||||
Walk probabilities:
|
||||
- p: Return parameter (go back to previous node)
|
||||
- q: In-out parameter (explore vs exploit)
|
||||
|
||||
Low p, high q → BFS-like (local structure)
|
||||
High p, low q → DFS-like (global structure)
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Value | Description |
|
||||
|-----------|-------|-------------|
|
||||
| dimensions | 1536 | Embedding size |
|
||||
| num_walks | 10 | Walks per node |
|
||||
| walk_length | 40 | Steps per walk |
|
||||
| window_size | 2 | Skip-gram window |
|
||||
| iterations | 3 | Training iterations |
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Algorithm | Purpose | Library |
|
||||
|-----------|---------|---------|
|
||||
| K-Means | PDF column detection | sklearn |
|
||||
| GMM | RAPTOR clustering | sklearn |
|
||||
| UMAP | Dimension reduction | umap-learn |
|
||||
| Silhouette | Cluster validation | sklearn |
|
||||
| Node2Vec | Graph embedding | graspologic |
|
||||
|
||||
## Related Files
|
||||
|
||||
- `/deepdoc/parser/pdf_parser.py` - K-Means, Silhouette
|
||||
- `/rag/raptor.py` - GMM, UMAP
|
||||
- `/graphrag/general/entity_embedding.py` - Node2Vec
|
||||
471
personal_analyze/06-ALGORITHMS/graph_algorithms.md
Normal file
471
personal_analyze/06-ALGORITHMS/graph_algorithms.md
Normal file
|
|
@ -0,0 +1,471 @@
|
|||
# Graph Algorithms
|
||||
|
||||
## Tong Quan
|
||||
|
||||
RAGFlow sử dụng graph algorithms cho knowledge graph construction và GraphRAG retrieval.
|
||||
|
||||
## 1. PageRank Algorithm
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/entity_resolution.py (line 150)
|
||||
/graphrag/general/index.py (line 460)
|
||||
/graphrag/search.py (line 83)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Tính importance score cho entities trong knowledge graph.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import networkx as nx
|
||||
|
||||
def compute_pagerank(graph):
|
||||
"""
|
||||
Compute PageRank scores for all nodes.
|
||||
"""
|
||||
pagerank = nx.pagerank(graph)
|
||||
return pagerank
|
||||
|
||||
# Usage in search ranking
|
||||
def rank_entities(entities, pagerank_scores):
|
||||
"""
|
||||
Rank entities by similarity * pagerank.
|
||||
"""
|
||||
ranked = sorted(
|
||||
entities.items(),
|
||||
key=lambda x: x[1]["sim"] * x[1]["pagerank"],
|
||||
reverse=True
|
||||
)
|
||||
return ranked
|
||||
```
|
||||
|
||||
### PageRank Formula
|
||||
|
||||
```
|
||||
PageRank Algorithm:
|
||||
|
||||
PR(u) = (1-d)/N + d × Σ PR(v)/L(v)
|
||||
for all v linking to u
|
||||
|
||||
where:
|
||||
- d = damping factor (typically 0.85)
|
||||
- N = total number of nodes
|
||||
- L(v) = number of outbound links from v
|
||||
|
||||
Iterative computation until convergence:
|
||||
PR^(t+1)(u) = (1-d)/N + d × Σ PR^(t)(v)/L(v)
|
||||
```
|
||||
|
||||
### Usage in RAGFlow
|
||||
|
||||
```python
|
||||
# In GraphRAG search
|
||||
def get_relevant_entities(query, graph):
|
||||
# 1. Get candidate entities by similarity
|
||||
candidates = vector_search(query)
|
||||
|
||||
# 2. Compute PageRank
|
||||
pagerank = nx.pagerank(graph)
|
||||
|
||||
# 3. Combine scores
|
||||
for entity in candidates:
|
||||
entity["final_score"] = entity["similarity"] * pagerank[entity["id"]]
|
||||
|
||||
return sorted(candidates, key=lambda x: x["final_score"], reverse=True)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Leiden Community Detection
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/general/leiden.py (lines 72-141)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Phát hiện communities trong knowledge graph để tổ chức entities thành groups.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from graspologic.partition import hierarchical_leiden
|
||||
from graspologic.utils import largest_connected_component
|
||||
|
||||
def _compute_leiden_communities(graph, max_cluster_size=12, seed=0xDEADBEEF):
|
||||
"""
|
||||
Compute hierarchical communities using Leiden algorithm.
|
||||
"""
|
||||
# Extract largest connected component
|
||||
lcc = largest_connected_component(graph)
|
||||
|
||||
# Run hierarchical Leiden
|
||||
community_mapping = hierarchical_leiden(
|
||||
lcc,
|
||||
max_cluster_size=max_cluster_size,
|
||||
random_seed=seed
|
||||
)
|
||||
|
||||
# Process results by level
|
||||
results = {}
|
||||
for level, communities in community_mapping.items():
|
||||
for community_id, nodes in communities.items():
|
||||
# Calculate community weight
|
||||
weight = sum(
|
||||
graph.nodes[n].get("rank", 1) *
|
||||
graph.nodes[n].get("weight", 1)
|
||||
for n in nodes
|
||||
)
|
||||
results[(level, community_id)] = {
|
||||
"nodes": nodes,
|
||||
"weight": weight
|
||||
}
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### Leiden Algorithm
|
||||
|
||||
```
|
||||
Leiden Algorithm (improvement over Louvain):
|
||||
|
||||
1. Local Moving Phase:
|
||||
- Move nodes between communities to improve modularity
|
||||
- Refined node movement to avoid poorly connected communities
|
||||
|
||||
2. Refinement Phase:
|
||||
- Partition communities into smaller subcommunities
|
||||
- Ensures well-connected communities
|
||||
|
||||
3. Aggregation Phase:
|
||||
- Create aggregate graph with communities as nodes
|
||||
- Repeat from step 1 until no improvement
|
||||
|
||||
Modularity:
|
||||
Q = (1/2m) × Σ [Aij - (ki×kj)/(2m)] × δ(ci, cj)
|
||||
|
||||
where:
|
||||
- Aij = edge weight between i and j
|
||||
- ki = degree of node i
|
||||
- m = total edge weight
|
||||
- δ(ci, cj) = 1 if same community, 0 otherwise
|
||||
```
|
||||
|
||||
### Hierarchical Leiden
|
||||
|
||||
```
|
||||
Hierarchical Leiden:
|
||||
- Recursively applies Leiden to each community
|
||||
- Creates multi-level community structure
|
||||
- Controlled by max_cluster_size parameter
|
||||
|
||||
Level 0: Root community (all nodes)
|
||||
Level 1: First-level subcommunities
|
||||
Level 2: Second-level subcommunities
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Entity Extraction (LLM-based)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/general/extractor.py
|
||||
/graphrag/light/graph_extractor.py
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Extract entities và relationships từ text sử dụng LLM.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class GraphExtractor:
|
||||
DEFAULT_ENTITY_TYPES = [
|
||||
"organization", "person", "geo", "event", "category"
|
||||
]
|
||||
|
||||
async def _process_single_content(self, content, entity_types):
|
||||
"""
|
||||
Extract entities from text using LLM with iterative gleaning.
|
||||
"""
|
||||
# Initial extraction
|
||||
prompt = self._build_extraction_prompt(content, entity_types)
|
||||
result = await self._llm_chat(prompt)
|
||||
|
||||
entities, relations = self._parse_result(result)
|
||||
|
||||
# Iterative gleaning (up to 2 times)
|
||||
for _ in range(2): # ENTITY_EXTRACTION_MAX_GLEANINGS
|
||||
glean_prompt = self._build_glean_prompt(result)
|
||||
glean_result = await self._llm_chat(glean_prompt)
|
||||
|
||||
# Check if more entities found
|
||||
if "NO" in glean_result.upper():
|
||||
break
|
||||
|
||||
new_entities, new_relations = self._parse_result(glean_result)
|
||||
entities.extend(new_entities)
|
||||
relations.extend(new_relations)
|
||||
|
||||
return entities, relations
|
||||
|
||||
def _parse_result(self, result):
|
||||
"""
|
||||
Parse LLM output into structured entities/relations.
|
||||
|
||||
Format: (entity_type, entity_name, description)
|
||||
Format: (source, target, relation_type, description)
|
||||
"""
|
||||
entities = []
|
||||
relations = []
|
||||
|
||||
for line in result.split("\n"):
|
||||
if line.startswith("(") and line.endswith(")"):
|
||||
parts = line[1:-1].split(TUPLE_DELIMITER)
|
||||
if len(parts) == 3: # Entity
|
||||
entities.append({
|
||||
"type": parts[0],
|
||||
"name": parts[1],
|
||||
"description": parts[2]
|
||||
})
|
||||
elif len(parts) == 4: # Relation
|
||||
relations.append({
|
||||
"source": parts[0],
|
||||
"target": parts[1],
|
||||
"type": parts[2],
|
||||
"description": parts[3]
|
||||
})
|
||||
|
||||
return entities, relations
|
||||
```
|
||||
|
||||
### Extraction Pipeline
|
||||
|
||||
```
|
||||
Entity Extraction Pipeline:
|
||||
|
||||
1. Initial Extraction
|
||||
└── LLM extracts entities using structured prompt
|
||||
|
||||
2. Iterative Gleaning (max 2 iterations)
|
||||
├── Ask LLM if more entities exist
|
||||
├── If YES: extract additional entities
|
||||
└── If NO: stop gleaning
|
||||
|
||||
3. Relation Extraction
|
||||
└── Extract relationships between entities
|
||||
|
||||
4. Graph Construction
|
||||
└── Build NetworkX graph from entities/relations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Entity Resolution
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/entity_resolution.py
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Merge duplicate entities trong knowledge graph.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import editdistance
|
||||
import networkx as nx
|
||||
|
||||
class EntityResolution:
|
||||
def is_similarity(self, a: str, b: str) -> bool:
|
||||
"""
|
||||
Check if two entity names are similar.
|
||||
"""
|
||||
a, b = a.lower(), b.lower()
|
||||
|
||||
# Chinese: character set intersection
|
||||
if self._is_chinese(a):
|
||||
a_set, b_set = set(a), set(b)
|
||||
max_len = max(len(a_set), len(b_set))
|
||||
overlap = len(a_set & b_set)
|
||||
return overlap / max_len >= 0.8
|
||||
|
||||
# English: Edit distance
|
||||
else:
|
||||
threshold = min(len(a), len(b)) // 2
|
||||
distance = editdistance.eval(a, b)
|
||||
return distance <= threshold
|
||||
|
||||
async def resolve(self, graph):
|
||||
"""
|
||||
Resolve duplicate entities in graph.
|
||||
"""
|
||||
# 1. Find candidate pairs
|
||||
nodes = list(graph.nodes())
|
||||
candidates = []
|
||||
|
||||
for i, a in enumerate(nodes):
|
||||
for b in nodes[i+1:]:
|
||||
if self.is_similarity(a, b):
|
||||
candidates.append((a, b))
|
||||
|
||||
# 2. LLM verification (batch)
|
||||
confirmed_pairs = []
|
||||
for batch in self._batch(candidates, size=100):
|
||||
results = await self._llm_verify_batch(batch)
|
||||
confirmed_pairs.extend([
|
||||
pair for pair, is_same in zip(batch, results)
|
||||
if is_same
|
||||
])
|
||||
|
||||
# 3. Merge confirmed pairs
|
||||
for a, b in confirmed_pairs:
|
||||
self._merge_nodes(graph, a, b)
|
||||
|
||||
# 4. Update PageRank
|
||||
pagerank = nx.pagerank(graph)
|
||||
for node in graph.nodes():
|
||||
graph.nodes[node]["pagerank"] = pagerank[node]
|
||||
|
||||
return graph
|
||||
```
|
||||
|
||||
### Similarity Metrics
|
||||
|
||||
```
|
||||
English Similarity (Edit Distance):
|
||||
distance(a, b) ≤ min(len(a), len(b)) // 2
|
||||
|
||||
Example:
|
||||
- "microsoft" vs "microsft" → distance=1 ≤ 4 → Similar
|
||||
- "google" vs "apple" → distance=5 > 2 → Not similar
|
||||
|
||||
Chinese Similarity (Character Set):
|
||||
|a ∩ b| / max(|a|, |b|) ≥ 0.8
|
||||
|
||||
Example:
|
||||
- "北京大学" vs "北京大" → 3/4 = 0.75 → Not similar
|
||||
- "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Largest Connected Component (LCC)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/general/leiden.py (line 67)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Extract largest connected subgraph trước khi community detection.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from graspologic.utils import largest_connected_component
|
||||
|
||||
def _stabilize_graph(graph):
|
||||
"""
|
||||
Extract and stabilize the largest connected component.
|
||||
"""
|
||||
# Get LCC
|
||||
lcc = largest_connected_component(graph)
|
||||
|
||||
# Sort nodes for reproducibility
|
||||
sorted_nodes = sorted(lcc.nodes())
|
||||
sorted_graph = lcc.subgraph(sorted_nodes).copy()
|
||||
|
||||
return sorted_graph
|
||||
```
|
||||
|
||||
### LCC Algorithm
|
||||
|
||||
```
|
||||
LCC Algorithm:
|
||||
|
||||
1. Find all connected components using BFS/DFS
|
||||
2. Select component with maximum number of nodes
|
||||
3. Return subgraph of that component
|
||||
|
||||
Complexity: O(V + E)
|
||||
where V = vertices, E = edges
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. N-hop Path Scoring
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/search.py (lines 181-187)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Score entities dựa trên path distance trong graph.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def compute_nhop_scores(entity, neighbors, n_hops=2):
|
||||
"""
|
||||
Score entities based on graph distance.
|
||||
"""
|
||||
nhop_scores = {}
|
||||
|
||||
for neighbor in neighbors:
|
||||
path = neighbor["path"]
|
||||
weights = neighbor["weights"]
|
||||
|
||||
for i in range(len(path) - 1):
|
||||
source, target = path[i], path[i + 1]
|
||||
|
||||
# Decay by distance
|
||||
score = entity["sim"] / (2 + i)
|
||||
|
||||
if (source, target) in nhop_scores:
|
||||
nhop_scores[(source, target)]["sim"] += score
|
||||
else:
|
||||
nhop_scores[(source, target)] = {"sim": score}
|
||||
|
||||
return nhop_scores
|
||||
```
|
||||
|
||||
### Scoring Formula
|
||||
|
||||
```
|
||||
N-hop Score with Decay:
|
||||
|
||||
score(e, path_i) = similarity(e) / (2 + distance_i)
|
||||
|
||||
where:
|
||||
- distance_i = number of hops from source entity
|
||||
- 2 = base constant to prevent division issues
|
||||
|
||||
Total score = Σ score(e, path_i) for all paths
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Algorithm | Purpose | Library |
|
||||
|-----------|---------|---------|
|
||||
| PageRank | Entity importance | NetworkX |
|
||||
| Leiden | Community detection | graspologic |
|
||||
| Entity Extraction | KG construction | LLM |
|
||||
| Entity Resolution | Deduplication | editdistance + LLM |
|
||||
| LCC | Graph preprocessing | graspologic |
|
||||
| N-hop Scoring | Path-based ranking | Custom |
|
||||
|
||||
## Related Files
|
||||
|
||||
- `/graphrag/entity_resolution.py` - Entity resolution
|
||||
- `/graphrag/general/leiden.py` - Community detection
|
||||
- `/graphrag/general/extractor.py` - Entity extraction
|
||||
- `/graphrag/search.py` - Graph search
|
||||
571
personal_analyze/06-ALGORITHMS/nlp_algorithms.md
Normal file
571
personal_analyze/06-ALGORITHMS/nlp_algorithms.md
Normal file
|
|
@ -0,0 +1,571 @@
|
|||
# NLP Algorithms
|
||||
|
||||
## Tong Quan
|
||||
|
||||
RAGFlow sử dụng multiple NLP algorithms cho tokenization, term weighting, và query processing.
|
||||
|
||||
## 1. Trie-based Tokenization
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/rag_tokenizer.py (lines 72-90, 120-240)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Chinese word segmentation sử dụng Trie data structure.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import datrie
|
||||
|
||||
class RagTokenizer:
|
||||
def __init__(self):
|
||||
# Load dictionary into Trie
|
||||
self.trie = datrie.Trie(string.printable + "".join(
|
||||
chr(i) for i in range(0x4E00, 0x9FFF) # CJK characters
|
||||
))
|
||||
|
||||
# Load from huqie.txt dictionary
|
||||
for line in open("rag/res/huqie.txt"):
|
||||
word, freq, pos = line.strip().split("\t")
|
||||
self.trie[word] = (int(freq), pos)
|
||||
|
||||
def _max_forward(self, text, start):
|
||||
"""
|
||||
Max-forward matching algorithm.
|
||||
"""
|
||||
end = len(text)
|
||||
while end > start:
|
||||
substr = text[start:end]
|
||||
if substr in self.trie:
|
||||
return substr, end
|
||||
end -= 1
|
||||
return text[start], start + 1
|
||||
|
||||
def _max_backward(self, text, end):
|
||||
"""
|
||||
Max-backward matching algorithm.
|
||||
"""
|
||||
start = 0
|
||||
while start < end:
|
||||
substr = text[start:end]
|
||||
if substr in self.trie:
|
||||
return substr, start
|
||||
start += 1
|
||||
return text[end-1], end - 1
|
||||
```
|
||||
|
||||
### Trie Structure
|
||||
|
||||
```
|
||||
Trie Data Structure:
|
||||
root
|
||||
/ \
|
||||
机 学
|
||||
/ \
|
||||
器 习
|
||||
/ \
|
||||
学 人
|
||||
|
||||
Words: 机器, 机器学习, 机器人, 学习
|
||||
|
||||
Lookup: O(m) where m = word length
|
||||
Insert: O(m)
|
||||
Space: O(n × m) where n = number of words
|
||||
```
|
||||
|
||||
### Max-Forward/Backward Algorithm
|
||||
|
||||
```
|
||||
Max-Forward Matching:
|
||||
Input: "机器学习是人工智能"
|
||||
|
||||
Step 1: Try "机器学习是人工智能" → Not found
|
||||
Step 2: Try "机器学习是人工" → Not found
|
||||
...
|
||||
Step n: Try "机器学习" → Found!
|
||||
Output: ["机器学习", ...]
|
||||
|
||||
Max-Backward Matching:
|
||||
Input: "机器学习"
|
||||
|
||||
Step 1: Try "机器学习" from end → Found!
|
||||
Output: ["机器学习"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. DFS with Memoization (Disambiguation)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/rag_tokenizer.py (lines 120-210)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Giải quyết ambiguity khi có nhiều cách tokenize.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def dfs_(self, text, memo={}):
|
||||
"""
|
||||
DFS with memoization for tokenization disambiguation.
|
||||
"""
|
||||
if text in memo:
|
||||
return memo[text]
|
||||
|
||||
if not text:
|
||||
return [[]]
|
||||
|
||||
results = []
|
||||
for end in range(1, len(text) + 1):
|
||||
prefix = text[:end]
|
||||
if prefix in self.trie or len(prefix) == 1:
|
||||
suffix_results = self.dfs_(text[end:], memo)
|
||||
for suffix in suffix_results:
|
||||
results.append([prefix] + suffix)
|
||||
|
||||
# Score and select best tokenization
|
||||
best = max(results, key=lambda x: self.score_(x))
|
||||
memo[text] = [best]
|
||||
return [best]
|
||||
|
||||
def score_(self, tokens):
|
||||
"""
|
||||
Score tokenization quality.
|
||||
|
||||
Formula: score = B/len(tokens) + L + F
|
||||
where:
|
||||
B = 30 (bonus for fewer tokens)
|
||||
L = sum of token lengths / total length
|
||||
F = sum of frequencies
|
||||
"""
|
||||
B = 30
|
||||
L = sum(len(t) for t in tokens) / max(1, sum(len(t) for t in tokens))
|
||||
F = sum(self.trie.get(t, (1, ''))[0] for t in tokens)
|
||||
|
||||
return B / len(tokens) + L + F
|
||||
```
|
||||
|
||||
### Scoring Function
|
||||
|
||||
```
|
||||
Tokenization Scoring:
|
||||
|
||||
score(tokens) = B/n + L + F
|
||||
|
||||
where:
|
||||
- B = 30 (base bonus)
|
||||
- n = number of tokens (fewer is better)
|
||||
- L = coverage ratio
|
||||
- F = sum of word frequencies (common words preferred)
|
||||
|
||||
Example:
|
||||
"北京大学" →
|
||||
Option 1: ["北京", "大学"] → score = 30/2 + 1.0 + (1000+500) = 1516
|
||||
Option 2: ["北", "京", "大", "学"] → score = 30/4 + 1.0 + (10+10+10+10) = 48.5
|
||||
|
||||
Winner: Option 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. TF-IDF Term Weighting
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/term_weight.py (lines 162-244)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Tính importance weight cho mỗi term trong query.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
class Dealer:
|
||||
def weights(self, tokens, preprocess=True):
|
||||
"""
|
||||
Calculate TF-IDF based weights for tokens.
|
||||
"""
|
||||
def idf(s, N):
|
||||
return math.log10(10 + ((N - s + 0.5) / (s + 0.5)))
|
||||
|
||||
# IDF based on term frequency in corpus
|
||||
idf1 = np.array([idf(self.freq(t), 10000000) for t in tokens])
|
||||
|
||||
# IDF based on document frequency
|
||||
idf2 = np.array([idf(self.df(t), 1000000000) for t in tokens])
|
||||
|
||||
# NER and POS weights
|
||||
ner_weights = np.array([self.ner(t) for t in tokens])
|
||||
pos_weights = np.array([self.postag(t) for t in tokens])
|
||||
|
||||
# Combined weight
|
||||
weights = (0.3 * idf1 + 0.7 * idf2) * ner_weights * pos_weights
|
||||
|
||||
# Normalize
|
||||
total = np.sum(weights)
|
||||
return [(t, w / total) for t, w in zip(tokens, weights)]
|
||||
```
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
TF-IDF Variant:
|
||||
|
||||
IDF(term) = log₁₀(10 + (N - df + 0.5) / (df + 0.5))
|
||||
|
||||
where:
|
||||
- N = total documents (10⁹ for df, 10⁷ for freq)
|
||||
- df = document frequency of term
|
||||
|
||||
Combined Weight:
|
||||
weight(term) = (0.3 × IDF_freq + 0.7 × IDF_df) × NER × POS
|
||||
|
||||
Normalization:
|
||||
normalized_weight(term) = weight(term) / Σ weight(all_terms)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Named Entity Recognition (NER)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/term_weight.py (lines 84-86, 144-149)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Dictionary-based entity type detection với weight assignment.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class Dealer:
|
||||
def __init__(self):
|
||||
# Load NER dictionary
|
||||
self.ner_dict = json.load(open("rag/res/ner.json"))
|
||||
|
||||
def ner(self, token):
|
||||
"""
|
||||
Get NER weight for token.
|
||||
"""
|
||||
NER_WEIGHTS = {
|
||||
"toxic": 2.0, # Toxic/sensitive words
|
||||
"func": 1.0, # Functional words
|
||||
"corp": 3.0, # Corporation names
|
||||
"loca": 3.0, # Location names
|
||||
"sch": 3.0, # School names
|
||||
"stock": 3.0, # Stock symbols
|
||||
"firstnm": 1.0, # First names
|
||||
}
|
||||
|
||||
for entity_type, weight in NER_WEIGHTS.items():
|
||||
if token in self.ner_dict.get(entity_type, set()):
|
||||
return weight
|
||||
|
||||
return 1.0 # Default
|
||||
```
|
||||
|
||||
### Entity Types
|
||||
|
||||
```
|
||||
NER Categories:
|
||||
┌──────────┬────────┬─────────────────────────────┐
|
||||
│ Type │ Weight │ Examples │
|
||||
├──────────┼────────┼─────────────────────────────┤
|
||||
│ corp │ 3.0 │ Microsoft, Google, Apple │
|
||||
│ loca │ 3.0 │ Beijing, New York │
|
||||
│ sch │ 3.0 │ MIT, Stanford │
|
||||
│ stock │ 3.0 │ AAPL, GOOG │
|
||||
│ toxic │ 2.0 │ (sensitive words) │
|
||||
│ func │ 1.0 │ the, is, are │
|
||||
│ firstnm │ 1.0 │ John, Mary │
|
||||
└──────────┴────────┴─────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Part-of-Speech (POS) Tagging
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/term_weight.py (lines 179-189)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Assign weights based on grammatical category.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def postag(self, token):
|
||||
"""
|
||||
Get POS weight for token.
|
||||
"""
|
||||
POS_WEIGHTS = {
|
||||
"r": 0.3, # Pronoun
|
||||
"c": 0.3, # Conjunction
|
||||
"d": 0.3, # Adverb
|
||||
"ns": 3.0, # Place noun
|
||||
"nt": 3.0, # Organization noun
|
||||
"n": 2.0, # Common noun
|
||||
}
|
||||
|
||||
# Get POS tag from tokenizer
|
||||
pos = self.tokenizer.tag(token)
|
||||
|
||||
# Check for numeric patterns
|
||||
if re.match(r"^[\d.]+$", token):
|
||||
return 2.0
|
||||
|
||||
return POS_WEIGHTS.get(pos, 1.0)
|
||||
```
|
||||
|
||||
### POS Weight Table
|
||||
|
||||
```
|
||||
POS Weight Assignments:
|
||||
┌───────┬────────┬─────────────────────┐
|
||||
│ Tag │ Weight │ Description │
|
||||
├───────┼────────┼─────────────────────┤
|
||||
│ n │ 2.0 │ Common noun │
|
||||
│ ns │ 3.0 │ Place noun │
|
||||
│ nt │ 3.0 │ Organization noun │
|
||||
│ v │ 1.0 │ Verb │
|
||||
│ a │ 1.0 │ Adjective │
|
||||
│ r │ 0.3 │ Pronoun │
|
||||
│ c │ 0.3 │ Conjunction │
|
||||
│ d │ 0.3 │ Adverb │
|
||||
│ num │ 2.0 │ Number │
|
||||
└───────┴────────┴─────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Synonym Detection
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/synonym.py (lines 71-93)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Query expansion qua synonym lookup.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from nltk.corpus import wordnet
|
||||
|
||||
class SynonymLookup:
|
||||
def __init__(self):
|
||||
# Load custom dictionary
|
||||
self.custom_dict = json.load(open("rag/res/synonym.json"))
|
||||
|
||||
def lookup(self, token, top_n=8):
|
||||
"""
|
||||
Find synonyms for token.
|
||||
|
||||
Strategy:
|
||||
1. Check custom dictionary first
|
||||
2. Fall back to WordNet for English
|
||||
"""
|
||||
# Custom dictionary
|
||||
if token in self.custom_dict:
|
||||
return self.custom_dict[token][:top_n]
|
||||
|
||||
# WordNet for English words
|
||||
if re.match(r"^[a-z]+$", token.lower()):
|
||||
synonyms = set()
|
||||
for syn in wordnet.synsets(token):
|
||||
for lemma in syn.lemmas():
|
||||
name = lemma.name().replace("_", " ")
|
||||
if name.lower() != token.lower():
|
||||
synonyms.add(name)
|
||||
|
||||
return list(synonyms)[:top_n]
|
||||
|
||||
return []
|
||||
```
|
||||
|
||||
### Synonym Sources
|
||||
|
||||
```
|
||||
Synonym Lookup Strategy:
|
||||
|
||||
1. Custom Dictionary (highest priority)
|
||||
- Domain-specific synonyms
|
||||
- Chinese synonyms
|
||||
- Technical terms
|
||||
|
||||
2. WordNet (English only)
|
||||
- General synonyms
|
||||
- Lemma extraction from synsets
|
||||
|
||||
Example:
|
||||
"computer" → WordNet → ["machine", "calculator", "computing device"]
|
||||
"机器学习" → Custom → ["ML", "machine learning", "深度学习"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Query Expansion
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/query.py (lines 85-218)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Build expanded query với weighted terms và synonyms.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class FulltextQueryer:
|
||||
QUERY_FIELDS = [
|
||||
"title_tks^10", # Title: 10x boost
|
||||
"title_sm_tks^5", # Title sub-tokens: 5x
|
||||
"important_kwd^30", # Keywords: 30x
|
||||
"important_tks^20", # Keyword tokens: 20x
|
||||
"question_tks^20", # Question tokens: 20x
|
||||
"content_ltks^2", # Content: 2x
|
||||
"content_sm_ltks^1", # Content sub-tokens: 1x
|
||||
]
|
||||
|
||||
def question(self, text, min_match=0.6):
|
||||
"""
|
||||
Build expanded query.
|
||||
"""
|
||||
# 1. Tokenize
|
||||
tokens = self.tokenizer.tokenize(text)
|
||||
|
||||
# 2. Get weights
|
||||
weighted_tokens = self.term_weight.weights(tokens)
|
||||
|
||||
# 3. Get synonyms
|
||||
synonyms = [self.synonym.lookup(t) for t, _ in weighted_tokens]
|
||||
|
||||
# 4. Build query string
|
||||
query_parts = []
|
||||
for (token, weight), syns in zip(weighted_tokens, synonyms):
|
||||
if syns:
|
||||
# Token with synonyms
|
||||
syn_str = " ".join(syns)
|
||||
query_parts.append(f"({token}^{weight:.4f} OR ({syn_str})^0.2)")
|
||||
else:
|
||||
query_parts.append(f"{token}^{weight:.4f}")
|
||||
|
||||
# 5. Add phrase queries (bigrams)
|
||||
for i in range(1, len(weighted_tokens)):
|
||||
left, _ = weighted_tokens[i-1]
|
||||
right, w = weighted_tokens[i]
|
||||
query_parts.append(f'"{left} {right}"^{w*2:.4f}')
|
||||
|
||||
return MatchTextExpr(
|
||||
query=" ".join(query_parts),
|
||||
fields=self.QUERY_FIELDS,
|
||||
min_match=f"{int(min_match * 100)}%"
|
||||
)
|
||||
```
|
||||
|
||||
### Query Expansion Example
|
||||
|
||||
```
|
||||
Input: "machine learning tutorial"
|
||||
|
||||
After expansion:
|
||||
(machine^0.35 OR (computer device)^0.2)
|
||||
(learning^0.40 OR (study education)^0.2)
|
||||
(tutorial^0.25 OR (guide lesson)^0.2)
|
||||
"machine learning"^0.80
|
||||
"learning tutorial"^0.50
|
||||
|
||||
With field boosting:
|
||||
{
|
||||
"query_string": {
|
||||
"query": "(machine^0.35 learning^0.40 tutorial^0.25)",
|
||||
"fields": ["title_tks^10", "important_kwd^30", "content_ltks^2"],
|
||||
"minimum_should_match": "60%"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Fine-Grained Tokenization
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/rag_tokenizer.py (lines 395-420)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Secondary tokenization cho compound words.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def fine_grained_tokenize(self, text):
|
||||
"""
|
||||
Break compound words into sub-tokens.
|
||||
"""
|
||||
# First pass: standard tokenization
|
||||
tokens = self.tokenize(text)
|
||||
|
||||
fine_tokens = []
|
||||
for token in tokens:
|
||||
# Skip short tokens
|
||||
if len(token) < 3:
|
||||
fine_tokens.append(token)
|
||||
continue
|
||||
|
||||
# Try to break into sub-tokens
|
||||
sub_tokens = self.dfs_(token)
|
||||
if len(sub_tokens[0]) > 1:
|
||||
fine_tokens.extend(sub_tokens[0])
|
||||
else:
|
||||
fine_tokens.append(token)
|
||||
|
||||
return fine_tokens
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
```
|
||||
Standard: "机器学习" → ["机器学习"]
|
||||
Fine-grained: "机器学习" → ["机器", "学习"]
|
||||
|
||||
Standard: "人工智能" → ["人工智能"]
|
||||
Fine-grained: "人工智能" → ["人工", "智能"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Algorithm | Purpose | File |
|
||||
|-----------|---------|------|
|
||||
| Trie Tokenization | Word segmentation | rag_tokenizer.py |
|
||||
| Max-Forward/Backward | Matching strategy | rag_tokenizer.py |
|
||||
| DFS + Memo | Disambiguation | rag_tokenizer.py |
|
||||
| TF-IDF | Term weighting | term_weight.py |
|
||||
| NER | Entity detection | term_weight.py |
|
||||
| POS Tagging | Grammatical analysis | term_weight.py |
|
||||
| Synonym | Query expansion | synonym.py |
|
||||
| Query Expansion | Search enhancement | query.py |
|
||||
| Fine-grained | Sub-tokenization | rag_tokenizer.py |
|
||||
|
||||
## Related Files
|
||||
|
||||
- `/rag/nlp/rag_tokenizer.py` - Tokenization
|
||||
- `/rag/nlp/term_weight.py` - TF-IDF, NER, POS
|
||||
- `/rag/nlp/synonym.py` - Synonym lookup
|
||||
- `/rag/nlp/query.py` - Query processing
|
||||
455
personal_analyze/06-ALGORITHMS/similarity_metrics.md
Normal file
455
personal_analyze/06-ALGORITHMS/similarity_metrics.md
Normal file
|
|
@ -0,0 +1,455 @@
|
|||
# Similarity & Distance Metrics
|
||||
|
||||
## Tong Quan
|
||||
|
||||
RAGFlow sử dụng multiple similarity metrics cho search, ranking, và entity resolution.
|
||||
|
||||
## 1. Cosine Similarity
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/query.py (line 221)
|
||||
/rag/raptor.py (line 189)
|
||||
/rag/nlp/search.py (line 60)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Đo độ tương đồng giữa hai vectors (embeddings).
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Cosine Similarity:
|
||||
|
||||
cos(θ) = (A · B) / (||A|| × ||B||)
|
||||
|
||||
= Σ(Ai × Bi) / (√Σ(Ai²) × √Σ(Bi²))
|
||||
|
||||
Range: [-1, 1]
|
||||
- cos = 1: Identical direction
|
||||
- cos = 0: Orthogonal
|
||||
- cos = -1: Opposite direction
|
||||
|
||||
For normalized vectors:
|
||||
cos(θ) = A · B (dot product only)
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
import numpy as np
|
||||
|
||||
def compute_cosine_similarity(vec1, vec2):
|
||||
"""
|
||||
Compute cosine similarity between two vectors.
|
||||
"""
|
||||
# Using sklearn
|
||||
sim = cosine_similarity([vec1], [vec2])[0][0]
|
||||
return sim
|
||||
|
||||
def compute_batch_similarity(query_vec, doc_vecs):
|
||||
"""
|
||||
Compute similarity between query and multiple documents.
|
||||
"""
|
||||
# Returns array of similarities
|
||||
sims = cosine_similarity([query_vec], doc_vecs)[0]
|
||||
return sims
|
||||
|
||||
# Manual implementation
|
||||
def cosine_sim_manual(a, b):
|
||||
dot_product = np.dot(a, b)
|
||||
norm_a = np.linalg.norm(a)
|
||||
norm_b = np.linalg.norm(b)
|
||||
return dot_product / (norm_a * norm_b)
|
||||
```
|
||||
|
||||
### Usage in RAGFlow
|
||||
|
||||
```python
|
||||
# Vector search scoring
|
||||
def hybrid_similarity(self, query_vec, doc_vecs, tkweight=0.3, vtweight=0.7):
|
||||
# Cosine similarity for vectors
|
||||
vsim = cosine_similarity([query_vec], doc_vecs)[0]
|
||||
|
||||
# Token similarity
|
||||
tksim = self.token_similarity(query_tokens, doc_tokens)
|
||||
|
||||
# Weighted combination
|
||||
combined = vsim * vtweight + tksim * tkweight
|
||||
|
||||
return combined
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Edit Distance (Levenshtein)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/entity_resolution.py (line 28, 246)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Measure string similarity cho entity resolution.
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Edit Distance (Levenshtein):
|
||||
|
||||
d(a, b) = minimum number of single-character edits
|
||||
(insertions, deletions, substitutions)
|
||||
|
||||
Dynamic Programming:
|
||||
d[i][j] = min(
|
||||
d[i-1][j] + 1, # deletion
|
||||
d[i][j-1] + 1, # insertion
|
||||
d[i-1][j-1] + c # substitution (c=0 if same, 1 if different)
|
||||
)
|
||||
|
||||
Base cases:
|
||||
d[i][0] = i
|
||||
d[0][j] = j
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import editdistance
|
||||
|
||||
def is_similar_by_edit_distance(a: str, b: str) -> bool:
|
||||
"""
|
||||
Check if two strings are similar using edit distance.
|
||||
|
||||
Threshold: distance ≤ min(len(a), len(b)) // 2
|
||||
"""
|
||||
a, b = a.lower(), b.lower()
|
||||
threshold = min(len(a), len(b)) // 2
|
||||
distance = editdistance.eval(a, b)
|
||||
return distance <= threshold
|
||||
|
||||
# Examples:
|
||||
# "microsoft" vs "microsft" → distance=1, threshold=4 → Similar
|
||||
# "google" vs "apple" → distance=5, threshold=2 → Not similar
|
||||
```
|
||||
|
||||
### Similarity Threshold
|
||||
|
||||
```
|
||||
Edit Distance Threshold Strategy:
|
||||
|
||||
threshold = min(len(a), len(b)) // 2
|
||||
|
||||
Rationale:
|
||||
- Allows ~50% character differences
|
||||
- Handles typos and minor variations
|
||||
- Stricter for short strings
|
||||
|
||||
Examples:
|
||||
| String A | String B | Distance | Threshold | Similar? |
|
||||
|-------------|-------------|----------|-----------|----------|
|
||||
| microsoft | microsft | 1 | 4 | Yes |
|
||||
| google | googl | 1 | 2 | Yes |
|
||||
| amazon | apple | 5 | 2 | No |
|
||||
| ibm | ibm | 0 | 1 | Yes |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Chinese Character Similarity
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/entity_resolution.py (lines 250-255)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Similarity measure cho Chinese entity names.
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Chinese Character Similarity:
|
||||
|
||||
sim(a, b) = |set(a) ∩ set(b)| / max(|set(a)|, |set(b)|)
|
||||
|
||||
Threshold: sim ≥ 0.8
|
||||
|
||||
Example:
|
||||
a = "北京大学" → set = {北, 京, 大, 学}
|
||||
b = "北京大" → set = {北, 京, 大}
|
||||
intersection = {北, 京, 大}
|
||||
sim = 3 / max(4, 3) = 3/4 = 0.75 < 0.8 → Not similar
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def is_similar_chinese(a: str, b: str) -> bool:
|
||||
"""
|
||||
Check if two Chinese strings are similar.
|
||||
Uses character set intersection.
|
||||
"""
|
||||
a_set = set(a)
|
||||
b_set = set(b)
|
||||
|
||||
max_len = max(len(a_set), len(b_set))
|
||||
intersection = len(a_set & b_set)
|
||||
|
||||
similarity = intersection / max_len
|
||||
|
||||
return similarity >= 0.8
|
||||
|
||||
# Examples:
|
||||
# "清华大学" vs "清华" → 2/4 = 0.5 → Not similar
|
||||
# "人工智能" vs "人工智慧" → 3/4 = 0.75 → Not similar
|
||||
# "机器学习" vs "机器学习研究" → 4/6 = 0.67 → Not similar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Token Similarity (Weighted)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/query.py (lines 230-242)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Measure similarity based on token overlap với weights.
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Token Similarity:
|
||||
|
||||
sim(query, doc) = Σ weight(t) for t ∈ (query ∩ doc)
|
||||
────────────────────────────────────
|
||||
Σ weight(t) for t ∈ query
|
||||
|
||||
where weight(t) = TF-IDF weight of token t
|
||||
|
||||
Range: [0, 1]
|
||||
- 0: No token overlap
|
||||
- 1: All query tokens in document
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def token_similarity(self, query_tokens_weighted, doc_tokens):
|
||||
"""
|
||||
Compute weighted token similarity.
|
||||
|
||||
Args:
|
||||
query_tokens_weighted: [(token, weight), ...]
|
||||
doc_tokens: set of document tokens
|
||||
|
||||
Returns:
|
||||
Similarity score in [0, 1]
|
||||
"""
|
||||
doc_set = set(doc_tokens)
|
||||
|
||||
matched_weight = 0
|
||||
total_weight = 0
|
||||
|
||||
for token, weight in query_tokens_weighted:
|
||||
total_weight += weight
|
||||
if token in doc_set:
|
||||
matched_weight += weight
|
||||
|
||||
if total_weight == 0:
|
||||
return 0
|
||||
|
||||
return matched_weight / total_weight
|
||||
|
||||
# Example:
|
||||
# query = [("machine", 0.4), ("learning", 0.35), ("tutorial", 0.25)]
|
||||
# doc = {"machine", "learning", "introduction"}
|
||||
# matched = 0.4 + 0.35 = 0.75
|
||||
# total = 1.0
|
||||
# similarity = 0.75
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Hybrid Similarity
|
||||
|
||||
### File Location
|
||||
```
|
||||
/rag/nlp/query.py (lines 220-228)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Combine token và vector similarity.
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Hybrid Similarity:
|
||||
|
||||
hybrid = α × token_sim + β × vector_sim
|
||||
|
||||
where:
|
||||
- α = text weight (default: 0.3)
|
||||
- β = vector weight (default: 0.7)
|
||||
- α + β = 1.0
|
||||
|
||||
Alternative with rank features:
|
||||
hybrid = (α × token_sim + β × vector_sim) × (1 + γ × pagerank)
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def hybrid_similarity(self, query_vec, doc_vecs,
|
||||
query_tokens, doc_tokens_list,
|
||||
tkweight=0.3, vtweight=0.7):
|
||||
"""
|
||||
Compute hybrid similarity combining token and vector similarity.
|
||||
"""
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
# Vector similarity (cosine)
|
||||
vsim = cosine_similarity([query_vec], doc_vecs)[0]
|
||||
|
||||
# Token similarity
|
||||
tksim = []
|
||||
for doc_tokens in doc_tokens_list:
|
||||
sim = self.token_similarity(query_tokens, doc_tokens)
|
||||
tksim.append(sim)
|
||||
|
||||
tksim = np.array(tksim)
|
||||
|
||||
# Handle edge case
|
||||
if np.sum(vsim) == 0:
|
||||
return tksim, tksim, vsim
|
||||
|
||||
# Weighted combination
|
||||
combined = vsim * vtweight + tksim * tkweight
|
||||
|
||||
return combined, tksim, vsim
|
||||
```
|
||||
|
||||
### Weight Recommendations
|
||||
|
||||
```
|
||||
Hybrid Weights by Use Case:
|
||||
┌─────────────────────────┬────────┬────────┐
|
||||
│ Use Case │ Token │ Vector │
|
||||
├─────────────────────────┼────────┼────────┤
|
||||
│ Conversational/Semantic │ 0.05 │ 0.95 │
|
||||
│ Technical Documentation │ 0.30 │ 0.70 │
|
||||
│ Legal/Exact Match │ 0.40 │ 0.60 │
|
||||
│ Code Search │ 0.50 │ 0.50 │
|
||||
│ Default │ 0.30 │ 0.70 │
|
||||
└─────────────────────────┴────────┴────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. IoU (Intersection over Union)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/operators.py (lines 702-725)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Measure bounding box overlap.
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
IoU = Area(A ∩ B) / Area(A ∪ B)
|
||||
|
||||
= Area(intersection) / (Area(A) + Area(B) - Area(intersection))
|
||||
|
||||
Range: [0, 1]
|
||||
- IoU = 0: No overlap
|
||||
- IoU = 1: Perfect overlap
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def compute_iou(box1, box2):
|
||||
"""
|
||||
Compute IoU between two boxes [x1, y1, x2, y2].
|
||||
"""
|
||||
# Intersection
|
||||
x1 = max(box1[0], box2[0])
|
||||
y1 = max(box1[1], box2[1])
|
||||
x2 = min(box1[2], box2[2])
|
||||
y2 = min(box1[3], box2[3])
|
||||
|
||||
intersection = max(0, x2 - x1) * max(0, y2 - y1)
|
||||
|
||||
# Union
|
||||
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
|
||||
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
|
||||
union = area1 + area2 - intersection
|
||||
|
||||
return intersection / union if union > 0 else 0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. N-gram Similarity
|
||||
|
||||
### File Location
|
||||
```
|
||||
/graphrag/entity_resolution.py (2-gram analysis)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Check digit differences trong entity names.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def check_2gram_digit_difference(a: str, b: str) -> bool:
|
||||
"""
|
||||
Check if strings differ only in digit 2-grams.
|
||||
"""
|
||||
def get_2grams(s):
|
||||
return [s[i:i+2] for i in range(len(s)-1)]
|
||||
|
||||
a_grams = get_2grams(a)
|
||||
b_grams = get_2grams(b)
|
||||
|
||||
# Find different 2-grams
|
||||
diff_grams = set(a_grams) ^ set(b_grams)
|
||||
|
||||
# Check if all differences are digit-only
|
||||
for gram in diff_grams:
|
||||
if not gram.isdigit():
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Example:
|
||||
# "product2023" vs "product2024" → True (only digit diff)
|
||||
# "productA" vs "productB" → False (letter diff)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Metric | Formula | Range | Use Case |
|
||||
|--------|---------|-------|----------|
|
||||
| Cosine | A·B / (‖A‖×‖B‖) | [-1, 1] | Vector search |
|
||||
| Edit Distance | min edits | [0, ∞) | String matching |
|
||||
| Chinese Char | \|A∩B\| / max(\|A\|,\|B\|) | [0, 1] | Chinese entities |
|
||||
| Token | Σw(matched) / Σw(all) | [0, 1] | Keyword matching |
|
||||
| Hybrid | α×token + β×vector | [0, 1] | Combined search |
|
||||
| IoU | intersection / union | [0, 1] | Box overlap |
|
||||
|
||||
## Related Files
|
||||
|
||||
- `/rag/nlp/query.py` - Similarity calculations
|
||||
- `/rag/nlp/search.py` - Search ranking
|
||||
- `/graphrag/entity_resolution.py` - Entity matching
|
||||
- `/deepdoc/vision/operators.py` - Box metrics
|
||||
637
personal_analyze/06-ALGORITHMS/vision_algorithms.md
Normal file
637
personal_analyze/06-ALGORITHMS/vision_algorithms.md
Normal file
|
|
@ -0,0 +1,637 @@
|
|||
# Vision Algorithms
|
||||
|
||||
## Tong Quan
|
||||
|
||||
RAGFlow sử dụng computer vision algorithms cho document understanding, OCR, và layout analysis.
|
||||
|
||||
## 1. OCR (Optical Character Recognition)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/ocr.py (lines 30-120)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Text detection và recognition từ document images.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import onnxruntime as ort
|
||||
|
||||
class OCR:
|
||||
def __init__(self):
|
||||
# Load ONNX models
|
||||
self.det_model = ort.InferenceSession("ocr_det.onnx")
|
||||
self.rec_model = ort.InferenceSession("ocr_rec.onnx")
|
||||
|
||||
def detect(self, image, device_id=0):
|
||||
"""
|
||||
Detect text regions in image.
|
||||
|
||||
Returns:
|
||||
List of bounding boxes with confidence scores
|
||||
"""
|
||||
# Preprocess
|
||||
img = self._preprocess_det(image)
|
||||
|
||||
# Run detection
|
||||
outputs = self.det_model.run(None, {"input": img})
|
||||
|
||||
# Post-process to get boxes
|
||||
boxes = self._postprocess_det(outputs[0])
|
||||
|
||||
return boxes
|
||||
|
||||
def recognize(self, image, boxes):
|
||||
"""
|
||||
Recognize text in detected regions.
|
||||
|
||||
Returns:
|
||||
List of (text, confidence) tuples
|
||||
"""
|
||||
results = []
|
||||
|
||||
for box in boxes:
|
||||
# Crop region
|
||||
crop = self._crop_region(image, box)
|
||||
|
||||
# Preprocess
|
||||
img = self._preprocess_rec(crop)
|
||||
|
||||
# Run recognition
|
||||
outputs = self.rec_model.run(None, {"input": img})
|
||||
|
||||
# Decode to text
|
||||
text, conf = self._decode_ctc(outputs[0])
|
||||
results.append((text, conf))
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### OCR Pipeline
|
||||
|
||||
```
|
||||
OCR Pipeline:
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Input Image │
|
||||
└──────────────────────────┬──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Detection Model (ONNX) │
|
||||
│ - DB (Differentiable Binarization) based │
|
||||
│ - Output: Text region polygons │
|
||||
└──────────────────────────┬──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Post-processing │
|
||||
│ - Polygon to bounding box │
|
||||
│ - Filter by confidence │
|
||||
│ - NMS for overlapping boxes │
|
||||
└──────────────────────────┬──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Recognition Model (ONNX) │
|
||||
│ - CRNN (CNN + RNN) based │
|
||||
│ - CTC decoding │
|
||||
│ - Output: Character sequence │
|
||||
└──────────────────────────┬──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Output: [(text, confidence, box), ...] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### CTC Decoding
|
||||
|
||||
```
|
||||
CTC (Connectionist Temporal Classification):
|
||||
|
||||
Input: Probability matrix P (T × C)
|
||||
T = time steps, C = character classes
|
||||
|
||||
Algorithm:
|
||||
1. For each time step, get most probable character
|
||||
2. Merge consecutive duplicates
|
||||
3. Remove blank tokens
|
||||
|
||||
Example:
|
||||
Raw output: [a, a, -, b, b, b, -, c]
|
||||
After merge: [a, -, b, -, c]
|
||||
After blank removal: [a, b, c]
|
||||
Final: "abc"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Layout Recognition (YOLOv10)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/layout_recognizer.py (lines 33-100)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Detect document layout elements (text, title, table, figure, etc.).
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class LayoutRecognizer:
|
||||
LABELS = [
|
||||
"text", "title", "figure", "figure caption",
|
||||
"table", "table caption", "header", "footer",
|
||||
"reference", "equation"
|
||||
]
|
||||
|
||||
def __init__(self):
|
||||
self.model = ort.InferenceSession("layout_yolov10.onnx")
|
||||
|
||||
def detect(self, image):
|
||||
"""
|
||||
Detect layout elements in document image.
|
||||
"""
|
||||
# Preprocess (resize, normalize)
|
||||
img = self._preprocess(image)
|
||||
|
||||
# Run inference
|
||||
outputs = self.model.run(None, {"images": img})
|
||||
|
||||
# Post-process
|
||||
boxes, labels, scores = self._postprocess(outputs[0])
|
||||
|
||||
# Filter by confidence
|
||||
results = []
|
||||
for box, label, score in zip(boxes, labels, scores):
|
||||
if score > 0.4: # Confidence threshold
|
||||
results.append({
|
||||
"box": box,
|
||||
"type": self.LABELS[label],
|
||||
"confidence": score
|
||||
})
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### Layout Types
|
||||
|
||||
```
|
||||
Document Layout Categories:
|
||||
┌──────────────────┬────────────────────────────────────┐
|
||||
│ Type │ Description │
|
||||
├──────────────────┼────────────────────────────────────┤
|
||||
│ text │ Body text paragraphs │
|
||||
│ title │ Section/document titles │
|
||||
│ figure │ Images, diagrams, charts │
|
||||
│ figure caption │ Text describing figures │
|
||||
│ table │ Data tables │
|
||||
│ table caption │ Text describing tables │
|
||||
│ header │ Page headers │
|
||||
│ footer │ Page footers │
|
||||
│ reference │ Bibliography, citations │
|
||||
│ equation │ Mathematical equations │
|
||||
└──────────────────┴────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### YOLO Detection
|
||||
|
||||
```
|
||||
YOLOv10 Detection:
|
||||
|
||||
1. Backbone: Feature extraction (CSPDarknet)
|
||||
2. Neck: Feature pyramid (PANet)
|
||||
3. Head: Prediction heads for different scales
|
||||
|
||||
Output format:
|
||||
[x_center, y_center, width, height, confidence, class_probs...]
|
||||
|
||||
Post-processing:
|
||||
1. Apply sigmoid to confidence
|
||||
2. Multiply conf × class_prob for class scores
|
||||
3. Filter by score threshold
|
||||
4. Apply NMS
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Table Structure Recognition (TSR)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/table_structure_recognizer.py (lines 30-100)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Detect table structure (rows, columns, cells, headers).
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class TableStructureRecognizer:
|
||||
LABELS = [
|
||||
"table", "table column", "table row",
|
||||
"table column header", "projected row header",
|
||||
"spanning cell"
|
||||
]
|
||||
|
||||
def __init__(self):
|
||||
self.model = ort.InferenceSession("table_structure.onnx")
|
||||
|
||||
def recognize(self, table_image):
|
||||
"""
|
||||
Recognize structure of a table image.
|
||||
"""
|
||||
# Preprocess
|
||||
img = self._preprocess(table_image)
|
||||
|
||||
# Run inference
|
||||
outputs = self.model.run(None, {"input": img})
|
||||
|
||||
# Parse structure
|
||||
structure = self._parse_structure(outputs)
|
||||
|
||||
return structure
|
||||
|
||||
def _parse_structure(self, outputs):
|
||||
"""
|
||||
Parse model output into table structure.
|
||||
"""
|
||||
rows = []
|
||||
columns = []
|
||||
cells = []
|
||||
|
||||
for detection in outputs:
|
||||
label = self.LABELS[detection["class"]]
|
||||
|
||||
if label == "table row":
|
||||
rows.append(detection["box"])
|
||||
elif label == "table column":
|
||||
columns.append(detection["box"])
|
||||
elif label == "spanning cell":
|
||||
cells.append({
|
||||
"box": detection["box"],
|
||||
"colspan": self._estimate_colspan(detection, columns),
|
||||
"rowspan": self._estimate_rowspan(detection, rows)
|
||||
})
|
||||
|
||||
return {
|
||||
"rows": sorted(rows, key=lambda x: x[1]), # Sort by Y
|
||||
"columns": sorted(columns, key=lambda x: x[0]), # Sort by X
|
||||
"cells": cells
|
||||
}
|
||||
```
|
||||
|
||||
### TSR Output
|
||||
|
||||
```
|
||||
Table Structure Output:
|
||||
|
||||
{
|
||||
"rows": [
|
||||
{"y": 10, "height": 30}, # Row 1
|
||||
{"y": 40, "height": 30}, # Row 2
|
||||
...
|
||||
],
|
||||
"columns": [
|
||||
{"x": 0, "width": 100}, # Col 1
|
||||
{"x": 100, "width": 150}, # Col 2
|
||||
...
|
||||
],
|
||||
"cells": [
|
||||
{"row": 0, "col": 0, "text": "Header 1"},
|
||||
{"row": 0, "col": 1, "text": "Header 2"},
|
||||
{"row": 1, "col": 0, "text": "Data 1", "colspan": 2},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Non-Maximum Suppression (NMS)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/operators.py (lines 702-725)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Filter overlapping bounding boxes trong object detection.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def nms(boxes, scores, iou_threshold=0.5):
|
||||
"""
|
||||
Non-Maximum Suppression algorithm.
|
||||
|
||||
Args:
|
||||
boxes: List of [x1, y1, x2, y2]
|
||||
scores: Confidence scores
|
||||
iou_threshold: IoU threshold for suppression
|
||||
|
||||
Returns:
|
||||
Indices of kept boxes
|
||||
"""
|
||||
# Sort by score (descending)
|
||||
indices = np.argsort(scores)[::-1]
|
||||
|
||||
keep = []
|
||||
while len(indices) > 0:
|
||||
# Keep highest scoring box
|
||||
current = indices[0]
|
||||
keep.append(current)
|
||||
|
||||
if len(indices) == 1:
|
||||
break
|
||||
|
||||
# Compute IoU with remaining boxes
|
||||
remaining = indices[1:]
|
||||
ious = compute_iou(boxes[current], boxes[remaining])
|
||||
|
||||
# Keep boxes with IoU below threshold
|
||||
indices = remaining[ious < iou_threshold]
|
||||
|
||||
return keep
|
||||
```
|
||||
|
||||
### NMS Algorithm
|
||||
|
||||
```
|
||||
NMS (Non-Maximum Suppression):
|
||||
|
||||
Input: Boxes B, Scores S, Threshold θ
|
||||
Output: Filtered boxes
|
||||
|
||||
Algorithm:
|
||||
1. Sort boxes by score (descending)
|
||||
2. Select box with highest score → add to results
|
||||
3. Remove boxes with IoU > θ with selected box
|
||||
4. Repeat until no boxes remain
|
||||
|
||||
Example:
|
||||
Boxes: [A(0.9), B(0.8), C(0.7)]
|
||||
IoU(A,B) = 0.7 > 0.5 → Remove B
|
||||
IoU(A,C) = 0.3 < 0.5 → Keep C
|
||||
Result: [A, C]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Intersection over Union (IoU)
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/operators.py (lines 702-725)
|
||||
/deepdoc/vision/recognizer.py (lines 339-357)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Measure overlap between bounding boxes.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
def compute_iou(box1, box2):
|
||||
"""
|
||||
Compute Intersection over Union.
|
||||
|
||||
Args:
|
||||
box1, box2: [x1, y1, x2, y2] format
|
||||
|
||||
Returns:
|
||||
IoU value in [0, 1]
|
||||
"""
|
||||
# Intersection coordinates
|
||||
x1 = max(box1[0], box2[0])
|
||||
y1 = max(box1[1], box2[1])
|
||||
x2 = min(box1[2], box2[2])
|
||||
y2 = min(box1[3], box2[3])
|
||||
|
||||
# Intersection area
|
||||
intersection = max(0, x2 - x1) * max(0, y2 - y1)
|
||||
|
||||
# Union area
|
||||
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
|
||||
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
|
||||
union = area1 + area2 - intersection
|
||||
|
||||
# IoU
|
||||
if union == 0:
|
||||
return 0
|
||||
|
||||
return intersection / union
|
||||
```
|
||||
|
||||
### IoU Formula
|
||||
|
||||
```
|
||||
IoU (Intersection over Union):
|
||||
|
||||
IoU = Area(A ∩ B) / Area(A ∪ B)
|
||||
|
||||
= Area(A ∩ B) / (Area(A) + Area(B) - Area(A ∩ B))
|
||||
|
||||
Range: [0, 1]
|
||||
- IoU = 0: No overlap
|
||||
- IoU = 1: Perfect overlap
|
||||
|
||||
Threshold Usage:
|
||||
- Detection: IoU > 0.5 → Same object
|
||||
- NMS: IoU > 0.5 → Suppress duplicate
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Image Preprocessing
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/vision/operators.py
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Prepare images for neural network input.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
class StandardizeImage:
|
||||
"""Normalize image to [0, 1] range."""
|
||||
|
||||
def __call__(self, image):
|
||||
return image.astype(np.float32) / 255.0
|
||||
|
||||
class NormalizeImage:
|
||||
"""Apply mean/std normalization."""
|
||||
|
||||
def __init__(self, mean=[0.485, 0.456, 0.406],
|
||||
std=[0.229, 0.224, 0.225]):
|
||||
self.mean = np.array(mean)
|
||||
self.std = np.array(std)
|
||||
|
||||
def __call__(self, image):
|
||||
return (image - self.mean) / self.std
|
||||
|
||||
class ToCHWImage:
|
||||
"""Convert HWC to CHW format."""
|
||||
|
||||
def __call__(self, image):
|
||||
return image.transpose((2, 0, 1))
|
||||
|
||||
class LinearResize:
|
||||
"""Resize image maintaining aspect ratio."""
|
||||
|
||||
def __init__(self, target_size):
|
||||
self.target = target_size
|
||||
|
||||
def __call__(self, image):
|
||||
h, w = image.shape[:2]
|
||||
scale = self.target / max(h, w)
|
||||
new_h, new_w = int(h * scale), int(w * scale)
|
||||
return cv2.resize(image, (new_w, new_h),
|
||||
interpolation=cv2.INTER_CUBIC)
|
||||
```
|
||||
|
||||
### Preprocessing Pipeline
|
||||
|
||||
```
|
||||
Image Preprocessing Pipeline:
|
||||
|
||||
1. Resize (maintain aspect ratio)
|
||||
- Target: 640 or 1280 depending on model
|
||||
|
||||
2. Standardize (0-255 → 0-1)
|
||||
- image = image / 255.0
|
||||
|
||||
3. Normalize (ImageNet stats)
|
||||
- image = (image - mean) / std
|
||||
- mean = [0.485, 0.456, 0.406]
|
||||
- std = [0.229, 0.224, 0.225]
|
||||
|
||||
4. Transpose (HWC → CHW)
|
||||
- PyTorch format: (C, H, W)
|
||||
|
||||
5. Pad (to square)
|
||||
- Pad with zeros to square shape
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. XGBoost Text Concatenation
|
||||
|
||||
### File Location
|
||||
```
|
||||
/deepdoc/parser/pdf_parser.py (lines 88-101, 131-170)
|
||||
```
|
||||
|
||||
### Purpose
|
||||
Predict whether adjacent text boxes should be merged.
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import xgboost as xgb
|
||||
|
||||
class PDFParser:
|
||||
def __init__(self):
|
||||
# Load pre-trained XGBoost model
|
||||
self.concat_model = xgb.Booster()
|
||||
self.concat_model.load_model("updown_concat_xgb.model")
|
||||
|
||||
def should_concat(self, box1, box2):
|
||||
"""
|
||||
Predict if two text boxes should be concatenated.
|
||||
"""
|
||||
# Extract features
|
||||
features = self._extract_concat_features(box1, box2)
|
||||
|
||||
# Create DMatrix
|
||||
dmatrix = xgb.DMatrix([features])
|
||||
|
||||
# Predict probability
|
||||
prob = self.concat_model.predict(dmatrix)[0]
|
||||
|
||||
return prob > 0.5
|
||||
|
||||
def _extract_concat_features(self, box1, box2):
|
||||
"""
|
||||
Extract 20+ features for concatenation decision.
|
||||
"""
|
||||
features = []
|
||||
|
||||
# Distance features
|
||||
y_dist = box2["top"] - box1["bottom"]
|
||||
char_height = box1["bottom"] - box1["top"]
|
||||
features.append(y_dist / max(char_height, 1))
|
||||
|
||||
# Alignment features
|
||||
x_overlap = min(box1["x1"], box2["x1"]) - max(box1["x0"], box2["x0"])
|
||||
features.append(x_overlap / max(box1["x1"] - box1["x0"], 1))
|
||||
|
||||
# Text pattern features
|
||||
text1, text2 = box1["text"], box2["text"]
|
||||
features.append(1 if text1.endswith((".", "。", "!", "?")) else 0)
|
||||
features.append(1 if text2[0].isupper() else 0)
|
||||
|
||||
# Layout features
|
||||
features.append(1 if box1.get("layout_num") == box2.get("layout_num") else 0)
|
||||
|
||||
# ... more features
|
||||
|
||||
return features
|
||||
```
|
||||
|
||||
### Feature List
|
||||
|
||||
```
|
||||
XGBoost Concatenation Features:
|
||||
|
||||
1. Spatial Features:
|
||||
- Y-distance / char_height
|
||||
- X-alignment overlap ratio
|
||||
- Same page flag
|
||||
|
||||
2. Text Pattern Features:
|
||||
- Ends with sentence punctuation
|
||||
- Ends with continuation punctuation
|
||||
- Next starts with uppercase
|
||||
- Next starts with number
|
||||
- Chinese numbering pattern
|
||||
|
||||
3. Layout Features:
|
||||
- Same layout_type
|
||||
- Same layout_num
|
||||
- Same column
|
||||
|
||||
4. Tokenization Features:
|
||||
- Token count ratio
|
||||
- Last/first token match
|
||||
|
||||
Total: 20+ features
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Algorithm | Purpose | Model Type |
|
||||
|-----------|---------|------------|
|
||||
| OCR | Text detection + recognition | ONNX (DB + CRNN) |
|
||||
| Layout Recognition | Element detection | ONNX (YOLOv10) |
|
||||
| TSR | Table structure | ONNX |
|
||||
| NMS | Box filtering | Classical |
|
||||
| IoU | Overlap measure | Classical |
|
||||
| XGBoost | Text concatenation | Gradient Boosting |
|
||||
|
||||
## Related Files
|
||||
|
||||
- `/deepdoc/vision/ocr.py` - OCR models
|
||||
- `/deepdoc/vision/layout_recognizer.py` - Layout detection
|
||||
- `/deepdoc/vision/table_structure_recognizer.py` - TSR
|
||||
- `/deepdoc/vision/operators.py` - Image processing
|
||||
- `/deepdoc/parser/pdf_parser.py` - XGBoost integration
|
||||
Loading…
Add table
Reference in a new issue