- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
189 lines
12 KiB
Markdown
189 lines
12 KiB
Markdown
# 06-ALGORITHMS - Core Algorithms & Math
|
||
|
||
## Tong Quan
|
||
|
||
Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories.
|
||
|
||
## Algorithm Categories
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ RAGFLOW ALGORITHM MAP │
|
||
└─────────────────────────────────────────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 1. CLUSTERING │ 2. DIMENSIONALITY REDUCTION │
|
||
│ ├── K-Means │ ├── UMAP │
|
||
│ ├── Gaussian Mixture Model (GMM) │ └── Node2Vec Embedding │
|
||
│ └── Silhouette Score │ │
|
||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 3. GRAPH ALGORITHMS │ 4. NLP/TEXT PROCESSING │
|
||
│ ├── PageRank │ ├── Trie-based Tokenization │
|
||
│ ├── Leiden Community Detection │ ├── Max-Forward/Backward Algorithm │
|
||
│ ├── Entity Extraction (LLM) │ ├── DFS with Memoization │
|
||
│ ├── Relation Extraction (LLM) │ ├── TF-IDF Term Weighting │
|
||
│ ├── Entity Resolution │ ├── Named Entity Recognition (NER) │
|
||
│ └── Largest Connected Component │ ├── Part-of-Speech Tagging (POS) │
|
||
│ │ ├── Synonym Detection (WordNet) │
|
||
│ │ └── Query Expansion │
|
||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 5. SIMILARITY/DISTANCE │ 6. INFORMATION RETRIEVAL │
|
||
│ ├── Cosine Similarity │ ├── BM25 Scoring │
|
||
│ ├── Edit Distance (Levenshtein) │ ├── Hybrid Score Fusion │
|
||
│ ├── IoU (Intersection over Union)│ ├── Cross-Encoder Reranking │
|
||
│ ├── Token Similarity │ └── Weighted Sum Fusion │
|
||
│ └── Hybrid Similarity │ │
|
||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 7. CHUNKING/MERGING │ 8. MACHINE LEARNING MODELS │
|
||
│ ├── Naive Merge (Token-based) │ ├── XGBoost (Text Concatenation) │
|
||
│ ├── Hierarchical Merge │ ├── ONNX Models (Vision) │
|
||
│ ├── Tree-based Merge │ └── Reranking Models │
|
||
│ └── Binary Search Merge │ │
|
||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 9. VISION/IMAGE PROCESSING │ 10. ADVANCED RAG │
|
||
│ ├── OCR (ONNX) │ ├── RAPTOR (Hierarchical Summarization)│
|
||
│ ├── Layout Recognition (YOLOv10) │ ├── GraphRAG │
|
||
│ ├── Table Structure Recognition │ └── Community Reports │
|
||
│ └── Non-Maximum Suppression (NMS)│ │
|
||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 11. OPTIMIZATION │ 12. DATA STRUCTURES │
|
||
│ ├── BIC (Bayesian Info Criterion)│ ├── Trie Tree │
|
||
│ └── Silhouette Score │ ├── Hierarchical Tree │
|
||
│ │ └── NetworkX Graph │
|
||
└───────────────────────────────────┴─────────────────────────────────────────┘
|
||
```
|
||
|
||
## Files Trong Module Nay
|
||
|
||
| File | Mo Ta |
|
||
|------|-------|
|
||
| [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm |
|
||
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination |
|
||
| [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization |
|
||
| [clustering_algorithms.md](./clustering_algorithms.md) | KMeans, GMM, UMAP |
|
||
| [graph_algorithms.md](./graph_algorithms.md) | PageRank, Leiden, Entity Resolution |
|
||
| [nlp_algorithms.md](./nlp_algorithms.md) | Tokenization, TF-IDF, NER, POS |
|
||
| [vision_algorithms.md](./vision_algorithms.md) | OCR, Layout, NMS |
|
||
| [similarity_metrics.md](./similarity_metrics.md) | Cosine, Edit Distance, IoU |
|
||
|
||
## Complete Algorithm Reference
|
||
|
||
### 1. CLUSTERING ALGORITHMS
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| K-Means | `/deepdoc/parser/pdf_parser.py:36` | Column detection in PDF layout |
|
||
| GMM | `/rag/raptor.py:22` | RAPTOR cluster selection |
|
||
| Silhouette Score | `/deepdoc/parser/pdf_parser.py:37` | Cluster validation |
|
||
|
||
### 2. DIMENSIONALITY REDUCTION
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| UMAP | `/rag/raptor.py:21` | Pre-clustering dimension reduction |
|
||
| Node2Vec | `/graphrag/general/entity_embedding.py:24` | Graph node embedding |
|
||
|
||
### 3. GRAPH ALGORITHMS
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| PageRank | `/graphrag/entity_resolution.py:150` | Entity importance scoring |
|
||
| Leiden | `/graphrag/general/leiden.py:72` | Hierarchical community detection |
|
||
| Entity Extraction | `/graphrag/general/extractor.py` | LLM-based entity extraction |
|
||
| Relation Extraction | `/graphrag/general/extractor.py` | LLM-based relation extraction |
|
||
| Entity Resolution | `/graphrag/entity_resolution.py` | Entity deduplication |
|
||
| LCC | `/graphrag/general/leiden.py:67` | Largest connected component |
|
||
|
||
### 4. NLP/TEXT PROCESSING
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| Trie Tokenization | `/rag/nlp/rag_tokenizer.py:72` | Chinese word segmentation |
|
||
| Max-Forward | `/rag/nlp/rag_tokenizer.py:250` | Forward tokenization |
|
||
| Max-Backward | `/rag/nlp/rag_tokenizer.py:273` | Backward tokenization |
|
||
| DFS + Memo | `/rag/nlp/rag_tokenizer.py:120` | Disambiguation |
|
||
| TF-IDF | `/rag/nlp/term_weight.py:223` | Term weighting |
|
||
| NER | `/rag/nlp/term_weight.py:84` | Named entity recognition |
|
||
| POS Tagging | `/rag/nlp/term_weight.py:179` | Part-of-speech tagging |
|
||
| Synonym | `/rag/nlp/synonym.py:71` | Synonym lookup |
|
||
| Query Expansion | `/rag/nlp/query.py:85` | Query rewriting |
|
||
| Porter Stemmer | `/rag/nlp/rag_tokenizer.py:27` | English stemming |
|
||
| WordNet Lemmatizer | `/rag/nlp/rag_tokenizer.py:27` | Lemmatization |
|
||
|
||
### 5. SIMILARITY/DISTANCE METRICS
|
||
|
||
| Algorithm | File | Formula |
|
||
|-----------|------|---------|
|
||
| Cosine Similarity | `/rag/nlp/query.py:221` | `cos(θ) = A·B / (‖A‖×‖B‖)` |
|
||
| Edit Distance | `/graphrag/entity_resolution.py:28` | Levenshtein distance |
|
||
| IoU | `/deepdoc/vision/operators.py:702` | `intersection / union` |
|
||
| Token Similarity | `/rag/nlp/query.py:230` | Weighted token overlap |
|
||
| Hybrid Similarity | `/rag/nlp/query.py:220` | `α×token + β×vector` |
|
||
|
||
### 6. INFORMATION RETRIEVAL
|
||
|
||
| Algorithm | File | Formula |
|
||
|-----------|------|---------|
|
||
| BM25 | `/rag/nlp/search.py` | ES native BM25 |
|
||
| Hybrid Fusion | `/rag/nlp/search.py:126` | `0.05×BM25 + 0.95×Vector` |
|
||
| Reranking | `/rag/nlp/search.py:330` | Cross-encoder scoring |
|
||
| Argsort Ranking | `/rag/nlp/search.py:429` | Score-based sorting |
|
||
|
||
### 7. CHUNKING/MERGING
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| Naive Merge | `/rag/nlp/__init__.py:582` | Token-based chunking |
|
||
| Naive Merge + Images | `/rag/nlp/__init__.py:645` | With image tracking |
|
||
| Hierarchical Merge | `/rag/nlp/__init__.py:487` | Tree-based merging |
|
||
| Binary Search | `/rag/nlp/__init__.py:512` | Efficient section lookup |
|
||
| DFS Tree Traversal | `/rag/flow/hierarchical_merger/` | Document hierarchy |
|
||
|
||
### 8. MACHINE LEARNING MODELS
|
||
|
||
| Model | File | Purpose |
|
||
|-------|------|---------|
|
||
| XGBoost | `/deepdoc/parser/pdf_parser.py:88` | Text concatenation |
|
||
| ONNX OCR | `/deepdoc/vision/ocr.py:32` | Text recognition |
|
||
| ONNX Layout | `/deepdoc/vision/layout_recognizer.py` | Layout detection |
|
||
| ONNX TSR | `/deepdoc/vision/table_structure_recognizer.py` | Table structure |
|
||
| YOLOv10 | `/deepdoc/vision/layout_recognizer.py` | Object detection |
|
||
|
||
### 9. VISION/IMAGE PROCESSING
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| NMS | `/deepdoc/vision/operators.py:702` | Box filtering |
|
||
| IoU Filtering | `/deepdoc/vision/recognizer.py:359` | Overlap detection |
|
||
| Bounding Box Overlap | `/deepdoc/vision/layout_recognizer.py:94` | Spatial analysis |
|
||
|
||
### 10. ADVANCED RAG
|
||
|
||
| Algorithm | File | Description |
|
||
|-----------|------|-------------|
|
||
| RAPTOR | `/rag/raptor.py:37` | Hierarchical summarization |
|
||
| GraphRAG | `/graphrag/` | Knowledge graph RAG |
|
||
| Community Reports | `/graphrag/general/community_reports_extractor.py` | Graph summaries |
|
||
|
||
### 11. OPTIMIZATION CRITERIA
|
||
|
||
| Algorithm | File | Formula |
|
||
|-----------|------|---------|
|
||
| BIC | `/rag/raptor.py:92` | `k×log(n) - 2×log(L)` |
|
||
| Silhouette | `/deepdoc/parser/pdf_parser.py:400` | `(b-a) / max(a,b)` |
|
||
|
||
## Statistics
|
||
|
||
- **Total Algorithms**: 50+
|
||
- **Categories**: 12
|
||
- **Key Libraries**: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX
|