- Updated README.md with complete algorithm map across 12 categories - Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec) - Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution) - Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym) - Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost) - Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid) |
||
|---|---|---|
| .. | ||
| bm25_scoring.md | ||
| clustering_algorithms.md | ||
| graph_algorithms.md | ||
| hybrid_score_fusion.md | ||
| nlp_algorithms.md | ||
| raptor_algorithm.md | ||
| README.md | ||
| similarity_metrics.md | ||
| vision_algorithms.md | ||
06-ALGORITHMS - Core Algorithms & Math
Tong Quan
Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories.
Algorithm Categories
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAGFLOW ALGORITHM MAP │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. CLUSTERING │ 2. DIMENSIONALITY REDUCTION │
│ ├── K-Means │ ├── UMAP │
│ ├── Gaussian Mixture Model (GMM) │ └── Node2Vec Embedding │
│ └── Silhouette Score │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 3. GRAPH ALGORITHMS │ 4. NLP/TEXT PROCESSING │
│ ├── PageRank │ ├── Trie-based Tokenization │
│ ├── Leiden Community Detection │ ├── Max-Forward/Backward Algorithm │
│ ├── Entity Extraction (LLM) │ ├── DFS with Memoization │
│ ├── Relation Extraction (LLM) │ ├── TF-IDF Term Weighting │
│ ├── Entity Resolution │ ├── Named Entity Recognition (NER) │
│ └── Largest Connected Component │ ├── Part-of-Speech Tagging (POS) │
│ │ ├── Synonym Detection (WordNet) │
│ │ └── Query Expansion │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 5. SIMILARITY/DISTANCE │ 6. INFORMATION RETRIEVAL │
│ ├── Cosine Similarity │ ├── BM25 Scoring │
│ ├── Edit Distance (Levenshtein) │ ├── Hybrid Score Fusion │
│ ├── IoU (Intersection over Union)│ ├── Cross-Encoder Reranking │
│ ├── Token Similarity │ └── Weighted Sum Fusion │
│ └── Hybrid Similarity │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 7. CHUNKING/MERGING │ 8. MACHINE LEARNING MODELS │
│ ├── Naive Merge (Token-based) │ ├── XGBoost (Text Concatenation) │
│ ├── Hierarchical Merge │ ├── ONNX Models (Vision) │
│ ├── Tree-based Merge │ └── Reranking Models │
│ └── Binary Search Merge │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 9. VISION/IMAGE PROCESSING │ 10. ADVANCED RAG │
│ ├── OCR (ONNX) │ ├── RAPTOR (Hierarchical Summarization)│
│ ├── Layout Recognition (YOLOv10) │ ├── GraphRAG │
│ ├── Table Structure Recognition │ └── Community Reports │
│ └── Non-Maximum Suppression (NMS)│ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 11. OPTIMIZATION │ 12. DATA STRUCTURES │
│ ├── BIC (Bayesian Info Criterion)│ ├── Trie Tree │
│ └── Silhouette Score │ ├── Hierarchical Tree │
│ │ └── NetworkX Graph │
└───────────────────────────────────┴─────────────────────────────────────────┘
Files Trong Module Nay
| File | Mo Ta |
|---|---|
| bm25_scoring.md | BM25 ranking algorithm |
| hybrid_score_fusion.md | Score combination |
| raptor_algorithm.md | Hierarchical summarization |
| clustering_algorithms.md | KMeans, GMM, UMAP |
| graph_algorithms.md | PageRank, Leiden, Entity Resolution |
| nlp_algorithms.md | Tokenization, TF-IDF, NER, POS |
| vision_algorithms.md | OCR, Layout, NMS |
| similarity_metrics.md | Cosine, Edit Distance, IoU |
Complete Algorithm Reference
1. CLUSTERING ALGORITHMS
| Algorithm | File | Description |
|---|---|---|
| K-Means | /deepdoc/parser/pdf_parser.py:36 |
Column detection in PDF layout |
| GMM | /rag/raptor.py:22 |
RAPTOR cluster selection |
| Silhouette Score | /deepdoc/parser/pdf_parser.py:37 |
Cluster validation |
2. DIMENSIONALITY REDUCTION
| Algorithm | File | Description |
|---|---|---|
| UMAP | /rag/raptor.py:21 |
Pre-clustering dimension reduction |
| Node2Vec | /graphrag/general/entity_embedding.py:24 |
Graph node embedding |
3. GRAPH ALGORITHMS
| Algorithm | File | Description |
|---|---|---|
| PageRank | /graphrag/entity_resolution.py:150 |
Entity importance scoring |
| Leiden | /graphrag/general/leiden.py:72 |
Hierarchical community detection |
| Entity Extraction | /graphrag/general/extractor.py |
LLM-based entity extraction |
| Relation Extraction | /graphrag/general/extractor.py |
LLM-based relation extraction |
| Entity Resolution | /graphrag/entity_resolution.py |
Entity deduplication |
| LCC | /graphrag/general/leiden.py:67 |
Largest connected component |
4. NLP/TEXT PROCESSING
| Algorithm | File | Description |
|---|---|---|
| Trie Tokenization | /rag/nlp/rag_tokenizer.py:72 |
Chinese word segmentation |
| Max-Forward | /rag/nlp/rag_tokenizer.py:250 |
Forward tokenization |
| Max-Backward | /rag/nlp/rag_tokenizer.py:273 |
Backward tokenization |
| DFS + Memo | /rag/nlp/rag_tokenizer.py:120 |
Disambiguation |
| TF-IDF | /rag/nlp/term_weight.py:223 |
Term weighting |
| NER | /rag/nlp/term_weight.py:84 |
Named entity recognition |
| POS Tagging | /rag/nlp/term_weight.py:179 |
Part-of-speech tagging |
| Synonym | /rag/nlp/synonym.py:71 |
Synonym lookup |
| Query Expansion | /rag/nlp/query.py:85 |
Query rewriting |
| Porter Stemmer | /rag/nlp/rag_tokenizer.py:27 |
English stemming |
| WordNet Lemmatizer | /rag/nlp/rag_tokenizer.py:27 |
Lemmatization |
5. SIMILARITY/DISTANCE METRICS
| Algorithm | File | Formula |
|---|---|---|
| Cosine Similarity | /rag/nlp/query.py:221 |
cos(θ) = A·B / (‖A‖×‖B‖) |
| Edit Distance | /graphrag/entity_resolution.py:28 |
Levenshtein distance |
| IoU | /deepdoc/vision/operators.py:702 |
intersection / union |
| Token Similarity | /rag/nlp/query.py:230 |
Weighted token overlap |
| Hybrid Similarity | /rag/nlp/query.py:220 |
α×token + β×vector |
6. INFORMATION RETRIEVAL
| Algorithm | File | Formula |
|---|---|---|
| BM25 | /rag/nlp/search.py |
ES native BM25 |
| Hybrid Fusion | /rag/nlp/search.py:126 |
0.05×BM25 + 0.95×Vector |
| Reranking | /rag/nlp/search.py:330 |
Cross-encoder scoring |
| Argsort Ranking | /rag/nlp/search.py:429 |
Score-based sorting |
7. CHUNKING/MERGING
| Algorithm | File | Description |
|---|---|---|
| Naive Merge | /rag/nlp/__init__.py:582 |
Token-based chunking |
| Naive Merge + Images | /rag/nlp/__init__.py:645 |
With image tracking |
| Hierarchical Merge | /rag/nlp/__init__.py:487 |
Tree-based merging |
| Binary Search | /rag/nlp/__init__.py:512 |
Efficient section lookup |
| DFS Tree Traversal | /rag/flow/hierarchical_merger/ |
Document hierarchy |
8. MACHINE LEARNING MODELS
| Model | File | Purpose |
|---|---|---|
| XGBoost | /deepdoc/parser/pdf_parser.py:88 |
Text concatenation |
| ONNX OCR | /deepdoc/vision/ocr.py:32 |
Text recognition |
| ONNX Layout | /deepdoc/vision/layout_recognizer.py |
Layout detection |
| ONNX TSR | /deepdoc/vision/table_structure_recognizer.py |
Table structure |
| YOLOv10 | /deepdoc/vision/layout_recognizer.py |
Object detection |
9. VISION/IMAGE PROCESSING
| Algorithm | File | Description |
|---|---|---|
| NMS | /deepdoc/vision/operators.py:702 |
Box filtering |
| IoU Filtering | /deepdoc/vision/recognizer.py:359 |
Overlap detection |
| Bounding Box Overlap | /deepdoc/vision/layout_recognizer.py:94 |
Spatial analysis |
10. ADVANCED RAG
| Algorithm | File | Description |
|---|---|---|
| RAPTOR | /rag/raptor.py:37 |
Hierarchical summarization |
| GraphRAG | /graphrag/ |
Knowledge graph RAG |
| Community Reports | /graphrag/general/community_reports_extractor.py |
Graph summaries |
11. OPTIMIZATION CRITERIA
| Algorithm | File | Formula |
|---|---|---|
| BIC | /rag/raptor.py:92 |
k×log(n) - 2×log(L) |
| Silhouette | /deepdoc/parser/pdf_parser.py:400 |
(b-a) / max(a,b) |
Statistics
- Total Algorithms: 50+
- Categories: 12
- Key Libraries: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX