# 06-ALGORITHMS - Core Algorithms & Math ## Tong Quan Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories. ## Algorithm Categories ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ RAGFLOW ALGORITHM MAP │ └─────────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 1. CLUSTERING │ 2. DIMENSIONALITY REDUCTION │ │ ├── K-Means │ ├── UMAP │ │ ├── Gaussian Mixture Model (GMM) │ └── Node2Vec Embedding │ │ └── Silhouette Score │ │ └───────────────────────────────────┴─────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 3. GRAPH ALGORITHMS │ 4. NLP/TEXT PROCESSING │ │ ├── PageRank │ ├── Trie-based Tokenization │ │ ├── Leiden Community Detection │ ├── Max-Forward/Backward Algorithm │ │ ├── Entity Extraction (LLM) │ ├── DFS with Memoization │ │ ├── Relation Extraction (LLM) │ ├── TF-IDF Term Weighting │ │ ├── Entity Resolution │ ├── Named Entity Recognition (NER) │ │ └── Largest Connected Component │ ├── Part-of-Speech Tagging (POS) │ │ │ ├── Synonym Detection (WordNet) │ │ │ └── Query Expansion │ └───────────────────────────────────┴─────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 5. SIMILARITY/DISTANCE │ 6. INFORMATION RETRIEVAL │ │ ├── Cosine Similarity │ ├── BM25 Scoring │ │ ├── Edit Distance (Levenshtein) │ ├── Hybrid Score Fusion │ │ ├── IoU (Intersection over Union)│ ├── Cross-Encoder Reranking │ │ ├── Token Similarity │ └── Weighted Sum Fusion │ │ └── Hybrid Similarity │ │ └───────────────────────────────────┴─────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 7. CHUNKING/MERGING │ 8. MACHINE LEARNING MODELS │ │ ├── Naive Merge (Token-based) │ ├── XGBoost (Text Concatenation) │ │ ├── Hierarchical Merge │ ├── ONNX Models (Vision) │ │ ├── Tree-based Merge │ └── Reranking Models │ │ └── Binary Search Merge │ │ └───────────────────────────────────┴─────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 9. VISION/IMAGE PROCESSING │ 10. ADVANCED RAG │ │ ├── OCR (ONNX) │ ├── RAPTOR (Hierarchical Summarization)│ │ ├── Layout Recognition (YOLOv10) │ ├── GraphRAG │ │ ├── Table Structure Recognition │ └── Community Reports │ │ └── Non-Maximum Suppression (NMS)│ │ └───────────────────────────────────┴─────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 11. OPTIMIZATION │ 12. DATA STRUCTURES │ │ ├── BIC (Bayesian Info Criterion)│ ├── Trie Tree │ │ └── Silhouette Score │ ├── Hierarchical Tree │ │ │ └── NetworkX Graph │ └───────────────────────────────────┴─────────────────────────────────────────┘ ``` ## Files Trong Module Nay | File | Mo Ta | |------|-------| | [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm | | [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination | | [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization | | [clustering_algorithms.md](./clustering_algorithms.md) | KMeans, GMM, UMAP | | [graph_algorithms.md](./graph_algorithms.md) | PageRank, Leiden, Entity Resolution | | [nlp_algorithms.md](./nlp_algorithms.md) | Tokenization, TF-IDF, NER, POS | | [vision_algorithms.md](./vision_algorithms.md) | OCR, Layout, NMS | | [similarity_metrics.md](./similarity_metrics.md) | Cosine, Edit Distance, IoU | ## Complete Algorithm Reference ### 1. CLUSTERING ALGORITHMS | Algorithm | File | Description | |-----------|------|-------------| | K-Means | `/deepdoc/parser/pdf_parser.py:36` | Column detection in PDF layout | | GMM | `/rag/raptor.py:22` | RAPTOR cluster selection | | Silhouette Score | `/deepdoc/parser/pdf_parser.py:37` | Cluster validation | ### 2. DIMENSIONALITY REDUCTION | Algorithm | File | Description | |-----------|------|-------------| | UMAP | `/rag/raptor.py:21` | Pre-clustering dimension reduction | | Node2Vec | `/graphrag/general/entity_embedding.py:24` | Graph node embedding | ### 3. GRAPH ALGORITHMS | Algorithm | File | Description | |-----------|------|-------------| | PageRank | `/graphrag/entity_resolution.py:150` | Entity importance scoring | | Leiden | `/graphrag/general/leiden.py:72` | Hierarchical community detection | | Entity Extraction | `/graphrag/general/extractor.py` | LLM-based entity extraction | | Relation Extraction | `/graphrag/general/extractor.py` | LLM-based relation extraction | | Entity Resolution | `/graphrag/entity_resolution.py` | Entity deduplication | | LCC | `/graphrag/general/leiden.py:67` | Largest connected component | ### 4. NLP/TEXT PROCESSING | Algorithm | File | Description | |-----------|------|-------------| | Trie Tokenization | `/rag/nlp/rag_tokenizer.py:72` | Chinese word segmentation | | Max-Forward | `/rag/nlp/rag_tokenizer.py:250` | Forward tokenization | | Max-Backward | `/rag/nlp/rag_tokenizer.py:273` | Backward tokenization | | DFS + Memo | `/rag/nlp/rag_tokenizer.py:120` | Disambiguation | | TF-IDF | `/rag/nlp/term_weight.py:223` | Term weighting | | NER | `/rag/nlp/term_weight.py:84` | Named entity recognition | | POS Tagging | `/rag/nlp/term_weight.py:179` | Part-of-speech tagging | | Synonym | `/rag/nlp/synonym.py:71` | Synonym lookup | | Query Expansion | `/rag/nlp/query.py:85` | Query rewriting | | Porter Stemmer | `/rag/nlp/rag_tokenizer.py:27` | English stemming | | WordNet Lemmatizer | `/rag/nlp/rag_tokenizer.py:27` | Lemmatization | ### 5. SIMILARITY/DISTANCE METRICS | Algorithm | File | Formula | |-----------|------|---------| | Cosine Similarity | `/rag/nlp/query.py:221` | `cos(θ) = A·B / (‖A‖×‖B‖)` | | Edit Distance | `/graphrag/entity_resolution.py:28` | Levenshtein distance | | IoU | `/deepdoc/vision/operators.py:702` | `intersection / union` | | Token Similarity | `/rag/nlp/query.py:230` | Weighted token overlap | | Hybrid Similarity | `/rag/nlp/query.py:220` | `α×token + β×vector` | ### 6. INFORMATION RETRIEVAL | Algorithm | File | Formula | |-----------|------|---------| | BM25 | `/rag/nlp/search.py` | ES native BM25 | | Hybrid Fusion | `/rag/nlp/search.py:126` | `0.05×BM25 + 0.95×Vector` | | Reranking | `/rag/nlp/search.py:330` | Cross-encoder scoring | | Argsort Ranking | `/rag/nlp/search.py:429` | Score-based sorting | ### 7. CHUNKING/MERGING | Algorithm | File | Description | |-----------|------|-------------| | Naive Merge | `/rag/nlp/__init__.py:582` | Token-based chunking | | Naive Merge + Images | `/rag/nlp/__init__.py:645` | With image tracking | | Hierarchical Merge | `/rag/nlp/__init__.py:487` | Tree-based merging | | Binary Search | `/rag/nlp/__init__.py:512` | Efficient section lookup | | DFS Tree Traversal | `/rag/flow/hierarchical_merger/` | Document hierarchy | ### 8. MACHINE LEARNING MODELS | Model | File | Purpose | |-------|------|---------| | XGBoost | `/deepdoc/parser/pdf_parser.py:88` | Text concatenation | | ONNX OCR | `/deepdoc/vision/ocr.py:32` | Text recognition | | ONNX Layout | `/deepdoc/vision/layout_recognizer.py` | Layout detection | | ONNX TSR | `/deepdoc/vision/table_structure_recognizer.py` | Table structure | | YOLOv10 | `/deepdoc/vision/layout_recognizer.py` | Object detection | ### 9. VISION/IMAGE PROCESSING | Algorithm | File | Description | |-----------|------|-------------| | NMS | `/deepdoc/vision/operators.py:702` | Box filtering | | IoU Filtering | `/deepdoc/vision/recognizer.py:359` | Overlap detection | | Bounding Box Overlap | `/deepdoc/vision/layout_recognizer.py:94` | Spatial analysis | ### 10. ADVANCED RAG | Algorithm | File | Description | |-----------|------|-------------| | RAPTOR | `/rag/raptor.py:37` | Hierarchical summarization | | GraphRAG | `/graphrag/` | Knowledge graph RAG | | Community Reports | `/graphrag/general/community_reports_extractor.py` | Graph summaries | ### 11. OPTIMIZATION CRITERIA | Algorithm | File | Formula | |-----------|------|---------| | BIC | `/rag/raptor.py:92` | `k×log(n) - 2×log(L)` | | Silhouette | `/deepdoc/parser/pdf_parser.py:400` | `(b-a) / max(a,b)` | ## Statistics - **Total Algorithms**: 50+ - **Categories**: 12 - **Key Libraries**: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX