ragflow/personal_analyze/06-ALGORITHMS
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00
..
bm25_scoring.md docs: Add detailed backend module analysis documentation 2025-11-26 11:10:54 +00:00
clustering_algorithms.md docs: Add comprehensive algorithm documentation (50+ algorithms) 2025-11-27 03:34:49 +00:00
graph_algorithms.md docs: Add comprehensive algorithm documentation (50+ algorithms) 2025-11-27 03:34:49 +00:00
hybrid_score_fusion.md docs: Add detailed backend module analysis documentation 2025-11-26 11:10:54 +00:00
nlp_algorithms.md docs: Add comprehensive algorithm documentation (50+ algorithms) 2025-11-27 03:34:49 +00:00
raptor_algorithm.md docs: Add detailed backend module analysis documentation 2025-11-26 11:10:54 +00:00
README.md docs: Add comprehensive algorithm documentation (50+ algorithms) 2025-11-27 03:34:49 +00:00
similarity_metrics.md docs: Add comprehensive algorithm documentation (50+ algorithms) 2025-11-27 03:34:49 +00:00
vision_algorithms.md docs: Add comprehensive algorithm documentation (50+ algorithms) 2025-11-27 03:34:49 +00:00

06-ALGORITHMS - Core Algorithms & Math

Tong Quan

Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories.

Algorithm Categories

┌─────────────────────────────────────────────────────────────────────────────┐
│                         RAGFLOW ALGORITHM MAP                                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  1. CLUSTERING                    │  2. DIMENSIONALITY REDUCTION            │
│  ├── K-Means                      │  ├── UMAP                               │
│  ├── Gaussian Mixture Model (GMM) │  └── Node2Vec Embedding                 │
│  └── Silhouette Score             │                                         │
└───────────────────────────────────┴─────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  3. GRAPH ALGORITHMS              │  4. NLP/TEXT PROCESSING                 │
│  ├── PageRank                     │  ├── Trie-based Tokenization            │
│  ├── Leiden Community Detection   │  ├── Max-Forward/Backward Algorithm     │
│  ├── Entity Extraction (LLM)      │  ├── DFS with Memoization               │
│  ├── Relation Extraction (LLM)    │  ├── TF-IDF Term Weighting              │
│  ├── Entity Resolution            │  ├── Named Entity Recognition (NER)     │
│  └── Largest Connected Component  │  ├── Part-of-Speech Tagging (POS)       │
│                                   │  ├── Synonym Detection (WordNet)        │
│                                   │  └── Query Expansion                    │
└───────────────────────────────────┴─────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  5. SIMILARITY/DISTANCE           │  6. INFORMATION RETRIEVAL               │
│  ├── Cosine Similarity            │  ├── BM25 Scoring                       │
│  ├── Edit Distance (Levenshtein)  │  ├── Hybrid Score Fusion                │
│  ├── IoU (Intersection over Union)│  ├── Cross-Encoder Reranking            │
│  ├── Token Similarity             │  └── Weighted Sum Fusion                │
│  └── Hybrid Similarity            │                                         │
└───────────────────────────────────┴─────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  7. CHUNKING/MERGING              │  8. MACHINE LEARNING MODELS             │
│  ├── Naive Merge (Token-based)    │  ├── XGBoost (Text Concatenation)       │
│  ├── Hierarchical Merge           │  ├── ONNX Models (Vision)               │
│  ├── Tree-based Merge             │  └── Reranking Models                   │
│  └── Binary Search Merge          │                                         │
└───────────────────────────────────┴─────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  9. VISION/IMAGE PROCESSING       │  10. ADVANCED RAG                       │
│  ├── OCR (ONNX)                   │  ├── RAPTOR (Hierarchical Summarization)│
│  ├── Layout Recognition (YOLOv10) │  ├── GraphRAG                           │
│  ├── Table Structure Recognition  │  └── Community Reports                  │
│  └── Non-Maximum Suppression (NMS)│                                         │
└───────────────────────────────────┴─────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  11. OPTIMIZATION                 │  12. DATA STRUCTURES                    │
│  ├── BIC (Bayesian Info Criterion)│  ├── Trie Tree                          │
│  └── Silhouette Score             │  ├── Hierarchical Tree                  │
│                                   │  └── NetworkX Graph                     │
└───────────────────────────────────┴─────────────────────────────────────────┘

Files Trong Module Nay

File Mo Ta
bm25_scoring.md BM25 ranking algorithm
hybrid_score_fusion.md Score combination
raptor_algorithm.md Hierarchical summarization
clustering_algorithms.md KMeans, GMM, UMAP
graph_algorithms.md PageRank, Leiden, Entity Resolution
nlp_algorithms.md Tokenization, TF-IDF, NER, POS
vision_algorithms.md OCR, Layout, NMS
similarity_metrics.md Cosine, Edit Distance, IoU

Complete Algorithm Reference

1. CLUSTERING ALGORITHMS

Algorithm File Description
K-Means /deepdoc/parser/pdf_parser.py:36 Column detection in PDF layout
GMM /rag/raptor.py:22 RAPTOR cluster selection
Silhouette Score /deepdoc/parser/pdf_parser.py:37 Cluster validation

2. DIMENSIONALITY REDUCTION

Algorithm File Description
UMAP /rag/raptor.py:21 Pre-clustering dimension reduction
Node2Vec /graphrag/general/entity_embedding.py:24 Graph node embedding

3. GRAPH ALGORITHMS

Algorithm File Description
PageRank /graphrag/entity_resolution.py:150 Entity importance scoring
Leiden /graphrag/general/leiden.py:72 Hierarchical community detection
Entity Extraction /graphrag/general/extractor.py LLM-based entity extraction
Relation Extraction /graphrag/general/extractor.py LLM-based relation extraction
Entity Resolution /graphrag/entity_resolution.py Entity deduplication
LCC /graphrag/general/leiden.py:67 Largest connected component

4. NLP/TEXT PROCESSING

Algorithm File Description
Trie Tokenization /rag/nlp/rag_tokenizer.py:72 Chinese word segmentation
Max-Forward /rag/nlp/rag_tokenizer.py:250 Forward tokenization
Max-Backward /rag/nlp/rag_tokenizer.py:273 Backward tokenization
DFS + Memo /rag/nlp/rag_tokenizer.py:120 Disambiguation
TF-IDF /rag/nlp/term_weight.py:223 Term weighting
NER /rag/nlp/term_weight.py:84 Named entity recognition
POS Tagging /rag/nlp/term_weight.py:179 Part-of-speech tagging
Synonym /rag/nlp/synonym.py:71 Synonym lookup
Query Expansion /rag/nlp/query.py:85 Query rewriting
Porter Stemmer /rag/nlp/rag_tokenizer.py:27 English stemming
WordNet Lemmatizer /rag/nlp/rag_tokenizer.py:27 Lemmatization

5. SIMILARITY/DISTANCE METRICS

Algorithm File Formula
Cosine Similarity /rag/nlp/query.py:221 cos(θ) = A·B / (‖A‖×‖B‖)
Edit Distance /graphrag/entity_resolution.py:28 Levenshtein distance
IoU /deepdoc/vision/operators.py:702 intersection / union
Token Similarity /rag/nlp/query.py:230 Weighted token overlap
Hybrid Similarity /rag/nlp/query.py:220 α×token + β×vector

6. INFORMATION RETRIEVAL

Algorithm File Formula
BM25 /rag/nlp/search.py ES native BM25
Hybrid Fusion /rag/nlp/search.py:126 0.05×BM25 + 0.95×Vector
Reranking /rag/nlp/search.py:330 Cross-encoder scoring
Argsort Ranking /rag/nlp/search.py:429 Score-based sorting

7. CHUNKING/MERGING

Algorithm File Description
Naive Merge /rag/nlp/__init__.py:582 Token-based chunking
Naive Merge + Images /rag/nlp/__init__.py:645 With image tracking
Hierarchical Merge /rag/nlp/__init__.py:487 Tree-based merging
Binary Search /rag/nlp/__init__.py:512 Efficient section lookup
DFS Tree Traversal /rag/flow/hierarchical_merger/ Document hierarchy

8. MACHINE LEARNING MODELS

Model File Purpose
XGBoost /deepdoc/parser/pdf_parser.py:88 Text concatenation
ONNX OCR /deepdoc/vision/ocr.py:32 Text recognition
ONNX Layout /deepdoc/vision/layout_recognizer.py Layout detection
ONNX TSR /deepdoc/vision/table_structure_recognizer.py Table structure
YOLOv10 /deepdoc/vision/layout_recognizer.py Object detection

9. VISION/IMAGE PROCESSING

Algorithm File Description
NMS /deepdoc/vision/operators.py:702 Box filtering
IoU Filtering /deepdoc/vision/recognizer.py:359 Overlap detection
Bounding Box Overlap /deepdoc/vision/layout_recognizer.py:94 Spatial analysis

10. ADVANCED RAG

Algorithm File Description
RAPTOR /rag/raptor.py:37 Hierarchical summarization
GraphRAG /graphrag/ Knowledge graph RAG
Community Reports /graphrag/general/community_reports_extractor.py Graph summaries

11. OPTIMIZATION CRITERIA

Algorithm File Formula
BIC /rag/raptor.py:92 k×log(n) - 2×log(L)
Silhouette /deepdoc/parser/pdf_parser.py:400 (b-a) / max(a,b)

Statistics

  • Total Algorithms: 50+
  • Categories: 12
  • Key Libraries: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX