ragflow/personal_analyze/06-ALGORITHMS/README.md
Claude 566bce428b
docs: Add comprehensive algorithm documentation (50+ algorithms)
- Updated README.md with complete algorithm map across 12 categories
- Added clustering_algorithms.md (K-Means, GMM, UMAP, Silhouette, Node2Vec)
- Added graph_algorithms.md (PageRank, Leiden, Entity Extraction/Resolution)
- Added nlp_algorithms.md (Trie tokenization, TF-IDF, NER, POS, Synonym)
- Added vision_algorithms.md (OCR, Layout Recognition, TSR, NMS, IoU, XGBoost)
- Added similarity_metrics.md (Cosine, Edit Distance, Token, Hybrid)
2025-11-27 03:34:49 +00:00

189 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 06-ALGORITHMS - Core Algorithms & Math
## Tong Quan
Module này chứa TẤT CẢ các thuật toán được sử dụng trong RAGFlow, bao gồm 50+ algorithms chia thành 12 categories.
## Algorithm Categories
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAGFLOW ALGORITHM MAP │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. CLUSTERING │ 2. DIMENSIONALITY REDUCTION │
│ ├── K-Means │ ├── UMAP │
│ ├── Gaussian Mixture Model (GMM) │ └── Node2Vec Embedding │
│ └── Silhouette Score │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 3. GRAPH ALGORITHMS │ 4. NLP/TEXT PROCESSING │
│ ├── PageRank │ ├── Trie-based Tokenization │
│ ├── Leiden Community Detection │ ├── Max-Forward/Backward Algorithm │
│ ├── Entity Extraction (LLM) │ ├── DFS with Memoization │
│ ├── Relation Extraction (LLM) │ ├── TF-IDF Term Weighting │
│ ├── Entity Resolution │ ├── Named Entity Recognition (NER) │
│ └── Largest Connected Component │ ├── Part-of-Speech Tagging (POS) │
│ │ ├── Synonym Detection (WordNet) │
│ │ └── Query Expansion │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 5. SIMILARITY/DISTANCE │ 6. INFORMATION RETRIEVAL │
│ ├── Cosine Similarity │ ├── BM25 Scoring │
│ ├── Edit Distance (Levenshtein) │ ├── Hybrid Score Fusion │
│ ├── IoU (Intersection over Union)│ ├── Cross-Encoder Reranking │
│ ├── Token Similarity │ └── Weighted Sum Fusion │
│ └── Hybrid Similarity │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 7. CHUNKING/MERGING │ 8. MACHINE LEARNING MODELS │
│ ├── Naive Merge (Token-based) │ ├── XGBoost (Text Concatenation) │
│ ├── Hierarchical Merge │ ├── ONNX Models (Vision) │
│ ├── Tree-based Merge │ └── Reranking Models │
│ └── Binary Search Merge │ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 9. VISION/IMAGE PROCESSING │ 10. ADVANCED RAG │
│ ├── OCR (ONNX) │ ├── RAPTOR (Hierarchical Summarization)│
│ ├── Layout Recognition (YOLOv10) │ ├── GraphRAG │
│ ├── Table Structure Recognition │ └── Community Reports │
│ └── Non-Maximum Suppression (NMS)│ │
└───────────────────────────────────┴─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 11. OPTIMIZATION │ 12. DATA STRUCTURES │
│ ├── BIC (Bayesian Info Criterion)│ ├── Trie Tree │
│ └── Silhouette Score │ ├── Hierarchical Tree │
│ │ └── NetworkX Graph │
└───────────────────────────────────┴─────────────────────────────────────────┘
```
## Files Trong Module Nay
| File | Mo Ta |
|------|-------|
| [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm |
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination |
| [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization |
| [clustering_algorithms.md](./clustering_algorithms.md) | KMeans, GMM, UMAP |
| [graph_algorithms.md](./graph_algorithms.md) | PageRank, Leiden, Entity Resolution |
| [nlp_algorithms.md](./nlp_algorithms.md) | Tokenization, TF-IDF, NER, POS |
| [vision_algorithms.md](./vision_algorithms.md) | OCR, Layout, NMS |
| [similarity_metrics.md](./similarity_metrics.md) | Cosine, Edit Distance, IoU |
## Complete Algorithm Reference
### 1. CLUSTERING ALGORITHMS
| Algorithm | File | Description |
|-----------|------|-------------|
| K-Means | `/deepdoc/parser/pdf_parser.py:36` | Column detection in PDF layout |
| GMM | `/rag/raptor.py:22` | RAPTOR cluster selection |
| Silhouette Score | `/deepdoc/parser/pdf_parser.py:37` | Cluster validation |
### 2. DIMENSIONALITY REDUCTION
| Algorithm | File | Description |
|-----------|------|-------------|
| UMAP | `/rag/raptor.py:21` | Pre-clustering dimension reduction |
| Node2Vec | `/graphrag/general/entity_embedding.py:24` | Graph node embedding |
### 3. GRAPH ALGORITHMS
| Algorithm | File | Description |
|-----------|------|-------------|
| PageRank | `/graphrag/entity_resolution.py:150` | Entity importance scoring |
| Leiden | `/graphrag/general/leiden.py:72` | Hierarchical community detection |
| Entity Extraction | `/graphrag/general/extractor.py` | LLM-based entity extraction |
| Relation Extraction | `/graphrag/general/extractor.py` | LLM-based relation extraction |
| Entity Resolution | `/graphrag/entity_resolution.py` | Entity deduplication |
| LCC | `/graphrag/general/leiden.py:67` | Largest connected component |
### 4. NLP/TEXT PROCESSING
| Algorithm | File | Description |
|-----------|------|-------------|
| Trie Tokenization | `/rag/nlp/rag_tokenizer.py:72` | Chinese word segmentation |
| Max-Forward | `/rag/nlp/rag_tokenizer.py:250` | Forward tokenization |
| Max-Backward | `/rag/nlp/rag_tokenizer.py:273` | Backward tokenization |
| DFS + Memo | `/rag/nlp/rag_tokenizer.py:120` | Disambiguation |
| TF-IDF | `/rag/nlp/term_weight.py:223` | Term weighting |
| NER | `/rag/nlp/term_weight.py:84` | Named entity recognition |
| POS Tagging | `/rag/nlp/term_weight.py:179` | Part-of-speech tagging |
| Synonym | `/rag/nlp/synonym.py:71` | Synonym lookup |
| Query Expansion | `/rag/nlp/query.py:85` | Query rewriting |
| Porter Stemmer | `/rag/nlp/rag_tokenizer.py:27` | English stemming |
| WordNet Lemmatizer | `/rag/nlp/rag_tokenizer.py:27` | Lemmatization |
### 5. SIMILARITY/DISTANCE METRICS
| Algorithm | File | Formula |
|-----------|------|---------|
| Cosine Similarity | `/rag/nlp/query.py:221` | `cos(θ) = A·B / (‖A‖×‖B‖)` |
| Edit Distance | `/graphrag/entity_resolution.py:28` | Levenshtein distance |
| IoU | `/deepdoc/vision/operators.py:702` | `intersection / union` |
| Token Similarity | `/rag/nlp/query.py:230` | Weighted token overlap |
| Hybrid Similarity | `/rag/nlp/query.py:220` | `α×token + β×vector` |
### 6. INFORMATION RETRIEVAL
| Algorithm | File | Formula |
|-----------|------|---------|
| BM25 | `/rag/nlp/search.py` | ES native BM25 |
| Hybrid Fusion | `/rag/nlp/search.py:126` | `0.05×BM25 + 0.95×Vector` |
| Reranking | `/rag/nlp/search.py:330` | Cross-encoder scoring |
| Argsort Ranking | `/rag/nlp/search.py:429` | Score-based sorting |
### 7. CHUNKING/MERGING
| Algorithm | File | Description |
|-----------|------|-------------|
| Naive Merge | `/rag/nlp/__init__.py:582` | Token-based chunking |
| Naive Merge + Images | `/rag/nlp/__init__.py:645` | With image tracking |
| Hierarchical Merge | `/rag/nlp/__init__.py:487` | Tree-based merging |
| Binary Search | `/rag/nlp/__init__.py:512` | Efficient section lookup |
| DFS Tree Traversal | `/rag/flow/hierarchical_merger/` | Document hierarchy |
### 8. MACHINE LEARNING MODELS
| Model | File | Purpose |
|-------|------|---------|
| XGBoost | `/deepdoc/parser/pdf_parser.py:88` | Text concatenation |
| ONNX OCR | `/deepdoc/vision/ocr.py:32` | Text recognition |
| ONNX Layout | `/deepdoc/vision/layout_recognizer.py` | Layout detection |
| ONNX TSR | `/deepdoc/vision/table_structure_recognizer.py` | Table structure |
| YOLOv10 | `/deepdoc/vision/layout_recognizer.py` | Object detection |
### 9. VISION/IMAGE PROCESSING
| Algorithm | File | Description |
|-----------|------|-------------|
| NMS | `/deepdoc/vision/operators.py:702` | Box filtering |
| IoU Filtering | `/deepdoc/vision/recognizer.py:359` | Overlap detection |
| Bounding Box Overlap | `/deepdoc/vision/layout_recognizer.py:94` | Spatial analysis |
### 10. ADVANCED RAG
| Algorithm | File | Description |
|-----------|------|-------------|
| RAPTOR | `/rag/raptor.py:37` | Hierarchical summarization |
| GraphRAG | `/graphrag/` | Knowledge graph RAG |
| Community Reports | `/graphrag/general/community_reports_extractor.py` | Graph summaries |
### 11. OPTIMIZATION CRITERIA
| Algorithm | File | Formula |
|-----------|------|---------|
| BIC | `/rag/raptor.py:92` | `k×log(n) - 2×log(L)` |
| Silhouette | `/deepdoc/parser/pdf_parser.py:400` | `(b-a) / max(a,b)` |
## Statistics
- **Total Algorithms**: 50+
- **Categories**: 12
- **Key Libraries**: sklearn, UMAP, XGBoost, NetworkX, graspologic, ONNX