ragflow/personal_analyze/06-ALGORITHMS/README.md
Claude a6ee18476d
docs: Add detailed backend module analysis documentation
Add comprehensive documentation covering 6 modules:
- 01-API-LAYER: Authentication, routing, SSE streaming
- 02-SERVICE-LAYER: Dialog, Task, LLM service analysis
- 03-RAG-ENGINE: Hybrid search, embedding, reranking
- 04-AGENT-SYSTEM: Canvas engine, components, tools
- 05-DOCUMENT-PROCESSING: Task executor, PDF parsing
- 06-ALGORITHMS: BM25, fusion, RAPTOR

Total 28 documentation files with code analysis, diagrams, and formulas.
2025-11-26 11:10:54 +00:00

128 lines
6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 06-ALGORITHMS - Core Algorithms & Math
## Tong Quan
Module này chứa các thuật toán core của RAGFlow bao gồm scoring, similarity, chunking, và advanced RAG techniques.
## Kien Truc Algorithms
```
┌─────────────────────────────────────────────────────────────────┐
│ CORE ALGORITHMS │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL ALGORITHMS │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ BM25 Scoring │ │ Vector Cosine │ │ Hybrid Fusion │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ ADVANCED RAG │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ RAPTOR │ │ GraphRAG │ │ Cross-Encoder │ │
│ │ (Hierarchical) │ │ (Knowledge G) │ │ (Reranking) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ TEXT PROCESSING │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ TF-IDF Weight │ │ Tokenization │ │ Query Expand │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Files Trong Module Nay
| File | Mo Ta |
|------|-------|
| [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm |
| [vector_similarity.md](./vector_similarity.md) | Cosine similarity calculations |
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination strategies |
| [tfidf_weighting.md](./tfidf_weighting.md) | TF-IDF term weighting |
| [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization |
| [graphrag_implementation.md](./graphrag_implementation.md) | Knowledge graph RAG |
## Algorithm Formulas
### BM25 Scoring
```
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
where:
f(qi, D) = term frequency of qi in document D
|D| = document length
avgdl = average document length
k1 = 1.2 (term frequency saturation)
b = 0.75 (length normalization)
```
### Cosine Similarity
```
cos(θ) = (A · B) / (||A|| × ||B||)
where:
A, B = embedding vectors
A · B = dot product
||A|| = L2 norm
```
### Hybrid Score Fusion
```
Hybrid_Score = α × BM25_Score + (1-α) × Vector_Score
Default: α = 0.05 (5% BM25, 95% Vector)
```
### TF-IDF Weighting
```
IDF(term) = log10(10 + (N - df(term) + 0.5) / (df(term) + 0.5))
Weight = (0.3 × IDF1 + 0.7 × IDF2) × NER × PoS
```
### Cross-Encoder Reranking
```
Final_Rank = α × Token_Sim + β × Vector_Sim + γ × Rank_Features
where:
α = 0.3 (token weight)
β = 0.7 (vector weight)
γ = variable (PageRank, tag boost)
```
## Algorithm Parameters
| Algorithm | Parameter | Default | Range |
|-----------|-----------|---------|-------|
| **BM25** | k1 | 1.2 | 0-2.0 |
| | b | 0.75 | 0-1.0 |
| **Hybrid** | vector_weight | 0.95 | 0-1.0 |
| | text_weight | 0.05 | 0-1.0 |
| **TF-IDF** | IDF1 weight | 0.3 | - |
| | IDF2 weight | 0.7 | - |
| **Chunking** | chunk_size | 512 | 128-2048 |
| | overlap | 0-10% | 0-100% |
| **RAPTOR** | max_clusters | 10-50 | - |
| | GMM threshold | 0.1 | - |
| **GraphRAG** | entity_topN | 6 | 1-100 |
| | similarity_threshold | 0.3 | 0-1.0 |
## Key Implementation Files
- `/rag/nlp/search.py` - Search algorithms
- `/rag/nlp/term_weight.py` - TF-IDF implementation
- `/rag/nlp/query.py` - Query processing
- `/rag/raptor.py` - RAPTOR algorithm
- `/graphrag/search.py` - GraphRAG search
- `/rag/nlp/__init__.py` - Chunking algorithms
## Performance Metrics
| Metric | Typical Value |
|--------|---------------|
| Vector Search Latency | < 100ms |
| BM25 Search Latency | < 50ms |
| Reranking Latency | 200-500ms |
| Total Retrieval | < 1s |