ragflow/personal_analyze/06-ALGORITHMS/README.md

# 06-ALGORITHMS - Core Algorithms & Math

## Tong Quan

Module này chứa các thuật toán core của RAGFlow bao gồm scoring, similarity, chunking, và advanced RAG techniques.

## Kien Truc Algorithms

```
┌─────────────────────────────────────────────────────────────────┐
│                    CORE ALGORITHMS                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  RETRIEVAL ALGORITHMS                                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  BM25 Scoring   │  │ Vector Cosine   │  │ Hybrid Fusion   │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  ADVANCED RAG                                                    │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  RAPTOR         │  │  GraphRAG       │  │  Cross-Encoder  │ │
│  │  (Hierarchical) │  │  (Knowledge G)  │  │  (Reranking)    │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  TEXT PROCESSING                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  TF-IDF Weight  │  │  Tokenization   │  │  Query Expand   │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```

## Files Trong Module Nay

| File | Mo Ta |
|------|-------|
| [bm25_scoring.md](./bm25_scoring.md) | BM25 ranking algorithm |
| [vector_similarity.md](./vector_similarity.md) | Cosine similarity calculations |
| [hybrid_score_fusion.md](./hybrid_score_fusion.md) | Score combination strategies |
| [tfidf_weighting.md](./tfidf_weighting.md) | TF-IDF term weighting |
| [raptor_algorithm.md](./raptor_algorithm.md) | Hierarchical summarization |
| [graphrag_implementation.md](./graphrag_implementation.md) | Knowledge graph RAG |

## Algorithm Formulas

### BM25 Scoring
```
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

where:
    f(qi, D) = term frequency of qi in document D
    |D| = document length
    avgdl = average document length
    k1 = 1.2 (term frequency saturation)
    b = 0.75 (length normalization)
```

### Cosine Similarity
```
cos(θ) = (A · B) / (||A|| × ||B||)

where:
    A, B = embedding vectors
    A · B = dot product
    ||A|| = L2 norm
```

### Hybrid Score Fusion
```
Hybrid_Score = α × BM25_Score + (1-α) × Vector_Score

Default: α = 0.05 (5% BM25, 95% Vector)
```

### TF-IDF Weighting
```
IDF(term) = log10(10 + (N - df(term) + 0.5) / (df(term) + 0.5))
Weight = (0.3 × IDF1 + 0.7 × IDF2) × NER × PoS
```

### Cross-Encoder Reranking
```
Final_Rank = α × Token_Sim + β × Vector_Sim + γ × Rank_Features

where:
    α = 0.3 (token weight)
    β = 0.7 (vector weight)
    γ = variable (PageRank, tag boost)
```

## Algorithm Parameters

| Algorithm | Parameter | Default | Range |
|-----------|-----------|---------|-------|
| **BM25** | k1 | 1.2 | 0-2.0 |
| | b | 0.75 | 0-1.0 |
| **Hybrid** | vector_weight | 0.95 | 0-1.0 |
| | text_weight | 0.05 | 0-1.0 |
| **TF-IDF** | IDF1 weight | 0.3 | - |
| | IDF2 weight | 0.7 | - |
| **Chunking** | chunk_size | 512 | 128-2048 |
| | overlap | 0-10% | 0-100% |
| **RAPTOR** | max_clusters | 10-50 | - |
| | GMM threshold | 0.1 | - |
| **GraphRAG** | entity_topN | 6 | 1-100 |
| | similarity_threshold | 0.3 | 0-1.0 |

## Key Implementation Files

- `/rag/nlp/search.py` - Search algorithms
- `/rag/nlp/term_weight.py` - TF-IDF implementation
- `/rag/nlp/query.py` - Query processing
- `/rag/raptor.py` - RAPTOR algorithm
- `/graphrag/search.py` - GraphRAG search
- `/rag/nlp/__init__.py` - Chunking algorithms

## Performance Metrics

| Metric | Typical Value |
|--------|---------------|
| Vector Search Latency | < 100ms |
| BM25 Search Latency | < 50ms |
| Reranking Latency | 200-500ms |
| Total Retrieval | < 1s |