History

…
..
bm25_scoring.md
hybrid_score_fusion.md
raptor_algorithm.md	docs: Add detailed backend module analysis documentation	2025-11-26 11:10:54 +00:00
README.md

README.md

06-ALGORITHMS - Core Algorithms & Math

Tong Quan

Module này chứa các thuật toán core của RAGFlow bao gồm scoring, similarity, chunking, và advanced RAG techniques.

Kien Truc Algorithms

┌─────────────────────────────────────────────────────────────────┐
│                    CORE ALGORITHMS                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  RETRIEVAL ALGORITHMS                                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  BM25 Scoring   │  │ Vector Cosine   │  │ Hybrid Fusion   │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  ADVANCED RAG                                                    │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  RAPTOR         │  │  GraphRAG       │  │  Cross-Encoder  │ │
│  │  (Hierarchical) │  │  (Knowledge G)  │  │  (Reranking)    │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  TEXT PROCESSING                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  TF-IDF Weight  │  │  Tokenization   │  │  Query Expand   │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Files Trong Module Nay

File	Mo Ta
bm25_scoring.md	BM25 ranking algorithm
vector_similarity.md	Cosine similarity calculations
hybrid_score_fusion.md	Score combination strategies
tfidf_weighting.md	TF-IDF term weighting
raptor_algorithm.md	Hierarchical summarization
graphrag_implementation.md	Knowledge graph RAG

Algorithm Formulas

BM25 Scoring

BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

where:
    f(qi, D) = term frequency of qi in document D
    |D| = document length
    avgdl = average document length
    k1 = 1.2 (term frequency saturation)
    b = 0.75 (length normalization)

Cosine Similarity

cos(θ) = (A · B) / (||A|| × ||B||)

where:
    A, B = embedding vectors
    A · B = dot product
    ||A|| = L2 norm

Hybrid Score Fusion

Hybrid_Score = α × BM25_Score + (1-α) × Vector_Score

Default: α = 0.05 (5% BM25, 95% Vector)

TF-IDF Weighting

IDF(term) = log10(10 + (N - df(term) + 0.5) / (df(term) + 0.5))
Weight = (0.3 × IDF1 + 0.7 × IDF2) × NER × PoS

Cross-Encoder Reranking

Final_Rank = α × Token_Sim + β × Vector_Sim + γ × Rank_Features

where:
    α = 0.3 (token weight)
    β = 0.7 (vector weight)
    γ = variable (PageRank, tag boost)

Algorithm Parameters

Algorithm	Parameter	Default	Range
BM25	k1	1.2	0-2.0
	b	0.75	0-1.0
Hybrid	vector_weight	0.95	0-1.0
	text_weight	0.05	0-1.0
TF-IDF	IDF1 weight	0.3	-
	IDF2 weight	0.7	-
Chunking	chunk_size	512	128-2048
	overlap	0-10%	0-100%
RAPTOR	max_clusters	10-50	-
	GMM threshold	0.1	-
GraphRAG	entity_topN	6	1-100
	similarity_threshold	0.3	0-1.0

Key Implementation Files

/rag/nlp/search.py - Search algorithms
/rag/nlp/term_weight.py - TF-IDF implementation
/rag/nlp/query.py - Query processing
/rag/raptor.py - RAPTOR algorithm
/graphrag/search.py - GraphRAG search
/rag/nlp/__init__.py - Chunking algorithms

Performance Metrics

Metric	Typical Value
Vector Search Latency	< 100ms
BM25 Search Latency	< 50ms
Reranking Latency	200-500ms
Total Retrieval	< 1s

README.md Unescape Escape

06-ALGORITHMS - Core Algorithms & Math

Tong Quan

Kien Truc Algorithms

Files Trong Module Nay

Algorithm Formulas

BM25 Scoring

Cosine Similarity

Hybrid Score Fusion

TF-IDF Weighting

Cross-Encoder Reranking

Algorithm Parameters

Key Implementation Files

Performance Metrics

README.md