Add comprehensive documentation covering 6 modules: - 01-API-LAYER: Authentication, routing, SSE streaming - 02-SERVICE-LAYER: Dialog, Task, LLM service analysis - 03-RAG-ENGINE: Hybrid search, embedding, reranking - 04-AGENT-SYSTEM: Canvas engine, components, tools - 05-DOCUMENT-PROCESSING: Task executor, PDF parsing - 06-ALGORITHMS: BM25, fusion, RAPTOR Total 28 documentation files with code analysis, diagrams, and formulas.
5.5 KiB
5.5 KiB
BM25 Scoring Algorithm
Tong Quan
BM25 (Best Matching 25) là probabilistic ranking function sử dụng trong full-text search.
Mathematical Formula
┌─────────────────────────────────────────────────────────────────┐
│ BM25 FORMULA │
└─────────────────────────────────────────────────────────────────┘
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
i=1..n
where:
Q = query terms [q1, q2, ..., qn]
D = document
f(qi, D) = frequency of term qi in document D
|D| = length of document D (in words)
avgdl = average document length in collection
k1 = term frequency saturation parameter (default: 1.2)
b = length normalization parameter (default: 0.75)
IDF(qi) = log((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)
where:
N = total number of documents
n(qi) = number of documents containing qi
Implementation in RAGFlow
# RAGFlow uses Elasticsearch's native BM25 through query_string query
# In /rag/nlp/search.py
def search(self, kb_ids, question, embd_mdl, tenant_ids, ...):
# Build BM25 query
matchText, keywords = self.qryr.question(qst, min_match=0.3)
# matchText contains weighted terms
# Example: "(machine^0.85 ML AI) (learning^0.72) \"machine learning\"^1.7"
# Execute search
res = self.dataStore.search(
src, highlightFields, filters, matchExprs,
orderBy, offset, limit, idx_names, kb_ids
)
Query Construction
# In /rag/nlp/query.py
def question(self, txt, tbl="qa", min_match=0.6):
"""
Build BM25 query with weighted terms.
Returns:
MatchTextExpr with query string and field boosts
"""
# Normalize and tokenize
tks = rag_tokenizer.tokenize(txt)
# Get TF-IDF weights
tks_w = self.tw.weights(tks)
# Build query string
q = []
for (tk, w) in tks_w[:256]:
# Add term with weight
syn = self.syn.lookup(tk) # Get synonyms
q.append(f"({tk}^{w:.4f} {syn})")
# Add phrase queries with 2x boost
for i in range(1, len(tks_w)):
left, right = tks_w[i-1][0], tks_w[i][0]
weight = max(tks_w[i-1][1], tks_w[i][1]) * 2
q.append(f'"{left} {right}"^{weight:.4f}')
return MatchTextExpr(
query=" ".join(q),
fields=self.query_fields,
min_match=f"{int(min_match * 100)}%"
)
Field Boosting
# Fields searched with boost factors
query_fields = [
"title_tks^10", # Title: 10x boost
"title_sm_tks^5", # Title (semantic): 5x
"important_kwd^30", # Keywords: 30x
"important_tks^20", # Keyword tokens: 20x
"question_tks^20", # Question tokens: 20x
"content_ltks^2", # Content: 2x
"content_sm_ltks", # Content (semantic): 1x
]
# Final BM25 score = sum of field scores with boosts
Elasticsearch Query
{
"query": {
"query_string": {
"query": "(machine^0.85 ML AI) (learning^0.72) \"machine learning\"^1.7",
"fields": [
"title_tks^10",
"important_kwd^30",
"content_ltks^2"
],
"minimum_should_match": "30%",
"default_operator": "OR"
}
}
}
Parameter Tuning
k1 Parameter
k1 controls term frequency saturation:
- k1 = 0: Binary occurrence (term present or not)
- k1 = 1.2: Moderate saturation (default)
- k1 = 2.0: Less saturation, higher weight for frequent terms
Higher k1 = more weight to term frequency
b Parameter
b controls length normalization:
- b = 0: No length normalization
- b = 0.75: Moderate normalization (default)
- b = 1.0: Full normalization
Higher b = more penalty for long documents
BM25 Score Components
┌─────────────────────────────────────────────────────────────────┐
│ BM25 SCORE BREAKDOWN │
└─────────────────────────────────────────────────────────────────┘
For query "machine learning" on document D:
1. IDF Component:
IDF("machine") = log((10000 - 500 + 0.5) / (500 + 0.5) + 1) = 3.0
IDF("learning") = log((10000 - 300 + 0.5) / (300 + 0.5) + 1) = 3.5
2. TF Component (k1=1.2):
f("machine", D) = 3 occurrences
TF_factor = (3 × 2.2) / (3 + 1.2 × (1 - 0.75 + 0.75 × 100/50))
= 6.6 / (3 + 1.2 × 1.75) = 6.6 / 5.1 = 1.29
3. Final Score:
BM25 = IDF("machine") × TF("machine") + IDF("learning") × TF("learning")
= 3.0 × 1.29 + 3.5 × 1.15
= 3.87 + 4.03 = 7.90
Minimum Match
# minimum_should_match controls required term coverage
min_match = "30%"
# For query with 10 terms:
# At least 3 terms must match
# This prevents:
# - Very sparse matches
# - High recall but low precision
Related Files
/rag/nlp/search.py- BM25 query construction/rag/nlp/query.py- Query processing/rag/utils/es_conn.py- Elasticsearch integration