ragflow/personal_analyze/06-ALGORITHMS/bm25_scoring.md
Claude a6ee18476d
docs: Add detailed backend module analysis documentation
Add comprehensive documentation covering 6 modules:
- 01-API-LAYER: Authentication, routing, SSE streaming
- 02-SERVICE-LAYER: Dialog, Task, LLM service analysis
- 03-RAG-ENGINE: Hybrid search, embedding, reranking
- 04-AGENT-SYSTEM: Canvas engine, components, tools
- 05-DOCUMENT-PROCESSING: Task executor, PDF parsing
- 06-ALGORITHMS: BM25, fusion, RAPTOR

Total 28 documentation files with code analysis, diagrams, and formulas.
2025-11-26 11:10:54 +00:00

5.5 KiB
Raw Blame History

BM25 Scoring Algorithm

Tong Quan

BM25 (Best Matching 25) là probabilistic ranking function sử dụng trong full-text search.

Mathematical Formula

┌─────────────────────────────────────────────────────────────────┐
│                      BM25 FORMULA                                │
└─────────────────────────────────────────────────────────────────┘

BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
             i=1..n

where:
    Q = query terms [q1, q2, ..., qn]
    D = document
    f(qi, D) = frequency of term qi in document D
    |D| = length of document D (in words)
    avgdl = average document length in collection
    k1 = term frequency saturation parameter (default: 1.2)
    b = length normalization parameter (default: 0.75)

IDF(qi) = log((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

where:
    N = total number of documents
    n(qi) = number of documents containing qi

Implementation in RAGFlow

# RAGFlow uses Elasticsearch's native BM25 through query_string query

# In /rag/nlp/search.py
def search(self, kb_ids, question, embd_mdl, tenant_ids, ...):
    # Build BM25 query
    matchText, keywords = self.qryr.question(qst, min_match=0.3)

    # matchText contains weighted terms
    # Example: "(machine^0.85 ML AI) (learning^0.72) \"machine learning\"^1.7"

    # Execute search
    res = self.dataStore.search(
        src, highlightFields, filters, matchExprs,
        orderBy, offset, limit, idx_names, kb_ids
    )

Query Construction

# In /rag/nlp/query.py

def question(self, txt, tbl="qa", min_match=0.6):
    """
    Build BM25 query with weighted terms.

    Returns:
        MatchTextExpr with query string and field boosts
    """
    # Normalize and tokenize
    tks = rag_tokenizer.tokenize(txt)

    # Get TF-IDF weights
    tks_w = self.tw.weights(tks)

    # Build query string
    q = []
    for (tk, w) in tks_w[:256]:
        # Add term with weight
        syn = self.syn.lookup(tk)  # Get synonyms
        q.append(f"({tk}^{w:.4f} {syn})")

    # Add phrase queries with 2x boost
    for i in range(1, len(tks_w)):
        left, right = tks_w[i-1][0], tks_w[i][0]
        weight = max(tks_w[i-1][1], tks_w[i][1]) * 2
        q.append(f'"{left} {right}"^{weight:.4f}')

    return MatchTextExpr(
        query=" ".join(q),
        fields=self.query_fields,
        min_match=f"{int(min_match * 100)}%"
    )

Field Boosting

# Fields searched with boost factors
query_fields = [
    "title_tks^10",           # Title: 10x boost
    "title_sm_tks^5",         # Title (semantic): 5x
    "important_kwd^30",       # Keywords: 30x
    "important_tks^20",       # Keyword tokens: 20x
    "question_tks^20",        # Question tokens: 20x
    "content_ltks^2",         # Content: 2x
    "content_sm_ltks",        # Content (semantic): 1x
]

# Final BM25 score = sum of field scores with boosts

Elasticsearch Query

{
    "query": {
        "query_string": {
            "query": "(machine^0.85 ML AI) (learning^0.72) \"machine learning\"^1.7",
            "fields": [
                "title_tks^10",
                "important_kwd^30",
                "content_ltks^2"
            ],
            "minimum_should_match": "30%",
            "default_operator": "OR"
        }
    }
}

Parameter Tuning

k1 Parameter

k1 controls term frequency saturation:
- k1 = 0: Binary occurrence (term present or not)
- k1 = 1.2: Moderate saturation (default)
- k1 = 2.0: Less saturation, higher weight for frequent terms

Higher k1 = more weight to term frequency

b Parameter

b controls length normalization:
- b = 0: No length normalization
- b = 0.75: Moderate normalization (default)
- b = 1.0: Full normalization

Higher b = more penalty for long documents

BM25 Score Components

┌─────────────────────────────────────────────────────────────────┐
│                    BM25 SCORE BREAKDOWN                          │
└─────────────────────────────────────────────────────────────────┘

For query "machine learning" on document D:

1. IDF Component:
   IDF("machine") = log((10000 - 500 + 0.5) / (500 + 0.5) + 1) = 3.0
   IDF("learning") = log((10000 - 300 + 0.5) / (300 + 0.5) + 1) = 3.5

2. TF Component (k1=1.2):
   f("machine", D) = 3 occurrences
   TF_factor = (3 × 2.2) / (3 + 1.2 × (1 - 0.75 + 0.75 × 100/50))
             = 6.6 / (3 + 1.2 × 1.75) = 6.6 / 5.1 = 1.29

3. Final Score:
   BM25 = IDF("machine") × TF("machine") + IDF("learning") × TF("learning")
        = 3.0 × 1.29 + 3.5 × 1.15
        = 3.87 + 4.03 = 7.90

Minimum Match

# minimum_should_match controls required term coverage

min_match = "30%"
# For query with 10 terms:
# At least 3 terms must match

# This prevents:
# - Very sparse matches
# - High recall but low precision
  • /rag/nlp/search.py - BM25 query construction
  • /rag/nlp/query.py - Query processing
  • /rag/utils/es_conn.py - Elasticsearch integration