03-RAG-ENGINE - Retrieval-Augmented Generation
Tổng Quan
RAG Engine là core của RAGFlow, implement các thuật toán retrieval, embedding, reranking và generation.
Kiến Trúc RAG Engine
┌─────────────────────────────────────────────────────────────────────────┐
│ RAG ENGINE ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────┐
│ User Query │
└──────────────┬──────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ QUERY PROCESSING │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Tokenize │→ │ TF-IDF │→ │ Synonym │ │
│ │ Query │ │ Weight │ │ Expansion │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└────────────────────────────────────┬──────────────────────────────────┘
│
┌────────────────┴────────────────┐
│ │
▼ ▼
┌───────────────────────────────┐ ┌───────────────────────────────────┐
│ VECTOR SEARCH │ │ BM25 SEARCH │
│ ┌─────────────────────────┐ │ │ ┌─────────────────────────────┐ │
│ │ Embedding Model │ │ │ │ Full-text Query │ │
│ │ (OpenAI/BGE/Jina) │ │ │ │ (Elasticsearch) │ │
│ └───────────┬─────────────┘ │ │ └───────────┬─────────────────┘ │
│ │ │ │ │ │
│ ┌───────────▼─────────────┐ │ │ ┌───────────▼─────────────────┐ │
│ │ Cosine Similarity │ │ │ │ BM25 Scoring │ │
│ │ Score (0-1) │ │ │ │ Score │ │
│ └───────────┬─────────────┘ │ │ └───────────┬─────────────────┘ │
└──────────────┼────────────────┘ └──────────────┼────────────────────┘
│ │
└───────────────┬───────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ SCORE FUSION │
│ │
│ Final = α × Vector_Score + (1-α) × BM25_Score │
│ where α = vector_similarity_weight (default: 0.3) │
│ │
└────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ RERANKING (Optional) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Cross-Encoder Model (Jina/Cohere/BGE) │ │
│ │ Re-score each chunk against query │ │
│ │ Return Top-K after reranking │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ CONTEXT BUILDING │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Format chunks into context string │ │
│ │ Add metadata (doc name, page, positions) │ │
│ │ Build citation mapping │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ LLM GENERATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ System Prompt + Context + User Query │ │
│ │ Token Fitting (stay within context window) │ │
│ │ Streaming Generation │ │
│ │ Citation Insertion │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
Cấu Trúc Thư Mục
/rag/
├── llm/ # LLM Model Abstractions
│ ├── chat_model.py # Chat LLM interface (30+ providers)
│ ├── embedding_model.py # Embedding models
│ ├── rerank_model.py # Reranking models
│ ├── cv_model.py # Computer vision
│ └── tts_model.py # Text-to-speech
│
├── nlp/ # NLP Processing
│ ├── query.py # Query processing
│ ├── search.py # Search & retrieval ⭐
│ └── rag_tokenizer.py # Tokenization
│
├── app/ # RAG Application
│ └── naive.py # Naive RAG implementation
│
├── flow/ # Processing Pipeline
│ ├── pipeline.py # Pipeline orchestration
│ ├── parser/ # Document parsing
│ ├── tokenizer/ # Tokenization
│ ├── splitter/ # Chunking
│ └── extractor/ # Information extraction
│
├── utils/ # Utilities
│ ├── es_conn.py # Elasticsearch connection
│ └── infinity_conn.py # Infinity connection
│
├── prompts/ # Prompt Templates
│ ├── generator.py # Prompt generator
│ ├── citations.md # Citation prompt
│ ├── keywords.md # Keyword extraction
│ └── ... # Other templates
│
├── raptor.py # RAPTOR algorithm
├── settings.py # Configuration
└── benchmark.py # Performance testing
Files Trong Module Này
Core Algorithms
1. Hybrid Search
# Score fusion formula
Final_Score = α × Vector_Score + (1-α) × BM25_Score
where:
α = vector_similarity_weight (default: 0.3)
Vector_Score = cosine_similarity(query_embedding, chunk_embedding)
BM25_Score = normalized_bm25(query_tokens, chunk_tokens)
2. BM25 Scoring
# BM25 formula
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
where:
f(qi, D) = term frequency of qi in document D
|D| = document length
avgdl = average document length
k1 = 1.2 (term frequency saturation)
b = 0.75 (length normalization)
3. Cosine Similarity
# Cosine similarity formula
cos(θ) = (A · B) / (||A|| × ||B||)
where:
A, B = embedding vectors
A · B = dot product
||A|| = L2 norm
4. Cross-Encoder Reranking
# Reranking score
Rerank_Score = CrossEncoder(query, document)
# Final ranking
Final_Rank = α × Token_Similarity + β × Vector_Similarity + γ × Rank_Features
where:
α = 0.3 (token weight)
β = 0.7 (vector weight)
γ = variable (PageRank, tag boost)
LLM Provider Support
Chat Models (30+)
| Provider |
Models |
| OpenAI |
GPT-3.5, GPT-4, GPT-4V |
| Anthropic |
Claude 3 (Opus, Sonnet, Haiku) |
| Google |
Gemini Pro |
| Alibaba |
Qwen, Qwen-VL |
| Groq |
LLaMA 3, Mixtral |
| Mistral |
Mistral 7B, Mixtral 8x7B |
| Cohere |
Command R, Command R+ |
| DeepSeek |
DeepSeek Chat |
| Ollama |
Local models |
| ... |
And many more |
Embedding Models
| Provider |
Models |
Dimensions |
| OpenAI |
text-embedding-3-small |
1536 |
| OpenAI |
text-embedding-3-large |
3072 |
| BGE |
bge-large-en-v1.5 |
1024 |
| BGE |
bge-m3 |
1024 |
| Jina |
jina-embeddings-v2 |
768 |
| Cohere |
embed-english-v3 |
1024 |
Reranking Models
| Provider |
Models |
| Jina |
jina-reranker-v2 |
| Cohere |
rerank-english-v3 |
| BGE |
bge-reranker-large |
| NVIDIA |
rerank-qa-mistral-4b |
Configuration Parameters
Search Configuration
{
"similarity_threshold": 0.2, # Minimum similarity
"vector_similarity_weight": 0.3, # α in fusion formula
"top_n": 6, # Final results count
"top_k": 1024, # Initial candidates
"rerank_model": "jina-reranker-v2"
}
Chunking Configuration
{
"chunk_token_num": 512, # Tokens per chunk
"delimiter": "\n!?。;!?", # Split delimiters
"layout_recognize": "DeepDOC", # Layout detection
"overlapped_percent": 0 # Chunk overlap
}
Generation Configuration
{
"temperature": 0.7,
"max_tokens": 2048,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0
}
Key Performance Metrics
| Metric |
Typical Value |
Description |
| Vector Search Latency |
< 100ms |
Elasticsearch query time |
| BM25 Search Latency |
< 50ms |
Full-text search time |
| Reranking Latency |
200-500ms |
Cross-encoder inference |
| Embedding Generation |
1-5s/batch |
Per batch of 16 texts |
| Total Retrieval |
< 1s |
End-to-end search |
Related Files
/api/db/services/dialog_service.py - Uses RAG engine
/rag/nlp/search.py - Core search implementation
/rag/utils/es_conn.py - Elasticsearch queries