9.4 KiB
Vietnamese Embedding Integration for LightRAG
This integration adds support for the AITeamVN/Vietnamese_Embedding model to LightRAG, enabling enhanced retrieval capabilities for Vietnamese text.
Model Information
- Model: AITeamVN/Vietnamese_Embedding
- Base Model: BAAI/bge-m3
- Type: Sentence Transformer
- Maximum Sequence Length: 2048 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Dot product similarity
- Language: Vietnamese (also supports other languages as it's based on BGE-M3)
- Training Data: ~300,000 triplets of queries, positive documents, and negative documents for Vietnamese
Features
✅ High-quality Vietnamese embeddings - Fine-tuned specifically for Vietnamese text retrieval
✅ Multilingual support - Inherits multilingual capabilities from BGE-M3
✅ Long context support - Handles up to 2048 tokens per input
✅ Efficient processing - Automatic device detection (CUDA/MPS/CPU)
✅ Normalized embeddings - Ready for dot product similarity
✅ Easy integration - Drop-in replacement for other embedding functions
Installation
1. Install LightRAG
cd LightRAG
pip install -e .
2. Install Required Dependencies
The Vietnamese embedding integration requires:
transformers(automatically installed)torch(automatically installed)numpy(automatically installed)
These will be automatically installed via pipmaster when you first use the Vietnamese embedding function.
3. Set Up HuggingFace Token
You need a HuggingFace token to access the model:
export HUGGINGFACE_API_KEY="your_hf_token_here"
# or
export HF_TOKEN="your_hf_token_here"
Get your token from: https://huggingface.co/settings/tokens
Quick Start
Simple Example
import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete
from lightrag.llm.vietnamese_embed import vietnamese_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import EmbeddingFunc
WORKING_DIR = "./vietnamese_rag_storage"
async def main():
# Get HuggingFace token
hf_token = os.environ.get("HUGGINGFACE_API_KEY")
# Initialize LightRAG with Vietnamese embedding
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=gpt_4o_mini_complete,
embedding_func=EmbeddingFunc(
embedding_dim=1024,
max_token_size=2048,
func=lambda texts: vietnamese_embed(
texts,
model_name="AITeamVN/Vietnamese_Embedding",
token=hf_token
)
),
)
# Initialize storage and pipeline
await rag.initialize_storages()
await initialize_pipeline_status()
# Insert Vietnamese text
await rag.ainsert("Việt Nam là một quốc gia nằm ở Đông Nam Á.")
# Query
result = await rag.aquery(
"Việt Nam ở đâu?",
param=QueryParam(mode="hybrid")
)
print(result)
await rag.finalize_storages()
if __name__ == "__main__":
asyncio.run(main())
Using with .env File
Create a .env file in your project directory:
# HuggingFace Token for Vietnamese Embedding
HUGGINGFACE_API_KEY=your_key_here
# LLM Configuration
OPENAI_API_KEY=your_openai_key_here
LLM_BINDING=openai
LLM_MODEL=gpt-4o-mini
# Embedding Configuration
EMBEDDING_MODEL=AITeamVN/Vietnamese_Embedding
EMBEDDING_DIM=1024
Example Scripts
We provide several example scripts demonstrating different use cases:
1. Simple Example
python examples/lightrag_vietnamese_embedding_simple.py
A minimal example showing basic Vietnamese text processing.
2. Comprehensive Demo
python examples/vietnamese_embedding_demo.py
A comprehensive demo including:
- Vietnamese text processing
- English text processing (multilingual support)
- Mixed language processing
- Multiple query examples
API Reference
vietnamese_embed()
Generate embeddings for texts using the Vietnamese Embedding model.
async def vietnamese_embed(
texts: list[str],
model_name: str = "AITeamVN/Vietnamese_Embedding",
token: str | None = None,
) -> np.ndarray
Parameters:
texts(list[str]): List of texts to embedmodel_name(str): HuggingFace model identifiertoken(str, optional): HuggingFace API token (reads from env if not provided)
Returns:
np.ndarray: Array of embeddings with shape (len(texts), 1024)
Example:
from lightrag.llm.vietnamese_embed import vietnamese_embed
texts = ["Xin chào", "Hello", "你好"]
embeddings = await vietnamese_embed(texts)
print(embeddings.shape) # (3, 1024)
vietnamese_embedding_func()
Convenience wrapper that automatically reads token from environment.
async def vietnamese_embedding_func(texts: list[str]) -> np.ndarray
Example:
from lightrag.llm.vietnamese_embed import vietnamese_embedding_func
# Token automatically read from HUGGINGFACE_API_KEY or HF_TOKEN
embeddings = await vietnamese_embedding_func(["Xin chào"])
Advanced Usage
Custom Model Configuration
from lightrag.llm.vietnamese_embed import vietnamese_embed
# Use a different model based on BGE-M3
embeddings = await vietnamese_embed(
texts=["Sample text"],
model_name="BAAI/bge-m3", # Use base model
token=your_token
)
Device Selection
The model automatically detects and uses the best available device:
- CUDA (if available)
- MPS (for Apple Silicon)
- CPU (fallback)
You can check which device is being used by enabling debug logging:
from lightrag.utils import setup_logger
setup_logger("lightrag", level="DEBUG")
Batch Processing
The embedding function supports efficient batch processing:
# Process multiple texts efficiently
large_batch = ["Text 1", "Text 2", ..., "Text 1000"]
embeddings = await vietnamese_embed(large_batch)
Integration with LightRAG Server
To use Vietnamese embedding with LightRAG Server, update your .env file:
# Vietnamese Embedding Configuration
EMBEDDING_MODEL=AITeamVN/Vietnamese_Embedding
EMBEDDING_DIM=1024
HUGGINGFACE_API_KEY=your_hf_token
# Or use custom binding
EMBEDDING_BINDING=huggingface
Then start the server:
lightrag-server
Performance Considerations
Memory Requirements
- GPU Memory: ~2-4 GB for the model
- RAM: ~4-8 GB recommended
- Disk Space: ~2 GB for model weights (cached after first download)
Speed
On a typical GPU:
- ~1000 texts/second for short texts (< 512 tokens)
- ~200-400 texts/second for longer texts (1024-2048 tokens)
Optimization Tips
- Use GPU: Significantly faster than CPU (10-50x)
- Batch Requests: Process multiple texts together
- Cache Model: First run downloads model; subsequent runs are faster
- Adjust max_length: Use shorter max_length if your texts are shorter
# Example: Optimize for shorter texts
embedding_func=EmbeddingFunc(
embedding_dim=1024,
max_token_size=512, # Reduce if texts are shorter
func=lambda texts: vietnamese_embed(texts)
)
Troubleshooting
Issue: "No HuggingFace token found"
Solution: Set the environment variable:
export HUGGINGFACE_API_KEY="your_token"
# or
export HF_TOKEN="your_token"
Issue: "Model download fails"
Solution:
- Check your internet connection
- Verify your HuggingFace token is valid
- Ensure you have enough disk space (~2 GB)
Issue: "Out of memory error"
Solution:
- Reduce batch size
- Use CPU instead of GPU (slower but uses less memory)
- Close other applications using GPU/RAM
Issue: "Slow embedding generation"
Solution:
- Ensure you're using GPU (check logs for device info)
- Install CUDA-enabled PyTorch:
pip install torch --index-url https://download.pytorch.org/whl/cu118 - Reduce max_token_size if your texts are shorter
Comparison with Other Embedding Models
| Model | Dimensions | Max Tokens | Languages | Fine-tuned for Vietnamese |
|---|---|---|---|---|
| Vietnamese_Embedding | 1024 | 2048 | Multilingual | ✅ Yes |
| BGE-M3 | 1024 | 8192 | Multilingual | ❌ No |
| text-embedding-3-large | 3072 | 8191 | Multilingual | ❌ No |
| text-embedding-3-small | 1536 | 8191 | Multilingual | ❌ No |
Citation
If you use the Vietnamese Embedding model in your research, please cite:
@misc{vietnamese_embedding_2024,
title={Vietnamese Embedding: Fine-tuned BGE-M3 for Vietnamese Retrieval},
author={AITeamVN},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/AITeamVN/Vietnamese_Embedding}
}
Support
For issues specific to the Vietnamese embedding integration:
- Open an issue on LightRAG GitHub
- Tag with
vietnamese-embeddinglabel
For issues with the model itself:
License
This integration follows LightRAG's license. The Vietnamese_Embedding model may have its own license terms - please check the model page for details.
Acknowledgments
- AITeamVN for training and releasing the Vietnamese_Embedding model
- BAAI for the base BGE-M3 model
- LightRAG team for the excellent RAG framework