4 KiB
4 KiB
Vietnamese Embedding - Quick Reference
🚀 Quick Setup (5 minutes)
1. Install & Configure
# Navigate to LightRAG directory
cd LightRAG
# Install in editable mode
pip install -e .
# Set your HuggingFace token
export HUGGINGFACE_API_KEY="your key here"
# Set your OpenAI key (or other LLM provider)
export OPENAI_API_KEY="your_openai_key"
2. Run Example
python examples/lightrag_vietnamese_embedding_simple.py
📝 Minimal Code Example
import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete
from lightrag.llm.vietnamese_embed import vietnamese_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import EmbeddingFunc
async def main():
rag = LightRAG(
working_dir="./vietnamese_rag",
llm_model_func=gpt_4o_mini_complete,
embedding_func=EmbeddingFunc(
embedding_dim=1024,
max_token_size=2048,
func=vietnamese_embed
)
)
await rag.initialize_storages()
await initialize_pipeline_status()
# Insert Vietnamese text
await rag.ainsert("Việt Nam là một quốc gia ở Đông Nam Á.")
# Query
result = await rag.aquery("Việt Nam ở đâu?", param=QueryParam(mode="hybrid"))
print(result)
await rag.finalize_storages()
asyncio.run(main())
🔧 Key Parameters
| Parameter | Value | Description |
|---|---|---|
embedding_dim |
1024 | Output dimensions |
max_token_size |
2048 | Maximum tokens per input |
model_name |
AITeamVN/Vietnamese_Embedding | HuggingFace model ID |
token |
Your HF token | HuggingFace API token |
🌐 Supported Languages
✅ Vietnamese (optimized)
✅ English (inherited from BGE-M3)
✅ Chinese (inherited from BGE-M3)
✅ 100+ other languages (multilingual support)
📊 Performance
| Metric | Value |
|---|---|
| GPU Memory | 2-4 GB |
| RAM | 4-8 GB |
| Disk Space | ~2 GB (first download) |
| Speed (GPU) | 200-1000 texts/sec |
| Speed (CPU) | 20-100 texts/sec |
📚 Resources
Documentation
- English:
docs/VietnameseEmbedding.md - Tiếng Việt:
docs/VietnameseEmbedding_VI.md
Examples
- Simple:
examples/lightrag_vietnamese_embedding_simple.py - Comprehensive:
examples/vietnamese_embedding_demo.py
Testing
python tests/test_vietnamese_embedding_integration.py
🐛 Common Issues
"No HuggingFace token found"
export HUGGINGFACE_API_KEY="your_token"
"Model download fails"
- Check internet connection
- Verify HuggingFace token is valid
- Ensure 2GB+ free disk space
"Out of memory"
- Reduce batch size
- Use CPU instead of GPU
- Close other GPU applications
"Slow embedding"
- Install CUDA-enabled PyTorch
- Check GPU is being used (see logs)
- Reduce
max_token_sizefor shorter texts
💡 Tips
- First run is slow: Model downloads (~2GB), cached afterward
- Use GPU: 10-50x faster than CPU
- Batch requests: Process multiple texts together
- Enable debug logs: See device being used
from lightrag.utils import setup_logger setup_logger("lightrag", level="DEBUG")
🔗 Links
- Model: AITeamVN/Vietnamese_Embedding
- Base Model: BAAI/bge-m3
- LightRAG: GitHub
📞 Support
- Issues: GitHub Issues
- Tag: Use
vietnamese-embeddinglabel - Model Issues: Vietnamese_Embedding page
✅ Quick Validation
Run this to test your setup:
python -c "
import asyncio
from lightrag.llm.vietnamese_embed import vietnamese_embed
async def test():
result = await vietnamese_embed(['Test text'])
print(f'✓ Success! Shape: {result.shape}')
asyncio.run(test())
"
Expected output: ✓ Success! Shape: (1, 1024)