162 lines
4 KiB
Markdown
162 lines
4 KiB
Markdown
# Vietnamese Embedding - Quick Reference
|
|
|
|
## 🚀 Quick Setup (5 minutes)
|
|
|
|
### 1. Install & Configure
|
|
```bash
|
|
# Navigate to LightRAG directory
|
|
cd LightRAG
|
|
|
|
# Install in editable mode
|
|
pip install -e .
|
|
|
|
# Set your HuggingFace token
|
|
export HUGGINGFACE_API_KEY="your key here"
|
|
|
|
# Set your OpenAI key (or other LLM provider)
|
|
export OPENAI_API_KEY="your_openai_key"
|
|
```
|
|
|
|
### 2. Run Example
|
|
```bash
|
|
python examples/lightrag_vietnamese_embedding_simple.py
|
|
```
|
|
|
|
## 📝 Minimal Code Example
|
|
|
|
```python
|
|
import os
|
|
import asyncio
|
|
from lightrag import LightRAG, QueryParam
|
|
from lightrag.llm.openai import gpt_4o_mini_complete
|
|
from lightrag.llm.vietnamese_embed import vietnamese_embed
|
|
from lightrag.kg.shared_storage import initialize_pipeline_status
|
|
from lightrag.utils import EmbeddingFunc
|
|
|
|
async def main():
|
|
rag = LightRAG(
|
|
working_dir="./vietnamese_rag",
|
|
llm_model_func=gpt_4o_mini_complete,
|
|
embedding_func=EmbeddingFunc(
|
|
embedding_dim=1024,
|
|
max_token_size=2048,
|
|
func=vietnamese_embed
|
|
)
|
|
)
|
|
|
|
await rag.initialize_storages()
|
|
await initialize_pipeline_status()
|
|
|
|
# Insert Vietnamese text
|
|
await rag.ainsert("Việt Nam là một quốc gia ở Đông Nam Á.")
|
|
|
|
# Query
|
|
result = await rag.aquery("Việt Nam ở đâu?", param=QueryParam(mode="hybrid"))
|
|
print(result)
|
|
|
|
await rag.finalize_storages()
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
## 🔧 Key Parameters
|
|
|
|
| Parameter | Value | Description |
|
|
|-----------|-------|-------------|
|
|
| `embedding_dim` | 1024 | Output dimensions |
|
|
| `max_token_size` | 2048 | Maximum tokens per input |
|
|
| `model_name` | AITeamVN/Vietnamese_Embedding | HuggingFace model ID |
|
|
| `token` | Your HF token | HuggingFace API token |
|
|
|
|
## 🌐 Supported Languages
|
|
|
|
✅ **Vietnamese** (optimized)
|
|
✅ **English** (inherited from BGE-M3)
|
|
✅ **Chinese** (inherited from BGE-M3)
|
|
✅ **100+ other languages** (multilingual support)
|
|
|
|
## 📊 Performance
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| GPU Memory | 2-4 GB |
|
|
| RAM | 4-8 GB |
|
|
| Disk Space | ~2 GB (first download) |
|
|
| Speed (GPU) | 200-1000 texts/sec |
|
|
| Speed (CPU) | 20-100 texts/sec |
|
|
|
|
## 📚 Resources
|
|
|
|
### Documentation
|
|
- **English:** `docs/VietnameseEmbedding.md`
|
|
- **Tiếng Việt:** `docs/VietnameseEmbedding_VI.md`
|
|
|
|
### Examples
|
|
- **Simple:** `examples/lightrag_vietnamese_embedding_simple.py`
|
|
- **Comprehensive:** `examples/vietnamese_embedding_demo.py`
|
|
|
|
### Testing
|
|
```bash
|
|
python tests/test_vietnamese_embedding_integration.py
|
|
```
|
|
|
|
## 🐛 Common Issues
|
|
|
|
### "No HuggingFace token found"
|
|
```bash
|
|
export HUGGINGFACE_API_KEY="your_token"
|
|
```
|
|
|
|
### "Model download fails"
|
|
- Check internet connection
|
|
- Verify HuggingFace token is valid
|
|
- Ensure 2GB+ free disk space
|
|
|
|
### "Out of memory"
|
|
- Reduce batch size
|
|
- Use CPU instead of GPU
|
|
- Close other GPU applications
|
|
|
|
### "Slow embedding"
|
|
- Install CUDA-enabled PyTorch
|
|
- Check GPU is being used (see logs)
|
|
- Reduce `max_token_size` for shorter texts
|
|
|
|
## 💡 Tips
|
|
|
|
1. **First run is slow:** Model downloads (~2GB), cached afterward
|
|
2. **Use GPU:** 10-50x faster than CPU
|
|
3. **Batch requests:** Process multiple texts together
|
|
4. **Enable debug logs:** See device being used
|
|
```python
|
|
from lightrag.utils import setup_logger
|
|
setup_logger("lightrag", level="DEBUG")
|
|
```
|
|
|
|
## 🔗 Links
|
|
|
|
- **Model:** [AITeamVN/Vietnamese_Embedding](https://huggingface.co/AITeamVN/Vietnamese_Embedding)
|
|
- **Base Model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
|
|
- **LightRAG:** [GitHub](https://github.com/HKUDS/LightRAG)
|
|
|
|
## 📞 Support
|
|
|
|
- **Issues:** [GitHub Issues](https://github.com/HKUDS/LightRAG/issues)
|
|
- **Tag:** Use `vietnamese-embedding` label
|
|
- **Model Issues:** [Vietnamese_Embedding page](https://huggingface.co/AITeamVN/Vietnamese_Embedding)
|
|
|
|
## ✅ Quick Validation
|
|
|
|
Run this to test your setup:
|
|
```bash
|
|
python -c "
|
|
import asyncio
|
|
from lightrag.llm.vietnamese_embed import vietnamese_embed
|
|
async def test():
|
|
result = await vietnamese_embed(['Test text'])
|
|
print(f'✓ Success! Shape: {result.shape}')
|
|
asyncio.run(test())
|
|
"
|
|
```
|
|
|
|
Expected output: `✓ Success! Shape: (1, 1024)`
|