LightRAG/docs/VietnameseEmbedding_QuickRef.md

# Vietnamese Embedding - Quick Reference

## 🚀 Quick Setup (5 minutes)

### 1. Install & Configure
```bash
# Navigate to LightRAG directory
cd LightRAG

# Install in editable mode
pip install -e .

# Set your HuggingFace token
export HUGGINGFACE_API_KEY="your key here"

# Set your OpenAI key (or other LLM provider)
export OPENAI_API_KEY="your_openai_key"
```

### 2. Run Example
```bash
python examples/lightrag_vietnamese_embedding_simple.py
```

## 📝 Minimal Code Example

```python
import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete
from lightrag.llm.vietnamese_embed import vietnamese_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import EmbeddingFunc

async def main():
    rag = LightRAG(
        working_dir="./vietnamese_rag",
        llm_model_func=gpt_4o_mini_complete,
        embedding_func=EmbeddingFunc(
            embedding_dim=1024,
            max_token_size=2048,
            func=vietnamese_embed
        )
    )

    await rag.initialize_storages()
    await initialize_pipeline_status()

    # Insert Vietnamese text
    await rag.ainsert("Việt Nam là một quốc gia ở Đông Nam Á.")

    # Query
    result = await rag.aquery("Việt Nam ở đâu?", param=QueryParam(mode="hybrid"))
    print(result)

    await rag.finalize_storages()

asyncio.run(main())
```

## 🔧 Key Parameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| `embedding_dim` | 1024 | Output dimensions |
| `max_token_size` | 2048 | Maximum tokens per input |
| `model_name` | AITeamVN/Vietnamese_Embedding | HuggingFace model ID |
| `token` | Your HF token | HuggingFace API token |

## 🌐 Supported Languages

✅ **Vietnamese** (optimized)
✅ **English** (inherited from BGE-M3)
✅ **Chinese** (inherited from BGE-M3)
✅ **100+ other languages** (multilingual support)

## 📊 Performance

| Metric | Value |
|--------|-------|
| GPU Memory | 2-4 GB |
| RAM | 4-8 GB |
| Disk Space | ~2 GB (first download) |
| Speed (GPU) | 200-1000 texts/sec |
| Speed (CPU) | 20-100 texts/sec |

## 📚 Resources

### Documentation
- **English:** `docs/VietnameseEmbedding.md`
- **Tiếng Việt:** `docs/VietnameseEmbedding_VI.md`

### Examples
- **Simple:** `examples/lightrag_vietnamese_embedding_simple.py`
- **Comprehensive:** `examples/vietnamese_embedding_demo.py`

### Testing
```bash
python tests/test_vietnamese_embedding_integration.py
```

## 🐛 Common Issues

### "No HuggingFace token found"
```bash
export HUGGINGFACE_API_KEY="your_token"
```

### "Model download fails"
- Check internet connection
- Verify HuggingFace token is valid
- Ensure 2GB+ free disk space

### "Out of memory"
- Reduce batch size
- Use CPU instead of GPU
- Close other GPU applications

### "Slow embedding"
- Install CUDA-enabled PyTorch
- Check GPU is being used (see logs)
- Reduce `max_token_size` for shorter texts

## 💡 Tips

1. **First run is slow:** Model downloads (~2GB), cached afterward
2. **Use GPU:** 10-50x faster than CPU
3. **Batch requests:** Process multiple texts together
4. **Enable debug logs:** See device being used
   ```python
   from lightrag.utils import setup_logger
   setup_logger("lightrag", level="DEBUG")
   ```

## 🔗 Links

- **Model:** [AITeamVN/Vietnamese_Embedding](https://huggingface.co/AITeamVN/Vietnamese_Embedding)
- **Base Model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
- **LightRAG:** [GitHub](https://github.com/HKUDS/LightRAG)

## 📞 Support

- **Issues:** [GitHub Issues](https://github.com/HKUDS/LightRAG/issues)
- **Tag:** Use `vietnamese-embedding` label
- **Model Issues:** [Vietnamese_Embedding page](https://huggingface.co/AITeamVN/Vietnamese_Embedding)

## ✅ Quick Validation

Run this to test your setup:
```bash
python -c "
import asyncio
from lightrag.llm.vietnamese_embed import vietnamese_embed
async def test():
    result = await vietnamese_embed(['Test text'])
    print(f'✓ Success! Shape: {result.shape}')
asyncio.run(test())
"
```

Expected output: `✓ Success! Shape: (1, 1024)`