# Vietnamese Embedding Integration for LightRAG This integration adds support for the **AITeamVN/Vietnamese_Embedding** model to LightRAG, enabling enhanced retrieval capabilities for Vietnamese text. ## Model Information - **Model**: [AITeamVN/Vietnamese_Embedding](https://huggingface.co/AITeamVN/Vietnamese_Embedding) - **Base Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) - **Type**: Sentence Transformer - **Maximum Sequence Length**: 2048 tokens - **Output Dimensionality**: 1024 dimensions - **Similarity Function**: Dot product similarity - **Language**: Vietnamese (also supports other languages as it's based on BGE-M3) - **Training Data**: ~300,000 triplets of queries, positive documents, and negative documents for Vietnamese ## Features ✅ **High-quality Vietnamese embeddings** - Fine-tuned specifically for Vietnamese text retrieval ✅ **Multilingual support** - Inherits multilingual capabilities from BGE-M3 ✅ **Long context support** - Handles up to 2048 tokens per input ✅ **Efficient processing** - Automatic device detection (CUDA/MPS/CPU) ✅ **Normalized embeddings** - Ready for dot product similarity ✅ **Easy integration** - Drop-in replacement for other embedding functions ## Installation ### 1. Install LightRAG ```bash cd LightRAG pip install -e . ``` ### 2. Install Required Dependencies The Vietnamese embedding integration requires: - `transformers` (automatically installed) - `torch` (automatically installed) - `numpy` (automatically installed) These will be automatically installed via `pipmaster` when you first use the Vietnamese embedding function. ### 3. Set Up HuggingFace Token You need a HuggingFace token to access the model: ```bash export HUGGINGFACE_API_KEY="your_hf_token_here" # or export HF_TOKEN="your_hf_token_here" ``` Get your token from: https://huggingface.co/settings/tokens ## Quick Start ### Simple Example ```python import os import asyncio from lightrag import LightRAG, QueryParam from lightrag.llm.openai import gpt_4o_mini_complete from lightrag.llm.vietnamese_embed import vietnamese_embed from lightrag.kg.shared_storage import initialize_pipeline_status from lightrag.utils import EmbeddingFunc WORKING_DIR = "./vietnamese_rag_storage" async def main(): # Get HuggingFace token hf_token = os.environ.get("HUGGINGFACE_API_KEY") # Initialize LightRAG with Vietnamese embedding rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=gpt_4o_mini_complete, embedding_func=EmbeddingFunc( embedding_dim=1024, max_token_size=2048, func=lambda texts: vietnamese_embed( texts, model_name="AITeamVN/Vietnamese_Embedding", token=hf_token ) ), ) # Initialize storage and pipeline await rag.initialize_storages() await initialize_pipeline_status() # Insert Vietnamese text await rag.ainsert("Việt Nam là một quốc gia nằm ở Đông Nam Á.") # Query result = await rag.aquery( "Việt Nam ở đâu?", param=QueryParam(mode="hybrid") ) print(result) await rag.finalize_storages() if __name__ == "__main__": asyncio.run(main()) ``` ### Using with `.env` File Create a `.env` file in your project directory: ```env # HuggingFace Token for Vietnamese Embedding HUGGINGFACE_API_KEY=your_key_here # LLM Configuration OPENAI_API_KEY=your_openai_key_here LLM_BINDING=openai LLM_MODEL=gpt-4o-mini # Embedding Configuration EMBEDDING_MODEL=AITeamVN/Vietnamese_Embedding EMBEDDING_DIM=1024 ``` ## Example Scripts We provide several example scripts demonstrating different use cases: ### 1. Simple Example ```bash python examples/lightrag_vietnamese_embedding_simple.py ``` A minimal example showing basic Vietnamese text processing. ### 2. Comprehensive Demo ```bash python examples/vietnamese_embedding_demo.py ``` A comprehensive demo including: - Vietnamese text processing - English text processing (multilingual support) - Mixed language processing - Multiple query examples ## API Reference ### `vietnamese_embed()` Generate embeddings for texts using the Vietnamese Embedding model. ```python async def vietnamese_embed( texts: list[str], model_name: str = "AITeamVN/Vietnamese_Embedding", token: str | None = None, ) -> np.ndarray ``` **Parameters:** - `texts` (list[str]): List of texts to embed - `model_name` (str): HuggingFace model identifier - `token` (str, optional): HuggingFace API token (reads from env if not provided) **Returns:** - `np.ndarray`: Array of embeddings with shape (len(texts), 1024) **Example:** ```python from lightrag.llm.vietnamese_embed import vietnamese_embed texts = ["Xin chào", "Hello", "你好"] embeddings = await vietnamese_embed(texts) print(embeddings.shape) # (3, 1024) ``` ### `vietnamese_embedding_func()` Convenience wrapper that automatically reads token from environment. ```python async def vietnamese_embedding_func(texts: list[str]) -> np.ndarray ``` **Example:** ```python from lightrag.llm.vietnamese_embed import vietnamese_embedding_func # Token automatically read from HUGGINGFACE_API_KEY or HF_TOKEN embeddings = await vietnamese_embedding_func(["Xin chào"]) ``` ## Advanced Usage ### Custom Model Configuration ```python from lightrag.llm.vietnamese_embed import vietnamese_embed # Use a different model based on BGE-M3 embeddings = await vietnamese_embed( texts=["Sample text"], model_name="BAAI/bge-m3", # Use base model token=your_token ) ``` ### Device Selection The model automatically detects and uses the best available device: 1. CUDA (if available) 2. MPS (for Apple Silicon) 3. CPU (fallback) You can check which device is being used by enabling debug logging: ```python from lightrag.utils import setup_logger setup_logger("lightrag", level="DEBUG") ``` ### Batch Processing The embedding function supports efficient batch processing: ```python # Process multiple texts efficiently large_batch = ["Text 1", "Text 2", ..., "Text 1000"] embeddings = await vietnamese_embed(large_batch) ``` ## Integration with LightRAG Server To use Vietnamese embedding with LightRAG Server, update your `.env` file: ```env # Vietnamese Embedding Configuration EMBEDDING_MODEL=AITeamVN/Vietnamese_Embedding EMBEDDING_DIM=1024 HUGGINGFACE_API_KEY=your_hf_token # Or use custom binding EMBEDDING_BINDING=huggingface ``` Then start the server: ```bash lightrag-server ``` ## Performance Considerations ### Memory Requirements - **GPU Memory**: ~2-4 GB for the model - **RAM**: ~4-8 GB recommended - **Disk Space**: ~2 GB for model weights (cached after first download) ### Speed On a typical GPU: - ~1000 texts/second for short texts (< 512 tokens) - ~200-400 texts/second for longer texts (1024-2048 tokens) ### Optimization Tips 1. **Use GPU**: Significantly faster than CPU (10-50x) 2. **Batch Requests**: Process multiple texts together 3. **Cache Model**: First run downloads model; subsequent runs are faster 4. **Adjust max_length**: Use shorter max_length if your texts are shorter ```python # Example: Optimize for shorter texts embedding_func=EmbeddingFunc( embedding_dim=1024, max_token_size=512, # Reduce if texts are shorter func=lambda texts: vietnamese_embed(texts) ) ``` ## Troubleshooting ### Issue: "No HuggingFace token found" **Solution:** Set the environment variable: ```bash export HUGGINGFACE_API_KEY="your_token" # or export HF_TOKEN="your_token" ``` ### Issue: "Model download fails" **Solution:** 1. Check your internet connection 2. Verify your HuggingFace token is valid 3. Ensure you have enough disk space (~2 GB) ### Issue: "Out of memory error" **Solution:** 1. Reduce batch size 2. Use CPU instead of GPU (slower but uses less memory) 3. Close other applications using GPU/RAM ### Issue: "Slow embedding generation" **Solution:** 1. Ensure you're using GPU (check logs for device info) 2. Install CUDA-enabled PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu118` 3. Reduce max_token_size if your texts are shorter ## Comparison with Other Embedding Models | Model | Dimensions | Max Tokens | Languages | Fine-tuned for Vietnamese | |-------|------------|------------|-----------|--------------------------| | Vietnamese_Embedding | 1024 | 2048 | Multilingual | ✅ Yes | | BGE-M3 | 1024 | 8192 | Multilingual | ❌ No | | text-embedding-3-large | 3072 | 8191 | Multilingual | ❌ No | | text-embedding-3-small | 1536 | 8191 | Multilingual | ❌ No | ## Citation If you use the Vietnamese Embedding model in your research, please cite: ```bibtex @misc{vietnamese_embedding_2024, title={Vietnamese Embedding: Fine-tuned BGE-M3 for Vietnamese Retrieval}, author={AITeamVN}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/AITeamVN/Vietnamese_Embedding} } ``` ## Support For issues specific to the Vietnamese embedding integration: - Open an issue on [LightRAG GitHub](https://github.com/HKUDS/LightRAG/issues) - Tag with `vietnamese-embedding` label For issues with the model itself: - Visit [AITeamVN/Vietnamese_Embedding](https://huggingface.co/AITeamVN/Vietnamese_Embedding) ## License This integration follows LightRAG's license. The Vietnamese_Embedding model may have its own license terms - please check the [model page](https://huggingface.co/AITeamVN/Vietnamese_Embedding) for details. ## Acknowledgments - **AITeamVN** for training and releasing the Vietnamese_Embedding model - **BAAI** for the base BGE-M3 model - **LightRAG team** for the excellent RAG framework