LightRAG/docs/VietnameseEmbedding_QuickRef.md
2025-10-25 16:09:06 +07:00

4 KiB

Vietnamese Embedding - Quick Reference

🚀 Quick Setup (5 minutes)

1. Install & Configure

# Navigate to LightRAG directory
cd LightRAG

# Install in editable mode
pip install -e .

# Set your HuggingFace token
export HUGGINGFACE_API_KEY="your key here"

# Set your OpenAI key (or other LLM provider)
export OPENAI_API_KEY="your_openai_key"

2. Run Example

python examples/lightrag_vietnamese_embedding_simple.py

📝 Minimal Code Example

import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete
from lightrag.llm.vietnamese_embed import vietnamese_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import EmbeddingFunc

async def main():
    rag = LightRAG(
        working_dir="./vietnamese_rag",
        llm_model_func=gpt_4o_mini_complete,
        embedding_func=EmbeddingFunc(
            embedding_dim=1024,
            max_token_size=2048,
            func=vietnamese_embed
        )
    )
    
    await rag.initialize_storages()
    await initialize_pipeline_status()
    
    # Insert Vietnamese text
    await rag.ainsert("Việt Nam là một quốc gia ở Đông Nam Á.")
    
    # Query
    result = await rag.aquery("Việt Nam ở đâu?", param=QueryParam(mode="hybrid"))
    print(result)
    
    await rag.finalize_storages()

asyncio.run(main())

🔧 Key Parameters

Parameter Value Description
embedding_dim 1024 Output dimensions
max_token_size 2048 Maximum tokens per input
model_name AITeamVN/Vietnamese_Embedding HuggingFace model ID
token Your HF token HuggingFace API token

🌐 Supported Languages

Vietnamese (optimized)
English (inherited from BGE-M3)
Chinese (inherited from BGE-M3)
100+ other languages (multilingual support)

📊 Performance

Metric Value
GPU Memory 2-4 GB
RAM 4-8 GB
Disk Space ~2 GB (first download)
Speed (GPU) 200-1000 texts/sec
Speed (CPU) 20-100 texts/sec

📚 Resources

Documentation

  • English: docs/VietnameseEmbedding.md
  • Tiếng Việt: docs/VietnameseEmbedding_VI.md

Examples

  • Simple: examples/lightrag_vietnamese_embedding_simple.py
  • Comprehensive: examples/vietnamese_embedding_demo.py

Testing

python tests/test_vietnamese_embedding_integration.py

🐛 Common Issues

"No HuggingFace token found"

export HUGGINGFACE_API_KEY="your_token"

"Model download fails"

  • Check internet connection
  • Verify HuggingFace token is valid
  • Ensure 2GB+ free disk space

"Out of memory"

  • Reduce batch size
  • Use CPU instead of GPU
  • Close other GPU applications

"Slow embedding"

  • Install CUDA-enabled PyTorch
  • Check GPU is being used (see logs)
  • Reduce max_token_size for shorter texts

💡 Tips

  1. First run is slow: Model downloads (~2GB), cached afterward
  2. Use GPU: 10-50x faster than CPU
  3. Batch requests: Process multiple texts together
  4. Enable debug logs: See device being used
    from lightrag.utils import setup_logger
    setup_logger("lightrag", level="DEBUG")
    

📞 Support

Quick Validation

Run this to test your setup:

python -c "
import asyncio
from lightrag.llm.vietnamese_embed import vietnamese_embed
async def test():
    result = await vietnamese_embed(['Test text'])
    print(f'✓ Success! Shape: {result.shape}')
asyncio.run(test())
"

Expected output: ✓ Success! Shape: (1, 1024)