kimhoang0511 686cb97c19 add embedding vn

2025-10-25 16:09:06 +07:00

4 KiB

Raw Blame History

Vietnamese Embedding - Quick Reference

🚀 Quick Setup (5 minutes)

1. Install & Configure

# Navigate to LightRAG directory
cd LightRAG

# Install in editable mode
pip install -e .

# Set your HuggingFace token
export HUGGINGFACE_API_KEY="your key here"

# Set your OpenAI key (or other LLM provider)
export OPENAI_API_KEY="your_openai_key"

2. Run Example

python examples/lightrag_vietnamese_embedding_simple.py

📝 Minimal Code Example

import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete
from lightrag.llm.vietnamese_embed import vietnamese_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import EmbeddingFunc

async def main():
    rag = LightRAG(
        working_dir="./vietnamese_rag",
        llm_model_func=gpt_4o_mini_complete,
        embedding_func=EmbeddingFunc(
            embedding_dim=1024,
            max_token_size=2048,
            func=vietnamese_embed
        )
    )
    
    await rag.initialize_storages()
    await initialize_pipeline_status()
    
    # Insert Vietnamese text
    await rag.ainsert("Việt Nam là một quốc gia ở Đông Nam Á.")
    
    # Query
    result = await rag.aquery("Việt Nam ở đâu?", param=QueryParam(mode="hybrid"))
    print(result)
    
    await rag.finalize_storages()

asyncio.run(main())

🔧 Key Parameters

Parameter	Value	Description
`embedding_dim`	1024	Output dimensions
`max_token_size`	2048	Maximum tokens per input
`model_name`	AITeamVN/Vietnamese_Embedding	HuggingFace model ID
`token`	Your HF token	HuggingFace API token

🌐 Supported Languages

✅ Vietnamese (optimized)
✅ English (inherited from BGE-M3)
✅ Chinese (inherited from BGE-M3)
✅ 100+ other languages (multilingual support)

📊 Performance

Metric	Value
GPU Memory	2-4 GB
RAM	4-8 GB
Disk Space	~2 GB (first download)
Speed (GPU)	200-1000 texts/sec
Speed (CPU)	20-100 texts/sec

📚 Resources

Documentation

English: docs/VietnameseEmbedding.md
Tiếng Việt: docs/VietnameseEmbedding_VI.md

Examples

Simple: examples/lightrag_vietnamese_embedding_simple.py
Comprehensive: examples/vietnamese_embedding_demo.py

Testing

python tests/test_vietnamese_embedding_integration.py

🐛 Common Issues

"No HuggingFace token found"

export HUGGINGFACE_API_KEY="your_token"

"Model download fails"

Check internet connection
Verify HuggingFace token is valid
Ensure 2GB+ free disk space

"Out of memory"

Reduce batch size
Use CPU instead of GPU
Close other GPU applications

"Slow embedding"

Install CUDA-enabled PyTorch
Check GPU is being used (see logs)
Reduce max_token_size for shorter texts

💡 Tips

First run is slow: Model downloads (~2GB), cached afterward
Use GPU: 10-50x faster than CPU
Batch requests: Process multiple texts together

Enable debug logs: See device being used

from lightrag.utils import setup_logger
setup_logger("lightrag", level="DEBUG")

🔗 Links

Model: AITeamVN/Vietnamese_Embedding
Base Model: BAAI/bge-m3
LightRAG: GitHub

📞 Support

Issues: GitHub Issues
Tag: Use vietnamese-embedding label
Model Issues: Vietnamese_Embedding page

✅ Quick Validation

Run this to test your setup:

python -c "
import asyncio
from lightrag.llm.vietnamese_embed import vietnamese_embed
async def test():
    result = await vietnamese_embed(['Test text'])
    print(f'✓ Success! Shape: {result.shape}')
asyncio.run(test())
"

Expected output: ✓ Success! Shape: (1, 1024)

4 KiB Raw Blame History