kimhoang0511 686cb97c19 add embedding vn

2025-10-25 16:09:06 +07:00

9.4 KiB

Raw Blame History

Vietnamese Embedding Integration for LightRAG

This integration adds support for the AITeamVN/Vietnamese_Embedding model to LightRAG, enabling enhanced retrieval capabilities for Vietnamese text.

Model Information

Model: AITeamVN/Vietnamese_Embedding
Base Model: BAAI/bge-m3
Type: Sentence Transformer
Maximum Sequence Length: 2048 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Dot product similarity
Language: Vietnamese (also supports other languages as it's based on BGE-M3)
Training Data: ~300,000 triplets of queries, positive documents, and negative documents for Vietnamese

Features

✅ High-quality Vietnamese embeddings - Fine-tuned specifically for Vietnamese text retrieval
✅ Multilingual support - Inherits multilingual capabilities from BGE-M3
✅ Long context support - Handles up to 2048 tokens per input
✅ Efficient processing - Automatic device detection (CUDA/MPS/CPU)
✅ Normalized embeddings - Ready for dot product similarity
✅ Easy integration - Drop-in replacement for other embedding functions

Installation

1. Install LightRAG

cd LightRAG
pip install -e .

2. Install Required Dependencies

The Vietnamese embedding integration requires:

transformers (automatically installed)
torch (automatically installed)
numpy (automatically installed)

These will be automatically installed via pipmaster when you first use the Vietnamese embedding function.

3. Set Up HuggingFace Token

You need a HuggingFace token to access the model:

export HUGGINGFACE_API_KEY="your_hf_token_here"
# or
export HF_TOKEN="your_hf_token_here"

Get your token from: https://huggingface.co/settings/tokens

Quick Start

Simple Example

import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete
from lightrag.llm.vietnamese_embed import vietnamese_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import EmbeddingFunc

WORKING_DIR = "./vietnamese_rag_storage"

async def main():
    # Get HuggingFace token
    hf_token = os.environ.get("HUGGINGFACE_API_KEY")
    
    # Initialize LightRAG with Vietnamese embedding
    rag = LightRAG(
        working_dir=WORKING_DIR,
        llm_model_func=gpt_4o_mini_complete,
        embedding_func=EmbeddingFunc(
            embedding_dim=1024,
            max_token_size=2048,
            func=lambda texts: vietnamese_embed(
                texts,
                model_name="AITeamVN/Vietnamese_Embedding",
                token=hf_token
            )
        ),
    )
    
    # Initialize storage and pipeline
    await rag.initialize_storages()
    await initialize_pipeline_status()
    
    # Insert Vietnamese text
    await rag.ainsert("Việt Nam là một quốc gia nằm ở Đông Nam Á.")
    
    # Query
    result = await rag.aquery(
        "Việt Nam ở đâu?",
        param=QueryParam(mode="hybrid")
    )
    print(result)
    
    await rag.finalize_storages()

if __name__ == "__main__":
    asyncio.run(main())

Using with `.env` File

Create a .env file in your project directory:

# HuggingFace Token for Vietnamese Embedding
HUGGINGFACE_API_KEY=your_key_here

# LLM Configuration
OPENAI_API_KEY=your_openai_key_here
LLM_BINDING=openai
LLM_MODEL=gpt-4o-mini

# Embedding Configuration
EMBEDDING_MODEL=AITeamVN/Vietnamese_Embedding
EMBEDDING_DIM=1024

Example Scripts

We provide several example scripts demonstrating different use cases:

1. Simple Example

python examples/lightrag_vietnamese_embedding_simple.py

A minimal example showing basic Vietnamese text processing.

2. Comprehensive Demo

python examples/vietnamese_embedding_demo.py

A comprehensive demo including:

Vietnamese text processing
English text processing (multilingual support)
Mixed language processing
Multiple query examples

API Reference

`vietnamese_embed()`

Generate embeddings for texts using the Vietnamese Embedding model.

async def vietnamese_embed(
    texts: list[str],
    model_name: str = "AITeamVN/Vietnamese_Embedding",
    token: str | None = None,
) -> np.ndarray

Parameters:

texts (list[str]): List of texts to embed
model_name (str): HuggingFace model identifier
token (str, optional): HuggingFace API token (reads from env if not provided)

Returns:

np.ndarray: Array of embeddings with shape (len(texts), 1024)

Example:

from lightrag.llm.vietnamese_embed import vietnamese_embed

texts = ["Xin chào", "Hello", "你好"]
embeddings = await vietnamese_embed(texts)
print(embeddings.shape)  # (3, 1024)

`vietnamese_embedding_func()`

Convenience wrapper that automatically reads token from environment.

async def vietnamese_embedding_func(texts: list[str]) -> np.ndarray

Example:

from lightrag.llm.vietnamese_embed import vietnamese_embedding_func

# Token automatically read from HUGGINGFACE_API_KEY or HF_TOKEN
embeddings = await vietnamese_embedding_func(["Xin chào"])

Advanced Usage

Custom Model Configuration

from lightrag.llm.vietnamese_embed import vietnamese_embed

# Use a different model based on BGE-M3
embeddings = await vietnamese_embed(
    texts=["Sample text"],
    model_name="BAAI/bge-m3",  # Use base model
    token=your_token
)

Device Selection

The model automatically detects and uses the best available device:

CUDA (if available)
MPS (for Apple Silicon)
CPU (fallback)

You can check which device is being used by enabling debug logging:

from lightrag.utils import setup_logger

setup_logger("lightrag", level="DEBUG")

Batch Processing

The embedding function supports efficient batch processing:

# Process multiple texts efficiently
large_batch = ["Text 1", "Text 2", ..., "Text 1000"]
embeddings = await vietnamese_embed(large_batch)

Integration with LightRAG Server

To use Vietnamese embedding with LightRAG Server, update your .env file:

# Vietnamese Embedding Configuration
EMBEDDING_MODEL=AITeamVN/Vietnamese_Embedding
EMBEDDING_DIM=1024
HUGGINGFACE_API_KEY=your_hf_token

# Or use custom binding
EMBEDDING_BINDING=huggingface

Then start the server:

lightrag-server

Performance Considerations

Memory Requirements

GPU Memory: ~2-4 GB for the model
RAM: ~4-8 GB recommended
Disk Space: ~2 GB for model weights (cached after first download)

Speed

On a typical GPU:

~1000 texts/second for short texts (< 512 tokens)
~200-400 texts/second for longer texts (1024-2048 tokens)

Optimization Tips

Use GPU: Significantly faster than CPU (10-50x)
Batch Requests: Process multiple texts together
Cache Model: First run downloads model; subsequent runs are faster
Adjust max_length: Use shorter max_length if your texts are shorter

# Example: Optimize for shorter texts
embedding_func=EmbeddingFunc(
    embedding_dim=1024,
    max_token_size=512,  # Reduce if texts are shorter
    func=lambda texts: vietnamese_embed(texts)
)

Troubleshooting

Issue: "No HuggingFace token found"

Solution: Set the environment variable:

export HUGGINGFACE_API_KEY="your_token"
# or
export HF_TOKEN="your_token"

Issue: "Model download fails"

Solution:

Check your internet connection
Verify your HuggingFace token is valid
Ensure you have enough disk space (~2 GB)

Issue: "Out of memory error"

Solution:

Reduce batch size
Use CPU instead of GPU (slower but uses less memory)
Close other applications using GPU/RAM

Issue: "Slow embedding generation"

Solution:

Ensure you're using GPU (check logs for device info)
Install CUDA-enabled PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cu118
Reduce max_token_size if your texts are shorter

Comparison with Other Embedding Models

Model	Dimensions	Max Tokens	Languages	Fine-tuned for Vietnamese
Vietnamese_Embedding	1024	2048	Multilingual	✅ Yes
BGE-M3	1024	8192	Multilingual	❌ No
text-embedding-3-large	3072	8191	Multilingual	❌ No
text-embedding-3-small	1536	8191	Multilingual	❌ No

Citation

If you use the Vietnamese Embedding model in your research, please cite:

@misc{vietnamese_embedding_2024,
  title={Vietnamese Embedding: Fine-tuned BGE-M3 for Vietnamese Retrieval},
  author={AITeamVN},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/AITeamVN/Vietnamese_Embedding}
}

Support

For issues specific to the Vietnamese embedding integration:

Open an issue on LightRAG GitHub
Tag with vietnamese-embedding label

For issues with the model itself:

Visit AITeamVN/Vietnamese_Embedding

License

This integration follows LightRAG's license. The Vietnamese_Embedding model may have its own license terms - please check the model page for details.

Acknowledgments

AITeamVN for training and releasing the Vietnamese_Embedding model
BAAI for the base BGE-M3 model
LightRAG team for the excellent RAG framework

9.4 KiB Raw Blame History

Vietnamese Embedding Integration for LightRAG

Model Information

Features

Installation

1. Install LightRAG

2. Install Required Dependencies

3. Set Up HuggingFace Token

Quick Start

Simple Example

Using with .env File

Example Scripts

1. Simple Example

2. Comprehensive Demo

API Reference

vietnamese_embed()

vietnamese_embedding_func()

Advanced Usage

Custom Model Configuration

Device Selection

Batch Processing

Integration with LightRAG Server

Performance Considerations

Memory Requirements

Speed

Optimization Tips

Troubleshooting

Issue: "No HuggingFace token found"

Issue: "Model download fails"

Issue: "Out of memory error"

Issue: "Slow embedding generation"

Comparison with Other Embedding Models

Citation

Support

License

Acknowledgments

9.4 KiB

Raw Blame History

Using with `.env` File

`vietnamese_embed()`

`vietnamese_embedding_func()`