gmakstutis/LightRAG

Fork 0

kimhoang0511 686cb97c19 add embedding vn

2025-10-25 16:09:06 +07:00

7.9 KiB

Raw Blame History

Vietnamese Embedding Examples

This directory contains example scripts demonstrating how to use the AITeamVN/Vietnamese_Embedding model with LightRAG.

Available Examples

1. Simple Example: `lightrag_vietnamese_embedding_simple.py`

Purpose: Minimal code to get started quickly

What it does:

Initializes LightRAG with Vietnamese embedding
Inserts a simple Vietnamese text
Performs a basic query
Clean and easy to understand

Run:

export HUGGINGFACE_API_KEY="your_hf_token"
export OPENAI_API_KEY="your_openai_key"
python examples/lightrag_vietnamese_embedding_simple.py

Expected output:

Inserting Vietnamese text...
Query: Thủ đô của Việt Nam là gì?
Answer: [Response about Hanoi being the capital]

Code size: ~50 lines
Execution time: ~30 seconds (first run with model download: ~2 minutes)

2. Comprehensive Demo: `vietnamese_embedding_demo.py`

Purpose: Full-featured demonstration with multiple scenarios

What it does:

Demo 1: Vietnamese text processing
- Inserts Vietnamese content about Vietnam
- Performs multiple queries in Vietnamese
- Demonstrates hybrid mode retrieval
Demo 2: English text processing (multilingual support)
- Inserts English content about AI
- Queries in English
- Shows model's multilingual capabilities
Demo 3: Mixed language processing
- Inserts mixed Vietnamese-English content
- Queries in both languages
- Demonstrates language-agnostic retrieval

Run:

export HUGGINGFACE_API_KEY="your_hf_token"
export OPENAI_API_KEY="your_openai_key"
python examples/vietnamese_embedding_demo.py

Expected output:

=============================================================
DEMO 1: Vietnamese Text Processing
=============================================================
✓ Initializing LightRAG with Vietnamese Embedding Model
✓ Text inserted successfully!

Querying in Vietnamese:
------------------------------------------------------------
❓ Query: Thủ đô của Việt Nam là gì?
💡 Answer: [Detailed response about Hanoi]
...

[Similar output for Demo 2 and Demo 3]

=============================================================
✓ All demos completed successfully!
=============================================================

Code size: ~300 lines
Execution time: ~2-5 minutes (depending on LLM speed)

Prerequisites

Required Environment Variables

# HuggingFace token (required for model access)
export HUGGINGFACE_API_KEY="hf_your_token_here"
# or
export HF_TOKEN="hf_your_token_here"

# LLM API key (using OpenAI as example)
export OPENAI_API_KEY="sk-your_key_here"

Get Your Tokens

HuggingFace Token:
- Visit: https://huggingface.co/settings/tokens
- Create a new token with "Read" permission
- Copy and export it
OpenAI API Key:
- Visit: https://platform.openai.com/api-keys
- Create a new key
- Copy and export it

Alternative: Use `.env` File

Create a .env file in the project root:

HUGGINGFACE_API_KEY=hf_your_token_here
OPENAI_API_KEY=sk-your_key_here

What to Expect on First Run

Model Download (First Time Only)

Size: ~2 GB
Time: 2-5 minutes (depending on internet speed)
Location: Cached in ~/.cache/huggingface/

After the first run, the model is cached and loads instantly.

Resource Usage

GPU Memory: 2-4 GB (if using GPU)
RAM: 4-8 GB
Disk Space: 2 GB for model + storage for RAG data

Common Issues & Solutions

Issue: "No HuggingFace token found"

Solution:

export HUGGINGFACE_API_KEY="your_token"

Issue: "Model download fails"

Possible causes:

No internet connection
Invalid HuggingFace token
Insufficient disk space

Solution:

Check internet connection
Verify token is correct
Ensure 2GB+ free space

Issue: "Out of memory error"

Solution:

Close other applications
Use CPU instead of GPU (slower but less memory)
Reduce batch size (if processing many texts)

Issue: "Slow performance"

Solution:

Install CUDA-enabled PyTorch for GPU support
Check GPU is being used (enable DEBUG logging)
Use GPU instead of CPU (10-50x faster)

Tips for Best Results

1. Enable Debug Logging

See what's happening under the hood:

from lightrag.utils import setup_logger
setup_logger("lightrag", level="DEBUG")

2. Use GPU for Production

Much faster than CPU:

# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118

3. Optimize for Your Use Case

Adjust parameters based on your text length:

embedding_func=EmbeddingFunc(
    embedding_dim=1024,
    max_token_size=1024,  # Reduce if your texts are shorter
    func=vietnamese_embed
)

4. Batch Processing

Process multiple texts together for efficiency:

texts = ["Text 1", "Text 2", ..., "Text N"]
await rag.ainsert(texts)  # More efficient than one by one

Understanding the Examples

Key Components

Embedding Function:

from lightrag.llm.vietnamese_embed import vietnamese_embed

This loads the Vietnamese_Embedding model.

LightRAG Configuration:

embedding_func=EmbeddingFunc(
    embedding_dim=1024,      # Vietnamese_Embedding outputs 1024 dimensions
    max_token_size=2048,     # Model supports up to 2048 tokens
    func=vietnamese_embed    # The embedding function
)

Text Insertion:

await rag.ainsert(text)  # Asynchronous insertion
# or
rag.insert(text)         # Synchronous insertion

Querying:

result = await rag.aquery(
    query,
    param=QueryParam(mode="hybrid")  # hybrid, local, global, naive, mix
)

Query Modes

naive: Simple vector similarity search
local: Context-dependent retrieval
global: Global knowledge retrieval
hybrid: Combines local and global
mix: Integrates knowledge graph and vector retrieval

For Vietnamese text, hybrid mode typically works best.

Modifying the Examples

Use Different LLM

Replace gpt_4o_mini_complete with your preferred LLM:

# Using Ollama
from lightrag.llm.ollama import ollama_model_complete
llm_model_func=ollama_model_complete

# Using Azure OpenAI
from lightrag.llm.azure_openai import azure_openai_complete
llm_model_func=azure_openai_complete

Use Different Embedding Model

While keeping the Vietnamese embedding:

from lightrag.llm.vietnamese_embed import vietnamese_embed

# Use different model from HuggingFace
func=lambda texts: vietnamese_embed(
    texts,
    model_name="BAAI/bge-m3",  # Use base model
    token=hf_token
)

Add Your Own Data

Replace the sample text with your own:

# Read from file
with open("your_data.txt", "r", encoding="utf-8") as f:
    text = f.read()

await rag.ainsert(text)

Next Steps

Try the simple example first to verify setup
Run the comprehensive demo to see all features
Modify examples for your specific use case
Read the documentation for advanced usage:
- English: docs/VietnameseEmbedding.md
- Vietnamese: docs/VietnameseEmbedding_VI.md

Run the test suite to validate your environment:

python tests/test_vietnamese_embedding_integration.py

Support

Documentation: See docs/VietnameseEmbedding.md
Issues: https://github.com/HKUDS/LightRAG/issues
Model: https://huggingface.co/AITeamVN/Vietnamese_Embedding

While you're here, check out these other LightRAG examples:

lightrag_openai_demo.py - Basic OpenAI integration
lightrag_ollama_demo.py - Using Ollama models
lightrag_hf_demo.py - HuggingFace models
rerank_example.py - Adding reranking
graph_visual_with_neo4j.py - Neo4J visualization

Happy coding! 🚀

For questions or feedback, please open an issue on GitHub with the vietnamese-embedding tag.

7.9 KiB Raw Blame History

Vietnamese Embedding Examples

Available Examples

1. Simple Example: lightrag_vietnamese_embedding_simple.py

2. Comprehensive Demo: vietnamese_embedding_demo.py

Prerequisites

Required Environment Variables

Get Your Tokens

Alternative: Use .env File

What to Expect on First Run

Model Download (First Time Only)

Resource Usage

Common Issues & Solutions

Issue: "No HuggingFace token found"

Issue: "Model download fails"

Issue: "Out of memory error"

Issue: "Slow performance"

Tips for Best Results

1. Enable Debug Logging

2. Use GPU for Production

3. Optimize for Your Use Case

4. Batch Processing

Understanding the Examples

Key Components

Query Modes

Modifying the Examples

Use Different LLM

Use Different Embedding Model

Add Your Own Data

Next Steps

Support

Related Examples in This Directory

7.9 KiB

Raw Blame History

1. Simple Example: `lightrag_vietnamese_embedding_simple.py`

2. Comprehensive Demo: `vietnamese_embedding_demo.py`

Alternative: Use `.env` File