LightRAG/requirements-trilingual.txt
Claude 12ab6ebb42
Add trilingual entity extractor (Chinese/English/Swedish)
Implements high-quality entity extraction for three languages using best-in-class tools:
- Chinese: HanLP (F1 95%)
- English: spaCy (F1 90%)
- Swedish: spaCy (F1 80-85%)

**Why not GLiNER?**
Quality gap too large:
- Chinese: 95% vs 24% (-71%)
- English: 90% vs 60% (-30%)
- Swedish: 85% vs 50% (-35%)

**Key Features:**
1. Lazy loading (memory efficient)
   - Loads models on-demand
   - Only one model in memory at a time (~1.5-1.8 GB)
   - Not 4-5 GB simultaneously

2. High quality
   - Each language uses optimal tool
   - Chinese: HanLP (specialized for Chinese)
   - English/Swedish: spaCy (official support)

3. Easy to use
   - Simple API: extract(text, language='zh'/'en'/'sv')
   - Automatic model management
   - Error handling and logging

**Files Added:**
- lightrag/kg/trilingual_entity_extractor.py - Core extractor class
- requirements-trilingual.txt - Dependencies (spacy + hanlp)
- scripts/install_trilingual_models.sh - One-click installation
- scripts/test_trilingual_extractor.py - Comprehensive test suite
- docs/TrilingualNER-Usage-zh.md - Complete usage guide

**Installation:**
```bash
# Method 1: One-click install
./scripts/install_trilingual_models.sh

# Method 2: Manual install
pip install -r requirements-trilingual.txt
python -m spacy download en_core_web_trf
python -m spacy download sv_core_news_lg
# HanLP downloads automatically on first use
```

**Usage:**
```python
from lightrag.kg.trilingual_entity_extractor import TrilingualEntityExtractor

extractor = TrilingualEntityExtractor()

# Chinese
entities = extractor.extract("苹果公司由史蒂夫·乔布斯创立。", language='zh')

# English
entities = extractor.extract("Apple Inc. was founded by Steve Jobs.", language='en')

# Swedish
entities = extractor.extract("Volvo grundades i Göteborg.", language='sv')
```

**Testing:**
```bash
python scripts/test_trilingual_extractor.py
```

**Resource Requirements:**
- Disk: ~1.4 GB (440MB + 545MB + 400MB)
- Memory: ~1.5-1.8 GB per language (lazy loaded)

**Performance (CPU):**
- Chinese: ~12 docs/s
- English: ~29 docs/s
- Swedish: ~26 docs/s

Addresses user's specific needs: pure Chinese, pure English, and pure Swedish documents.
2025-11-19 17:29:00 +00:00

21 lines
643 B
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 三语言实体提取器依赖
# 用于支持中文、英文、瑞典语实体提取
# spaCy - 用于英文和瑞典语
spacy>=3.7.0
# HanLP - 用于中文
hanlp>=2.1.0
# 安装说明:
# 1. 安装基础依赖
# pip install -r requirements-trilingual.txt
#
# 2. 下载 spaCy 语言模型
# python -m spacy download en_core_web_trf # 英文 Transformer 模型 (~440 MB)
# python -m spacy download sv_core_news_lg # 瑞典语大模型 (~545 MB)
#
# 3. HanLP 模型会在首次使用时自动下载 (~400 MB)
#
# 总磁盘空间需求: ~1.4 GB
# 内存占用(按需加载): ~1.5-1.8 GB同时只加载一个语言模型