LightRAG

History

Claude dd8ad7c46d Add detailed comparison: HanLP vs GLiNER for Chinese entity extraction This guide addresses user's question about HanLP (33k stars) vs GLiNER for Chinese RAG systems. TL;DR: HanLP is significantly better for Chinese NER Quick Comparison: - Chinese F1: HanLP 95.22% vs GLiNER ~70% - Speed: GLiNER 4x faster, but quality gap is too large - Stars: HanLP 33k vs GLiNER 3k (but stars ≠ quality for specific languages) - Recommendation: HanLP for Chinese, GLiNER for English/multilingual Detailed Content: 1. Performance Comparison: - HanLP BERT on MSRA: F1 95.22% (P: 94.79%, R: 95.65%) - GLiNER-Multi on Chinese: Average score 24.3 (lowest among all languages) - Reason: GLiNER trained primarily on English, zero-shot transfer to Chinese is weak 2. Feature Comparison: HanLP Strengths: - Specifically designed for Chinese NLP - Integrated tokenization (critical for Chinese) - Multiple pre-trained models (MSRA, OntoNotes) - Rich Chinese documentation and community GLiNER Strengths: - Zero-shot learning (any entity type without training) - Extremely flexible - Faster inference (500-2000 sentences/sec vs 100-500) - Multilingual support (40+ languages) 3. Real-World Test: - Example Chinese text about Apple Inc., Tim Cook, MacBook Pro - HanLP: ~95% accuracy, correctly identifies all entities with boundaries - GLiNER: ~65-75% accuracy, misses some entities, lower confidence scores 4. Speed vs Quality Trade-off: - For 1417 chunks: HanLP 20s vs GLiNER 5s (15s difference) - Quality difference: 95% vs 70% F1 (25% gap) - Conclusion: 15 seconds saved not worth 25% quality loss 5. Use Case Recommendations: Choose HanLP: ✅ Pure Chinese RAG systems ✅ Quality priority ✅ Standard entity types (person, location, organization) ✅ Academic research Choose GLiNER: ✅ Custom entity types needed ✅ Multilingual text (English primary + some Chinese) ✅ Rapid prototyping ✅ Entity types change frequently 6. Hybrid Strategy (Best Practice): ``` Option 1: HanLP (entities) + LLM (relations) - Entity F1: 95%, Time: 20s - Relation quality maintained - Best balance Option 2: HanLP primary + GLiNER for custom types - Standard entities: HanLP - Domain-specific entities: GLiNER - Deduplicate results ``` 7. LightRAG Integration: - Provides complete code examples for: - Pure HanLP extractor - Hybrid HanLP + GLiNER extractor - Language-adaptive extractor - Performance comparison for indexing 1417 chunks 8. Cost Analysis: - Model size: HanLP 400MB vs GLiNER 280MB - Memory: HanLP 1.5GB vs GLiNER 1GB - Cost for 100k chunks on AWS: ~$0.03 vs ~$0.007 - Conclusion: Cost difference negligible compared to quality 9. Community & Support: - HanLP: Active Chinese community, comprehensive Chinese docs, widely cited - GLiNER: International community, strong English docs, fewer Chinese resources 10. Full Comparison Matrix: - vs other tools: spaCy, StanfordNLP, jieba - HanLP ranks #1 for Chinese NER (F1 95%) - GLiNER ranks better for flexibility but lower for Chinese accuracy Key Insights: - GitHub stars don't equal quality for specific languages - HanLP 33k stars reflects its Chinese NLP dominance - GLiNER 3k stars but excels at zero-shot and English - For Chinese RAG: HanLP >>> GLiNER (quality gap too large) - For multilingual RAG: Consider GLiNER - Recommended: HanLP for entities, LLM for relations Final Recommendation for LightRAG Chinese users: Stage 1: Try HanLP alone for entity extraction Stage 2: Use HanLP (entities) + LLM (relations) hybrid Stage 3: Evaluate quality vs pure LLM baseline Helps Chinese RAG users make informed decisions about entity extraction approaches.		2025-11-19 16:16:00 +00:00
..
Algorithm.md	Create Algorithm.md	2025-01-24 21:19:04 +01:00
DockerDeployment.md	Add BuildKit cache mounts to optimize Docker build performance	2025-11-03 12:40:30 +08:00
EvaluatingEntityRelationQuality-zh.md	Add comprehensive entity/relation extraction quality evaluation guide	2025-11-19 12:45:31 +00:00
FrontendBuildGuide.md	Use frozen lockfile for consistent frontend builds	2025-10-14 03:34:55 +08:00
HanLPvsGLiNER-zh.md	Add detailed comparison: HanLP vs GLiNER for Chinese entity extraction	2025-11-19 16:16:00 +00:00
LightRAG_concurrent_explain.md	Update README	2025-07-27 17:26:49 +08:00
OfflineDeployment.md	refactor: move document deps to api group, remove dynamic imports	2025-11-13 13:34:09 +08:00
PerformanceFAQ-zh.md	Add comprehensive performance FAQ addressing max_async, LLM selection, and database optimization	2025-11-19 10:21:58 +00:00
PerformanceOptimization-zh.md	Add performance optimization guide and configuration for LightRAG indexing	2025-11-19 09:55:28 +00:00
PerformanceOptimization.md	Add performance optimization guide and configuration for LightRAG indexing	2025-11-19 09:55:28 +00:00
RAGEvaluationMethodsComparison-zh.md	Add comprehensive comparison of RAG evaluation methods	2025-11-19 13:36:56 +00:00
SelfHostedOptimization-zh.md	Add comprehensive self-hosted LLM optimization guide for LightRAG	2025-11-19 10:53:48 +00:00
UV_LOCK_GUIDE.md	Migrate Dockerfile from pip to uv package manager for faster builds	2025-10-16 01:54:20 +08:00
WhatIsGleaning-zh.md	Add comprehensive guide explaining gleaning concept in LightRAG	2025-11-19 11:45:07 +00:00
WhatIsRAGAS-zh.md	Add comprehensive RAGAS evaluation framework guide	2025-11-19 12:52:22 +00:00