LightRAG

Author	SHA1	Message	Date
Claude	d78a8cb9df	Add comprehensive performance FAQ addressing max_async, LLM selection, and database optimization ## Questions Addressed 1. How does max_async work? - Explains two-layer concurrency control architecture - Code references: operate.py:2932 (chunk level), lightrag.py:647 (worker pool) - Clarifies difference between max_async and actual API concurrency 2. Why does concurrency help if TPS is fixed? - Addresses user's critical insight about API throughput limits - Explains difference between RPM/TPM limits vs instantaneous TPS - Shows how concurrency hides network latency - Provides concrete examples with timing calculations - Key insight: max_async doesn't increase API capacity, but helps fully utilize it 3. Which LLM models for entity/relationship extraction? - Comprehensive model comparison (GPT-4o, Claude, Gemini, DeepSeek, Qwen) - Performance benchmarks with actual metrics - Cost analysis per 1000 chunks - Recommendations for different scenarios: * Best value: GPT-4o-mini ($8/1000 chunks, 91% accuracy) * Highest quality: Claude 3.5 Sonnet (96% accuracy, $180/1000 chunks) * Fastest: Gemini 1.5 Flash (2s/chunk, $3/1000 chunks) * Self-hosted: DeepSeek-V3, Qwen2.5 (zero marginal cost) 4. Does switching graph database help extraction speed? - Detailed pipeline breakdown showing 95% time in LLM extraction - Graph database only affects 6-12% of total indexing time - Performance comparison: NetworkX vs Neo4j vs Memgraph - Conclusion: Optimize max_async first (4-8x speedup), database last (1-2% speedup) ## Key Technical Insights - Network latency hiding: Serial processing wastes time on network RTT * Serial (max_async=1): 128s for 4 requests * Concurrent (max_async=4): 34s for 4 requests (3.8x faster) - API utilization analysis: * max_async=1 achieves only 20% of TPM limit * max_async=16 achieves 100% of TPM limit * Demonstrates why default max_async=4 is too conservative - Optimization priority ranking: 1. Increase max_async: 4-8x speedup ✅✅✅ 2. Better LLM model: 2-3x speedup ✅✅ 3. Disable gleaning: 2x speedup ✅ 4. Optimize embedding concurrency: 1.2-1.5x speedup ✅ 5. Switch graph database: 1-2% speedup ⚠️ ## User's Optimization Roadmap Current state: 1417 chunks in 5.7 hours (0.07 chunks/s) Recommended steps: 1. Set MAX_ASYNC=16 → 1.5 hours (save 4.2 hours) 2. Switch to GPT-4o-mini → 1.2 hours (save 0.3 hours) 3. Optional: Disable gleaning → 0.6 hours (save 0.6 hours) 4. Optional: Self-host model → 0.25 hours (save 0.35 hours) ## Files Changed - docs/PerformanceFAQ-zh.md: Comprehensive FAQ (800+ lines) addressing all questions * Technical architecture explanation * Mathematical analysis of concurrency benefits * Model comparison with benchmarks * Pipeline breakdown with code references * Optimization priority ranking with ROI analysis	2025-11-19 10:21:58 +00:00
Claude	6a56829e69	Add performance optimization guide and configuration for LightRAG indexing ## Problem Default configuration leads to extremely slow indexing speed: - 100 chunks taking ~1500 seconds (0.1 chunks/s) - 1417 chunks requiring ~5.7 hours total - Root cause: Conservative concurrency limits (MAX_ASYNC=4, MAX_PARALLEL_INSERT=2) ## Solution Add comprehensive performance optimization resources: 1. Optimized configuration template (.env.performance): - MAX_ASYNC=16 (4x improvement from default 4) - MAX_PARALLEL_INSERT=4 (2x improvement from default 2) - EMBEDDING_FUNC_MAX_ASYNC=16 (2x improvement from default 8) - EMBEDDING_BATCH_NUM=32 (3.2x improvement from default 10) - Expected speedup: 4-8x faster indexing 2. Performance optimization guide (docs/PerformanceOptimization.md): - Root cause analysis with code references - Detailed configuration explanations - Performance benchmarks and comparisons - Quick fix instructions - Advanced optimization strategies - Troubleshooting guide - Multiple configuration templates for different scenarios 3. Chinese version (docs/PerformanceOptimization-zh.md): - Full translation of performance guide - Localized for Chinese users ## Performance Impact With recommended configuration (MAX_ASYNC=16): - Batch processing time: ~1500s → ~400s (4x faster) - Overall throughput: 0.07 → 0.28 chunks/s (4x faster) - User's 1417 chunks: ~5.7 hours → ~1.4 hours (save 4.3 hours) With aggressive configuration (MAX_ASYNC=32): - Batch processing time: ~1500s → ~200s (8x faster) - Overall throughput: 0.07 → 0.5 chunks/s (8x faster) - User's 1417 chunks: ~5.7 hours → ~0.7 hours (save 5 hours) ## Files Changed - .env.performance: Ready-to-use optimized configuration with detailed comments - docs/PerformanceOptimization.md: Comprehensive English guide (150+ lines) - docs/PerformanceOptimization-zh.md: Comprehensive Chinese guide (150+ lines) ## Usage Users can now: 1. Quick fix: `cp .env.performance .env` and restart 2. Learn: Read comprehensive guides for understanding bottlenecks 3. Customize: Use templates for different LLM providers and scenarios	2025-11-19 09:55:28 +00:00
yangdx	5cc916861f	Expand AGENTS.md with testing controls and automation guidelines - Add pytest marker and CLI toggle docs - Document automation workflow rules - Clarify integration test setup - Add agent-specific best practices - Update testing command examples	2025-11-19 11:30:54 +08:00
Daniel.y	af4d2a3dcc	Merge pull request #2386 from danielaskdd/excel-optimization Feat: Enhance XLSX Extraction by Adding Separators and Escape Special Characters	2025-11-19 10:26:32 +08:00
yangdx	95cd0ece74	Fix DOCX table extraction by escaping special characters in cells - Add escape_cell() function - Escape backslashes first - Handle tabs and newlines - Preserve tab-delimited format - Prevent double-escaping issues	2025-11-19 09:54:35 +08:00
yangdx	87de2b3e9e	Update XLSX extraction documentation to reflect current implementation	2025-11-19 04:26:41 +08:00
yangdx	0244699d81	Optimize XLSX extraction by using sheet.max_column instead of two-pass scan • Remove two-pass row scanning approach • Use built-in sheet.max_column property • Simplify column width detection logic • Improve memory efficiency • Maintain column alignment preservation	2025-11-19 04:02:39 +08:00
yangdx	2b16016312	Optimize XLSX extraction to avoid storing all rows in memory • Remove intermediate row storage • Use iterator twice instead of list() • Preserve column alignment logic • Reduce memory footprint • Maintain same output format	2025-11-19 03:48:36 +08:00
yangdx	ef659a1e09	Preserve column alignment in XLSX extraction with two-pass processing • Two-pass approach for consistent width • Maintain tabular structure integrity • Determine max columns first pass • Extract with alignment second pass • Prevent column misalignment issues	2025-11-19 03:34:22 +08:00
yangdx	3efb1716b4	Enhance XLSX extraction with structured tab-delimited format and escaping - Add clear sheet separators - Escape special characters - Trim trailing empty columns - Preserve row structure - Single-pass optimization	2025-11-19 03:06:29 +08:00
Daniel.y	efbbaaf7f9	Merge pull request #2383 from danielaskdd/doc-table Feat: Enhanced DOCX Extraction with Table Content Support	2025-11-19 02:26:02 +08:00
yangdx	e7d2803a65	Remove text stripping in DOCX extraction to preserve whitespace • Keep original paragraph spacing • Preserve cell whitespace in tables • Maintain document formatting • Don't strip leading/trailing spaces	2025-11-19 02:12:27 +08:00
yangdx	186c8f0e16	Preserve blank paragraphs in DOCX extraction to maintain spacing • Remove text emptiness check • Always append paragraph text • Maintain document formatting • Preserve original spacing	2025-11-19 02:03:10 +08:00
yangdx	fa887d811b	Fix table column structure preservation in DOCX extraction • Always append cell text to maintain columns • Preserve empty cells in table structure • Check for any content before adding rows • Use tab separation for proper alignment • Improve table formatting consistency	2025-11-19 01:52:02 +08:00
yangdx	4438ba41a3	Enhance DOCX extraction to preserve document order with tables • Include tables in extracted content • Maintain original document order • Add spacing around tables • Use tabs to separate table cells • Process all body elements sequentially	2025-11-19 01:31:33 +08:00
yangdx	d16c7840ab	Bump API version to 0256	2025-11-18 23:15:31 +08:00
yangdx	e77340d4a1	Adjust chunking parameters to match the default environment variable settings	2025-11-18 23:14:50 +08:00
yangdx	24423c9215	Merge branch 'fix_chunk_comment'	2025-11-18 22:47:23 +08:00
yangdx	1bfa1f81cb	Merge branch 'main' into fix_chunk_comment	2025-11-18 22:38:50 +08:00
yangdx	9c10c87554	Fix linting	2025-11-18 22:38:43 +08:00
yangdx	9109509b1a	Merge branch 'dev-postgres-vchordrq'	2025-11-18 22:25:35 +08:00
yangdx	dbae327a17	Merge branch 'main' into dev-postgres-vchordrq	2025-11-18 22:13:27 +08:00
yangdx	b583b8a59d	Merge branch 'feature/postgres-vchordrq-indexes' into dev-postgres-vchordrq	2025-11-18 22:05:48 +08:00
yangdx	3096f844fb	fix(postgres): allow vchordrq.epsilon config when probes is empty Previously, configure_vchordrq would fail silently when probes was empty (the default), preventing epsilon from being configured. Now each parameter is handled independently with conditional execution, and configuration errors fail-fast instead of being swallowed. This fixes the documented epsilon setting being impossible to use in the default configuration.	2025-11-18 21:58:36 +08:00
EightyOliveira	dacca334e0	refactor(chunking): rename params and improve docstring for chunking_by_token_size	2025-11-18 15:46:28 +08:00
wmsnp	f4bf5d279c	fix: add logger to configure_vchordrq() and format code	2025-11-18 15:31:08 +08:00
Daniel.y	dfbc97363c	Merge pull request #2369 from HKUDS/workspace-isolation Feat: Add Workspace Isolation for Pipeline Status and In-memory Storage	2025-11-18 15:21:10 +08:00
yangdx	702cfd2981	Fix document deletion concurrency control and validation logic • Clarify job naming for single vs batch deletion • Update job name validation in busy pipeline check	2025-11-18 13:59:24 +08:00
yangdx	656025b75e	Rename GitHub workflow from "Tests" to "Offline Unit Tests"	2025-11-18 13:36:00 +08:00
yangdx	7e9c8ed1e8	Rename test classes to prevent warning from pytest • TestResult → ExecutionResult • TestStats → ExecutionStats • Update class docstrings • Update type hints • Update variable references	2025-11-18 13:33:05 +08:00
yangdx	4048fc4b89	Fix: auto-acquire pipeline when idle in document deletion • Track if we acquired the pipeline lock • Auto-acquire pipeline when idle • Only release if we acquired it • Prevent concurrent deletion conflicts • Improve deletion job validation	2025-11-18 13:25:13 +08:00
yangdx	1745b30a5f	Fix missing workspace parameter in update flags status call	2025-11-18 12:55:48 +08:00
yangdx	f8dd2e0724	Fix namespace parsing when workspace contains colons • Use rsplit instead of split • Handle colons in workspace names	2025-11-18 12:23:05 +08:00
yangdx	472b498ade	Replace pytest group reference with explicit dependencies in evaluation • Remove pytest group dependency • Add explicit pytest>=8.4.2 • Add pytest-asyncio>=1.2.0 • Add pre-commit directly • Fix potential circular dependency	2025-11-18 12:17:21 +08:00
yangdx	a11912ffa5	Add testing workflow guidelines to basic development rules * Define pytest marker patterns * Document CI/CD test execution * Specify offline vs integration tests * Add test isolation best practices * Reference testing guidelines doc	2025-11-18 11:54:19 +08:00
yangdx	41bf6d0283	Fix test to use default workspace parameter behavior	2025-11-18 11:51:17 +08:00
wmsnp	d07023c962	feat(postgres_impl): add vchordrq vector index support and unify vector index creation logic	2025-11-18 11:45:16 +08:00
yangdx	4ea2124001	Add GitHub CI workflow and test markers for offline/integration tests - Add GitHub Actions workflow for CI - Mark integration tests requiring services - Add offline test markers for isolated tests - Skip integration tests by default - Configure pytest markers and collection	2025-11-18 11:36:10 +08:00
yangdx	4fef731f37	Standardize test directory creation and remove tempfile dependency • Remove unused tempfile import • Use consistent project temp/ structure • Clean up existing directories first • Create directories with os.makedirs • Use descriptive test directory names	2025-11-18 10:39:54 +08:00
yangdx	1fe05df211	Refactor test configuration to use pytest fixtures and CLI options • Add pytest command-line options • Create session-scoped fixtures • Remove hardcoded environment vars • Update test function signatures • Improve configuration priority	2025-11-18 10:31:53 +08:00
yangdx	6ae0c14438	test: add concurrent execution to workspace isolation test • Add async sleep to mock functions • Test concurrent ainsert operations • Use asyncio.gather for parallel exec • Measure concurrent execution time	2025-11-18 10:17:34 +08:00
yangdx	6cef8df159	Reduce log level and improve workspace mismatch message clarity • Change warning to info level • Simplify workspace mismatch wording	2025-11-18 08:25:21 +08:00
yangdx	fc9f7c705e	Fix linting	2025-11-18 08:07:54 +08:00
yangdx	f83b475ab1	Remove Dependabot configuration file • Delete .github/dependabot.yml • Remove weekly pip updates	2025-11-18 01:42:15 +08:00
yangdx	21ad990e36	Improve workspace isolation tests with better parallelism checks and cleanup • Add finalize_share_data cleanup • Refactor lock timing measurement • Add timeline overlap validation • Include purpose/scope documentation • Fix tokenizer integration	2025-11-18 01:38:31 +08:00
yangdx	5da82bb096	Add pre-commit to pytest dependencies and format test code • Add pre-commit to pytest extra deps • Update lock file dependencies	2025-11-18 00:42:04 +08:00
yangdx	99262adaaa	Enhance workspace isolation test with distinct mock data and persistence • Use different mock LLM per workspace • Add persistent test directory • Create workspace-specific responses • Skip cleanup for inspection	2025-11-18 00:38:31 +08:00
yangdx	b7b8d15632	Refactor pytest dependencies into separate optional group - Extract pytest deps to own group - Reference pytest group in evaluation - Add pytest config to pyproject.toml - Update uv.lock with new structure	2025-11-17 23:52:13 +08:00
yangdx	1874cfaf73	Fix linting	2025-11-17 23:32:38 +08:00
Daniel.y	3806892a40	Merge pull request #2371 from BukeLy/pytest-style-conversion test: Convert test_workspace_isolation.py to pytest style	2025-11-17 23:28:56 +08:00

1 2 3 4 5 ...

5777 commits