LightRAG

Author	SHA1	Message	Date
clssck	95c83abcf8	feat(lightrag,lightrag_webui): add S3 storage integration and UI Add S3 storage client and API routes for document management: - Implement s3_routes.py with file upload, download, delete endpoints - Enhance s3_client.py with improved error handling and operations - Add S3 browser UI component with file viewing and management - Implement FileViewer and PDFViewer components for storage preview - Add Resizable and Sheet UI components for layout control Update backend infrastructure: - Add bulk operations and parameterized queries to postgres_impl.py - Enhance document routes with improved type hints - Update API server registration for new S3 routes - Refine upload routes and utility functions Modernize web UI: - Integrate S3 browser into main application layout - Update localization files for storage UI strings - Add storage settings to application configuration - Sync package dependencies and lock files Remove obsolete reproduction script: - Delete reproduce_citation.py (replaced by test suite) Update configuration: - Enhance pyrightconfig.json for stricter type checking	2025-12-07 11:04:38 +01:00
clssck	dd1413f3eb	test(lightrag,examples): add prompt accuracy and quality tests Add comprehensive test suites for prompt evaluation: - test_prompt_accuracy.py: 365 lines testing prompt extraction accuracy - test_prompt_quality_deep.py: 672 lines for deep quality analysis - Refactor prompt.py to consolidate optimized variants (removed prompt_optimized.py) - Apply ruff formatting and type hints across 30 files - Update pyrightconfig.json for static type checking - Modernize reproduce scripts and examples with improved type annotations - Sync uv.lock dependencies	2025-12-05 16:39:52 +01:00
clssck	69358d830d	test(lightrag,examples,api): comprehensive ruff formatting and type hints Format entire codebase with ruff and add type hints across all modules: - Apply ruff formatting to all Python files (121 files, 17K insertions) - Add type hints to function signatures throughout lightrag core and API - Update test suite with improved type annotations and docstrings - Add pyrightconfig.json for static type checking configuration - Create prompt_optimized.py and test_extraction_prompt_ab.py test files - Update ruff.toml and .gitignore for improved linting configuration - Standardize code style across examples, reproduce scripts, and utilities	2025-12-05 15:17:06 +01:00
clssck	c5f230a30c	test: fix env handling, add type hints, improve docs Improve code quality and test robustness: - Refactor environment variable parsing in rerank config using centralized get_env_value helper - Add return type hints to all test methods for better type safety - Fix patch path in test from lightrag.utils to lightrag.rerank for correct import location - Clarify batch insert endpoint behavior regarding duplicate content rejection - Expand .dockerignore to comprehensively exclude node_modules (200MB+), Python cache files, and venv directories - Update dependency groups: align evaluation and test extras with pytest/pre-commit/ruff tools	2025-12-03 15:02:11 +01:00
clssck	9bae6267f6	chore: sync with upstream (#4 ) * chore: sync with upstream - Cohere rerank improvements - Content deduplication - Dependency updates * fix: address CodeRabbit review feedback - Harden env parsing for RERANK_MAX_TOKENS_PER_DOC with try/except - Add @pytest.mark.offline to test_overlap_validation - Remove unused doc_indices variable	2025-12-03 13:16:28 +01:00
clssck	48c7732edc	feat: add automatic entity resolution with 3-layer matching Implement automatic entity resolution to prevent duplicate nodes in the knowledge graph. The system uses a 3-layer approach: 1. Case-insensitive exact matching (free, instant) 2. Fuzzy string matching >85% threshold (free, instant) 3. Vector similarity + LLM verification (for acronyms/synonyms) Key features: - Pre-resolution phase prevents race conditions in parallel processing - Numeric suffix detection blocks false matches (IL-4 ≠ IL-13) - PostgreSQL alias cache for fast lookups on subsequent ingestion - Configurable thresholds via environment variables Bug fixes included: - Fix fuzzy matching false positives for numbered entities - Fix alias cache not being populated (missing db parameter) - Skip entity_aliases table from generic id index creation New files: - lightrag/entity_resolution/ - Core resolution module - tests/test_entity_resolution/ - Unit tests - docker/postgres-age-vector/ - Custom PG image with pgvector + AGE - docker-compose.test.yml - Integration test environment Configuration (env.example): - ENTITY_RESOLUTION_ENABLED=true - ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85 - ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5 - ENTITY_RESOLUTION_MAX_CANDIDATES=3	2025-11-27 15:35:02 +01:00
yangdx	95cd0ece74	Fix DOCX table extraction by escaping special characters in cells - Add escape_cell() function - Escape backslashes first - Handle tabs and newlines - Preserve tab-delimited format - Prevent double-escaping issues	2025-11-19 09:54:35 +08:00
yangdx	87de2b3e9e	Update XLSX extraction documentation to reflect current implementation	2025-11-19 04:26:41 +08:00
yangdx	0244699d81	Optimize XLSX extraction by using sheet.max_column instead of two-pass scan • Remove two-pass row scanning approach • Use built-in sheet.max_column property • Simplify column width detection logic • Improve memory efficiency • Maintain column alignment preservation	2025-11-19 04:02:39 +08:00
yangdx	2b16016312	Optimize XLSX extraction to avoid storing all rows in memory • Remove intermediate row storage • Use iterator twice instead of list() • Preserve column alignment logic • Reduce memory footprint • Maintain same output format	2025-11-19 03:48:36 +08:00
yangdx	ef659a1e09	Preserve column alignment in XLSX extraction with two-pass processing • Two-pass approach for consistent width • Maintain tabular structure integrity • Determine max columns first pass • Extract with alignment second pass • Prevent column misalignment issues	2025-11-19 03:34:22 +08:00
yangdx	3efb1716b4	Enhance XLSX extraction with structured tab-delimited format and escaping - Add clear sheet separators - Escape special characters - Trim trailing empty columns - Preserve row structure - Single-pass optimization	2025-11-19 03:06:29 +08:00
yangdx	e7d2803a65	Remove text stripping in DOCX extraction to preserve whitespace • Keep original paragraph spacing • Preserve cell whitespace in tables • Maintain document formatting • Don't strip leading/trailing spaces	2025-11-19 02:12:27 +08:00
yangdx	186c8f0e16	Preserve blank paragraphs in DOCX extraction to maintain spacing • Remove text emptiness check • Always append paragraph text • Maintain document formatting • Preserve original spacing	2025-11-19 02:03:10 +08:00
yangdx	fa887d811b	Fix table column structure preservation in DOCX extraction • Always append cell text to maintain columns • Preserve empty cells in table structure • Check for any content before adding rows • Use tab separation for proper alignment • Improve table formatting consistency	2025-11-19 01:52:02 +08:00
yangdx	4438ba41a3	Enhance DOCX extraction to preserve document order with tables • Include tables in extracted content • Maintain original document order • Add spacing around tables • Use tabs to separate table cells • Process all body elements sequentially	2025-11-19 01:31:33 +08:00
yangdx	702cfd2981	Fix document deletion concurrency control and validation logic • Clarify job naming for single vs batch deletion • Update job name validation in busy pipeline check	2025-11-18 13:59:24 +08:00
yangdx	1745b30a5f	Fix missing workspace parameter in update flags status call	2025-11-18 12:55:48 +08:00
yangdx	52c812b9a0	Fix workspace isolation for pipeline status across all operations - Fix final_namespace error in get_namespace_data() - Fix get_workspace_from_request return type - Add workspace param to pipeline status calls	2025-11-17 12:54:33 +08:00
yangdx	926960e957	Refactor workspace handling to use default workspace and namespace locks - Remove DB-specific workspace configs - Add default workspace auto-setting - Replace global locks with namespace locks - Simplify pipeline status management - Remove redundant graph DB locking	2025-11-17 12:54:33 +08:00
yangdx	c246eff725	Improve docling integration with macOS compatibility and CLI flag - Add --docling CLI flag for easier setup - Add numpy version constraints - Exclude docling on macOS (fork-safety)	2025-11-17 12:54:32 +08:00
yangdx	69a0b74ce7	refactor: move document deps to api group, remove dynamic imports - Merge offline-docs into api extras - Remove pipmaster dynamic installs - Add async document processing - Pre-check docling availability - Update offline deployment docs	2025-11-17 12:54:32 +08:00
yangdx	c434879c7a	Replace PyPDF2 with pypdf for PDF processing - Update import from PyPDF2 to pypdf - Change dependency to pypdf>=6.1.0 - Update all requirements files - Remove PyPDF2 from lock file - Use modern pypdf library	2025-11-17 12:54:32 +08:00
BukeLy	eb52ec94d7	feat: Add workspace isolation support for pipeline status Problem: In multi-tenant scenarios, different workspaces share a single global pipeline_status namespace, causing pipelines from different tenants to block each other, severely impacting concurrent processing performance. Solution: - Extended get_namespace_data() to recognize workspace-specific pipeline namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern) - Added workspace parameter to initialize_pipeline_status() for per-tenant isolated pipeline namespaces - Updated all 7 call sites to use workspace-aware locks: * lightrag.py: process_document_queue(), aremove_document() * document_routes.py: background_delete_documents(), clear_documents(), cancel_pipeline(), get_pipeline_status(), delete_documents() Impact: - Different workspaces can process documents concurrently without blocking - Backward compatible: empty workspace defaults to "pipeline_status" - Maintains fail-fast: uninitialized pipeline raises clear error - Expected N× performance improvement for N concurrent tenants Bug fixes: - Fixed AttributeError by using self.workspace instead of self.global_config - Fixed pipeline status endpoint to show workspace-specific status - Fixed delete endpoint to check workspace-specific busy flag Code changes: 4 files, 141 insertions(+), 28 deletions(-) Testing: All syntax checks passed, comprehensive workspace isolation tests completed	2025-11-17 12:53:44 +08:00
yangdx	61b57cbb5d	Add PDF decryption support for password-protected files • Add PDF_DECRYPT_PASSWORD env variable • Check encryption status before reading • Handle decrypt errors gracefully • Log detailed error messages • Support both encrypted/plain PDFs	2025-11-01 15:01:17 +08:00
yangdx	c46c1b26a9	Add pycryptodome dependency for PDF encryption support	2025-10-31 01:49:42 +08:00
yangdx	78ad8873b8	Add cancellation check in delete loop	2025-10-24 14:47:20 +08:00
yangdx	743aefc655	Add pipeline cancellation feature for graceful processing termination • Add cancel_pipeline API endpoint • Implement PipelineCancelledException • Add cancellation checks in main loop • Handle task cancellation gracefully • Mark cancelled docs as FAILED	2025-10-24 14:08:12 +08:00
yangdx	8dc23eeff2	Fix RayAnything compatible problem • Use "preprocessed" to indicate multimodal processing is required • Update DocProcessingStatus to process status convertion automatically • Remove multimodal_processed from DocStatus enum value • Update UI filter logic	2025-10-22 20:15:29 +08:00
yangdx	162370b6e6	Add optional LLM cache deletion when deleting documents • Add delete_llm_cache parameter to API • Collect cache IDs from text chunks • Delete cache after graph operations • Update UI with new checkbox option • Add i18n translations for cache option	2025-10-22 12:19:23 +08:00
yangdx	dc62c78f98	Add entity/relation chunk tracking with configurable source ID limits - Add entity_chunks & relation_chunks storage - Implement KEEP/FIFO limit strategies - Update env.example with new settings - Add migration for chunk tracking data - Support all KV storage	2025-10-20 15:24:15 +08:00
yangdx	130b4959dc	Add PREPROCESSED (multimodal_processed) status for multimodal document processing • Add DocStatus.PREPROCESSED enum value • Update API routes and response models • Add preprocessed filter in web UI • Update localization files • Handle preprocessed status in deletion	2025-10-14 14:02:05 +08:00
Jon	cf2a024e37	feat: Add endpoint and UI to retry failed documents Add a new `/documents/reprocess_failed` API endpoint and corresponding UI button to retry processing of failed and pending documents. This addresses a common recovery scenario when document processing fails due to server crashes, network errors, or LLM service outages. Backend changes: - Add ReprocessResponse model with status, message, and track_id fields - Add POST /documents/reprocess_failed endpoint that triggers background reprocessing of FAILED, PENDING, and interrupted PROCESSING documents - Reuses existing apipeline_process_enqueue_documents for consistency - Includes comprehensive docstring and logging for observability Frontend changes: - Add TypeScript types and API function for the new endpoint - Add retry handler with intelligent polling (fast refresh → normal) - Add "Retry Failed" button in Documents page toolbar - Button disabled when pipeline is busy to prevent duplicate operations - Complete i18n support (English and Chinese translations) This feature provides a convenient way to recover from processing failures without requiring a full filesystem rescan.	2025-10-04 16:46:29 -04:00
yangdx	7cba458f22	Limit deprecated documents endpoint to 1000 records with fair distribution	2025-09-28 11:18:10 +08:00
yangdx	2adb8efdc7	Add duplicate document detection and skip processed files in scanning - Add get_doc_by_file_path to all storages - Skip processed files in scan operation - Check duplicates in upload endpoints - Check duplicates in text insert APIs - Return status info in duplicate responses	2025-09-23 17:30:54 +08:00
yangdx	8f6287e27e	Add path traversal security validation for file deletion operations • Add validate_file_path_security function • Prevent path traversal attacks • Validate file paths before deletion • Check both input and enqueued dirs • Log security violations	2025-09-17 01:12:44 +08:00
yangdx	17d665c9f3	Limit history messages to latest 1000 entries with truncation indicator • Limit history to 1000 latest messages • Add truncation message when needed • Show count of truncated messages • Update API documentation • Prevent memory issues with large logs	2025-09-05 12:31:36 +08:00
yangdx	9b7ed84e05	Improve document deletion error handling and message consistency - Standardize deletion log messages - Add try-catch for file operations - Improve enqueued file error handling	2025-08-20 11:01:24 +08:00
yangdx	2603e99005	Enhance file deletion to remove files from both input and enqueued dirs	2025-08-19 17:13:58 +08:00
yangdx	9ed5b93467	Add [File Extraction] prefix to error messages and logs	2025-08-19 11:33:28 +08:00
yangdx	377f1a022e	fix: reset PROCESSING/FAILED docs to PENDING at the beginging of document processing pipeline - Reset documents with PROCESSING/FAILED status to PENDING when they pass consistency checks - Update doc_status storage and clear error messages/metadata on reset	2025-08-18 00:49:52 +08:00
yangdx	add8b07a21	Improve logging messages for document processing clarity	2025-08-18 00:22:04 +08:00
yangdx	14e083a1a6	fix: replace pyuca with pypinyin for Chinese pinyin sorting and add file_path sort	2025-08-17 15:21:24 +08:00
yangdx	61469c0a56	Add Chinese pinyin sorting support across document operations • Replace pyuca with centralized utils function • Add pinyin sort keys for file paths • Update MongoDB indexes with zh collation • Migrate existing indexes for compatibility • Support Chinese chars in Redis/JSON storage • Keep PostgreSQL sorting order controled by Database Collate order	2025-08-17 12:45:48 +08:00
yangdx	cceb46b320	fix: subdirectories are no longer processed during file scans • Change rglob to glob for file scanning • Simplify error logging messages	2025-08-16 23:46:33 +08:00
yangdx	f5b0c3d38c	feat: Recording file extraction error status to document pipeline - Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage - Enhance pipeline_enqueue_file with detailed error handling for all file processing stages: * File access errors (permissions, not found) * UTF-8 encoding errors * Format-specific processing errors (PDF, DOCX, PPTX, XLSX) * Content validation errors * Unsupported file type errors This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.	2025-08-16 23:08:52 +08:00
yangdx	5d00c4c7a8	feat: move processed files to __enqueued__ directory after processing with filename conflicts handling	2025-08-16 13:19:20 +08:00
yangdx	3bba5fc506	Fix linting	2025-08-14 13:03:23 +08:00
yangdx	772f981e7e	fix: check and process queued docs even when upload directory is empty	2025-08-14 12:35:39 +08:00
yangdx	fd0ae4646f	Fixes crash when processing files with UTF-8 encoding error - Fix TypeError "cannot unpack non-iterable bool object" in document processing - Change all error returns from `False` to `(False, "")` for consistency - Ensure pipeline_enqueue_file always returns tuple (bool, str) - Add missing return statement for no-content-extracted case - Improve error handling for UTF-8 encoding issues and unsupported file types	2025-08-14 05:31:38 +08:00

1 2 3

149 commits