Add S3 storage client and API routes for document management:
- Implement s3_routes.py with file upload, download, delete endpoints
- Enhance s3_client.py with improved error handling and operations
- Add S3 browser UI component with file viewing and management
- Implement FileViewer and PDFViewer components for storage preview
- Add Resizable and Sheet UI components for layout control
Update backend infrastructure:
- Add bulk operations and parameterized queries to postgres_impl.py
- Enhance document routes with improved type hints
- Update API server registration for new S3 routes
- Refine upload routes and utility functions
Modernize web UI:
- Integrate S3 browser into main application layout
- Update localization files for storage UI strings
- Add storage settings to application configuration
- Sync package dependencies and lock files
Remove obsolete reproduction script:
- Delete reproduce_citation.py (replaced by test suite)
Update configuration:
- Enhance pyrightconfig.json for stricter type checking
Add comprehensive test suites for prompt evaluation:
- test_prompt_accuracy.py: 365 lines testing prompt extraction accuracy
- test_prompt_quality_deep.py: 672 lines for deep quality analysis
- Refactor prompt.py to consolidate optimized variants (removed prompt_optimized.py)
- Apply ruff formatting and type hints across 30 files
- Update pyrightconfig.json for static type checking
- Modernize reproduce scripts and examples with improved type annotations
- Sync uv.lock dependencies
Format entire codebase with ruff and add type hints across all modules:
- Apply ruff formatting to all Python files (121 files, 17K insertions)
- Add type hints to function signatures throughout lightrag core and API
- Update test suite with improved type annotations and docstrings
- Add pyrightconfig.json for static type checking configuration
- Create prompt_optimized.py and test_extraction_prompt_ab.py test files
- Update ruff.toml and .gitignore for improved linting configuration
- Standardize code style across examples, reproduce scripts, and utilities
Improve code quality and test robustness:
- Refactor environment variable parsing in rerank config using centralized get_env_value helper
- Add return type hints to all test methods for better type safety
- Fix patch path in test from lightrag.utils to lightrag.rerank for correct import location
- Clarify batch insert endpoint behavior regarding duplicate content rejection
- Expand .dockerignore to comprehensively exclude node_modules (200MB+), Python cache files, and venv directories
- Update dependency groups: align evaluation and test extras with pytest/pre-commit/ruff tools
• Always append cell text to maintain columns
• Preserve empty cells in table structure
• Check for any content before adding rows
• Use tab separation for proper alignment
• Improve table formatting consistency
• Include tables in extracted content
• Maintain original document order
• Add spacing around tables
• Use tabs to separate table cells
• Process all body elements sequentially
- Update import from PyPDF2 to pypdf
- Change dependency to pypdf>=6.1.0
- Update all requirements files
- Remove PyPDF2 from lock file
- Use modern pypdf library
Problem:
In multi-tenant scenarios, different workspaces share a single global
pipeline_status namespace, causing pipelines from different tenants to
block each other, severely impacting concurrent processing performance.
Solution:
- Extended get_namespace_data() to recognize workspace-specific pipeline
namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)
- Added workspace parameter to initialize_pipeline_status() for per-tenant
isolated pipeline namespaces
- Updated all 7 call sites to use workspace-aware locks:
* lightrag.py: process_document_queue(), aremove_document()
* document_routes.py: background_delete_documents(), clear_documents(),
cancel_pipeline(), get_pipeline_status(), delete_documents()
Impact:
- Different workspaces can process documents concurrently without blocking
- Backward compatible: empty workspace defaults to "pipeline_status"
- Maintains fail-fast: uninitialized pipeline raises clear error
- Expected N× performance improvement for N concurrent tenants
Bug fixes:
- Fixed AttributeError by using self.workspace instead of self.global_config
- Fixed pipeline status endpoint to show workspace-specific status
- Fixed delete endpoint to check workspace-specific busy flag
Code changes: 4 files, 141 insertions(+), 28 deletions(-)
Testing: All syntax checks passed, comprehensive workspace isolation tests completed
• Use "preprocessed" to indicate multimodal processing is required
• Update DocProcessingStatus to process status convertion automatically
• Remove multimodal_processed from DocStatus enum value
• Update UI filter logic
• Add delete_llm_cache parameter to API
• Collect cache IDs from text chunks
• Delete cache after graph operations
• Update UI with new checkbox option
• Add i18n translations for cache option
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
• Add DocStatus.PREPROCESSED enum value
• Update API routes and response models
• Add preprocessed filter in web UI
• Update localization files
• Handle preprocessed status in deletion
Add a new `/documents/reprocess_failed` API endpoint and corresponding
UI button to retry processing of failed and pending documents. This
addresses a common recovery scenario when document processing fails due
to server crashes, network errors, or LLM service outages.
Backend changes:
- Add ReprocessResponse model with status, message, and track_id fields
- Add POST /documents/reprocess_failed endpoint that triggers background
reprocessing of FAILED, PENDING, and interrupted PROCESSING documents
- Reuses existing apipeline_process_enqueue_documents for consistency
- Includes comprehensive docstring and logging for observability
Frontend changes:
- Add TypeScript types and API function for the new endpoint
- Add retry handler with intelligent polling (fast refresh → normal)
- Add "Retry Failed" button in Documents page toolbar
- Button disabled when pipeline is busy to prevent duplicate operations
- Complete i18n support (English and Chinese translations)
This feature provides a convenient way to recover from processing
failures without requiring a full filesystem rescan.
- Add get_doc_by_file_path to all storages
- Skip processed files in scan operation
- Check duplicates in upload endpoints
- Check duplicates in text insert APIs
- Return status info in duplicate responses
• Limit history to 1000 latest messages
• Add truncation message when needed
• Show count of truncated messages
• Update API documentation
• Prevent memory issues with large logs
- Reset documents with PROCESSING/FAILED status to PENDING when they pass consistency checks
- Update doc_status storage and clear error messages/metadata on reset
• Replace pyuca with centralized utils function
• Add pinyin sort keys for file paths
• Update MongoDB indexes with zh collation
• Migrate existing indexes for compatibility
• Support Chinese chars in Redis/JSON storage
• Keep PostgreSQL sorting order controled by Database Collate order
- Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage
- Enhance pipeline_enqueue_file with detailed error handling for all file processing stages:
* File access errors (permissions, not found)
* UTF-8 encoding errors
* Format-specific processing errors (PDF, DOCX, PPTX, XLSX)
* Content validation errors
* Unsupported file type errors
This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.