• Always append cell text to maintain columns
• Preserve empty cells in table structure
• Check for any content before adding rows
• Use tab separation for proper alignment
• Improve table formatting consistency
• Include tables in extracted content
• Maintain original document order
• Add spacing around tables
• Use tabs to separate table cells
• Process all body elements sequentially
- Update import from PyPDF2 to pypdf
- Change dependency to pypdf>=6.1.0
- Update all requirements files
- Remove PyPDF2 from lock file
- Use modern pypdf library
Problem:
In multi-tenant scenarios, different workspaces share a single global
pipeline_status namespace, causing pipelines from different tenants to
block each other, severely impacting concurrent processing performance.
Solution:
- Extended get_namespace_data() to recognize workspace-specific pipeline
namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)
- Added workspace parameter to initialize_pipeline_status() for per-tenant
isolated pipeline namespaces
- Updated all 7 call sites to use workspace-aware locks:
* lightrag.py: process_document_queue(), aremove_document()
* document_routes.py: background_delete_documents(), clear_documents(),
cancel_pipeline(), get_pipeline_status(), delete_documents()
Impact:
- Different workspaces can process documents concurrently without blocking
- Backward compatible: empty workspace defaults to "pipeline_status"
- Maintains fail-fast: uninitialized pipeline raises clear error
- Expected N× performance improvement for N concurrent tenants
Bug fixes:
- Fixed AttributeError by using self.workspace instead of self.global_config
- Fixed pipeline status endpoint to show workspace-specific status
- Fixed delete endpoint to check workspace-specific busy flag
Code changes: 4 files, 141 insertions(+), 28 deletions(-)
Testing: All syntax checks passed, comprehensive workspace isolation tests completed
• Use "preprocessed" to indicate multimodal processing is required
• Update DocProcessingStatus to process status convertion automatically
• Remove multimodal_processed from DocStatus enum value
• Update UI filter logic
• Add delete_llm_cache parameter to API
• Collect cache IDs from text chunks
• Delete cache after graph operations
• Update UI with new checkbox option
• Add i18n translations for cache option
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
• Add DocStatus.PREPROCESSED enum value
• Update API routes and response models
• Add preprocessed filter in web UI
• Update localization files
• Handle preprocessed status in deletion
Add a new `/documents/reprocess_failed` API endpoint and corresponding
UI button to retry processing of failed and pending documents. This
addresses a common recovery scenario when document processing fails due
to server crashes, network errors, or LLM service outages.
Backend changes:
- Add ReprocessResponse model with status, message, and track_id fields
- Add POST /documents/reprocess_failed endpoint that triggers background
reprocessing of FAILED, PENDING, and interrupted PROCESSING documents
- Reuses existing apipeline_process_enqueue_documents for consistency
- Includes comprehensive docstring and logging for observability
Frontend changes:
- Add TypeScript types and API function for the new endpoint
- Add retry handler with intelligent polling (fast refresh → normal)
- Add "Retry Failed" button in Documents page toolbar
- Button disabled when pipeline is busy to prevent duplicate operations
- Complete i18n support (English and Chinese translations)
This feature provides a convenient way to recover from processing
failures without requiring a full filesystem rescan.
- Add get_doc_by_file_path to all storages
- Skip processed files in scan operation
- Check duplicates in upload endpoints
- Check duplicates in text insert APIs
- Return status info in duplicate responses
• Limit history to 1000 latest messages
• Add truncation message when needed
• Show count of truncated messages
• Update API documentation
• Prevent memory issues with large logs
- Reset documents with PROCESSING/FAILED status to PENDING when they pass consistency checks
- Update doc_status storage and clear error messages/metadata on reset
• Replace pyuca with centralized utils function
• Add pinyin sort keys for file paths
• Update MongoDB indexes with zh collation
• Migrate existing indexes for compatibility
• Support Chinese chars in Redis/JSON storage
• Keep PostgreSQL sorting order controled by Database Collate order
- Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage
- Enhance pipeline_enqueue_file with detailed error handling for all file processing stages:
* File access errors (permissions, not found)
* UTF-8 encoding errors
* Format-specific processing errors (PDF, DOCX, PPTX, XLSX)
* Content validation errors
* Unsupported file type errors
This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.
- Remove optional 'modes' parameter from aclear_cache() and clear_cache() methods
- Replace deprecated drop_cache_by_modes() with drop() method for complete cache clearing
- Update API endpoint to ignore mode-specific parameters and clear all cache
- Simplify frontend clearCache() function to send empty request body
This change ensures all LLM cache is cleared together.
- Add file_path sorting support to all database backends (JSON, Redis, PostgreSQL, MongoDB)
- Implement smart column header switching between "ID" and "File Name" based on display mode
- Add automatic sort field switching when toggling between ID and file name display
- Create composite indexes for workspace+file_path in PostgreSQL and MongoDB for better query performance
- Update frontend to maintain sort state when switching display modes
- Add internationalization support for "fileName" in English and Chinese locales
This enhancement improves user experience by providing intuitive file-based sorting
while maintaining performance through optimized database indexes.
- Add pagination support to BaseDocStatusStorage interface and all implementations (PostgreSQL, MongoDB, Redis, JSON)
- Implement RESTful API endpoints for paginated document queries and status counts
- Create reusable pagination UI components with internationalization support
- Optimize performance with database-level pagination and efficient in-memory processing
- Maintain backward compatibility while adding configurable page sizes (10-200 items)