Commit graph

1292 commits

Author SHA1 Message Date
yangdx
112ed234c4 Bump API version to 0258 2025-12-01 12:20:27 +08:00
yangdx
ea8d55ab42 Add documentation for embedding provider configuration rules 2025-11-28 17:49:30 +08:00
yangdx
4ab4a7ac94 Allow embedding models to use provider defaults when unspecified
- Set EMBEDDING_MODEL default to None
- Pass model param only when provided
- Let providers use their own defaults
- Fix lollms embed function params
- Add ollama embed_model default param
2025-11-28 16:57:33 +08:00
yangdx
881b8d3a50 Bump API version to 0257 2025-11-28 15:39:55 +08:00
yangdx
56e0365cf0 Add configurable model parameter to jina_embed function
- Add model parameter to jina_embed
- Pass model from API server
- Default to jina-embeddings-v4
- Update function documentation
- Make model selection flexible
2025-11-28 15:38:29 +08:00
yangdx
48b67d3077 Handle missing WebUI assets gracefully without blocking server startup
- Change build check from error to warning
- Redirect to /docs when WebUI unavailable
- Add webui_available to health endpoint
- Only mount /webui if assets exist
- Return status tuple from build check
2025-11-25 02:51:55 +08:00
yangdx
b7de694f48 Add comprehensive error logging across API routes
- Add error logs to Ollama API endpoints
- Replace logging with unified logger
- Log streaming query errors
- Add data query error logging
- Include stack traces for debugging
2025-11-19 22:50:06 +08:00
yangdx
0fb2925c6a Remove ascii_colors dependency and fix stream handling errors
• Remove ascii_colors.trace_exception calls
• Add SafeStreamHandler for closed streams
• Patch ascii_colors console handler
• Prevent ValueError on stream close
• Improve logging error handling
2025-11-19 21:38:17 +08:00
yangdx
95cd0ece74 Fix DOCX table extraction by escaping special characters in cells
- Add escape_cell() function
- Escape backslashes first
- Handle tabs and newlines
- Preserve tab-delimited format
- Prevent double-escaping issues
2025-11-19 09:54:35 +08:00
yangdx
87de2b3e9e Update XLSX extraction documentation to reflect current implementation 2025-11-19 04:26:41 +08:00
yangdx
0244699d81 Optimize XLSX extraction by using sheet.max_column instead of two-pass scan
• Remove two-pass row scanning approach
• Use built-in sheet.max_column property
• Simplify column width detection logic
• Improve memory efficiency
• Maintain column alignment preservation
2025-11-19 04:02:39 +08:00
yangdx
2b16016312 Optimize XLSX extraction to avoid storing all rows in memory
• Remove intermediate row storage
• Use iterator twice instead of list()
• Preserve column alignment logic
• Reduce memory footprint
• Maintain same output format
2025-11-19 03:48:36 +08:00
yangdx
ef659a1e09 Preserve column alignment in XLSX extraction with two-pass processing
• Two-pass approach for consistent width
• Maintain tabular structure integrity
• Determine max columns first pass
• Extract with alignment second pass
• Prevent column misalignment issues
2025-11-19 03:34:22 +08:00
yangdx
3efb1716b4 Enhance XLSX extraction with structured tab-delimited format and escaping
- Add clear sheet separators
- Escape special characters
- Trim trailing empty columns
- Preserve row structure
- Single-pass optimization
2025-11-19 03:06:29 +08:00
yangdx
e7d2803a65 Remove text stripping in DOCX extraction to preserve whitespace
• Keep original paragraph spacing
• Preserve cell whitespace in tables
• Maintain document formatting
• Don't strip leading/trailing spaces
2025-11-19 02:12:27 +08:00
yangdx
186c8f0e16 Preserve blank paragraphs in DOCX extraction to maintain spacing
• Remove text emptiness check
• Always append paragraph text
• Maintain document formatting
• Preserve original spacing
2025-11-19 02:03:10 +08:00
yangdx
fa887d811b Fix table column structure preservation in DOCX extraction
• Always append cell text to maintain columns
• Preserve empty cells in table structure
• Check for any content before adding rows
• Use tab separation for proper alignment
• Improve table formatting consistency
2025-11-19 01:52:02 +08:00
yangdx
4438ba41a3 Enhance DOCX extraction to preserve document order with tables
• Include tables in extracted content
• Maintain original document order
• Add spacing around tables
• Use tabs to separate table cells
• Process all body elements sequentially
2025-11-19 01:31:33 +08:00
yangdx
d16c7840ab Bump API version to 0256 2025-11-18 23:15:31 +08:00
yangdx
702cfd2981 Fix document deletion concurrency control and validation logic
• Clarify job naming for single vs batch deletion
• Update job name validation in busy pipeline check
2025-11-18 13:59:24 +08:00
yangdx
1745b30a5f Fix missing workspace parameter in update flags status call 2025-11-18 12:55:48 +08:00
yangdx
ddc76f0c80 Merge branch 'main' into workspace-isolation 2025-11-17 17:08:07 +08:00
yangdx
9262f66d13 Bump API version to 0255 2025-11-17 17:07:18 +08:00
yangdx
e22ac52ebc Auto-initialize pipeline status in LightRAG.initialize_storages()
• Remove manual initialize_pipeline_status calls
• Auto-init in initialize_storages method
• Update error messages for clarity
• Warn on workspace conflicts
2025-11-17 12:54:33 +08:00
yangdx
52c812b9a0 Fix workspace isolation for pipeline status across all operations
- Fix final_namespace error in get_namespace_data()
- Fix get_workspace_from_request return type
- Add workspace param to pipeline status calls
2025-11-17 12:54:33 +08:00
yangdx
926960e957 Refactor workspace handling to use default workspace and namespace locks
- Remove DB-specific workspace configs
- Add default workspace auto-setting
- Replace global locks with namespace locks
- Simplify pipeline status management
- Remove redundant graph DB locking
2025-11-17 12:54:33 +08:00
yangdx
ec05d89c2a Add macOS fork safety check for Gunicorn multi-worker mode
• Check OBJC_DISABLE_INITIALIZE_FORK_SAFETY
• Prevent NumPy/Accelerate crashes
• Show detailed error message
• Provide multiple fix options
• Exit early if misconfigured
2025-11-17 12:54:33 +08:00
yangdx
e5addf4d94 Improve embedding config priority and add debug logging
• Fix embedding_dim priority logic
• Add final config logging
2025-11-17 12:54:32 +08:00
yangdx
6b2af2b579 Refactor embedding function creation with proper attribute inheritance
- Extract max_token_size from providers
- Avoid double-wrapping EmbeddingFunc
- Improve configuration priority logic
- Add comprehensive debug logging
- Return complete EmbeddingFunc instance
2025-11-17 12:54:32 +08:00
yangdx
14a6c24ed7 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-17 12:54:32 +08:00
yangdx
2f2f35b883 Add macOS compatibility check for DOCLING with multi-worker Gunicorn 2025-11-17 12:54:32 +08:00
yangdx
c246eff725 Improve docling integration with macOS compatibility and CLI flag
- Add --docling CLI flag for easier setup
- Add numpy version constraints
- Exclude docling on macOS (fork-safety)
2025-11-17 12:54:32 +08:00
yangdx
7b7f93d77c Implement lazy configuration initialization for API server
• Add lazy config initialization
• Maintain backward compatibility
• Support programmatic usage
• Add gunicorn dependency
• Explicit config in entry points
2025-11-17 12:54:32 +08:00
yangdx
69a0b74ce7 refactor: move document deps to api group, remove dynamic imports
- Merge offline-docs into api extras
- Remove pipmaster dynamic installs
- Add async document processing
- Pre-check docling availability
- Update offline deployment docs
2025-11-17 12:54:32 +08:00
yangdx
93a3e47134 Remove deprecated response_type parameter from query settings
- Bump API version to 0254
- Remove response format UI controls
- Hard-code response_type in query params
- Add migration for version 19
- Clean up settings store structure
2025-11-17 12:54:32 +08:00
yangdx
c434879c7a Replace PyPDF2 with pypdf for PDF processing
- Update import from PyPDF2 to pypdf
- Change dependency to pypdf>=6.1.0
- Update all requirements files
- Remove PyPDF2 from lock file
- Use modern pypdf library
2025-11-17 12:54:32 +08:00
BukeLy
18a4870229 fix: Add default workspace support for backward compatibility
Fixes two compatibility issues in workspace isolation:

1. Problem: lightrag_server.py calls initialize_pipeline_status()
   without workspace parameter, causing pipeline to initialize in
   global namespace instead of rag's workspace.

   Solution: Add set_default_workspace() mechanism in shared_storage.
   LightRAG.initialize_storages() now sets default workspace, which
   initialize_pipeline_status() uses when called without parameters.

2. Problem: /health endpoint hardcoded to use "pipeline_status",
   cannot return workspace-specific status or support frontend
   workspace selection.

   Solution: Add LIGHTRAG-WORKSPACE header support. Endpoint now
   extracts workspace from header or falls back to server default,
   returning correct workspace-specific pipeline status.

Changes:
- lightrag/kg/shared_storage.py: Add set/get_default_workspace()
- lightrag/lightrag.py: Call set_default_workspace() in initialize_storages()
- lightrag/api/lightrag_server.py: Add get_workspace_from_request() helper,
  update /health endpoint to support LIGHTRAG-WORKSPACE header

Testing:
- Backward compatibility: Old code works without modification
- Multi-instance safety: Explicit workspace passing preserved
- /health endpoint: Supports both default and header-specified workspaces

Related: #2353
2025-11-17 12:54:20 +08:00
BukeLy
eb52ec94d7 feat: Add workspace isolation support for pipeline status
Problem:
In multi-tenant scenarios, different workspaces share a single global
pipeline_status namespace, causing pipelines from different tenants to
block each other, severely impacting concurrent processing performance.

Solution:
- Extended get_namespace_data() to recognize workspace-specific pipeline
  namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)
- Added workspace parameter to initialize_pipeline_status() for per-tenant
  isolated pipeline namespaces
- Updated all 7 call sites to use workspace-aware locks:
  * lightrag.py: process_document_queue(), aremove_document()
  * document_routes.py: background_delete_documents(), clear_documents(),
    cancel_pipeline(), get_pipeline_status(), delete_documents()

Impact:
- Different workspaces can process documents concurrently without blocking
- Backward compatible: empty workspace defaults to "pipeline_status"
- Maintains fail-fast: uninitialized pipeline raises clear error
- Expected N× performance improvement for N concurrent tenants

Bug fixes:
- Fixed AttributeError by using self.workspace instead of self.global_config
- Fixed pipeline status endpoint to show workspace-specific status
- Fixed delete endpoint to check workspace-specific busy flag

Code changes: 4 files, 141 insertions(+), 28 deletions(-)

Testing: All syntax checks passed, comprehensive workspace isolation tests completed
2025-11-17 12:53:44 +08:00
yangdx
b5589ce4d5 Merge branch 'main' into embedding-limit 2025-11-15 01:10:34 +08:00
yangdx
4343db753a Add macOS fork safety check for Gunicorn multi-worker mode
• Check OBJC_DISABLE_INITIALIZE_FORK_SAFETY
• Prevent NumPy/Accelerate crashes
• Show detailed error message
• Provide multiple fix options
• Exit early if misconfigured
2025-11-15 00:58:23 +08:00
yangdx
5dec4deac7 Improve embedding config priority and add debug logging
• Fix embedding_dim priority logic
• Add final config logging
2025-11-14 23:22:44 +08:00
yangdx
963a0a5db1 Refactor embedding function creation with proper attribute inheritance
- Extract max_token_size from providers
- Avoid double-wrapping EmbeddingFunc
- Improve configuration priority logic
- Add comprehensive debug logging
- Return complete EmbeddingFunc instance
2025-11-14 22:29:08 +08:00
yangdx
ab4d7ac2b0 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-14 19:28:36 +08:00
yangdx
cc031a3db9 Add macOS compatibility check for DOCLING with multi-worker Gunicorn 2025-11-13 19:18:04 +08:00
yangdx
a24d8181c2 Improve docling integration with macOS compatibility and CLI flag
- Add --docling CLI flag for easier setup
- Add numpy version constraints
- Exclude docling on macOS (fork-safety)
2025-11-13 18:58:09 +08:00
yangdx
746c069ab0 Implement lazy configuration initialization for API server
• Add lazy config initialization
• Maintain backward compatibility
• Support programmatic usage
• Add gunicorn dependency
• Explicit config in entry points
2025-11-13 15:28:05 +08:00
yangdx
4b31942e2a refactor: move document deps to api group, remove dynamic imports
- Merge offline-docs into api extras
- Remove pipmaster dynamic installs
- Add async document processing
- Pre-check docling availability
- Update offline deployment docs
2025-11-13 13:34:09 +08:00
yangdx
8c07c91833 Remove deprecated response_type parameter from query settings
- Bump API version to 0254
- Remove response format UI controls
- Hard-code response_type in query params
- Add migration for version 19
- Clean up settings store structure
2025-11-12 12:19:30 +08:00
yangdx
fdcb4d0b6d Replace PyPDF2 with pypdf for PDF processing
- Update import from PyPDF2 to pypdf
- Change dependency to pypdf>=6.1.0
- Update all requirements files
- Remove PyPDF2 from lock file
- Use modern pypdf library
2025-11-11 01:38:09 +08:00
yangdx
1f9d0735c3 Bump API version to 0253 2025-11-09 14:42:22 +08:00