Commit graph

1294 commits

Author SHA1 Message Date
clssck
8d099fc3ac chore: sync with upstream HKUDS/LightRAG
- Add KaTeX extensions (mhchem for chemistry, copy-tex for copying)
- Add CASCADE to AGE extension for PostgreSQL
- Remove future dependency, replace passlib with bcrypt
- Fix Jina embedding configuration and provider defaults
- Update gunicorn help text and bump API version to 0258
- Documentation and README updates
2025-12-01 21:30:19 +01:00
clssck
663ada943a chore: add citation system and enhance RAG UI components
Add citation tracking and display system across backend and frontend components.
Backend changes include citation.py for document attribution, enhanced query routes
with citation metadata, improved prompt templates, and PostgreSQL schema updates.
Frontend includes CitationMarker component, HoverCard UI, QuerySettings refinements,
and ChatMessage enhancements for displaying document sources. Update dependencies
and docker-compose test configuration for improved development workflow.
2025-12-01 17:50:00 +01:00
clssck
43af31f888 feat: add db_degree visibility and orphan connection UI
Graph Connectivity Awareness:
- Add db_degree property to all KG implementations (NetworkX, Postgres, Neo4j, Mongo, Memgraph)
- Show database degree vs visual degree in node panel with amber badge
- Add visual indicator (amber border) for nodes with hidden connections
- Add "Load X hidden connection(s)" button to expand hidden neighbors
- Add configurable "Expand Depth" setting (1-5) in graph settings
- Use global maxNodes setting for node expansion consistency

Orphan Connection UI:
- Add OrphanConnectionDialog component for manual orphan entity connection
- Add OrphanConnectionControl button in graph sidebar
- Expose /graph/orphans/connect API endpoint for frontend use

Backend Improvements:
- Add get_orphan_entities() and connect_orphan_entities() to base storage
- Add orphan connection configuration parameters
- Improve entity extraction with relationship density requirements

Frontend:
- Add graphExpandDepth and graphIncludeOrphans to settings store
- Add min_degree and include_orphans graph filtering parameters
- Update translations (en.json, zh.json)
2025-11-29 21:08:07 +01:00
clssck
d2c9e6e2ec test(lightrag): add orphan connection feature with quality validation tests
Implement automatic orphan entity connection system that identifies entities with
no relationships and creates meaningful connections via vector similarity + LLM
validation. This improves knowledge graph connectivity and retrieval quality.
Changes:
- Add orphan connection configuration parameters (thresholds, cross-connect settings)
- Implement aconnect_orphan_entities() method with 4-step validation pipeline
- Add SQL templates for efficient orphan and candidate entity queries
- Create POST /graph/orphans/connect API endpoint with configurable parameters
- Add orphan connection validation prompt for LLM-based relationship verification
- Include relationship density requirement in extraction prompts to prevent orphans
- Update docker-compose.test.yml with optimized extraction parameters
- Add quality validation test suite (run_quality_tests.py) for retrieval evaluation
- Add unit test framework (test_orphan_connection_quality.py) with test cases
- Enable auto-run of orphan connection after document processing
2025-11-28 18:23:30 +01:00
clssck
b6074b9a81 chore(lightrag, lightrag_webui): improve code quality and security
- Extract PostgreSQL storage check into named variable for clarity
- Move APIRouter initialization into create_table_routes function scope
- Add robust type handling for database query results
- Add input validation for table names and pagination parameters
- Add regex-based SQL injection prevention for table name sanitization
- Improve clipboard copy fallback logic and error handling
- Add memoization for JSON serialization to prevent unnecessary recalculations
- Hide meta column from table explorer UI display
- Sort table columns alphabetically for consistent ordering
- Add keyboard accessibility to status filter buttons
- Add preprocessed status filter to document manager
- Update @tanstack/react-query from 5.60.0 to 5.87.1
- Extract dev storage config into constant to reduce duplication
- Update documentation comments for clarity
2025-11-27 21:39:42 +01:00
clssck
a9edadef45 feat: add Table Explorer feature with dynamic table data fetching and schema display
- Implemented Table Explorer component to allow users to select and view database tables.
- Added API calls for fetching table list, schema, and paginated data.
- Introduced row detail modal for displaying and copying row data.
- Enhanced DataTable component to support row click events.
- Updated UI components for better user experience and accessibility.
- Added mock data for development mode to facilitate testing.
- Updated localization files to include new terms related to tables.
- Modified settings store to include storage configuration for conditional UI rendering.
- Improved styling and layout for various components to align with new design standards.
2025-11-27 18:27:14 +01:00
clssck
48c7732edc feat: add automatic entity resolution with 3-layer matching
Implement automatic entity resolution to prevent duplicate nodes in the
knowledge graph. The system uses a 3-layer approach:

1. Case-insensitive exact matching (free, instant)
2. Fuzzy string matching >85% threshold (free, instant)
3. Vector similarity + LLM verification (for acronyms/synonyms)

Key features:
- Pre-resolution phase prevents race conditions in parallel processing
- Numeric suffix detection blocks false matches (IL-4 ≠ IL-13)
- PostgreSQL alias cache for fast lookups on subsequent ingestion
- Configurable thresholds via environment variables

Bug fixes included:
- Fix fuzzy matching false positives for numbered entities
- Fix alias cache not being populated (missing db parameter)
- Skip entity_aliases table from generic id index creation

New files:
- lightrag/entity_resolution/ - Core resolution module
- tests/test_entity_resolution/ - Unit tests
- docker/postgres-age-vector/ - Custom PG image with pgvector + AGE
- docker-compose.test.yml - Integration test environment

Configuration (env.example):
- ENTITY_RESOLUTION_ENABLED=true
- ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85
- ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5
- ENTITY_RESOLUTION_MAX_CANDIDATES=3
2025-11-27 15:35:02 +01:00
yangdx
48b67d3077 Handle missing WebUI assets gracefully without blocking server startup
- Change build check from error to warning
- Redirect to /docs when WebUI unavailable
- Add webui_available to health endpoint
- Only mount /webui if assets exist
- Return status tuple from build check
2025-11-25 02:51:55 +08:00
yangdx
b7de694f48 Add comprehensive error logging across API routes
- Add error logs to Ollama API endpoints
- Replace logging with unified logger
- Log streaming query errors
- Add data query error logging
- Include stack traces for debugging
2025-11-19 22:50:06 +08:00
yangdx
0fb2925c6a Remove ascii_colors dependency and fix stream handling errors
• Remove ascii_colors.trace_exception calls
• Add SafeStreamHandler for closed streams
• Patch ascii_colors console handler
• Prevent ValueError on stream close
• Improve logging error handling
2025-11-19 21:38:17 +08:00
yangdx
95cd0ece74 Fix DOCX table extraction by escaping special characters in cells
- Add escape_cell() function
- Escape backslashes first
- Handle tabs and newlines
- Preserve tab-delimited format
- Prevent double-escaping issues
2025-11-19 09:54:35 +08:00
yangdx
87de2b3e9e Update XLSX extraction documentation to reflect current implementation 2025-11-19 04:26:41 +08:00
yangdx
0244699d81 Optimize XLSX extraction by using sheet.max_column instead of two-pass scan
• Remove two-pass row scanning approach
• Use built-in sheet.max_column property
• Simplify column width detection logic
• Improve memory efficiency
• Maintain column alignment preservation
2025-11-19 04:02:39 +08:00
yangdx
2b16016312 Optimize XLSX extraction to avoid storing all rows in memory
• Remove intermediate row storage
• Use iterator twice instead of list()
• Preserve column alignment logic
• Reduce memory footprint
• Maintain same output format
2025-11-19 03:48:36 +08:00
yangdx
ef659a1e09 Preserve column alignment in XLSX extraction with two-pass processing
• Two-pass approach for consistent width
• Maintain tabular structure integrity
• Determine max columns first pass
• Extract with alignment second pass
• Prevent column misalignment issues
2025-11-19 03:34:22 +08:00
yangdx
3efb1716b4 Enhance XLSX extraction with structured tab-delimited format and escaping
- Add clear sheet separators
- Escape special characters
- Trim trailing empty columns
- Preserve row structure
- Single-pass optimization
2025-11-19 03:06:29 +08:00
yangdx
e7d2803a65 Remove text stripping in DOCX extraction to preserve whitespace
• Keep original paragraph spacing
• Preserve cell whitespace in tables
• Maintain document formatting
• Don't strip leading/trailing spaces
2025-11-19 02:12:27 +08:00
yangdx
186c8f0e16 Preserve blank paragraphs in DOCX extraction to maintain spacing
• Remove text emptiness check
• Always append paragraph text
• Maintain document formatting
• Preserve original spacing
2025-11-19 02:03:10 +08:00
yangdx
fa887d811b Fix table column structure preservation in DOCX extraction
• Always append cell text to maintain columns
• Preserve empty cells in table structure
• Check for any content before adding rows
• Use tab separation for proper alignment
• Improve table formatting consistency
2025-11-19 01:52:02 +08:00
yangdx
4438ba41a3 Enhance DOCX extraction to preserve document order with tables
• Include tables in extracted content
• Maintain original document order
• Add spacing around tables
• Use tabs to separate table cells
• Process all body elements sequentially
2025-11-19 01:31:33 +08:00
yangdx
d16c7840ab Bump API version to 0256 2025-11-18 23:15:31 +08:00
yangdx
702cfd2981 Fix document deletion concurrency control and validation logic
• Clarify job naming for single vs batch deletion
• Update job name validation in busy pipeline check
2025-11-18 13:59:24 +08:00
yangdx
1745b30a5f Fix missing workspace parameter in update flags status call 2025-11-18 12:55:48 +08:00
yangdx
ddc76f0c80 Merge branch 'main' into workspace-isolation 2025-11-17 17:08:07 +08:00
yangdx
9262f66d13 Bump API version to 0255 2025-11-17 17:07:18 +08:00
yangdx
e22ac52ebc Auto-initialize pipeline status in LightRAG.initialize_storages()
• Remove manual initialize_pipeline_status calls
• Auto-init in initialize_storages method
• Update error messages for clarity
• Warn on workspace conflicts
2025-11-17 12:54:33 +08:00
yangdx
52c812b9a0 Fix workspace isolation for pipeline status across all operations
- Fix final_namespace error in get_namespace_data()
- Fix get_workspace_from_request return type
- Add workspace param to pipeline status calls
2025-11-17 12:54:33 +08:00
yangdx
926960e957 Refactor workspace handling to use default workspace and namespace locks
- Remove DB-specific workspace configs
- Add default workspace auto-setting
- Replace global locks with namespace locks
- Simplify pipeline status management
- Remove redundant graph DB locking
2025-11-17 12:54:33 +08:00
yangdx
ec05d89c2a Add macOS fork safety check for Gunicorn multi-worker mode
• Check OBJC_DISABLE_INITIALIZE_FORK_SAFETY
• Prevent NumPy/Accelerate crashes
• Show detailed error message
• Provide multiple fix options
• Exit early if misconfigured
2025-11-17 12:54:33 +08:00
yangdx
e5addf4d94 Improve embedding config priority and add debug logging
• Fix embedding_dim priority logic
• Add final config logging
2025-11-17 12:54:32 +08:00
yangdx
6b2af2b579 Refactor embedding function creation with proper attribute inheritance
- Extract max_token_size from providers
- Avoid double-wrapping EmbeddingFunc
- Improve configuration priority logic
- Add comprehensive debug logging
- Return complete EmbeddingFunc instance
2025-11-17 12:54:32 +08:00
yangdx
14a6c24ed7 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-17 12:54:32 +08:00
yangdx
2f2f35b883 Add macOS compatibility check for DOCLING with multi-worker Gunicorn 2025-11-17 12:54:32 +08:00
yangdx
c246eff725 Improve docling integration with macOS compatibility and CLI flag
- Add --docling CLI flag for easier setup
- Add numpy version constraints
- Exclude docling on macOS (fork-safety)
2025-11-17 12:54:32 +08:00
yangdx
7b7f93d77c Implement lazy configuration initialization for API server
• Add lazy config initialization
• Maintain backward compatibility
• Support programmatic usage
• Add gunicorn dependency
• Explicit config in entry points
2025-11-17 12:54:32 +08:00
yangdx
69a0b74ce7 refactor: move document deps to api group, remove dynamic imports
- Merge offline-docs into api extras
- Remove pipmaster dynamic installs
- Add async document processing
- Pre-check docling availability
- Update offline deployment docs
2025-11-17 12:54:32 +08:00
yangdx
93a3e47134 Remove deprecated response_type parameter from query settings
- Bump API version to 0254
- Remove response format UI controls
- Hard-code response_type in query params
- Add migration for version 19
- Clean up settings store structure
2025-11-17 12:54:32 +08:00
yangdx
c434879c7a Replace PyPDF2 with pypdf for PDF processing
- Update import from PyPDF2 to pypdf
- Change dependency to pypdf>=6.1.0
- Update all requirements files
- Remove PyPDF2 from lock file
- Use modern pypdf library
2025-11-17 12:54:32 +08:00
BukeLy
18a4870229 fix: Add default workspace support for backward compatibility
Fixes two compatibility issues in workspace isolation:

1. Problem: lightrag_server.py calls initialize_pipeline_status()
   without workspace parameter, causing pipeline to initialize in
   global namespace instead of rag's workspace.

   Solution: Add set_default_workspace() mechanism in shared_storage.
   LightRAG.initialize_storages() now sets default workspace, which
   initialize_pipeline_status() uses when called without parameters.

2. Problem: /health endpoint hardcoded to use "pipeline_status",
   cannot return workspace-specific status or support frontend
   workspace selection.

   Solution: Add LIGHTRAG-WORKSPACE header support. Endpoint now
   extracts workspace from header or falls back to server default,
   returning correct workspace-specific pipeline status.

Changes:
- lightrag/kg/shared_storage.py: Add set/get_default_workspace()
- lightrag/lightrag.py: Call set_default_workspace() in initialize_storages()
- lightrag/api/lightrag_server.py: Add get_workspace_from_request() helper,
  update /health endpoint to support LIGHTRAG-WORKSPACE header

Testing:
- Backward compatibility: Old code works without modification
- Multi-instance safety: Explicit workspace passing preserved
- /health endpoint: Supports both default and header-specified workspaces

Related: #2353
2025-11-17 12:54:20 +08:00
BukeLy
eb52ec94d7 feat: Add workspace isolation support for pipeline status
Problem:
In multi-tenant scenarios, different workspaces share a single global
pipeline_status namespace, causing pipelines from different tenants to
block each other, severely impacting concurrent processing performance.

Solution:
- Extended get_namespace_data() to recognize workspace-specific pipeline
  namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)
- Added workspace parameter to initialize_pipeline_status() for per-tenant
  isolated pipeline namespaces
- Updated all 7 call sites to use workspace-aware locks:
  * lightrag.py: process_document_queue(), aremove_document()
  * document_routes.py: background_delete_documents(), clear_documents(),
    cancel_pipeline(), get_pipeline_status(), delete_documents()

Impact:
- Different workspaces can process documents concurrently without blocking
- Backward compatible: empty workspace defaults to "pipeline_status"
- Maintains fail-fast: uninitialized pipeline raises clear error
- Expected N× performance improvement for N concurrent tenants

Bug fixes:
- Fixed AttributeError by using self.workspace instead of self.global_config
- Fixed pipeline status endpoint to show workspace-specific status
- Fixed delete endpoint to check workspace-specific busy flag

Code changes: 4 files, 141 insertions(+), 28 deletions(-)

Testing: All syntax checks passed, comprehensive workspace isolation tests completed
2025-11-17 12:53:44 +08:00
yangdx
b5589ce4d5 Merge branch 'main' into embedding-limit 2025-11-15 01:10:34 +08:00
yangdx
4343db753a Add macOS fork safety check for Gunicorn multi-worker mode
• Check OBJC_DISABLE_INITIALIZE_FORK_SAFETY
• Prevent NumPy/Accelerate crashes
• Show detailed error message
• Provide multiple fix options
• Exit early if misconfigured
2025-11-15 00:58:23 +08:00
yangdx
5dec4deac7 Improve embedding config priority and add debug logging
• Fix embedding_dim priority logic
• Add final config logging
2025-11-14 23:22:44 +08:00
yangdx
963a0a5db1 Refactor embedding function creation with proper attribute inheritance
- Extract max_token_size from providers
- Avoid double-wrapping EmbeddingFunc
- Improve configuration priority logic
- Add comprehensive debug logging
- Return complete EmbeddingFunc instance
2025-11-14 22:29:08 +08:00
yangdx
ab4d7ac2b0 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-14 19:28:36 +08:00
yangdx
cc031a3db9 Add macOS compatibility check for DOCLING with multi-worker Gunicorn 2025-11-13 19:18:04 +08:00
yangdx
a24d8181c2 Improve docling integration with macOS compatibility and CLI flag
- Add --docling CLI flag for easier setup
- Add numpy version constraints
- Exclude docling on macOS (fork-safety)
2025-11-13 18:58:09 +08:00
yangdx
746c069ab0 Implement lazy configuration initialization for API server
• Add lazy config initialization
• Maintain backward compatibility
• Support programmatic usage
• Add gunicorn dependency
• Explicit config in entry points
2025-11-13 15:28:05 +08:00
yangdx
4b31942e2a refactor: move document deps to api group, remove dynamic imports
- Merge offline-docs into api extras
- Remove pipmaster dynamic installs
- Add async document processing
- Pre-check docling availability
- Update offline deployment docs
2025-11-13 13:34:09 +08:00
yangdx
8c07c91833 Remove deprecated response_type parameter from query settings
- Bump API version to 0254
- Remove response format UI controls
- Hard-code response_type in query params
- Add migration for version 19
- Clean up settings store structure
2025-11-12 12:19:30 +08:00