Commit graph

3776 commits

Author SHA1 Message Date
Bukely_
4e5351de63
Merge cf68cdfe3a into 9562a974d2 2025-12-12 10:18:32 +08:00
yangdx
9562a974d2 Bump API version to 0260 2025-12-12 10:13:34 +08:00
yangdx
f254ccae8e Merge branch 'fix/missing-langue-in-keywords-extraction-prompt' 2025-12-12 10:11:56 +08:00
yangdx
834778eb01 Reorganize entity extraction prompt for better clarity
- Move instructions before data section
- Update task description wording
2025-12-12 06:12:47 +08:00
José Carlos
d8cd48f43b fix(cache): include language parameter in keywords extraction cache key 2025-12-11 12:43:42 -03:00
yangdx
4fd3914194 Merge branch 'main' into fix/openai-prompt-caching-optimization-2355 2025-12-11 19:12:42 +08:00
yangdx
294f75438e Restructure entity extraction prompt format for consistency
• Move entity_types to user prompt
• Add XML-style formatting tags
• Update examples with entity_types
2025-12-11 19:12:34 +08:00
José Carlos
fd2ff358bf fix(prompt): use language parameter in keywords_extraction prompt
The language parameter was being passed to the prompt but was not
being used. Added {language} placeholder to ensure keywords are
extracted in the configured language.
2025-12-06 09:50:12 -03:00
yangdx
9009abed3e Fix top_n behavior with chunking to limit documents not chunks
- Disable API-level top_n when chunking
- Apply top_n to aggregated documents
- Add comprehensive test coverage
2025-12-03 13:08:26 +08:00
yangdx
561ba4e4b5 Fix trailing whitespace and update test mocking for rerank module
• Remove trailing whitespace
• Fix TiktokenTokenizer import patch
• Add async context manager mocks
• Update aiohttp.ClientSession patch
• Improve test reliability
2025-12-03 12:40:48 +08:00
yangdx
8e50eef58b Merge branch 'main' into cohere-rerank 2025-12-02 22:19:37 +08:00
yangdx
19c16bc464 Add content deduplication check for document insertion endpoints
• Check content hash before insertion
• Return duplicated status if exists
• Use sanitized text for hash computation
• Apply to both single and batch inserts
• Prevent duplicate content processing
2025-12-02 17:49:48 +08:00
yangdx
8d28b95966 Fix duplicate document responses to return original track_id
- Return existing track_id for duplicates
- Remove track_id generation in reprocess
- Update reprocess response documentation
- Clarify track_id behavior in comments
- Update API response examples
2025-12-02 14:32:28 +08:00
yangdx
381ddfffd4 Bump API version to 0259 2025-12-02 13:27:02 +08:00
yangdx
2ecf77efe2 Update help text to use correct gunicorn command with workers flag 2025-12-02 02:52:31 +08:00
yangdx
d6019c82af Add CASCADE to AGE extension creation in PostgreSQL implementation
- Add CASCADE option to CREATE EXTENSION
- Ensure dependencies are installed
- Fix potential AGE setup issues
2025-12-02 00:17:41 +08:00
yangdx
112ed234c4 Bump API version to 0258 2025-12-01 12:20:27 +08:00
yangdx
ea8d55ab42 Add documentation for embedding provider configuration rules 2025-11-28 17:49:30 +08:00
yangdx
4ab4a7ac94 Allow embedding models to use provider defaults when unspecified
- Set EMBEDDING_MODEL default to None
- Pass model param only when provided
- Let providers use their own defaults
- Fix lollms embed function params
- Add ollama embed_model default param
2025-11-28 16:57:33 +08:00
yangdx
881b8d3a50 Bump API version to 0257 2025-11-28 15:39:55 +08:00
yangdx
56e0365cf0 Add configurable model parameter to jina_embed function
- Add model parameter to jina_embed
- Pass model from API server
- Default to jina-embeddings-v4
- Update function documentation
- Make model selection flexible
2025-11-28 15:38:29 +08:00
yangdx
6e2946e78a Add max_token_size parameter to azure_openai_embed wrapper 2025-11-28 13:41:01 +08:00
yangdx
4f12fe121d Change entity extraction logging from warning to info level
• Reduce log noise for empty entities
2025-11-27 11:00:34 +08:00
Ghazi-raad
4e8e08cf4d
Update lightrag/operate.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-26 23:18:20 +00:00
Ghazi-raad
56677ae466
Update lightrag/prompt.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-26 23:18:12 +00:00
Ghazi-raad
207af40f54 Optimize for OpenAI Prompt Caching: Restructure entity extraction prompts
- Remove input_text from entity_extraction_system_prompt to enable caching
- Move input_text to entity_extraction_user_prompt for per-chunk variability
- Update operate.py to format system prompt once without input_text
- Format user prompts with input_text for each chunk

This enables OpenAI's automatic prompt caching (50% discount on cached tokens):
- ~1300 token system message cached and reused for ALL chunks
- Only ~150 token user message varies per chunk
- Expected 45% cost reduction on prompt tokens during indexing
- 2-3x faster response times from cached prompts

Fixes #2355
2025-11-26 21:56:25 +00:00
palanisd
a898f0548d
Merge branch 'HKUDS:main' into cohere-rerank 2025-11-25 14:21:43 -05:00
BukeLy
cf68cdfe3a refactor: improve PostgreSQL migration code quality
Why this change is needed:
1. Added clarifying comments to _pg_migrate_workspace_data() parameter handling
2. Removed dead code from PGDocStatusStorage.initialize() that was never executed

Changes:

1. PostgreSQL Migration Parameter Documentation (lightrag/kg/postgres_impl.py:2240-2241):
   - Added comments explaining dict rebuild for correct value ordering
   - Clarifies that Python 3.7+ dict insertion order is relied upon
   - Documents that execute() converts dict to tuple via .values()

2. Dead Code Removal (lightrag/kg/postgres_impl.py:3061-3062):
   - Removed unreachable table creation code from PGDocStatusStorage.initialize()
   - Table is already created by PostgreSQLDB.initdb() during initialization
   - This code path was never executed as table always exists before initialize() is called
   - Added NOTE comment explaining where table creation actually happens

Impact:
- No functional changes - only code clarification and cleanup
- Reduces maintenance burden by removing unreachable code
- Improves code readability with better documentation

Testing:
- All 14 PostgreSQL migration tests pass
- All 5 UnifiedLock safety tests pass
- Pre-commit checks pass (ruff-format, ruff)
2025-11-26 02:24:51 +08:00
BukeLy
a8f5c9bd33 fix: migrate workspace data in PostgreSQL Case 1 to prevent data loss
Why this change is needed:
In multi-tenant deployments, when workspace A migrates first (creating
the new model-suffixed table), subsequent workspace B initialization
enters Case 1 (both tables exist). The original Case 1 logic only
checked if the legacy table was empty globally, without checking if
the current workspace had unmigrated data. This caused workspace B's
data to remain in the legacy table while the application queried the
new table, resulting in data loss for workspace B.

How it solves the problem:
1. Extracted migration logic into _pg_migrate_workspace_data() helper
   function to avoid code duplication
2. Modified Case 1 to check if current workspace has data in legacy
   table and migrate it if found
3. Both Case 1 and Case 4 now use the same migration helper, ensuring
   consistent behavior
4. After migration, only delete the current workspace's data from
   legacy table, preserving other workspaces' data

Impact:
- Prevents data loss in multi-tenant PostgreSQL deployments
- Maintains backward compatibility with single-tenant setups
- Reduces code duplication between Case 1 and Case 4

Testing:
All PostgreSQL migration tests pass (8/8)
2025-11-26 01:16:57 +08:00
yangdx
93d445dfdd Add pipeline status lock function for legacy compatibility
- Add get_pipeline_status_lock function
- Return NamespaceLock for consistency
- Support workspace parameter
- Enable logging option
- Legacy code compatibility
2025-11-25 18:24:39 +08:00
EightyOliveira
8994c70f2f fix:exception handling order error 2025-11-25 16:36:41 +08:00
yangdx
48b67d3077 Handle missing WebUI assets gracefully without blocking server startup
- Change build check from error to warning
- Redirect to /docs when WebUI unavailable
- Add webui_available to health endpoint
- Only mount /webui if assets exist
- Return status tuple from build check
2025-11-25 02:51:55 +08:00
yangdx
8c4d7a00ad Refactor: Extract retry decorator to reduce code duplication in Neo4J storage
• Define READ_RETRY_EXCEPTIONS constant
• Create reusable READ_RETRY decorator
• Replace 11 duplicate retry decorators
• Improve code maintainability
• Add missing retry to edge_degrees_batch
2025-11-25 01:35:21 +08:00
yangdx
7aaa51cda9 Add retry decorators to Neo4j read operations for resilience 2025-11-24 22:28:15 +08:00
copilot-swe-agent[bot]
8835fc244a Improve edge case handling for max_tokens=1
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-24 03:43:05 +00:00
copilot-swe-agent[bot]
1d6ea0c5f7 Fix chunking infinite loop when overlap_tokens >= max_tokens
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-24 03:40:58 +00:00
BukeLy
510baebf62 fix: correct PostgreSQL execute() parameter format in workspace cleanup
Critical Bug Fix:
PostgreSQLDB.execute() expects data as dict, but workspace cleanup
was passing a list [workspace], causing cleanup to fail with
"PostgreSQLDB.execute() expects data as dict, got list" error.

Changes:
1. Fixed postgres_impl.py:2522
   - Changed: await db.execute(delete_query, [workspace])
   - To: await db.execute(delete_query, {"workspace": workspace})

2. Improved test_postgres_migration.py mock
   - Enhanced COUNT(*) mock to properly distinguish between:
     * Legacy table with workspace filter (returns 50)
     * Legacy table without filter after deletion (returns 0)
     * New table verification (returns 50)
   - Uses storage.legacy_table_name dynamically instead of hardcoded strings
   - Detects table type by checking for model suffix patterns

3. Fixed test_unified_lock_safety.py formatting
   - Applied ruff formatting to assert statement

Impact:
- Workspace-aware legacy cleanup now works correctly
- Legacy tables properly deleted when all workspace data migrated
- Legacy tables preserved when other workspace data remains

Tests: All 25 unit tests pass
2025-11-23 16:55:48 +08:00
BukeLy
16fff353d9 fix: prevent data loss in PostgreSQL migration and add doc_status table creation
This commit fixes two critical issues in PostgreSQL storage:

BUG 1: Legacy table cleanup causing data loss across workspaces
---------------------------------------------------------------
PROBLEM:
- After migrating workspace_a data from legacy table, the ENTIRE legacy
  table was deleted
- This caused workspace_b's data (still in legacy table) to be lost
- Multi-tenant data isolation was violated

FIX:
- Implement workspace-aware cleanup: only delete migrated workspace's data
- Check if other workspaces still have data before dropping table
- Only drop legacy table when it becomes completely empty
- If other workspace data exists, preserve legacy table with remaining records

Location: postgres_impl.py PGVectorStorage.setup_table() lines 2510-2567

Test verification:
- test_workspace_migration_isolation_e2e_postgres validates this fix

BUG 2: PGDocStatusStorage missing table initialization
-------------------------------------------------------
PROBLEM:
- PGDocStatusStorage.initialize() only set workspace, never created table
- Caused "relation 'lightrag_doc_status' does not exist" errors
- document insertion (ainsert) failed immediately

FIX:
- Add table creation to initialize() method using _pg_create_table()
- Consistent with other storage implementations:
  * MongoDocStatusStorage creates collections
  * JsonDocStatusStorage creates directories
  * PGDocStatusStorage now creates tables ✓

Location: postgres_impl.py PGDocStatusStorage.initialize() lines 2965-2971

Test Results:
- Unit tests: 13/13 passed (test_unified_lock_safety,
  test_workspace_migration_isolation, test_dimension_mismatch)
- E2E tests require PostgreSQL server

Related: PR #2391 (Vector Storage Model Isolation)
2025-11-23 16:43:49 +08:00
BukeLy
204a2535c8 fix: prevent double-release in UnifiedLock.__aexit__ error recovery
Problem:
When UnifiedLock.__aexit__ encountered an exception during async_lock.release(),
the error recovery logic would incorrectly attempt to release async_lock again
because it only checked main_lock_released flag. This could cause:
- Double-release attempts on already-failed locks
- Masking of original exceptions
- Undefined behavior in lock state

Root Cause:
The recovery logic used only main_lock_released to determine whether to attempt
async_lock release, without tracking whether async_lock.release() had already
been attempted and failed.

Fix:
- Added async_lock_released flag to track async_lock release attempts
- Updated recovery logic condition to check both main_lock_released AND
  async_lock_released before attempting async_lock release
- This ensures async_lock.release() is only called once, even if it fails

Testing:
- Added test_aexit_no_double_release_on_async_lock_failure:
  Verifies async_lock.release() is called only once when it fails
- Added test_aexit_recovery_on_main_lock_failure:
  Verifies recovery logic still works when main lock fails
- All 5 UnifiedLock safety tests pass

Impact:
- Eliminates double-release bugs in multiprocess lock scenarios
- Preserves correct error propagation
- Maintains recovery logic for legitimate failure cases

Files Modified:
- lightrag/kg/shared_storage.py: Added async_lock_released tracking
- tests/test_unified_lock_safety.py: Added 2 new tests (5 total now pass)
2025-11-23 16:34:08 +08:00
BukeLy
cfc6587e04 fix: prevent race conditions and cross-workspace data leakage in migration
Why this change is needed:
Two critical P0 security vulnerabilities were identified in CursorReview:
1. UnifiedLock silently allows unprotected execution when lock is None, creating
   false security and potential race conditions in multi-process scenarios
2. PostgreSQL migration copies ALL workspace data during legacy table migration,
   violating multi-tenant isolation and causing data leakage

How it solves it:
- UnifiedLock now raises RuntimeError when lock is None instead of WARNING
- Added workspace parameter to setup_table() for proper data isolation
- Migration queries now filter by workspace in both COUNT and SELECT operations
- Added clear error messages to help developers diagnose initialization issues

Impact:
- lightrag/kg/shared_storage.py: UnifiedLock raises exception on None lock
- lightrag/kg/postgres_impl.py: Added workspace filtering to migration logic
- tests/test_unified_lock_safety.py: 3 tests for lock safety
- tests/test_workspace_migration_isolation.py: 3 tests for workspace isolation
- tests/test_dimension_mismatch.py: Updated table names and mocks
- tests/test_postgres_migration.py: Updated mocks for workspace filtering

Testing:
- All 31 tests pass (16 migration + 4 safety + 3 lock + 3 workspace + 5 dimension)
- Backward compatible: existing code continues working unchanged
- Code style verified with ruff and pre-commit hooks
2025-11-23 16:09:59 +08:00
BukeLy
f69cf9bcd6 fix: prevent vector dimension mismatch crashes and data loss on no-suffix restarts
Why this change is needed:
Two critical issues were identified in Codex review of PR #2391:
1. Migration fails when legacy collections/tables use different embedding dimensions
   (e.g., upgrading from 1536d to 3072d models causes initialization failures)
2. When model_suffix is empty (no model_name provided), table_name equals legacy_table_name,
   causing Case 1 logic to delete the only table/collection on second startup

How it solves it:
- Added dimension compatibility checks before migration in both Qdrant and PostgreSQL
- PostgreSQL uses two-method detection: pg_attribute metadata query + vector sampling fallback
- When dimensions mismatch, skip migration and create new empty table/collection, preserving legacy data
- Added safety check to detect when new and legacy names are identical, preventing deletion
- Both backends log clear warnings about dimension mismatches and skipped migrations

Impact:
- lightrag/kg/qdrant_impl.py: Added dimension check (lines 254-297) and no-suffix safety (lines 163-169)
- lightrag/kg/postgres_impl.py: Added dimension check with fallback (lines 2347-2410) and no-suffix safety (lines 2281-2287)
- tests/test_no_model_suffix_safety.py: New test file with 4 test cases covering edge scenarios
- Backward compatible: All existing scenarios continue working unchanged

Testing:
- All 20 tests pass (16 existing migration tests + 4 new safety tests)
- E2E tests enhanced with explicit verification points for dimension mismatch scenarios
- Verified graceful degradation when dimension detection fails
- Code style verified with ruff and pre-commit hooks
2025-11-23 15:44:07 +08:00
netbrah
a05bbf105e Add Cohere reranker config, chunking, and tests 2025-11-22 16:43:13 -05:00
yangdx
7b76211066 Add fallback to AZURE_OPENAI_API_VERSION for embedding API version 2025-11-22 00:14:35 +08:00
yangdx
ffd8da512e Improve Azure OpenAI compatibility and error handling
• Reduce log noise for Azure content filters
• Add default API version fallback
• Change warning to debug log level
• Handle empty choices in streaming
• Better Azure OpenAI integration
2025-11-21 23:51:18 +08:00
yangdx
fafa1791f4 Fix Azure OpenAI model parameter to use deployment name consistently
- Use deployment name for Azure API calls
- Fix model param in embed function
- Consistent api_model logic
- Prevent Azure model name conflicts
2025-11-21 23:41:52 +08:00
yangdx
ac9f2574a5 Improve Azure OpenAI wrapper functions with full parameter support
• Add missing parameters to wrappers
• Update docstrings for clarity
• Ensure API consistency
• Fix parameter forwarding
• Maintain backward compatibility
2025-11-21 19:24:32 +08:00
yangdx
45f4f82392 Refactor Azure OpenAI client creation to support client_configs merging
- Handle None client_configs case
- Merge configs with explicit params
- Override client_configs with params
- Use dict unpacking for client init
- Maintain parameter precedence
2025-11-21 19:14:16 +08:00
yangdx
0c4cba3860 Fix double decoration in azure_openai_embed and document decorator usage
• Remove redundant @retry decorator
• Call openai_embed.func directly
• Add detailed decorator documentation
• Prevent double parameter injection
• Fix EmbeddingFunc wrapping issues
2025-11-21 18:03:53 +08:00
yangdx
b46c152306 Fix linting 2025-11-21 17:16:44 +08:00
yangdx
b709f8f869 Consolidate Azure OpenAI implementation into main OpenAI module
• Unified OpenAI/Azure client creation
• Azure module now re-exports functions
• Backward compatibility maintained
• Reduced code duplication
2025-11-21 17:12:33 +08:00