Commit graph

3689 commits

Author SHA1 Message Date
BukeLy
4c12301e81 fix: correct parameter passing in delete_entity_relation
Why this change is needed:
The previous fix in commit 7dc1f83e incorrectly "fixed" delete_entity_relation
by converting the parameter dict to a list. However, PostgreSQLDB.execute()
expects a dict[str, Any] parameter, not a list. The execute() method internally
converts dict values to tuple (line 1487: tuple(data.values())), so passing
a list bypasses the expected interface and causes parameter binding issues.

What was wrong:
```python
params = {"workspace": self.workspace, "entity_name": entity_name}
await self.db.execute(delete_sql, list(params.values()))  # WRONG
```

The correct approach (matching delete_entity method):
```python
await self.db.execute(
    delete_sql, {"workspace": self.workspace, "entity_name": entity_name}
)
```

How it solves it:
- Pass parameters as a dict directly to db.execute(), matching the method signature
- Maintain consistency with delete_entity() which correctly passes a dict
- Let db.execute() handle the dict-to-tuple conversion internally as designed

Impact:
- delete_entity_relation now correctly passes parameters to PostgreSQL
- Method interface consistency with other delete operations
- Proper parameter binding ensures reliable entity relation deletion

Testing:
- All 6 PostgreSQL migration tests pass
- Verified parameter passing matches delete_entity pattern
- Code review identified the issue before production use

Related:
- Fixes incorrect "fix" from commit 7dc1f83e
- Aligns with PostgreSQLDB.execute() interface (line 1477-1480)
2025-11-19 23:31:09 +08:00
BukeLy
7dc1f83efb fix: PostgreSQL read methods and delete_entity_relation bugs
Why this change is needed:
After implementing model isolation, two critical bugs were discovered that would cause data access failures:

Bug 1: In delete_entity_relation(), the SQL query uses positional parameters
($1, $2) but the parameter dict was not converted to a list of values before
passing to db.execute(). This caused parameter binding failures when trying to
delete entity relations.

Bug 2: Four read methods (get_by_id, get_by_ids, get_vectors_by_ids, drop)
were still using namespace_to_table_name(self.namespace) to get legacy table
names instead of self.table_name with model suffix. This meant these methods
would query the wrong table (legacy without suffix) while data was being
inserted into the new table (with suffix), causing data not found errors.

How it solves it:
- Bug 1: Convert parameter dict to list using list(params.values()) before
  passing to db.execute(), matching the pattern used in other methods
- Bug 2: Replace all namespace_to_table_name(self.namespace) calls with
  self.table_name in the four affected methods, ensuring they query the
  correct model-specific table

Impact:
- delete_entity_relation now correctly deletes relations by entity name
- All read operations now correctly query model-specific tables
- Data written with model isolation can now be properly retrieved
- Maintains consistency with write operations using self.table_name

Testing:
- All 6 PostgreSQL migration tests pass (test_postgres_migration.py)
- All 6 Qdrant migration tests pass (test_qdrant_migration.py)
- Verified parameter binding works correctly
- Verified read methods access correct tables
2025-11-19 23:01:01 +08:00
BukeLy
ad68624d02 feat: PostgreSQL model isolation and auto-migration
Why this change is needed:
PostgreSQL vector storage needs model isolation to prevent dimension
conflicts when different workspaces use different embedding models.
Without this, the first workspace locks the vector dimension for all
subsequent workspaces, causing failures.

How it solves it:
- Implements dynamic table naming with model suffix: {table}_{model}_{dim}d
- Adds setup_table() method mirroring Qdrant's approach for consistency
- Implements 4-branch migration logic: both exist -> warn, only new -> use,
  neither -> create, only legacy -> migrate
- Batch migration: 500 records/batch (same as Qdrant)
- No automatic rollback to support idempotent re-runs

Impact:
- PostgreSQL tables now isolated by embedding model and dimension
- Automatic data migration from legacy tables on startup
- Backward compatible: model_name=None defaults to "unknown"
- All SQL operations use dynamic table names

Testing:
- 6 new tests for PostgreSQL migration (100% pass)
- Tests cover: naming, migration trigger, scenarios 1-3
- 3 additional scenario tests added for Qdrant completeness

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 22:54:37 +08:00
BukeLy
df5aacb545 feat: Qdrant model isolation and auto-migration
Why this change is needed:
To implement vector storage model isolation for Qdrant, allowing different workspaces to use different embedding models without conflict, and automatically migrating existing data.

How it solves it:
- Modified QdrantVectorDBStorage to use model-specific collection suffixes
- Implemented automated migration logic from legacy collections to new schema
- Fixed Shared-Data lock re-entrancy issue in multiprocess mode
- Added comprehensive tests for collection naming and migration triggers

Impact:
- Existing users will have data automatically migrated on next startup
- New workspaces will use isolated collections based on embedding model
- Fixes potential lock-related bugs in shared storage

Testing:
- Added tests/test_qdrant_migration.py passing
- Verified migration logic covers all 4 states (New/Legacy existence combinations)
2025-11-19 18:47:38 +08:00
BukeLy
13f2440bbf feat: enhance BaseVectorStorage for model isolation
Why this change is needed:
To enforce consistent naming and migration strategy across all vector storages.

How it solves it:
- Added _generate_collection_suffix() helper
- Added _get_legacy_collection_name() and _get_new_collection_name() interfaces

Impact:
Prepares storage implementations for multi-model support.

Testing:
Added tests/test_base_storage_integrity.py passing.
2025-11-19 02:15:22 +08:00
BukeLy
5c10d3d58e feat: enhance EmbeddingFunc with model_name support
Why this change is needed:
To support vector storage model isolation, we need to track which model is used for embeddings and generate unique identifiers for collections/tables.

How it solves it:
- Added model_name field to EmbeddingFunc
- Added get_model_identifier() method to generate sanitized suffix
- Added unit tests to verify behavior

Impact:
Enables subsequent changes in storage backends to isolate data by model.

Testing:
Added tests/test_embedding_func.py passing.
2025-11-19 02:11:39 +08:00
yangdx
d16c7840ab Bump API version to 0256 2025-11-18 23:15:31 +08:00
yangdx
e77340d4a1 Adjust chunking parameters to match the default environment variable settings 2025-11-18 23:14:50 +08:00
yangdx
1bfa1f81cb Merge branch 'main' into fix_chunk_comment 2025-11-18 22:38:50 +08:00
yangdx
9c10c87554 Fix linting 2025-11-18 22:38:43 +08:00
yangdx
dbae327a17 Merge branch 'main' into dev-postgres-vchordrq 2025-11-18 22:13:27 +08:00
yangdx
3096f844fb fix(postgres): allow vchordrq.epsilon config when probes is empty
Previously, configure_vchordrq would fail silently when probes was empty
(the default), preventing epsilon from being configured. Now each parameter
is handled independently with conditional execution, and configuration
errors fail-fast instead of being swallowed.

This fixes the documented epsilon setting being impossible to use in the
default configuration.
2025-11-18 21:58:36 +08:00
EightyOliveira
dacca334e0 refactor(chunking): rename params and improve docstring for chunking_by_token_size 2025-11-18 15:46:28 +08:00
yangdx
702cfd2981 Fix document deletion concurrency control and validation logic
• Clarify job naming for single vs batch deletion
• Update job name validation in busy pipeline check
2025-11-18 13:59:24 +08:00
yangdx
4048fc4b89 Fix: auto-acquire pipeline when idle in document deletion
• Track if we acquired the pipeline lock
• Auto-acquire pipeline when idle
• Only release if we acquired it
• Prevent concurrent deletion conflicts
• Improve deletion job validation
2025-11-18 13:25:13 +08:00
yangdx
1745b30a5f Fix missing workspace parameter in update flags status call 2025-11-18 12:55:48 +08:00
yangdx
f8dd2e0724 Fix namespace parsing when workspace contains colons
• Use rsplit instead of split
• Handle colons in workspace names
2025-11-18 12:23:05 +08:00
wmsnp
d07023c962
feat(postgres_impl): add vchordrq vector index support and unify vector index creation logic 2025-11-18 11:45:16 +08:00
yangdx
6cef8df159 Reduce log level and improve workspace mismatch message clarity
• Change warning to info level
• Simplify workspace mismatch wording
2025-11-18 08:25:21 +08:00
yangdx
ddc76f0c80 Merge branch 'main' into workspace-isolation 2025-11-17 17:08:07 +08:00
yangdx
9262f66d13 Bump API version to 0255 2025-11-17 17:07:18 +08:00
yangdx
393f880311 Improve LightRAG initialization checker tool with better usage docs
• Add workspace param to get_namespace_data
• Update docstring with proper usage example
• Simplify demo to show correct workflow
• Remove confusing before/after comparison
• Clarify tool should run after init
2025-11-17 15:42:54 +08:00
yangdx
9d7b7981ce Add pipeline status validation before document deletion 2025-11-17 14:58:10 +08:00
yangdx
98e964dfc4 Fix initialization instructions in check_lightrag_setup function 2025-11-17 14:27:26 +08:00
yangdx
6d6716e9f8 Add _default_workspace to shared storage finalization
- Add _default_workspace to global vars
- Set _default_workspace to None on cleanup
- Ensure complete resource cleanup
- Fix missing workspace finalization
2025-11-17 13:46:46 +08:00
yangdx
f1d8f18c80 Merge branch 'main' into workspace-isolation 2025-11-17 13:01:33 +08:00
yangdx
cdd53ee875 Remove manual initialize_pipeline_status() calls across codebase
- Auto-init pipeline status in storages
- Remove redundant import statements
- Simplify initialization pattern
- Update docs and examples
2025-11-17 12:54:33 +08:00
yangdx
e22ac52ebc Auto-initialize pipeline status in LightRAG.initialize_storages()
• Remove manual initialize_pipeline_status calls
• Auto-init in initialize_storages method
• Update error messages for clarity
• Warn on workspace conflicts
2025-11-17 12:54:33 +08:00
yangdx
e8383df3b8 Fix NamespaceLock context variable timing to prevent lock bricking
* Acquire lock before setting ContextVar
* Prevent state corruption on cancellation
* Fix permanent lock brick scenario
* Store context only after success
* Handle acquisition failure properly
2025-11-17 12:54:33 +08:00
yangdx
95e1fb1612 Remove final_namespace attribute for in-memory storage and use namespace in clean_llm_query_cache.py 2025-11-17 12:54:33 +08:00
yangdx
7ed0eac4c9 Fix workspace filtering logic in get_all_update_flags_status
• Handle namespaces with/without prefixes
• Fix workspace matching logic
2025-11-17 12:54:33 +08:00
yangdx
78689e8837 Fix pipeline status namespace check to handle root case
- Add check for bare "pipeline_status"
- Handle namespace without prefix
2025-11-17 12:54:33 +08:00
yangdx
d54d0d55d9 Standardize empty workspace handling from "_" to "" across storage
* Unify empty workspace behavior by changing workspace from "_" to ""
* Fixed incorrect empty workspace detection in get_all_update_flags_status()
2025-11-17 12:54:33 +08:00
yangdx
b6a5a90eaf Fix NamespaceLock concurrent coroutine safety with ContextVar
- Use ContextVar for per-coroutine storage
- Prevent state interference between coroutines
- Add re-entrance protection check
2025-11-17 12:54:33 +08:00
yangdx
fd486bc922 Refactor storage classes to use namespace instead of final_namespace 2025-11-17 12:54:33 +08:00
yangdx
01814bfc7a Fix missing function call parentheses in get_all_update_flags_status 2025-11-17 12:54:33 +08:00
yangdx
7deb9a64b9 Refactor namespace lock to support reusable async context manager
• Add NamespaceLock class wrapper
• Fix lock re-entrance issues
• Enable concurrent lock usage
• Fresh context per async with block
• Update get_namespace_lock API
2025-11-17 12:54:33 +08:00
yangdx
52c812b9a0 Fix workspace isolation for pipeline status across all operations
- Fix final_namespace error in get_namespace_data()
- Fix get_workspace_from_request return type
- Add workspace param to pipeline status calls
2025-11-17 12:54:33 +08:00
yangdx
926960e957 Refactor workspace handling to use default workspace and namespace locks
- Remove DB-specific workspace configs
- Add default workspace auto-setting
- Replace global locks with namespace locks
- Simplify pipeline status management
- Remove redundant graph DB locking
2025-11-17 12:54:33 +08:00
yangdx
ec05d89c2a Add macOS fork safety check for Gunicorn multi-worker mode
• Check OBJC_DISABLE_INITIALIZE_FORK_SAFETY
• Prevent NumPy/Accelerate crashes
• Show detailed error message
• Provide multiple fix options
• Exit early if misconfigured
2025-11-17 12:54:33 +08:00
yangdx
e5addf4d94 Improve embedding config priority and add debug logging
• Fix embedding_dim priority logic
• Add final config logging
2025-11-17 12:54:32 +08:00
yangdx
2fb57e767d Fix embedding token limit initialization order
* Capture max_token_size before decorator
* Apply wrapper after capturing attribute
* Prevent decorator from stripping dataclass
* Ensure token limit is properly set
2025-11-17 12:54:32 +08:00
yangdx
6b2af2b579 Refactor embedding function creation with proper attribute inheritance
- Extract max_token_size from providers
- Avoid double-wrapping EmbeddingFunc
- Improve configuration priority logic
- Add comprehensive debug logging
- Return complete EmbeddingFunc instance
2025-11-17 12:54:32 +08:00
yangdx
f0254773c6 Convert embedding_token_limit from property to field with __post_init__
• Remove property decorator
• Add field with init=False
• Set value in __post_init__ method
• embedding_token_limit is now in config dictionary
2025-11-17 12:54:32 +08:00
yangdx
14a6c24ed7 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-17 12:54:32 +08:00
yangdx
f5b48587ed Improve Bedrock error handling with retry logic and custom exceptions
• Add specific exception types
• Implement proper retry mechanism
• Better error classification
• Enhanced logging and validation
• Enable embedding retry decorator
2025-11-17 12:54:32 +08:00
yangdx
77221564b0 Add max_token_size parameter to embedding function decorators
- Add max_token_size=8192 to all embed funcs
- Move siliconcloud to deprecated folder
- Import wrap_embedding_func_with_attrs
- Update EmbeddingFunc docstring
- Fix langfuse import type annotation
2025-11-17 12:54:32 +08:00
yangdx
8283c86bce Refactor exception handling in MemgraphStorage label methods 2025-11-17 12:54:32 +08:00
yangdx
423e4e927a Fix null reference errors in graph database error handling
- Initialize result vars to None
- Add null checks before consume calls
- Prevent crashes in except blocks
- Apply fix to both Neo4J and Memgraph
2025-11-17 12:54:32 +08:00
yangdx
2f2f35b883 Add macOS compatibility check for DOCLING with multi-worker Gunicorn 2025-11-17 12:54:32 +08:00