Commit graph

57 commits

Author SHA1 Message Date
BukeLy
f69cf9bcd6 fix: prevent vector dimension mismatch crashes and data loss on no-suffix restarts
Why this change is needed:
Two critical issues were identified in Codex review of PR #2391:
1. Migration fails when legacy collections/tables use different embedding dimensions
   (e.g., upgrading from 1536d to 3072d models causes initialization failures)
2. When model_suffix is empty (no model_name provided), table_name equals legacy_table_name,
   causing Case 1 logic to delete the only table/collection on second startup

How it solves it:
- Added dimension compatibility checks before migration in both Qdrant and PostgreSQL
- PostgreSQL uses two-method detection: pg_attribute metadata query + vector sampling fallback
- When dimensions mismatch, skip migration and create new empty table/collection, preserving legacy data
- Added safety check to detect when new and legacy names are identical, preventing deletion
- Both backends log clear warnings about dimension mismatches and skipped migrations

Impact:
- lightrag/kg/qdrant_impl.py: Added dimension check (lines 254-297) and no-suffix safety (lines 163-169)
- lightrag/kg/postgres_impl.py: Added dimension check with fallback (lines 2347-2410) and no-suffix safety (lines 2281-2287)
- tests/test_no_model_suffix_safety.py: New test file with 4 test cases covering edge scenarios
- Backward compatible: All existing scenarios continue working unchanged

Testing:
- All 20 tests pass (16 existing migration tests + 4 new safety tests)
- E2E tests enhanced with explicit verification points for dimension mismatch scenarios
- Verified graceful degradation when dimension detection fails
- Code style verified with ruff and pre-commit hooks
2025-11-23 15:44:07 +08:00
BukeLy
5180c1e395 feat: implement dimension compatibility checks for PostgreSQL and Qdrant migrations
This update introduces checks for vector dimension compatibility before migrating legacy data in both PostgreSQL and Qdrant storage implementations. If a dimension mismatch is detected, the migration is skipped to prevent data loss, and a new empty table or collection is created for the new embedding model.

Key changes include:
- Added dimension checks in `PGVectorStorage` and `QdrantVectorDBStorage` classes.
- Enhanced logging to inform users about dimension mismatches and the creation of new storage.
- Updated E2E tests to validate the new behavior, ensuring legacy data is preserved and new structures are created correctly.

Impact:
- Prevents potential data corruption during migrations with mismatched dimensions.
- Improves user experience by providing clear logging and maintaining legacy data integrity.

Testing:
- New tests confirm that the system behaves as expected when encountering dimension mismatches.
2025-11-20 12:22:13 +08:00
BukeLy
8386ea061e refactor: unify PostgreSQL and Qdrant migration logic for consistency
Why this change is needed:
Previously, PostgreSQL and Qdrant had inconsistent migration behavior:
- PostgreSQL kept legacy tables after migration, requiring manual cleanup
- Qdrant auto-deleted legacy collections after migration
This inconsistency caused confusion for users and required different
documentation for each backend.

How it solves the problem:
Unified both backends to follow the same smart cleanup strategy:
- Case 1 (both exist): Auto-delete if legacy is empty, warn if has data
- Case 4 (migration): Auto-delete legacy after successful verification
This provides a fully automated migration experience without manual intervention.

Impact:
- Eliminates need for users to manually delete legacy tables/collections
- Reduces storage waste from duplicate data
- Provides consistent behavior across PostgreSQL and Qdrant
- Simplifies documentation and user experience

Testing:
- All 16 unit tests pass (8 PostgreSQL + 8 Qdrant)
- Added 4 new tests for Case 1 scenarios (empty vs non-empty legacy)
- Updated E2E tests to verify auto-deletion behavior
- All lint checks pass (ruff-format, ruff, trailing-whitespace)
2025-11-20 11:37:59 +08:00
BukeLy
48f6511404 style: Apply ruff-format to qdrant_impl.py
Fix code formatting to comply with ruff-format requirements.
Split long conditional expression across multiple lines for better readability.
2025-11-20 02:43:59 +08:00
BukeLy
e24b2ed4fa fix: Prioritize workspace-specific legacy collections in Qdrant migration
Why this change is needed:
The E2E test test_backward_compat_old_workspace_naming_qdrant was failing
because _find_legacy_collection() searched for generic "lightrag_vdb_{namespace}"
before workspace-specific "{workspace}_{namespace}" collections. When both
existed, it would always find the generic one first (which might be empty),
ignoring the workspace collection that actually contained the data to migrate.

How it solves it:
Reordered the candidates list in _find_legacy_collection() to prioritize
more specific naming patterns over generic ones:
  1. {workspace}_{namespace}  (most specific, old workspace format)
  2. lightrag_vdb_{namespace}  (generic legacy format)
  3. {namespace}  (most generic, oldest format)

This ensures the migration finds the correct source collection with actual data.

Impact:
- Fixes test_backward_compat_old_workspace_naming_qdrant which creates
  a "prod_chunks" collection with 10 points
- Migration will now correctly find and migrate from workspace-specific
  legacy collections before falling back to generic collections
- Maintains backward compatibility with all legacy naming patterns

Testing:
Run: pytest tests/test_e2e_multi_instance.py::test_backward_compat_old_workspace_naming_qdrant -v
2025-11-20 02:34:55 +08:00
BukeLy
42df825d30 fix: handle empty model_suffix in Qdrant collection naming
This change ensures that when the model_suffix is empty, the final_namespace falls back to the legacy_namespace, preventing potential naming issues. A warning is logged to inform users about the missing model suffix and the fallback to the legacy naming scheme.

Additionally, comprehensive tests have been added to verify the behavior of both PostgreSQL and Qdrant storage when model_suffix is empty, ensuring that the naming conventions are correctly applied and that no trailing underscores are present.

Impact:
- Prevents crashes due to empty model_suffix
- Provides clear feedback to users regarding configuration issues
- Maintains backward compatibility with existing setups

Testing:
All new tests pass, validating the handling of empty model_suffix scenarios.
2025-11-20 01:55:20 +08:00
BukeLy
6bef40766d style: fix lint errors (trailing whitespace and formatting) 2025-11-20 01:41:23 +08:00
BukeLy
088b986ac6 style: fix lint issues (trailing whitespace and formatting) 2025-11-20 01:28:39 +08:00
BukeLy
5d9547344a fix: correct Qdrant legacy_namespace for data migration
Why this change is needed:
The legacy_namespace logic was incorrectly including workspace in the
collection name, causing migration to fail in E2E tests. When workspace
was set (e.g., to a temp directory path), legacy_namespace became
"/tmp/xxx_chunks" instead of "lightrag_vdb_chunks", so the migration
logic couldn't find the legacy collection.

How it solves it:
Changed legacy_namespace to always use the old naming scheme without
workspace prefix: "lightrag_vdb_{namespace}". This matches the actual
collection names from pre-migration code and aligns with PostgreSQL's
approach where legacy_table_name = base_table (without workspace).

Impact:
- Qdrant legacy data migration now works correctly in E2E tests
- All unit tests pass (6/6 for both Qdrant and PostgreSQL)
- E2E test_legacy_migration_qdrant should now pass

Testing:
- Unit tests: pytest tests/test_qdrant_migration.py -v (6/6 passed)
- Unit tests: pytest tests/test_postgres_migration.py -v (6/6 passed)
- Updated test_qdrant_collection_naming to verify new legacy_namespace
2025-11-20 01:08:15 +08:00
BukeLy
df5aacb545 feat: Qdrant model isolation and auto-migration
Why this change is needed:
To implement vector storage model isolation for Qdrant, allowing different workspaces to use different embedding models without conflict, and automatically migrating existing data.

How it solves it:
- Modified QdrantVectorDBStorage to use model-specific collection suffixes
- Implemented automated migration logic from legacy collections to new schema
- Fixed Shared-Data lock re-entrancy issue in multiprocess mode
- Added comprehensive tests for collection naming and migration triggers

Impact:
- Existing users will have data automatically migrated on next startup
- New workspaces will use isolated collections based on embedding model
- Fixes potential lock-related bugs in shared storage

Testing:
- Added tests/test_qdrant_migration.py passing
- Verified migration logic covers all 4 states (New/Legacy existence combinations)
2025-11-19 18:47:38 +08:00
yangdx
926960e957 Refactor workspace handling to use default workspace and namespace locks
- Remove DB-specific workspace configs
- Add default workspace auto-setting
- Replace global locks with namespace locks
- Simplify pipeline status management
- Remove redundant graph DB locking
2025-11-17 12:54:33 +08:00
yangdx
5f4a280458 Add Qdrant legacy collection migration with workspace support
- Add QdrantMigrationError exception
- Implement automatic data migration
- Support workspace-based partitioning
- Add migration verification logic
- Update collection naming scheme
2025-10-30 19:16:33 +08:00
Anush008
8584980e3a
refactor: Qdrant Multi-tenancy (Include staged)
Signed-off-by: Anush008 <anushshetty90@gmail.com>
2025-10-26 09:58:24 +05:30
yangdx
9be22dd666 Preserve ordering in get_by_ids methods across all storage implementations
- Fix result ordering in vector stores
- Update KV storage get_by_ids methods
- Maintain order in doc status storage
- Return None for missing IDs
2025-10-11 12:37:59 +08:00
yangdx
43f6fcea6c Fix linting 2025-09-12 17:00:53 +08:00
luxiang
fb4166ba2a chore: compatible wit qdrant v1.7.3 2025-09-10 20:07:49 +08:00
yangdx
03d0fa3014 perf: add optional query_embedding parameter to avoid redundant embedding calls 2025-08-29 18:15:45 +08:00
yangdx
a923d378dd Remove deprecated ID-based filtering from vector storage queries
- Remove ids param from QueryParam
- Simplify BaseVectorStorage.query signature
- Update all vector storage implementations
- Streamline PostgreSQL query templates
- Remove ID filtering from operate.py calls
2025-08-29 17:06:48 +08:00
yangdx
8f7031b882 Add get_vectors_by_ids method to QdrantVectorDBStorage 2025-08-15 16:46:52 +08:00
yangdx
0b22ffb252 Refac: uniformly protected with the get_data_init_lock for all storage initializations 2025-08-14 03:46:19 +08:00
yangdx
5d1bc8b49d Relocate client creation to the initialize method to prevent race conditions in multi-process mode. 2025-08-12 18:20:56 +08:00
yangdx
74783d7781 Remove redundant debug logging for Qdrant operations 2025-08-12 17:29:05 +08:00
yangdx
fc8ca1a706 Fix: add muti-process lock for initialize and drop method for all storage 2025-08-12 04:25:09 +08:00
yangdx
095e0cbfa2 Refac: Add workspace infomation to all logger output for all storage type 2025-08-12 01:19:09 +08:00
yangdx
033098c1bc Feat: Add WORKSPACE support to all storage types 2025-07-07 00:57:21 +08:00
yangdx
271722405f feat: Flatten LLM cache structure for improved recall efficiency
Refactored the LLM cache to a flat Key-Value (KV) structure, replacing the previous nested format. The old structure used the 'mode' as a key and stored specific cache content as JSON nested under it. This change significantly enhances cache recall efficiency.
2025-07-02 16:11:53 +08:00
yangdx
045993f7d2 Remove deprecated search_by_prefix 2025-05-03 11:17:49 +08:00
yangdx
08e8a7ead1 Fix linting 2025-05-03 00:46:28 +08:00
yangdx
6021796a61 Fix created_at problem for Qdrant vector db 2025-05-02 16:38:35 +08:00
yangdx
ca63386546 Increase embeding priority for query request 2025-04-28 20:10:39 +08:00
yangdx
95a8ee27ed Fix linting 2025-03-31 23:22:27 +08:00
yangdx
1df4b777d7 Add drop funtions to storage implementations 2025-03-30 15:17:57 +08:00
Roy
8aa9d0e6ca Add optional ids filter to vector database query methods
- Updated query method signatures across multiple vector database implementations
- Added optional `ids` parameter to filter search results
- Consistent implementation across ChromaDB, Faiss, Milvus, MongoDB, NanoVectorDB, Oracle, Qdrant, and TiDB vector storage classes
2025-03-11 15:22:17 +00:00
zrguo
da59cc89d8 fix linting 2025-03-09 00:51:14 +08:00
dixyes
458eafd714 Fix qdrant payload id
Qdrant now is using PointStruct.payload["id"], not PointStruct.id UUID.
This will fix id overwrite
2025-03-08 16:40:40 +08:00
zrguo
e822f35c89 Fix edit entity and relation bugs 2025-03-07 14:39:06 +08:00
zrguo
81568f3bad fix linting 2025-03-04 15:53:20 +08:00
zrguo
3a2a636862 Implement the missing methods. 2025-03-04 15:50:53 +08:00
Yannick Stephan
48a1ad9b3b
Merge pull request #883 from YanSte/fix-return-none
Optimised returns
2025-02-19 22:24:50 +01:00
Yannick Stephan
9277fe8c29 fixed return 2025-02-19 22:22:41 +01:00
Saifeddine ALOUI
473e52a095
Update qdrant_impl.py 2025-02-19 19:51:39 +01:00
Yannick Stephan
2524e02428 remove tqdm and cleaned readme and ollama 2025-02-18 19:58:03 +01:00
Yannick Stephan
2b2c81a722 added some comments 2025-02-16 16:04:07 +01:00
Yannick Stephan
0e7aff96bb back to not making breaks 2025-02-16 15:08:50 +01:00
Yannick Stephan
a0844bca28 cleaned import 2025-02-16 14:45:45 +01:00
Yannick Stephan
3fef8201c6 added final, required methods and cleaned import 2025-02-16 14:38:09 +01:00
Yannick Stephan
931c31fa8c cleaned code 2025-02-16 13:55:30 +01:00
Yannick Stephan
3eba41aab6 updated clean of what implemented on BaseVectorStorage 2025-02-16 13:24:42 +01:00
ArnoChen
cac1c993a9 remove redundant cosine similarity filter in Qdrant query
fix
2025-02-14 03:16:01 +08:00
ArnoChen
9a91b68e62 fix configuration errors of mongodb, neo4j, and qdrant backends. 2025-02-14 02:48:15 +08:00