From d5bf5cf4e9568c8726eb13c5f18c09cd511a1217 Mon Sep 17 00:00:00 2001 From: hajdul88 <52442977+hajdul88@users.noreply.github.com> Date: Fri, 5 Dec 2025 12:26:45 +0100 Subject: [PATCH] fix: fixes lancedb batch handling (#1872) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Description Fixes lancedb batch handling issue. Duplicated elements could appear in the collections when duplicates happen in the same insert batch. ## Type of Change - [x] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement - [ ] Other (please specify): ## Screenshots/Videos (if applicable) ## Pre-submission Checklist - [x] **I have tested my changes thoroughly before submitting this PR** - [x] **This PR contains minimal changes necessary to address the issue/feature** - [x] My code follows the project's coding standards and style guidelines - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added necessary documentation (if applicable) - [x] All new and existing tests pass - [x] I have searched existing PRs to ensure this change hasn't been submitted already - [x] I have linked any relevant issues in the description - [x] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. ## Summary by CodeRabbit * **Bug Fixes** * Improved data integrity by implementing deduplication logic to eliminate duplicate entries and ensure only the latest version is retained. ✏️ Tip: You can customize this high-level summary in your review settings. --- .../infrastructure/databases/vector/lancedb/LanceDBAdapter.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py b/cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py index 30631ac4c..6d724f9d7 100644 --- a/cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py +++ b/cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py @@ -193,6 +193,8 @@ class LanceDBAdapter(VectorDBInterface): for (data_point_index, data_point) in enumerate(data_points) ] + lance_data_points = list({dp.id: dp for dp in lance_data_points}.values()) + async with self.VECTOR_DB_LOCK: await ( collection.merge_insert("id")