<!-- .github/pull_request_template.md --> ## Description This PR implements a data deletion system for unused DataPoint models based on last access tracking. The system tracks when data is accessed during search operations and provides cleanup functionality to remove data that hasn't been accessed within a configurable time threshold. **Key Changes:** 1. Added `last_accessed` timestamp field to the SQL `Data` model 2. Added `last_accessed_at` timestamp field to the graph `DataPoint` model 3. Implemented `update_node_access_timestamps()` function that updates both graph nodes and SQL records during search operations 4. Created `cleanup_unused_data()` function with SQL-based deletion mode for whole document cleanup 5. Added Alembic migration to add `last_accessed` column to the `data` table 6. Integrated timestamp tracking into in retrievers 7. Added comprehensive end-to-end test for the cleanup functionality ## Related Issues Fixes #[issue_number] ## Type of Change - [x] New feature (non-breaking change that adds functionality) - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement ## Database Changes - [x] This PR includes database schema changes - [x] Alembic migration included: `add_last_accessed_to_data` - [x] Migration adds `last_accessed` column to `data` table - [x] Migration includes backward compatibility (nullable column) - [x] Migration tested locally ## Implementation Details ### Files Modified: 1. **cognee/modules/data/models/Data.py** - Added `last_accessed` column 2. **cognee/infrastructure/engine/models/DataPoint.py** - Added `last_accessed_at` field 3. **cognee/modules/retrieval/chunks_retriever.py** - Integrated timestamp tracking in `get_context()` 4. **cognee/modules/retrieval/utils/update_node_access_timestamps.py** (new file) - Core tracking logic 5. **cognee/tasks/cleanup/cleanup_unused_data.py** (new file) - Cleanup implementation 6. **alembic/versions/[revision]_add_last_accessed_to_data.py** (new file) - Database migration 7. **cognee/tests/test_cleanup_unused_data.py** (new file) - End-to-end test ### Key Functions: - `update_node_access_timestamps(items)` - Updates timestamps in both graph and SQL - `cleanup_unused_data(minutes_threshold, dry_run, text_doc)` - Main cleanup function - SQL-based cleanup mode uses `cognee.delete()` for proper document deletion ## Testing - [x] Added end-to-end test: `test_textdocument_cleanup_with_sql()` - [x] Test covers: add → cognify → search → timestamp verification → aging → cleanup → deletion verification - [x] Test verifies cleanup across all storage systems (SQL, graph, vector) - [x] All existing tests pass - [x] Manual testing completed ## Screenshots/Videos N/A - Backend functionality ## Pre-submission Checklist - [x] **I have tested my changes thoroughly before submitting this PR** - [x] **This PR contains minimal changes necessary to address the issue/feature** - [x] My code follows the project's coding standards and style guidelines - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added necessary documentation (if applicable) - [x] All new and existing tests pass - [x] I have searched existing PRs to ensure this change hasn't been submitted already - [x] I have linked any relevant issues in the description - [x] My commits have clear and descriptive messages ## Breaking Changes None - This is a new feature that doesn't affect existing functionality. ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. Resolves #1335 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added access timestamp tracking to monitor when data is last retrieved. * Introduced automatic cleanup of unused data based on configurable time thresholds and access history. * Retrieval operations now update access timestamps to ensure accurate tracking of data usage. * **Tests** * Added integration test validating end-to-end cleanup workflow across storage layers. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> |
||
|---|---|---|
| .. | ||
| versions | ||
| env.py | ||
| README | ||
| script.py.mako | ||
Generic single-database configuration with an async dbapi.