Commit graph

4586 commits

Author SHA1 Message Date
Geoff-Robin
d4ce340cb5 Removed unused imports 2025-10-06 18:31:08 +05:30
Geoff-Robin
7fe1de770d Remove assignment to unused variable graph_db' 2025-10-06 18:29:58 +05:30
Geoff-Robin
0a9b624010 changed return type for fetch_page_content to Dict[str,str] 2025-10-06 18:27:54 +05:30
Geoff-Robin
3c9e5f830b Solved more nitpick comments 2025-10-06 18:16:31 +05:30
Geoff-Robin
791e38b2c0 Solved more nitpick comments 2025-10-06 18:00:20 +05:30
Geoff-Robin
1b5c099f8b CodeRabbit reviews solved 2025-10-06 17:15:25 +05:30
hajdul88
2184ae866b feat: adds redis as optional dependency 2025-10-06 13:39:51 +02:00
hajdul88
d537d6225c chore: adds error reproduction files (to delete later) 2025-10-06 13:19:18 +02:00
hajdul88
0776a07f0a feat: adds redis to docker compose 2025-10-06 13:17:22 +02:00
Geoff-Robin
ae740eda96 Added related documentation 2025-10-06 04:23:10 +05:30
Geoff-Robin
667bbd775e Added cron job and removed obvious comments 2025-10-06 04:12:32 +05:30
Geoff-Robin
4d5146c802 Added Documentation 2025-10-06 04:00:15 +05:30
Geoff-Robin
0f64f6804d Done adding cron job web scraping 2025-10-06 03:45:09 +05:30
Geoff-Robin
e5633bc368 corrected F402 error pointed out by ruff check 2025-10-06 03:44:24 +05:30
Geoff-Robin
f449fce0f1 Done with scraping_task successfully 2025-10-06 02:27:20 +05:30
Geoff-Robin
f148b1df89 Added support for multiple base_url extraction 2025-10-05 20:13:44 +05:30
Geoff-Robin
77ea7c4b1d Added APScheduler 2025-10-05 20:02:02 +05:30
Geoff-Robin
c2aa95521c removed structured argument 2025-10-05 20:00:19 +05:30
Geoff-Robin
2cba31a086 Tested and Debugged scraping usage in cognee.add() pipeline 2025-10-04 21:26:25 +05:30
Geoff-Robin
ab6fc65406 Added global context for bs4crawler and tavily config 2025-10-04 19:40:37 +05:30
Geoff-Robin
da7ebc4574 Removed asyncio import 2025-10-04 15:10:46 +05:30
Geoff-Robin
fbef6675bc removed unused Dict import from typing 2025-10-04 15:10:05 +05:30
Geoff-Robin
20fb77316c Done with integration with add workflow when incremental_loading is set to False 2025-10-04 15:01:13 +05:30
Geoff-Robin
1ab9d24cf0 Changed bs4_connector.py to bs4_crawler.py 2025-10-03 12:33:13 +05:30
Daulet Amirkhanov
ee45afed42
Fix test_cli_edge_cases, test_delete_all_with_user_id unit test (#1493)
<!-- .github/pull_request_template.md -->
## Description

Github Actions job:


https://github.com/topoteretes/cognee/actions/runs/18199627173/job/51815009426?pr=1493


<!--
Please provide a clear, human-generated description of the changes in
this PR.
DO NOT use AI-generated descriptions. We want to understand your thought
process and reasoning.
-->

## Type of Change
<!-- Please check the relevant option -->
- [x] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update
- [ ] Code refactoring
- [ ] Performance improvement
- [ ] Other (please specify):

## Screenshots/Videos (if applicable)
<!-- Add screenshots or videos to help explain your changes -->

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [ ] **I have tested my changes thoroughly before submitting this PR**
- [ ] **This PR contains minimal changes necessary to address the
issue/feature**
- [ ] My code follows the project's coding standards and style
guidelines
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added necessary documentation (if applicable)
- [ ] All new and existing tests pass
- [ ] I have searched existing PRs to ensure this change hasn't been
submitted already
- [ ] I have linked any relevant issues in the description
- [ ] My commits have clear and descriptive messages

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-10-02 18:00:00 +01:00
Daulet Amirkhanov
38070c489b fix test_cli_edge_cases.py, test_delete_all_with_user_id unit test 2025-10-02 17:44:01 +01:00
Geoff-Robin
edd119ef97 first iteration of bs4_connector.py done 2025-10-02 22:04:50 +05:30
Daulet Amirkhanov
2efce6949b
Feature/delete preview (#1385)
## Description

This pull request introduces a preview step to the `cognee delete`
command, fulfilling the requirements of issue #1366

When a user runs the delete command, it now first queries the database
to calculate the scope of the deletion and presents a summary (number of
datasets, data entries, users) before asking for final confirmation.
This improves the safety and usability of the command, preventing
accidental data loss.

This PR also adds the `--force` flag to bypass the preview, which is
useful for scripting and automation.

## Type of Change

- [x] New feature (non-breaking change that adds functionality)
- [ ] Bug fix (non-breaking change that fixes an issue)

## Changes Made

- **`cognee/cli/commands/delete_command.py`**: Modified to include the
preview logic. It now calls the counting function, displays the results,
and proceeds with deletion only after confirmation.
- **`cognee/modules/data/methods/get_deletion_counts.py`**: Added this
new file to contain the logic for querying the database and calculating
the deletion counts for datasets, data entries, and users.

## Testing

I have tested the changes through **Manual CLI Testing**: I ran the
`cognee delete` command with the `--dataset-name`, `--user-id`, and
`--all` flags to manually verify that the preview output is correct.

### Terminal Output
Here are screenshots of the command working with the all possible flags:
<img width="1898" height="1087" alt="cognee1"
src="https://github.com/user-attachments/assets/939aa4d0-748c-45e4-a2a6-f5e7982c1fc0"
/>
<img width="1788" height="748" alt="cognee2"
src="https://github.com/user-attachments/assets/213884be-cce1-4007-90f9-5e6d3a302ced"
/>

## Pre-submission Checklist

- [x] **I have tested my changes thoroughly before submitting this PR**
- [x] **This PR contains minimal changes necessary to address the
issue/feature**
- [x] My code follows the project's coding standards and style
guidelines
- [x] I have added tests that prove my feature works
- [ ] I have not added or changed documentation (as it was not required
for this CLI change)
- [x] I have searched existing PRs to ensure this change has been
submitted already
- [x] I have linked the relevant issue in the description

## Related Issues

Fixes #1366 

## DCO Affirmation

I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-10-02 15:21:57 +01:00
Daulet Amirkhanov
a92f4bdf3f fix: update failing tests and refactor delete_preview implementation 2025-10-02 15:05:39 +01:00
Daulet Amirkhanov
d5dd6c2fc2
Merge branch 'dev' into feature/delete-preview 2025-10-02 12:02:16 +01:00
Andrej Milicevic
a744f8d435 test: Rollback pgvector test. Was failing for some reason. 2025-10-02 09:54:30 +02:00
shehab-badawy
9c87a10848 feat: Add delete preview for --dataset-name and --all flags
This commit introduces the preview functionality for the  command. The preview displays a summary of what will be deleted before asking for user confirmation.

The feature is fully functional for the following flags:
-  / : Correctly counts the number of data entries within the specified dataset.
- : Correctly counts the total number of datasets, data entries, and users in the system.

The logic for the  flag is a work in progress. The current implementation uses a placeholder and needs a method to query a user directly by their ID to be completed.
2025-10-02 01:44:11 -04:00
Geoff-Robin
4979f43fc0 Added playwright as a dependency 2025-10-02 02:21:33 +05:30
Geoff-Robin
c283977035 switched httpx AsyncClient to fetch webpage 2025-10-02 02:01:46 +05:30
Geoff-Robin
60499c439c Added logging 2025-10-02 01:54:56 +05:30
Geoff-Robin
925bd38195 Setup models.py and utils.py 2025-10-02 01:32:00 +05:30
Geoff-Robin
70a2cc9d65 removed scrapy and added bs4 2025-10-02 01:28:48 +05:30
Andrej Milicevic
6f0756f312 test: Rollback deduplication test 2025-10-01 18:10:57 +02:00
Igor Ilic
95fdbab406
refactor: Remove macos13 from ci/cd and support (#1489)
<!-- .github/pull_request_template.md -->

## Description
Remove MacOS13 support and CI/CD tests

## Type of Change
<!-- Please check the relevant option -->
- [ ] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update
- [ ] Code refactoring
- [ ] Performance improvement
- [x] Other (please specify): Remove MacOS13 support

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [ ] **I have tested my changes thoroughly before submitting this PR**
- [ ] **This PR contains minimal changes necessary to address the
issue/feature**
- [ ] My code follows the project's coding standards and style
guidelines
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added necessary documentation (if applicable)
- [ ] All new and existing tests pass
- [ ] I have searched existing PRs to ensure this change hasn't been
submitted already
- [ ] I have linked any relevant issues in the description
- [ ] My commits have clear and descriptive messages

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-10-01 18:01:04 +02:00
Andrej Milicevic
5b46f86be5 test: Removed long text string about qunatum computers from tests. Used a file instead. 2025-10-01 17:59:53 +02:00
Daulet Amirkhanov
0bf3490d63 chore: update cognee-cli to use MCP Docker image from main. Bring back deprecation warnings 2025-10-01 16:16:06 +01:00
Aniruddha Mandal
4412495d67 chore: update dependency specifications in pyproject.toml
- Changed "mistralai==1.9.10" to "mistralai>=1.9.10" for more flexible versioning.
- Removed "mistralai" from the optional dependencies under "mistral".
- Expanded the "docs" dependency to include "pdf" support.
2025-10-01 00:33:05 +05:30
Aniruddha Mandal
fedb945365 chore: remove uv.lock file
- Deleted the uv.lock file to streamline dependency management.
- This change may require regeneration of the lock file in future dependency updates.
2025-10-01 00:26:59 +05:30
Aniruddha Mandal
4e96e04405 chore: update dependencies in pyproject.toml and uv.lock
- Added "mistralai==1.9.10" to the dependencies in pyproject.toml.
- Updated sdist entries in uv.lock to remove unnecessary upload-time fields for various packages.
- Ensured consistency in package specifications across the project files.
2025-10-01 00:22:28 +05:30
Igor Ilic
3dba072c49
fix: resolve formatting issue (#1486)
<!-- .github/pull_request_template.md -->

## Description
ruff formatting

## Type of Change
<!-- Please check the relevant option -->
- [ ] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update
- [ ] Code refactoring
- [ ] Performance improvement
- [ ] Other (please specify):

## Screenshots/Videos (if applicable)
<!-- Add screenshots or videos to help explain your changes -->

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [ ] **I have tested my changes thoroughly before submitting this PR**
- [ ] **This PR contains minimal changes necessary to address the
issue/feature**
- [ ] My code follows the project's coding standards and style
guidelines
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added necessary documentation (if applicable)
- [ ] All new and existing tests pass
- [ ] I have searched existing PRs to ensure this change hasn't been
submitted already
- [ ] I have linked any relevant issues in the description
- [ ] My commits have clear and descriptive messages

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-09-30 18:12:57 +02:00
Igor Ilic
7ab000d891 refactor: Add test for updating of docs and visualization 2025-09-30 18:12:22 +02:00
Vasilije
2ee5a3ca7a
feat: Enhance PDF parsing (#1445)
<!-- .github/pull_request_template.md -->

## Description
<!-- 
Please provide a clear, human-generated description of the changes in
this PR.
DO NOT use AI-generated descriptions. We want to understand your thought
process and reasoning.
-->
I've just added a new PDF parser, AdvancedPdfLoader. It uses the
unstructured library and does a much better job of handling PDFs,
especially with its layout-aware parsing, table preservation, and image
handling.

I also built in a safeguard: if unstructured isn't installed or throws
an error, it'll automatically fall back to the old PyPdfLoader so it
won't just crash. All the related unit tests and project dependencies
are taken care of, too.

https://github.com/topoteretes/cognee/issues/1342

## Type of Change
<!-- Please check the relevant option -->
- [ ] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update
- [ ] Code refactoring
- [x] Performance improvement
- [ ] Other (please specify):

## Changes Made
<!-- List the specific changes made in this PR -->
- Added AdvancedPdfLoader class for enhanced PDF processing using the
unstructured library.
- Integrated fallback mechanism to PyPdfLoader in case of unstructured
library import failure or exceptions.
- Updated supported loaders to include AdvancedPdfLoader.
- Added unit tests for AdvancedPdfLoader to ensure functionality and
error handling.
- Updated poetry.lock and pyproject.toml to include new dependencies and
versions.

## Testing
<!-- Describe how you tested your changes -->
pytest -v ./cognee/tests/test_advanced_pdf_loader.py
## Screenshots/Videos (if applicable)
<!-- Add screenshots or videos to help explain your changes -->

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [x] **I have tested my changes thoroughly before submitting this PR**
- [x] **This PR contains minimal changes necessary to address the
issue/feature**
- [x] My code follows the project's coding standards and style
guidelines
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added necessary documentation (if applicable)
- [x] All new and existing tests pass
- [x] I have searched existing PRs to ensure this change hasn't been
submitted already
- [x] I have linked any relevant issues in the description
- [x] My commits have clear and descriptive messages

## Related Issues
<!-- Link any related issues using "Fixes #issue_number" or "Relates to
#issue_number" -->

## Additional Notes
<!-- Add any additional notes, concerns, or context for reviewers -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-09-30 17:46:53 +02:00
EricXiao
d868912df5
Merge branch 'dev' into feat/add-pdfproloader 2025-09-30 23:24:14 +08:00
Andrej Milicevic
45f00b022f test: Renamed s3 test. Commented out docling test. Fails until docling resolves their issue. 2025-09-30 17:22:43 +02:00
Geoff-Robin
6348c9d8de Created models.py 2025-09-30 20:46:26 +05:30