Commit graph

4218 commits

Author SHA1 Message Date
Andrej Milicevic
bb756a0e19 remove api version 2025-10-22 12:42:23 +02:00
Andrej Milicevic
1ecea0a955 change endpoint 2025-10-22 12:39:31 +02:00
Vasilije
738759bc5b
(WIP) Fix/fix web parsing (#1552)
<!-- .github/pull_request_template.md -->

## Description
<!--
Please provide a clear, human-generated description of the changes in
this PR.
DO NOT use AI-generated descriptions. We want to understand your thought
process and reasoning.
-->

This PR (using TDD):
1. Separates web crawling implementation into separate fetching, and
parsing (loader) steps
2. Fetching is used in `save_data_item_to_storage`. Default settings are
used for fetching
3. Loader produces a txt file, scraping the fetched html and saves it in
a txt file (`html_hash.html` -> `html_hash.txt`), similar to how we
process pdf files

## Type of Change
<!-- Please check the relevant option -->
- [x] Bug fix (non-breaking change that fixes an issue)
- [x] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update
- [x] Code refactoring
- [ ] Performance improvement
- [ ] Other (please specify):

## Screenshots/Videos (if applicable)
<!-- Add screenshots or videos to help explain your changes -->

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [ ] **I have tested my changes thoroughly before submitting this PR**
- [ ] **This PR contains minimal changes necessary to address the
issue/feature**
- [ ] My code follows the project's coding standards and style
guidelines
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added necessary documentation (if applicable)
- [ ] All new and existing tests pass
- [ ] I have searched existing PRs to ensure this change hasn't been
submitted already
- [ ] I have linked any relevant issues in the description
- [ ] My commits have clear and descriptive messages

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-10-22 11:57:40 +02:00
Andrej Milicevic
5f2e9bd84b fix: change baml llm provider 2025-10-22 10:37:37 +02:00
Andrej Milicevic
da05671fd7 test: changed the api key of baml tests 2025-10-22 10:23:34 +02:00
Daulet Amirkhanov
10e4fd7681 Make BS4 loader compatible with tavily fetcher 2025-10-21 23:46:21 +01:00
Daulet Amirkhanov
20c9e5498b skip tavily in Github CI for now 2025-10-21 23:27:18 +01:00
Daulet Amirkhanov
a35bcecdf9 refactor tavily_crawler test 2025-10-21 23:13:40 +01:00
Daulet Amirkhanov
3f5c09eb45 lazy load cron_web_scraper_task and web_scraper_task 2025-10-21 23:11:01 +01:00
Daulet Amirkhanov
f02aa1abfc ruff format 2025-10-21 23:02:25 +01:00
Daulet Amirkhanov
0f6aac19e8 TDD: add test cases and finish loading stage 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
6895813ae8 tests: name integration tests more meaningfully 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
ed4eba4c44 add back in-code comments for ingest_data 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
03b4547b7f validate e2e - urls are saved as htmls, and loaders are selected correctly 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
f84e31c626 bs4_loader.py -> beautiful_soup_loader.py, add to supported loaders 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
322ef156cb redefine preferred_loaders param to allow for args per loader 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
7210198f2e implement bs4_loader.py methods aside load yet 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
16e1c60925 move bs4 html parsing into bs4_loader 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
9d9969676f Separate BeautifulSoup crawling from fetching 2025-10-21 22:47:52 +01:00
Daulet Amirkhanov
a7ff188018 add crawler tests 2025-10-21 22:47:22 +01:00
Daulet Amirkhanov
5035c872a7 refactor: update web scraper configurations and simplify fetch logic 2025-10-21 22:47:22 +01:00
Daulet Amirkhanov
95e735d397 remove fetchers_config, use default configs for Tavily and BeautifulSoup 2025-10-21 22:46:50 +01:00
Daulet Amirkhanov
abbbf88ad3 CI: use scraping dependenies for integration tests 2025-10-21 22:46:50 +01:00
Daulet Amirkhanov
085e81c082 Clean up - remove UnsupportedPathSchemeError 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
35d3c08779 Clean up add.py imports 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
fdf7c27fec refactor: remove WebUrlLoader imports 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
1213a3a4cb revert changes to LoaderEngine and LoaderInterface 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
f7c2187ce7 remove loaders_config as it's not in use 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
fc660b46bb remove web_url_loader since there is no logic post fetching for loader 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
d7417d9b06 refactor: move url data fetching logic into save_data_item_to_storage 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
17b33ab443 feat: web_url_fetcher 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
8fe789ee96 nit: remove uneccessary import 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
1a0978fb37 incremental loading - fallback to regular, update test cases 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
a0f760a3d1 refactor: remove redundant filestream arg from LoaderEngine.load_file(...) 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
a69a7e5fc4 tests: remove redundant bs4 configs from tests 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
b5190c90f1 add logging for crawling status; add cap to the crawl_delay from robots.txt
- Not advising to use the cap, but giving an option to be able to configure it
2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
b9877f9e87 create web_url_loader_example.py 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
9b802f651b fix: web_url_loader load_data should yield stored_path 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
d0f3e224cb refactor ingest_data to accomodate non-FS data items 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
2e7ff0b01b remove reduntant HtmlContent class in save_data_item_to_storage 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
c0d450b165 tests: fix test_add - add missing required parameter 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
572c8ebce7 refactor: use pydantic models for tavily and beautifulsoup configs instead of dicts 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
36364285b2 tests: fix failing tests 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
9a9f9f6836 tests: add some tests to assert behaviour is as expected 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
185600fe17 revert url_crawler changes to cognee.add(), and update web_url_loader.load() 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
d884867d2c extend LoaderInterface to support web_url_loader, implement load() 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
305969c61b refactor web_url_loader filename 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
95106d5914 fix: ensure web urls correctly go through ingest_data and reach loaders 2025-10-21 22:46:49 +01:00
Daulet Amirkhanov
9395539868 feat: interface for WebLoader 2025-10-21 22:46:49 +01:00
Vasilije
62157a114d
feature: Cognee Search sessions/conversation related short-term memory (#1545)
<!-- .github/pull_request_template.md -->

## Description
This PR introduces QA sessions and conversation related short term
memory in cognee search using Redis.

## Type of Change
<!-- Please check the relevant option -->
- [ ] Bug fix (non-breaking change that fixes an issue)
- [x] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update
- [ ] Code refactoring
- [ ] Performance improvement
- [ ] Other (please specify):

## Screenshots/Videos (if applicable)
None

## Pre-submission Checklist
<!-- Please check all boxes that apply before submitting your PR -->
- [x] **I have tested my changes thoroughly before submitting this PR**
- [x] **This PR contains minimal changes necessary to address the
issue/feature**
- [x] My code follows the project's coding standards and style
guidelines
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added necessary documentation (if applicable)
- [x] All new and existing tests pass
- [x] I have searched existing PRs to ensure this change hasn't been
submitted already
- [x] I have linked any relevant issues in the description
- [x] My commits have clear and descriptive messages

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
2025-10-21 16:53:16 +02:00