fix: Clean up pokemons (#746)

<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
This commit is contained in:
Vasilije 2025-04-19 10:51:51 +02:00 committed by GitHub
parent 8eda1eda74
commit 8374e402a8
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 0 additions and 370 deletions

View file

@ -1,162 +0,0 @@
import cognee
import asyncio
from cognee.shared.logging_utils import get_logger, ERROR
from cognee.api.v1.search import SearchType
from cognee.modules.retrieval.EntityCompletionRetriever import EntityCompletionRetriever
from cognee.modules.retrieval.context_providers.TripletSearchContextProvider import (
TripletSearchContextProvider,
)
from cognee.modules.retrieval.context_providers.SummarizedTripletSearchContextProvider import (
SummarizedTripletSearchContextProvider,
)
from cognee.modules.retrieval.entity_extractors.DummyEntityExtractor import DummyEntityExtractor
article_1 = """
Title: The Theory of Relativity: A Revolutionary Breakthrough
Author: Dr. Sarah Chen
Albert Einstein's theory of relativity fundamentally changed our understanding of space, time, and gravity. Published in 1915, the general theory of relativity describes gravity as a consequence of the curvature of spacetime caused by mass and energy. This groundbreaking work built upon his special theory of relativity from 1905, which introduced the famous equation E=mc².
Einstein's work at the Swiss Patent Office gave him time to develop these revolutionary ideas. His mathematical framework predicted several phenomena that were later confirmed, including:
- The bending of light by gravity
- The precession of Mercury's orbit
- The existence of black holes
The theory continues to be tested and validated today, most recently through the detection of gravitational waves by LIGO in 2015, exactly 100 years after its publication.
"""
article_2 = """
Title: The Manhattan Project and Its Scientific Director
Author: Prof. Michael Werner
J. Robert Oppenheimer's leadership of the Manhattan Project marked a pivotal moment in scientific history. As scientific director of the Los Alamos Laboratory, he assembled and led an extraordinary team of physicists in the development of the atomic bomb during World War II.
Oppenheimer's journey to Los Alamos began at Harvard and continued through his groundbreaking work in quantum mechanics and nuclear physics at Berkeley. His expertise in theoretical physics and exceptional leadership abilities made him the ideal candidate to head the secret weapons laboratory.
Key aspects of his directorship included:
- Recruitment of top scientific talent from across the country
- Integration of theoretical physics with practical engineering challenges
- Development of implosion-type nuclear weapons
- Management of complex security and ethical considerations
After witnessing the first nuclear test, codenamed Trinity, Oppenheimer famously quoted the Bhagavad Gita: "Now I am become Death, the destroyer of worlds." This moment reflected the profound moral implications of scientific advancement that would shape his later advocacy for international atomic controls.
"""
article_3 = """
Title: The Birth of Quantum Physics
Author: Dr. Lisa Martinez
The early 20th century witnessed a revolutionary transformation in our understanding of the microscopic world. The development of quantum mechanics emerged from the collaborative efforts of numerous brilliant physicists grappling with phenomena that classical physics couldn't explain.
Key contributors and their insights included:
- Max Planck's discovery of energy quantization (1900)
- Niels Bohr's model of the atom with discrete energy levels (1913)
- Werner Heisenberg's uncertainty principle (1927)
- Erwin Schrödinger's wave equation (1926)
- Paul Dirac's quantum theory of the electron (1928)
Einstein's 1905 paper on the photoelectric effect, which demonstrated light's particle nature, was a crucial contribution to this field. The Copenhagen interpretation, developed primarily by Bohr and Heisenberg, became the standard understanding of quantum mechanics, despite ongoing debates about its philosophical implications. These foundational developments continue to influence modern physics, from quantum computing to quantum field theory.
"""
async def main(enable_steps):
# Step 1: Reset data and system state
if enable_steps.get("prune_data"):
await cognee.prune.prune_data()
print("Data pruned.")
if enable_steps.get("prune_system"):
await cognee.prune.prune_system(metadata=True)
print("System pruned.")
# Step 2: Add text
if enable_steps.get("add_text"):
text_list = [article_1, article_2, article_3]
for text in text_list:
await cognee.add(text)
print(f"Added text: {text[:50]}...")
# Step 3: Create knowledge graph
if enable_steps.get("cognify"):
await cognee.cognify()
print("Knowledge graph created.")
# Step 4: Query insights using our new retrievers
if enable_steps.get("retriever"):
# Common settings
search_settings = {
"top_k": 5,
"collections": ["Entity_name", "TextSummary_text"],
"properties_to_project": ["name", "description", "text"],
}
# Create both context providers
direct_provider = TripletSearchContextProvider(**search_settings)
summary_provider = SummarizedTripletSearchContextProvider(**search_settings)
# Create retrievers with different providers
direct_retriever = EntityCompletionRetriever(
extractor=DummyEntityExtractor(),
context_provider=direct_provider,
system_prompt_path="answer_simple_question.txt",
user_prompt_path="context_for_question.txt",
)
summary_retriever = EntityCompletionRetriever(
extractor=DummyEntityExtractor(),
context_provider=summary_provider,
system_prompt_path="answer_simple_question.txt",
user_prompt_path="context_for_question.txt",
)
query = "What were the early contributions to quantum physics?"
print("\nQuery:", query)
# Try with direct triplets
print("\n=== Direct Triplets ===")
context = await direct_retriever.get_context(query)
print("\nEntity Context:")
print(context)
result = await direct_retriever.get_completion(query)
print("\nEntity Completion:")
print(result)
# Try with summarized triplets
print("\n=== Summarized Triplets ===")
context = await summary_retriever.get_context(query)
print("\nEntity Context:")
print(context)
result = await summary_retriever.get_completion(query)
print("\nEntity Completion:")
print(result)
# Compare with standard search
print("\n=== Standard Search ===")
search_results = await cognee.search(
query_type=SearchType.GRAPH_COMPLETION, query_text=query
)
print(search_results)
if __name__ == "__main__":
logger = get_logger(level=ERROR)
rebuild_kg = True
retrieve = True
steps_to_enable = {
"prune_data": rebuild_kg,
"prune_system": rebuild_kg,
"add_text": rebuild_kg,
"cognify": rebuild_kg,
"retriever": retrieve,
}
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(main(steps_to_enable))
finally:
loop.run_until_complete(loop.shutdown_asyncgens())

View file

@ -1,208 +0,0 @@
# Standard library imports
import os
import json
import asyncio
import pathlib
from uuid import uuid5, NAMESPACE_OID
from typing import List, Optional
from pathlib import Path
import dlt
import requests
import cognee
from cognee.low_level import DataPoint, setup as cognee_setup
from cognee.api.v1.search import SearchType
from cognee.tasks.storage import add_data_points
from cognee.modules.pipelines.tasks.task import Task
from cognee.modules.pipelines import run_tasks
BASE_URL = "https://pokeapi.co/api/v2/"
os.environ["BUCKET_URL"] = "./.data_storage"
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "true"
# Data Models
class Abilities(DataPoint):
name: str = "Abilities"
metadata: dict = {"index_fields": ["name"]}
class PokemonAbility(DataPoint):
name: str
ability__name: str
ability__url: str
is_hidden: bool
slot: int
_dlt_load_id: str
_dlt_id: str
_dlt_parent_id: str
_dlt_list_idx: str
is_type: Abilities
metadata: dict = {"index_fields": ["ability__name"]}
class Pokemons(DataPoint):
name: str = "Pokemons"
have: Abilities
metadata: dict = {"index_fields": ["name"]}
class Pokemon(DataPoint):
name: str
base_experience: int
height: int
weight: int
is_default: bool
order: int
location_area_encounters: str
species__name: str
species__url: str
cries__latest: str
cries__legacy: str
sprites__front_default: str
sprites__front_shiny: str
sprites__back_default: Optional[str]
sprites__back_shiny: Optional[str]
_dlt_load_id: str
_dlt_id: str
is_type: Pokemons
abilities: List[PokemonAbility]
metadata: dict = {"index_fields": ["name"]}
# Data Collection Functions
@dlt.resource(write_disposition="replace")
def pokemon_list(limit: int = 50):
response = requests.get(f"{BASE_URL}pokemon", params={"limit": limit})
response.raise_for_status()
yield response.json()["results"]
@dlt.transformer(data_from=pokemon_list)
def pokemon_details(pokemons):
"""Fetches detailed info for each Pokémon"""
for pokemon in pokemons:
response = requests.get(pokemon["url"])
response.raise_for_status()
yield response.json()
# Data Loading Functions
def load_abilities_data(jsonl_abilities):
abilities_root = Abilities()
pokemon_abilities = []
for jsonl_ability in jsonl_abilities:
with open(jsonl_ability, "r") as f:
for line in f:
ability = json.loads(line)
ability["id"] = uuid5(NAMESPACE_OID, ability["_dlt_id"])
ability["name"] = ability["ability__name"]
ability["is_type"] = abilities_root
pokemon_abilities.append(ability)
return abilities_root, pokemon_abilities
def load_pokemon_data(jsonl_pokemons, pokemon_abilities, pokemon_root):
pokemons = []
for jsonl_pokemon in jsonl_pokemons:
with open(jsonl_pokemon, "r") as f:
for line in f:
pokemon_data = json.loads(line)
abilities = [
ability
for ability in pokemon_abilities
if ability["_dlt_parent_id"] == pokemon_data["_dlt_id"]
]
pokemon_data["external_id"] = pokemon_data["id"]
pokemon_data["id"] = uuid5(NAMESPACE_OID, str(pokemon_data["id"]))
pokemon_data["abilities"] = [PokemonAbility(**ability) for ability in abilities]
pokemon_data["is_type"] = pokemon_root
pokemons.append(Pokemon(**pokemon_data))
return pokemons
# Main Application Logic
async def setup_and_process_data():
"""Setup configuration and process Pokemon data"""
# Setup configuration
data_directory_path = str(
pathlib.Path(os.path.join(pathlib.Path(__file__).parent, ".data_storage")).resolve()
)
cognee_directory_path = str(
pathlib.Path(os.path.join(pathlib.Path(__file__).parent, ".cognee_system")).resolve()
)
cognee.config.data_root_directory(data_directory_path)
cognee.config.system_root_directory(cognee_directory_path)
# Initialize pipeline and collect data
pipeline = dlt.pipeline(
pipeline_name="pokemon_pipeline",
destination="filesystem",
dataset_name="pokemon_data",
)
info = pipeline.run([pokemon_list, pokemon_details])
print(info)
# Load and process data
STORAGE_PATH = Path(".data_storage/pokemon_data/pokemon_details")
jsonl_pokemons = sorted(STORAGE_PATH.glob("*.jsonl"))
if not jsonl_pokemons:
raise FileNotFoundError("No JSONL files found in the storage directory.")
ABILITIES_PATH = Path(".data_storage/pokemon_data/pokemon_details__abilities")
jsonl_abilities = sorted(ABILITIES_PATH.glob("*.jsonl"))
if not jsonl_abilities:
raise FileNotFoundError("No JSONL files found in the storage directory.")
# Process data
abilities_root, pokemon_abilities = load_abilities_data(jsonl_abilities)
pokemon_root = Pokemons(have=abilities_root)
pokemons = load_pokemon_data(jsonl_pokemons, pokemon_abilities, pokemon_root)
return pokemons
async def pokemon_cognify(pokemons):
"""Process Pokemon data with Cognee and perform search"""
# Setup and run Cognee tasks
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
await cognee_setup()
# tasks = [Task(add_data_points, task_config={"batch_size": 50})]
tasks = [Task(add_data_points)]
results = run_tasks(
tasks=tasks,
data=pokemons,
dataset_id=uuid5(NAMESPACE_OID, "Pokemon"),
pipeline_name="pokemon_pipeline",
)
async for result in results:
print(result)
print("Done")
# Perform search
search_results = await cognee.search(
query_type=SearchType.GRAPH_COMPLETION, query_text="pokemons?"
)
print("Search results:")
for result_text in search_results:
print(result_text)
async def main():
pokemons = await setup_and_process_data()
await pokemon_cognify(pokemons)
if __name__ == "__main__":
asyncio.run(main())