graphiti/graphiti_core/prompts/dedupe_edges.py
Daniel Chalef 4a307dbf10
Optimize edge deduplication prompt for caching and clarity (#970)
* Optimize edge deduplication prompt for caching and clarity

- Restructure prompt to place invariant instructions at top and dynamic context at bottom for better LLM caching
- Change 'id' to 'idx' in edge context lists to avoid confusion with other identifiers
- Remove 'fact_type_id' from edge types context as LLM only needs fact_type_name
- Remove dynamic range values from prompt instructions (e.g., "range 0-N")
- Add debug logging before LLM call to track input sizes
- Add validation logging after LLM response to catch invalid idx values
- Clarify that duplicate_facts uses EXISTING FACTS idx and contradicted_facts uses INVALIDATION CANDIDATES idx

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Address terminology consistency and edge case logging

- Update Pydantic field descriptions to use 'idx' instead of 'ids' for consistency
- Fix debug logging to handle empty list edge case (avoid 'idx 0--1' display)

Note on review feedback:
- Validation is intentionally non-redundant: warnings provide visibility, list comprehensions ensure robustness
- WARNING level is appropriate for LLM output issues (not system errors)
- Existing test coverage is sufficient for this defensive logging addition

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-02 17:07:43 -07:00

174 lines
6.1 KiB
Python

"""
Copyright 2024, Zep Software, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any, Protocol, TypedDict
from pydantic import BaseModel, Field
from .models import Message, PromptFunction, PromptVersion
from .prompt_helpers import to_prompt_json
class EdgeDuplicate(BaseModel):
duplicate_facts: list[int] = Field(
...,
description='List of idx values of any duplicate facts. If no duplicate facts are found, default to empty list.',
)
contradicted_facts: list[int] = Field(
...,
description='List of idx values of facts that should be invalidated. If no facts should be invalidated, the list should be empty.',
)
fact_type: str = Field(..., description='One of the provided fact types or DEFAULT')
class UniqueFact(BaseModel):
uuid: str = Field(..., description='unique identifier of the fact')
fact: str = Field(..., description='fact of a unique edge')
class UniqueFacts(BaseModel):
unique_facts: list[UniqueFact]
class Prompt(Protocol):
edge: PromptVersion
edge_list: PromptVersion
resolve_edge: PromptVersion
class Versions(TypedDict):
edge: PromptFunction
edge_list: PromptFunction
resolve_edge: PromptFunction
def edge(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that de-duplicates edges from edge lists.',
),
Message(
role='user',
content=f"""
Given the following context, determine whether the New Edge represents any of the edges in the list of Existing Edges.
<EXISTING EDGES>
{to_prompt_json(context['related_edges'], indent=2)}
</EXISTING EDGES>
<NEW EDGE>
{to_prompt_json(context['extracted_edges'], indent=2)}
</NEW EDGE>
Task:
If the New Edges represents the same factual information as any edge in Existing Edges, return the id of the duplicate fact
as part of the list of duplicate_facts.
If the NEW EDGE is not a duplicate of any of the EXISTING EDGES, return an empty list.
Guidelines:
1. The facts do not need to be completely identical to be duplicates, they just need to express the same information.
""",
),
]
def edge_list(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that de-duplicates edges from edge lists.',
),
Message(
role='user',
content=f"""
Given the following context, find all of the duplicates in a list of facts:
Facts:
{to_prompt_json(context['edges'], indent=2)}
Task:
If any facts in Facts is a duplicate of another fact, return a new fact with one of their uuid's.
Guidelines:
1. identical or near identical facts are duplicates
2. Facts are also duplicates if they are represented by similar sentences
3. Facts will often discuss the same or similar relation between identical entities
4. The final list should have only unique facts. If 3 facts are all duplicates of each other, only one of their
facts should be in the response
""",
),
]
def resolve_edge(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that de-duplicates facts from fact lists and determines which existing '
'facts are contradicted by the new fact.',
),
Message(
role='user',
content=f"""
Task:
You will receive TWO separate lists of facts. Each list uses 'idx' as its index field, starting from 0.
1. DUPLICATE DETECTION:
- If the NEW FACT represents identical factual information as any fact in EXISTING FACTS, return those idx values in duplicate_facts.
- Facts with similar information that contain key differences should NOT be marked as duplicates.
- Return idx values from EXISTING FACTS.
- If no duplicates, return an empty list for duplicate_facts.
2. FACT TYPE CLASSIFICATION:
- Given the predefined FACT TYPES, determine if the NEW FACT should be classified as one of these types.
- Return the fact type as fact_type or DEFAULT if NEW FACT is not one of the FACT TYPES.
3. CONTRADICTION DETECTION:
- Based on FACT INVALIDATION CANDIDATES and NEW FACT, determine which facts the new fact contradicts.
- Return idx values from FACT INVALIDATION CANDIDATES.
- If no contradictions, return an empty list for contradicted_facts.
IMPORTANT:
- duplicate_facts: Use ONLY 'idx' values from EXISTING FACTS
- contradicted_facts: Use ONLY 'idx' values from FACT INVALIDATION CANDIDATES
- These are two separate lists with independent idx ranges starting from 0
Guidelines:
1. Some facts may be very similar but will have key differences, particularly around numeric values in the facts.
Do not mark these facts as duplicates.
<FACT TYPES>
{context['edge_types']}
</FACT TYPES>
<EXISTING FACTS>
{context['existing_edges']}
</EXISTING FACTS>
<FACT INVALIDATION CANDIDATES>
{context['edge_invalidation_candidates']}
</FACT INVALIDATION CANDIDATES>
<NEW FACT>
{context['new_edge']}
</NEW FACT>
""",
),
]
versions: Versions = {'edge': edge, 'edge_list': edge_list, 'resolve_edge': resolve_edge}