graphiti/graphiti_core/prompts/dedupe_nodes.py
HUGO SON ce9ef3ca79
Add support for non-ASCII characters in LLM prompts (#805)
* Add support for non-ASCII characters in LLM prompts

- Add ensure_ascii parameter to Graphiti class (default: True)
- Create to_prompt_json helper function for consistent JSON serialization
- Update all prompt files to use new helper function
- Preserve Korean/Japanese/Chinese characters when ensure_ascii=False
- Maintain backward compatibility with existing behavior

Fixes issue where non-ASCII characters were escaped as unicode sequences
in prompts, making them unreadable in LLM logs and potentially affecting
model understanding.

* Remove unused json imports after replacing with to_prompt_json helper

- Fix ruff lint errors (F401) for unused json imports
- All prompt files now use to_prompt_json helper instead of json.dumps
- Maintains clean code style and passes lint checks

* Fix ensure_ascii propagation to all LLM calls

- Add ensure_ascii parameter to maintenance operation functions that were missing it
- Update function signatures in node_operations, community_operations, temporal_operations, and edge_operations
- Ensure all llm_client.generate_response calls receive proper ensure_ascii context
- Fix hardcoded ensure_ascii: True values that prevented non-ASCII character preservation
- Maintain backward compatibility with default ensure_ascii=True
- Complete the fix for issue #804 ensuring Korean/Japanese/Chinese characters are properly handled in LLM prompts
2025-08-08 11:07:32 -04:00

208 lines
8 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""
Copyright 2024, Zep Software, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any, Protocol, TypedDict
from pydantic import BaseModel, Field
from .models import Message, PromptFunction, PromptVersion
from .prompt_helpers import to_prompt_json
class NodeDuplicate(BaseModel):
id: int = Field(..., description='integer id of the entity')
duplicate_idx: int = Field(
...,
description='idx of the duplicate entity. If no duplicate entities are found, default to -1.',
)
name: str = Field(
...,
description='Name of the entity. Should be the most complete and descriptive name of the entity. Do not include any JSON formatting in the Entity name such as {}.',
)
duplicates: list[int] = Field(
...,
description='idx of all entities that are a duplicate of the entity with the above id.',
)
class NodeResolutions(BaseModel):
entity_resolutions: list[NodeDuplicate] = Field(..., description='List of resolved nodes')
class Prompt(Protocol):
node: PromptVersion
node_list: PromptVersion
nodes: PromptVersion
class Versions(TypedDict):
node: PromptFunction
node_list: PromptFunction
nodes: PromptFunction
def node(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that determines whether or not a NEW ENTITY is a duplicate of any EXISTING ENTITIES.',
),
Message(
role='user',
content=f"""
<PREVIOUS MESSAGES>
{to_prompt_json([ep for ep in context['previous_episodes']], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</PREVIOUS MESSAGES>
<CURRENT MESSAGE>
{context['episode_content']}
</CURRENT MESSAGE>
<NEW ENTITY>
{to_prompt_json(context['extracted_node'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</NEW ENTITY>
<ENTITY TYPE DESCRIPTION>
{to_prompt_json(context['entity_type_description'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</ENTITY TYPE DESCRIPTION>
<EXISTING ENTITIES>
{to_prompt_json(context['existing_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</EXISTING ENTITIES>
Given the above EXISTING ENTITIES and their attributes, MESSAGE, and PREVIOUS MESSAGES; Determine if the NEW ENTITY extracted from the conversation
is a duplicate entity of one of the EXISTING ENTITIES.
Entities should only be considered duplicates if they refer to the *same real-world object or concept*.
Semantic Equivalence: if a descriptive label in existing_entities clearly refers to a named entity in context, treat them as duplicates.
Do NOT mark entities as duplicates if:
- They are related but distinct.
- They have similar names or purposes but refer to separate instances or concepts.
TASK:
1. Compare `new_entity` against each item in `existing_entities`.
2. If it refers to the same realworld object or concept, collect its index.
3. Let `duplicate_idx` = the *first* collected index, or 1 if none.
4. Let `duplicates` = the list of *all* collected indices (empty list if none).
Also return the full name of the NEW ENTITY (whether it is the name of the NEW ENTITY, a node it
is a duplicate of, or a combination of the two).
""",
),
]
def nodes(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that determines whether or not ENTITIES extracted from a conversation are duplicates'
'of existing entities.',
),
Message(
role='user',
content=f"""
<PREVIOUS MESSAGES>
{to_prompt_json([ep for ep in context['previous_episodes']], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</PREVIOUS MESSAGES>
<CURRENT MESSAGE>
{context['episode_content']}
</CURRENT MESSAGE>
Each of the following ENTITIES were extracted from the CURRENT MESSAGE.
Each entity in ENTITIES is represented as a JSON object with the following structure:
{{
id: integer id of the entity,
name: "name of the entity",
entity_type: "ontological classification of the entity",
entity_type_description: "Description of what the entity type represents",
duplication_candidates: [
{{
idx: integer index of the candidate entity,
name: "name of the candidate entity",
entity_type: "ontological classification of the candidate entity",
...<additional attributes>
}}
]
}}
<ENTITIES>
{to_prompt_json(context['extracted_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</ENTITIES>
<EXISTING ENTITIES>
{to_prompt_json(context['existing_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</EXISTING ENTITIES>
For each of the above ENTITIES, determine if the entity is a duplicate of any of the EXISTING ENTITIES.
Entities should only be considered duplicates if they refer to the *same real-world object or concept*.
Do NOT mark entities as duplicates if:
- They are related but distinct.
- They have similar names or purposes but refer to separate instances or concepts.
Task:
Your response will be a list called entity_resolutions which contains one entry for each entity.
For each entity, return the id of the entity as id, the name of the entity as name, and the duplicate_idx
as an integer.
- If an entity is a duplicate of one of the EXISTING ENTITIES, return the idx of the candidate it is a
duplicate of.
- If an entity is not a duplicate of one of the EXISTING ENTITIES, return the -1 as the duplication_idx
""",
),
]
def node_list(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that de-duplicates nodes from node lists.',
),
Message(
role='user',
content=f"""
Given the following context, deduplicate a list of nodes:
Nodes:
{to_prompt_json(context['nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
Task:
1. Group nodes together such that all duplicate nodes are in the same list of uuids
2. All duplicate uuids should be grouped together in the same list
3. Also return a new summary that synthesizes the summary into a new short summary
Guidelines:
1. Each uuid from the list of nodes should appear EXACTLY once in your response
2. If a node has no duplicates, it should appear in the response in a list of only one uuid
Respond with a JSON object in the following format:
{{
"nodes": [
{{
"uuids": ["5d643020624c42fa9de13f97b1b3fa39", "node that is a duplicate of 5d643020624c42fa9de13f97b1b3fa39"],
"summary": "Brief summary of the node summaries that appear in the list of names."
}}
]
}}
""",
),
]
versions: Versions = {'node': node, 'node_list': node_list, 'nodes': nodes}