normalize string formatting in dedupe_nodes.py to use single quotes

Improve dedup prompts
remove poetry.lock
2025-09-27 14:01:47 -07:00 · 2025-09-27 14:00:44 -07:00 · 2025-09-27 14:00:44 -07:00
2 changed files with 42 additions and 5653 deletions
--- a/graphiti_core/prompts/dedupe_nodes.py
+++ b/graphiti_core/prompts/dedupe_nodes.py
@ -92,12 +92,23 @@ def node(context: dict[str, Any]) -> list[Message]:
         TASK:
         1. Compare `new_entity` against each item in `existing_entities`.
-         2. If it refers to the same real‐world object or concept, collect its index.
+         2. If it refers to the same real-world object or concept, collect its index.
-         3. Let `duplicate_idx` = the *first* collected index, or –1 if none.
+         3. Let `duplicate_idx` = the smallest collected index, or -1 if none.
-         4. Let `duplicates` = the list of *all* collected indices (empty list if none).
+         4. Let `duplicates` = the sorted list of all collected indices (empty list if none).
-        
+
-        Also return the full name of the NEW ENTITY (whether it is the name of the NEW ENTITY, a node it
+        Respond with a JSON object containing an "entity_resolutions" array with a single entry:
-        is a duplicate of, or a combination of the two).
+        {{
            "entity_resolutions": [
                {{
                    "id": integer id from NEW ENTITY,
                    "name": the best full name for the entity,
                    "duplicate_idx": integer index of the best duplicate in EXISTING ENTITIES, or -1 if none,
                    "duplicates": sorted list of all duplicate indices you collected (deduplicate the list, use [] when none)
                }}
            ]
        }}
        Only reference indices that appear in EXISTING ENTITIES, and return [] / -1 when unsure.
        """,
        ),
    ]
@ -126,26 +137,26 @@ def nodes(context: dict[str, Any]) -> list[Message]:
        {{
            id: integer id of the entity,
            name: "name of the entity",
-            entity_type: "ontological classification of the entity",
+            entity_type: ["Entity", "<optional additional label>", ...],
-            entity_type_description: "Description of what the entity type represents",
+            entity_type_description: "Description of what the entity type represents"
            duplication_candidates: [
                {{
                    idx: integer index of the candidate entity,
                    name: "name of the candidate entity",
                    entity_type: "ontological classification of the candidate entity",
                    ...<additional attributes>
                }}
            ]
        }}
-        
+
        <ENTITIES>
        {to_prompt_json(context['extracted_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
        </ENTITIES>
-        
+
        <EXISTING ENTITIES>
        {to_prompt_json(context['existing_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
        </EXISTING ENTITIES>
        Each entry in EXISTING ENTITIES is an object with the following structure:
        {{
            idx: integer index of the candidate entity (use this when referencing a duplicate),
            name: "name of the candidate entity",
            entity_types: ["Entity", "<optional additional label>", ...],
            ...<additional attributes such as summaries or metadata>
        }}
        For each of the above ENTITIES, determine if the entity is a duplicate of any of the EXISTING ENTITIES.
        Entities should only be considered duplicates if they refer to the *same real-world object or concept*.
@ -155,14 +166,19 @@ def nodes(context: dict[str, Any]) -> list[Message]:
        - They have similar names or purposes but refer to separate instances or concepts.
        Task:
-        Your response will be a list called entity_resolutions which contains one entry for each entity.
+        Respond with a JSON object that contains an "entity_resolutions" array with one entry for each entity in ENTITIES, ordered by the entity id.
-        
+
-        For each entity, return the id of the entity as id, the name of the entity as name, and the duplicate_idx
+        For every entity, return an object with the following keys:
-        as an integer.
+        {{
-        
+            "id": integer id from ENTITIES,
-        - If an entity is a duplicate of one of the EXISTING ENTITIES, return the idx of the candidate it is a 
+            "name": the best full name for the entity (preserve the original name unless a duplicate has a more complete name),
-        duplicate of.
+            "duplicate_idx": the idx of the EXISTING ENTITY that is the best duplicate match, or -1 if there is no duplicate,
-        - If an entity is not a duplicate of one of the EXISTING ENTITIES, return the -1 as the duplication_idx
+            "duplicates": a sorted list of all idx values from EXISTING ENTITIES that refer to duplicates (deduplicate the list, use [] when none or unsure)
        }}
        - Only use idx values that appear in EXISTING ENTITIES.
        - Set duplicate_idx to the smallest idx you collected for that entity, or -1 if duplicates is empty.
        - Never fabricate entities or indices.
        """,
        ),
    ]
--- a/poetry.lock
+++ b/poetry.lock
Author	SHA1	Message	Date
Daniel Chalef	ad384372a7	normalize string formatting in dedupe_nodes.py to use single quotes	2025-09-27 14:01:47 -07:00
Daniel Chalef	23511f3b5e	Improve dedup prompts	2025-09-27 14:00:44 -07:00
Daniel Chalef	e40fe556d5	remove poetry.lock	2025-09-27 14:00:44 -07:00