Compare commits

...
Sign in to create a new pull request.

3 commits

Author SHA1 Message Date
Daniel Chalef
ad384372a7 normalize string formatting in dedupe_nodes.py to use single quotes 2025-09-27 14:01:47 -07:00
Daniel Chalef
23511f3b5e Improve dedup prompts 2025-09-27 14:00:44 -07:00
Daniel Chalef
e40fe556d5 remove poetry.lock 2025-09-27 14:00:44 -07:00
2 changed files with 42 additions and 5653 deletions

View file

@ -92,12 +92,23 @@ def node(context: dict[str, Any]) -> list[Message]:
TASK: TASK:
1. Compare `new_entity` against each item in `existing_entities`. 1. Compare `new_entity` against each item in `existing_entities`.
2. If it refers to the same realworld object or concept, collect its index. 2. If it refers to the same real-world object or concept, collect its index.
3. Let `duplicate_idx` = the *first* collected index, or 1 if none. 3. Let `duplicate_idx` = the smallest collected index, or -1 if none.
4. Let `duplicates` = the list of *all* collected indices (empty list if none). 4. Let `duplicates` = the sorted list of all collected indices (empty list if none).
Also return the full name of the NEW ENTITY (whether it is the name of the NEW ENTITY, a node it Respond with a JSON object containing an "entity_resolutions" array with a single entry:
is a duplicate of, or a combination of the two). {{
"entity_resolutions": [
{{
"id": integer id from NEW ENTITY,
"name": the best full name for the entity,
"duplicate_idx": integer index of the best duplicate in EXISTING ENTITIES, or -1 if none,
"duplicates": sorted list of all duplicate indices you collected (deduplicate the list, use [] when none)
}}
]
}}
Only reference indices that appear in EXISTING ENTITIES, and return [] / -1 when unsure.
""", """,
), ),
] ]
@ -126,26 +137,26 @@ def nodes(context: dict[str, Any]) -> list[Message]:
{{ {{
id: integer id of the entity, id: integer id of the entity,
name: "name of the entity", name: "name of the entity",
entity_type: "ontological classification of the entity", entity_type: ["Entity", "<optional additional label>", ...],
entity_type_description: "Description of what the entity type represents", entity_type_description: "Description of what the entity type represents"
duplication_candidates: [
{{
idx: integer index of the candidate entity,
name: "name of the candidate entity",
entity_type: "ontological classification of the candidate entity",
...<additional attributes>
}}
]
}} }}
<ENTITIES> <ENTITIES>
{to_prompt_json(context['extracted_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)} {to_prompt_json(context['extracted_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</ENTITIES> </ENTITIES>
<EXISTING ENTITIES> <EXISTING ENTITIES>
{to_prompt_json(context['existing_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)} {to_prompt_json(context['existing_nodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</EXISTING ENTITIES> </EXISTING ENTITIES>
Each entry in EXISTING ENTITIES is an object with the following structure:
{{
idx: integer index of the candidate entity (use this when referencing a duplicate),
name: "name of the candidate entity",
entity_types: ["Entity", "<optional additional label>", ...],
...<additional attributes such as summaries or metadata>
}}
For each of the above ENTITIES, determine if the entity is a duplicate of any of the EXISTING ENTITIES. For each of the above ENTITIES, determine if the entity is a duplicate of any of the EXISTING ENTITIES.
Entities should only be considered duplicates if they refer to the *same real-world object or concept*. Entities should only be considered duplicates if they refer to the *same real-world object or concept*.
@ -155,14 +166,19 @@ def nodes(context: dict[str, Any]) -> list[Message]:
- They have similar names or purposes but refer to separate instances or concepts. - They have similar names or purposes but refer to separate instances or concepts.
Task: Task:
Your response will be a list called entity_resolutions which contains one entry for each entity. Respond with a JSON object that contains an "entity_resolutions" array with one entry for each entity in ENTITIES, ordered by the entity id.
For each entity, return the id of the entity as id, the name of the entity as name, and the duplicate_idx For every entity, return an object with the following keys:
as an integer. {{
"id": integer id from ENTITIES,
- If an entity is a duplicate of one of the EXISTING ENTITIES, return the idx of the candidate it is a "name": the best full name for the entity (preserve the original name unless a duplicate has a more complete name),
duplicate of. "duplicate_idx": the idx of the EXISTING ENTITY that is the best duplicate match, or -1 if there is no duplicate,
- If an entity is not a duplicate of one of the EXISTING ENTITIES, return the -1 as the duplication_idx "duplicates": a sorted list of all idx values from EXISTING ENTITIES that refer to duplicates (deduplicate the list, use [] when none or unsure)
}}
- Only use idx values that appear in EXISTING ENTITIES.
- Set duplicate_idx to the smallest idx you collected for that entity, or -1 if duplicates is empty.
- Never fabricate entities or indices.
""", """,
), ),
] ]

5627
poetry.lock generated

File diff suppressed because it is too large Load diff