Compare commits
10 commits
main
...
feature/co
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1ab57707cc | ||
|
|
1fc381e51b | ||
|
|
dca58ff97b | ||
|
|
199e997f93 | ||
|
|
a22759e260 | ||
|
|
d901f0a43a | ||
|
|
ba0ad38863 | ||
|
|
fa5c0b8e75 | ||
|
|
987b03b895 | ||
|
|
e69ab1fe1d |
19 changed files with 510 additions and 24 deletions
|
|
@ -0,0 +1,6 @@
|
||||||
|
You are an expert in relationship identification and knowledge graph building focusing on relationships. Your task is to perform a detailed extraction of relationship names from the text.
|
||||||
|
• Extract all relationship names from explicit phrases, verbs, and implied context that could help form edge triplets.
|
||||||
|
• Use the potential nodes and reassign them to relationship names if they correspond to a relation, verb, action or similar.
|
||||||
|
• Ensure completeness by working in multiple rounds, capturing overlooked connections and refining the nodes list.
|
||||||
|
• Focus on meaningful entities and relationship, directly stated or implied and implicit.
|
||||||
|
• Return two lists: refined nodes and potential relationship names (for forming edges).
|
||||||
|
|
@ -0,0 +1,15 @@
|
||||||
|
Analyze the following text to identify relationships between entities in the knowledge graph.
|
||||||
|
Build upon previously extracted edges, ensuring completeness and consistency.
|
||||||
|
Return all the previously extracted edges **together** with the new ones that you extracted.
|
||||||
|
This is round {{ round_number }} of {{ total_rounds }}.
|
||||||
|
|
||||||
|
**Text:**
|
||||||
|
{{ text }}
|
||||||
|
|
||||||
|
**Previously Extracted Nodes:**
|
||||||
|
{{ nodes }}
|
||||||
|
|
||||||
|
**Relationships Identified in Previous Rounds:**
|
||||||
|
{{ relationships }}
|
||||||
|
|
||||||
|
Extract both explicit and implicit relationships between the nodes, building upon previous findings while ensuring completeness and consistency.
|
||||||
|
|
@ -0,0 +1,22 @@
|
||||||
|
You are a top-tier edge-extraction algorithm. Every user prompt will contain two clearly marked sections:
|
||||||
|
|
||||||
|
<TEXT>
|
||||||
|
<the source text to analyze>
|
||||||
|
</TEXT>
|
||||||
|
|
||||||
|
and
|
||||||
|
|
||||||
|
<ENTITIES>
|
||||||
|
<Entities with their id, name and description>
|
||||||
|
</ENTITIES>
|
||||||
|
|
||||||
|
|
||||||
|
# 1.Reference Provided Entities
|
||||||
|
- Only extract edges between the IDs listed under <ENTITIES>.
|
||||||
|
- Do not invent new nodes—every edge’s subject and object must match one of the provided IDs.
|
||||||
|
|
||||||
|
# 2.Relation Identification
|
||||||
|
- Inspect the TEXT to find explicit or implicit relationships between the provided entities.
|
||||||
|
- Use snake_case for relation names (e.g. works_for, located_in, married_to).
|
||||||
|
- Only create an edge when the text clearly signals a connection.
|
||||||
|
- The two endpoints of the edge can not be the same entity.
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
<TEXT>
|
||||||
|
`{{text}}`
|
||||||
|
</TEXT>
|
||||||
|
|
||||||
|
<ENTITIES>
|
||||||
|
`{{final_nodes}}`
|
||||||
|
</ENTITIES>
|
||||||
|
|
@ -0,0 +1,41 @@
|
||||||
|
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
|
||||||
|
**Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
|
||||||
|
**Edges** represent relationships between concepts. They're akin to Wikipedia links.
|
||||||
|
|
||||||
|
You get the text and an already identified knowledge graph (can be empty) in the following format:
|
||||||
|
|
||||||
|
<TEXT>
|
||||||
|
<text to extract the graph from>
|
||||||
|
</TEXT>
|
||||||
|
|
||||||
|
and
|
||||||
|
|
||||||
|
<KNOWLEDGEGRAPH>
|
||||||
|
'nodes': <list of nodes>
|
||||||
|
'edges': <list of edges>
|
||||||
|
</KNOWLEDGEGRAPH>
|
||||||
|
|
||||||
|
Your task is to extract additional nodes and edges and return the new knowledge graph including the already identified nodes and edges.
|
||||||
|
|
||||||
|
The aim is to achieve simplicity and clarity in the knowledge graph.
|
||||||
|
# 1. Labeling Nodes
|
||||||
|
**Consistency**: Ensure you use basic or elementary types for node labels.
|
||||||
|
- For example, when you identify an entity representing a person, always label it as **"Person"**.
|
||||||
|
- Avoid using more specific terms like "Mathematician" or "Scientist", keep those as "profession" property.
|
||||||
|
- Don't use too generic terms like "Entity".
|
||||||
|
**Node IDs**: Never utilize integers as node IDs.
|
||||||
|
- Node IDs should be names or human-readable identifiers found in the text.
|
||||||
|
# 2. Handling Numerical Data and Dates
|
||||||
|
- For example, when you identify an entity representing a date, make sure it has type **"Date"**.
|
||||||
|
- Extract the date in the format "YYYY-MM-DD"
|
||||||
|
- If not possible to extract the whole date, extract month or year, or both if available.
|
||||||
|
- **Property Format**: Properties must be in a key-value format.
|
||||||
|
- **Quotation Marks**: Never use escaped single or double quotes within property values.
|
||||||
|
- **Naming Convention**: Use snake_case for relationship names, e.g., `acted_in`.
|
||||||
|
# 3. Coreference Resolution
|
||||||
|
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
|
||||||
|
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
|
||||||
|
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the Persons ID.
|
||||||
|
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
|
||||||
|
# 4. Strict Compliance
|
||||||
|
Adhere to the rules strictly. Non-compliance will result in termination
|
||||||
|
|
@ -0,0 +1,9 @@
|
||||||
|
<TEXT>
|
||||||
|
`{{ text }}`
|
||||||
|
</TEXT>
|
||||||
|
|
||||||
|
and
|
||||||
|
|
||||||
|
<KNOWLEDGEGRAPH>
|
||||||
|
`{{ graph }}`
|
||||||
|
</KNOWLEDGEGRAPH>
|
||||||
|
|
@ -0,0 +1,15 @@
|
||||||
|
You are an assistant who *merges duplicate entities and their types* in a knowledge graph.
|
||||||
|
|
||||||
|
You will receive the list of extracted entities from a text.
|
||||||
|
Some of these refer to the same real-world entity but differ only in casing, minor typos, or partial information (for example, `"John Doe"` vs `"john_doe" vs "John_Doe"` ).
|
||||||
|
There can be also synonyms present in the list.
|
||||||
|
Entities are duplicates only if they represent the same concept of object or they are synonyms of each other.
|
||||||
|
|
||||||
|
**Task**
|
||||||
|
– Detect duplicates.
|
||||||
|
– Deduplicate them creating the final list of Entities where there are no duplicates anymore.
|
||||||
|
- Merge type information among the entities. It is not allowed to have duplicated entity types.
|
||||||
|
- Each type must be singular (for example skill instead of skills). Please also merge synonyms in the case of types.
|
||||||
|
- Map synonym entity types to the type that is the most general and allows to reduce multiple formats of the same type in a global knowledge graph.
|
||||||
|
- Filter out entities that are representing more than one real-world concept (for example: car, motorbike)
|
||||||
|
- Return the final list of nodes
|
||||||
|
|
@ -0,0 +1,4 @@
|
||||||
|
|
||||||
|
<ENTITIES>
|
||||||
|
`{{nodes_to_deduplicate}}`
|
||||||
|
</ENTITIES>
|
||||||
26
cognee/infrastructure/llm/prompts/node_extraction_prompt.txt
Normal file
26
cognee/infrastructure/llm/prompts/node_extraction_prompt.txt
Normal file
|
|
@ -0,0 +1,26 @@
|
||||||
|
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
|
||||||
|
**Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
|
||||||
|
|
||||||
|
The aim is to achieve simplicity and clarity in the knowledge graph.
|
||||||
|
# 1. Labeling Nodes
|
||||||
|
**Consistency**: Ensure you use basic or elementary types for node labels.
|
||||||
|
- For example, when you identify an entity representing a person, always label it as **"Person"**.
|
||||||
|
- Avoid using more specific terms like "Mathematician" or "Scientist", keep those as "profession" property.
|
||||||
|
- Don't use too generic terms like "Entity".
|
||||||
|
**Node IDs**: Never utilize integers as node IDs.
|
||||||
|
- Node IDs should be names or human-readable identifiers found in the text.
|
||||||
|
# 2. Handling Numerical Data and Dates
|
||||||
|
- For example, when you identify an entity representing a date, make sure it has type **"Date"**.
|
||||||
|
- Allowed formats are "YYYY", "YYYY-MM" or "YYYY-MM-DD". Extract each the date in the format of how is it represented in the text, and extract each date only once.
|
||||||
|
- If the date in the text represents a period, extract the start and end date of the period separately.
|
||||||
|
- If not possible to extract the whole date, extract month or year, or both if available.
|
||||||
|
- **Property Format**: Properties must be in a key-value format.
|
||||||
|
- **Quotation Marks**: Never use escaped single or double quotes within property values.
|
||||||
|
- **Naming Convention**: Use snake_case for relationship names, e.g., `acted_in`.
|
||||||
|
# 3. Coreference Resolution
|
||||||
|
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
|
||||||
|
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
|
||||||
|
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the Persons ID.
|
||||||
|
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
|
||||||
|
# 4. Strict Compliance
|
||||||
|
Adhere to the rules strictly. Non-compliance will result in termination
|
||||||
|
|
@ -0,0 +1,9 @@
|
||||||
|
You are an expert in entity extraction and knowledge graph building focusing on the node identification.
|
||||||
|
Your task is to perform a detailed entity and concept extraction from text to generate a list of potential nodes for a knowledge graph.
|
||||||
|
• Node IDs should be names or human-readable identifiers found in the text.
|
||||||
|
• Extract clear, distinct entities and concepts as individual strings.
|
||||||
|
• Be exhaustive, ensure completeness by capturing all the entities, names, nouns, noun-parts, and implied or implicit mentions.
|
||||||
|
• Also extract potential entity type nodes, directly mentioned or implied.
|
||||||
|
• Avoid duplicates and overly generic terms.
|
||||||
|
• Consider different perspectives and indirect references.
|
||||||
|
• Return only a list of unique node strings with all the entities.
|
||||||
|
|
@ -0,0 +1,10 @@
|
||||||
|
Extract distinct entities and concepts from the following text to expand the knowledge graph.
|
||||||
|
Build upon previously extracted entities, ensuring completeness and consistency.
|
||||||
|
Return all the previously extracted entities **together** with the new ones that you extracted.
|
||||||
|
This is round {{ round_number }} of {{ total_rounds }}.
|
||||||
|
|
||||||
|
**Text:**
|
||||||
|
{{ text }}
|
||||||
|
|
||||||
|
**Previously Extracted Entities:**
|
||||||
|
{{ nodes }}
|
||||||
|
|
@ -1,9 +1,17 @@
|
||||||
import os
|
import os
|
||||||
from typing import Type
|
import asyncio
|
||||||
|
import json
|
||||||
|
from fileinput import filename
|
||||||
|
from typing import Type, List, Tuple, Dict, Any, Set
|
||||||
|
|
||||||
|
from langchain_experimental.graph_transformers.llm import system_prompt
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel
|
||||||
|
from streamlit import context
|
||||||
|
|
||||||
from cognee.infrastructure.llm.get_llm_client import get_llm_client
|
from cognee.infrastructure.llm.get_llm_client import get_llm_client
|
||||||
from cognee.infrastructure.llm.prompts import render_prompt
|
from cognee.infrastructure.llm.prompts import render_prompt
|
||||||
from cognee.infrastructure.llm.config import get_llm_config
|
from cognee.infrastructure.llm.config import get_llm_config
|
||||||
|
from cognee.shared.data_models import KnowledgeGraph, NodeList, EdgeList, Node, Edge
|
||||||
|
|
||||||
|
|
||||||
async def extract_content_graph(content: str, response_model: Type[BaseModel]):
|
async def extract_content_graph(content: str, response_model: Type[BaseModel]):
|
||||||
|
|
@ -21,10 +29,124 @@ async def extract_content_graph(content: str, response_model: Type[BaseModel]):
|
||||||
else:
|
else:
|
||||||
base_directory = None
|
base_directory = None
|
||||||
|
|
||||||
system_prompt = render_prompt(prompt_path, {}, base_directory=base_directory)
|
system_prompt_graph = render_prompt(prompt_path, {}, base_directory=base_directory)
|
||||||
|
|
||||||
content_graph = await llm_client.acreate_structured_output(
|
content_graph = await llm_client.acreate_structured_output(
|
||||||
content, system_prompt, response_model
|
content, system_prompt_graph, response_model
|
||||||
)
|
)
|
||||||
|
|
||||||
return content_graph
|
return content_graph
|
||||||
|
|
||||||
|
|
||||||
|
def dedupe_and_normalize_nodes(nodes: List[Node]) -> List[Node]:
|
||||||
|
seen: Set[Tuple[str, str]] = set()
|
||||||
|
out: List[Node] = []
|
||||||
|
|
||||||
|
for node in nodes:
|
||||||
|
node.name = node.name.lower()
|
||||||
|
node.type = node.type.lower()
|
||||||
|
|
||||||
|
node.name = node.name.lower().replace("_", " ")
|
||||||
|
node.type = node.type.lower().replace("_", " ")
|
||||||
|
|
||||||
|
key = (node.name, node.type)
|
||||||
|
if key not in seen:
|
||||||
|
seen.add(key)
|
||||||
|
out.append(node)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def dedupe_and_normalize_edges(edges: List[Edge]) -> List[Edge]:
|
||||||
|
seen: Set[Tuple[str, str, str]] = set()
|
||||||
|
out: List[Edge] = []
|
||||||
|
|
||||||
|
for edge in edges:
|
||||||
|
edge.relationship_name = edge.relationship_name.lower()
|
||||||
|
|
||||||
|
key = (edge.source_node_id, edge.relationship_name, edge.target_node_id)
|
||||||
|
if key not in seen:
|
||||||
|
seen.add(key)
|
||||||
|
out.append(edge)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_content_graph2(
|
||||||
|
content: str, response_model: Type[BaseModel], node_rounds: int = 1, edge_rounds: int = 1
|
||||||
|
):
|
||||||
|
llm_client = get_llm_client()
|
||||||
|
|
||||||
|
###### NODE EXTRACTION
|
||||||
|
node_prompt_path = "node_extraction_prompt.txt"
|
||||||
|
|
||||||
|
node_system = render_prompt(node_prompt_path, {})
|
||||||
|
|
||||||
|
node_tasks = [
|
||||||
|
llm_client.acreate_structured_output(content, node_system, NodeList)
|
||||||
|
for _ in range(node_rounds)
|
||||||
|
]
|
||||||
|
|
||||||
|
node_results = await asyncio.gather(*node_tasks)
|
||||||
|
|
||||||
|
all_nodes: List[Node] = [node for nl in node_results for node in nl.nodes]
|
||||||
|
###### NODE DEDUPLICATION
|
||||||
|
all_nodes = dedupe_and_normalize_nodes(all_nodes)
|
||||||
|
|
||||||
|
all_nodes_merged = {
|
||||||
|
"nodes_to_deduplicate": json.dumps([n.model_dump() for n in all_nodes], ensure_ascii=False)
|
||||||
|
}
|
||||||
|
|
||||||
|
merge_system_prompt = "merge_nodes_system_prompt.txt"
|
||||||
|
merge_user_prompt = "merge_nodes_user_prompt.txt"
|
||||||
|
|
||||||
|
merge_system = render_prompt(filename=merge_system_prompt, context={})
|
||||||
|
merge_user = render_prompt(filename=merge_user_prompt, context=all_nodes_merged)
|
||||||
|
|
||||||
|
final_nodes_list = await llm_client.acreate_structured_output(
|
||||||
|
text_input=merge_user, system_prompt=merge_system, response_model=NodeList
|
||||||
|
)
|
||||||
|
|
||||||
|
###### EDGE EXTRACTION
|
||||||
|
|
||||||
|
edge_system_prompt = "edge_extraction_system_prompt.txt"
|
||||||
|
edge_user_prompt = "edge_extraction_user_prompt.txt"
|
||||||
|
|
||||||
|
edge_system = render_prompt(edge_system_prompt, {})
|
||||||
|
nodes_for_edge_extraction = {
|
||||||
|
"final_nodes": json.dumps(
|
||||||
|
[n.model_dump() for n in final_nodes_list.nodes], ensure_ascii=False
|
||||||
|
),
|
||||||
|
"text": content,
|
||||||
|
}
|
||||||
|
|
||||||
|
edge_user = render_prompt(edge_user_prompt, context=nodes_for_edge_extraction)
|
||||||
|
|
||||||
|
edge_tasks = [
|
||||||
|
llm_client.acreate_structured_output(
|
||||||
|
text_input=edge_user, system_prompt=edge_system, response_model=EdgeList
|
||||||
|
)
|
||||||
|
for _ in range(edge_rounds)
|
||||||
|
]
|
||||||
|
|
||||||
|
edge_results = await asyncio.gather(*edge_tasks)
|
||||||
|
|
||||||
|
all_edges: List[Edge] = [edge for nl in edge_results for edge in nl.edges]
|
||||||
|
###### EDGE DEDUPLICATION
|
||||||
|
all_edges = dedupe_and_normalize_edges(all_edges)
|
||||||
|
|
||||||
|
all_edges_merged = {
|
||||||
|
"edges_to_deduplicate": json.dumps([n.model_dump() for n in all_edges], ensure_ascii=False)
|
||||||
|
}
|
||||||
|
|
||||||
|
merge_system_prompt = "merge_edges_system_prompt.txt"
|
||||||
|
merge_user_prompt = "merge_edges_user_prompt.txt"
|
||||||
|
|
||||||
|
merge_system = render_prompt(filename=merge_system_prompt, context={})
|
||||||
|
merge_user = render_prompt(filename=merge_user_prompt, context=all_edges_merged)
|
||||||
|
|
||||||
|
final_edges_list = await llm_client.acreate_structured_output(
|
||||||
|
text_input=merge_user, system_prompt=merge_system, response_model=EdgeList
|
||||||
|
)
|
||||||
|
|
||||||
|
return KnowledgeGraph(nodes=final_nodes_list.nodes, edges=final_edges_list.edges)
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,46 @@
|
||||||
|
import json
|
||||||
|
from typing import Type
|
||||||
|
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
from cognee.infrastructure.llm.get_llm_client import get_llm_client
|
||||||
|
from cognee.infrastructure.llm.prompts import render_prompt
|
||||||
|
from cognee.shared.data_models import KnowledgeGraph
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_content_graph_sequential(
|
||||||
|
content: str, response_model: Type[BaseModel], graph_extraction_rounds: int = 2
|
||||||
|
):
|
||||||
|
llm_client = get_llm_client()
|
||||||
|
|
||||||
|
graph_system_prompt_path = "generate_graph_prompt_sequential.txt"
|
||||||
|
graph_user_prompt_path = "generate_graph_prompt_sequential_user.txt"
|
||||||
|
graph_system = render_prompt(graph_system_prompt_path, {})
|
||||||
|
|
||||||
|
current_nodes = []
|
||||||
|
current_edges = []
|
||||||
|
|
||||||
|
knowledge_graph = KnowledgeGraph(nodes=[], edges=[])
|
||||||
|
|
||||||
|
for round_idx in range(graph_extraction_rounds):
|
||||||
|
nodes_json = json.dumps([n.model_dump() for n in current_nodes], ensure_ascii=False)
|
||||||
|
edges_json = json.dumps([e.model_dump() for e in current_edges], ensure_ascii=False)
|
||||||
|
|
||||||
|
graph_user = render_prompt(
|
||||||
|
graph_user_prompt_path, #:TODO: this could use some formatting due to html #34 codes.
|
||||||
|
{
|
||||||
|
"text": content,
|
||||||
|
"graph": f"nodes: {nodes_json}, edges: {edges_json}",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
knowledge_graph = await llm_client.acreate_structured_output(
|
||||||
|
text_input=graph_user,
|
||||||
|
system_prompt=graph_system,
|
||||||
|
response_model=response_model,
|
||||||
|
)
|
||||||
|
|
||||||
|
current_nodes = knowledge_graph.nodes
|
||||||
|
current_edges = knowledge_graph.edges
|
||||||
|
|
||||||
|
return knowledge_graph
|
||||||
|
|
@ -0,0 +1,81 @@
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from typing import List, Tuple, Set
|
||||||
|
|
||||||
|
from cognee.infrastructure.llm.get_llm_client import get_llm_client
|
||||||
|
from cognee.infrastructure.llm.prompts import render_prompt
|
||||||
|
from cognee.shared.data_models import KnowledgeGraph, NodeList, EdgeList, Node, Edge
|
||||||
|
|
||||||
|
|
||||||
|
def dedupe_and_normalize_nodes(nodes: List[Node]) -> List[Node]:
|
||||||
|
seen: Set[Tuple[str, str]] = set()
|
||||||
|
out: List[Node] = []
|
||||||
|
|
||||||
|
for node in nodes:
|
||||||
|
node.name = node.name.lower()
|
||||||
|
node.type = node.type.lower()
|
||||||
|
|
||||||
|
node.name = node.name.lower().replace("_", " ")
|
||||||
|
node.type = node.type.lower().replace("_", " ")
|
||||||
|
|
||||||
|
key = (node.name, node.type)
|
||||||
|
if key not in seen:
|
||||||
|
seen.add(key)
|
||||||
|
out.append(node)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_content_node_edge_multi_parallel(content: str, node_rounds: int = 1):
|
||||||
|
llm_client = get_llm_client()
|
||||||
|
|
||||||
|
###### NODE EXTRACTION
|
||||||
|
node_prompt_path = "node_extraction_prompt.txt"
|
||||||
|
|
||||||
|
node_system = render_prompt(node_prompt_path, {})
|
||||||
|
|
||||||
|
node_tasks = [
|
||||||
|
llm_client.acreate_structured_output(content, node_system, NodeList)
|
||||||
|
for _ in range(node_rounds)
|
||||||
|
]
|
||||||
|
|
||||||
|
node_results = await asyncio.gather(*node_tasks)
|
||||||
|
|
||||||
|
all_nodes: List[Node] = [node for nl in node_results for node in nl.nodes]
|
||||||
|
###### NODE DEDUPLICATION
|
||||||
|
all_nodes = dedupe_and_normalize_nodes(all_nodes)
|
||||||
|
|
||||||
|
all_nodes_merged = {
|
||||||
|
"nodes_to_deduplicate": json.dumps([n.model_dump() for n in all_nodes], ensure_ascii=False)
|
||||||
|
}
|
||||||
|
|
||||||
|
merge_system_prompt = "merge_nodes_system_prompt.txt"
|
||||||
|
merge_user_prompt = "merge_nodes_user_prompt.txt"
|
||||||
|
|
||||||
|
merge_system = render_prompt(filename=merge_system_prompt, context={})
|
||||||
|
merge_user = render_prompt(filename=merge_user_prompt, context=all_nodes_merged)
|
||||||
|
|
||||||
|
final_nodes_list = await llm_client.acreate_structured_output(
|
||||||
|
text_input=merge_user, system_prompt=merge_system, response_model=NodeList
|
||||||
|
)
|
||||||
|
|
||||||
|
###### EDGE EXTRACTION
|
||||||
|
|
||||||
|
edge_system_prompt = "edge_extraction_system_prompt.txt"
|
||||||
|
edge_user_prompt = "edge_extraction_user_prompt.txt"
|
||||||
|
|
||||||
|
edge_system = render_prompt(edge_system_prompt, {})
|
||||||
|
nodes_for_edge_extraction = {
|
||||||
|
"final_nodes": json.dumps(
|
||||||
|
[n.model_dump() for n in final_nodes_list.nodes], ensure_ascii=False
|
||||||
|
),
|
||||||
|
"text": content,
|
||||||
|
}
|
||||||
|
|
||||||
|
edge_user = render_prompt(edge_user_prompt, context=nodes_for_edge_extraction)
|
||||||
|
|
||||||
|
final_edges_list = await llm_client.acreate_structured_output(
|
||||||
|
text_input=edge_user, system_prompt=edge_system, response_model=EdgeList
|
||||||
|
)
|
||||||
|
|
||||||
|
return KnowledgeGraph(nodes=final_nodes_list.nodes, edges=final_edges_list.edges)
|
||||||
|
|
@ -0,0 +1,57 @@
|
||||||
|
import json
|
||||||
|
|
||||||
|
from cognee.infrastructure.llm.get_llm_client import get_llm_client
|
||||||
|
from cognee.infrastructure.llm.prompts import render_prompt
|
||||||
|
from cognee.shared.data_models import KnowledgeGraph, NodeList, EdgeList
|
||||||
|
|
||||||
|
|
||||||
|
async def extract_content_node_edge_multi_sequential(
|
||||||
|
content: str, node_rounds: int = 2, edge_rounds=2
|
||||||
|
):
|
||||||
|
llm_client = get_llm_client()
|
||||||
|
|
||||||
|
current_nodes = NodeList()
|
||||||
|
|
||||||
|
for pass_idx in range(node_rounds):
|
||||||
|
nodes_json = json.dumps([n.model_dump() for n in current_nodes.nodes], ensure_ascii=False)
|
||||||
|
|
||||||
|
node_system = render_prompt("node_extraction_prompt_sequential.txt", {})
|
||||||
|
node_user = render_prompt(
|
||||||
|
"node_extraction_prompt_sequential_user.txt",
|
||||||
|
{
|
||||||
|
"text": content,
|
||||||
|
"nodes": {nodes_json},
|
||||||
|
"total_rounds": {node_rounds},
|
||||||
|
"round_number": {pass_idx},
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
current_nodes = await llm_client.acreate_structured_output(node_user, node_system, NodeList)
|
||||||
|
|
||||||
|
final_nodes = current_nodes
|
||||||
|
final_nodes_json = json.dumps([n.model_dump() for n in final_nodes.nodes], ensure_ascii=False)
|
||||||
|
|
||||||
|
current_edges = EdgeList()
|
||||||
|
|
||||||
|
for pass_idx in range(edge_rounds):
|
||||||
|
edges_json = json.dumps([n.model_dump() for n in current_edges.edges], ensure_ascii=False)
|
||||||
|
|
||||||
|
edges_system = render_prompt("edge_extraction_prompt_sequential.txt", {})
|
||||||
|
edges_user = render_prompt(
|
||||||
|
"edge_extraction_prompt_sequential_user.txt",
|
||||||
|
{
|
||||||
|
"text": content,
|
||||||
|
"nodes": {final_nodes_json},
|
||||||
|
"edges": {edges_json},
|
||||||
|
"total_rounds": {node_rounds},
|
||||||
|
"round_number": {pass_idx},
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
current_edges = await llm_client.acreate_structured_output(
|
||||||
|
edges_user, edges_system, EdgeList
|
||||||
|
)
|
||||||
|
|
||||||
|
final_edges = current_edges
|
||||||
|
|
||||||
|
return KnowledgeGraph(nodes=final_nodes.nodes, edges=final_edges.edges)
|
||||||
|
|
@ -46,9 +46,6 @@ else:
|
||||||
name: str
|
name: str
|
||||||
type: str
|
type: str
|
||||||
description: str
|
description: str
|
||||||
properties: Optional[Dict[str, Any]] = Field(
|
|
||||||
None, description="A dictionary of properties associated with the node."
|
|
||||||
)
|
|
||||||
|
|
||||||
class Edge(BaseModel):
|
class Edge(BaseModel):
|
||||||
"""Edge in a knowledge graph."""
|
"""Edge in a knowledge graph."""
|
||||||
|
|
@ -56,9 +53,16 @@ else:
|
||||||
source_node_id: str
|
source_node_id: str
|
||||||
target_node_id: str
|
target_node_id: str
|
||||||
relationship_name: str
|
relationship_name: str
|
||||||
properties: Optional[Dict[str, Any]] = Field(
|
|
||||||
None, description="A dictionary of properties associated with the edge."
|
class NodeList(BaseModel):
|
||||||
)
|
"""Nodes"""
|
||||||
|
|
||||||
|
nodes: List[Node] = Field(..., default_factory=list)
|
||||||
|
|
||||||
|
class EdgeList(BaseModel):
|
||||||
|
"""Nodes"""
|
||||||
|
|
||||||
|
edges: List[Edge] = Field(..., default_factory=list)
|
||||||
|
|
||||||
class KnowledgeGraph(BaseModel):
|
class KnowledgeGraph(BaseModel):
|
||||||
"""Knowledge graph."""
|
"""Knowledge graph."""
|
||||||
|
|
|
||||||
|
|
@ -6,7 +6,22 @@ from pydantic import BaseModel
|
||||||
from cognee.infrastructure.databases.graph import get_graph_engine
|
from cognee.infrastructure.databases.graph import get_graph_engine
|
||||||
from cognee.modules.ontology.rdf_xml.OntologyResolver import OntologyResolver
|
from cognee.modules.ontology.rdf_xml.OntologyResolver import OntologyResolver
|
||||||
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk
|
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk
|
||||||
from cognee.modules.data.extraction.knowledge_graph import extract_content_graph
|
|
||||||
|
from cognee.modules.data.extraction.knowledge_graph.extract_content_graph import (
|
||||||
|
extract_content_graph,
|
||||||
|
)
|
||||||
|
from cognee.modules.data.extraction.knowledge_graph.extract_content_node_edge_multi_parallel import (
|
||||||
|
extract_content_node_edge_multi_parallel,
|
||||||
|
)
|
||||||
|
|
||||||
|
from cognee.modules.data.extraction.knowledge_graph.extract_content_graph_sequential import (
|
||||||
|
extract_content_graph_sequential,
|
||||||
|
)
|
||||||
|
|
||||||
|
from cognee.modules.data.extraction.knowledge_graph.extract_content_node_edge_multi_sequential import (
|
||||||
|
extract_content_node_edge_multi_sequential,
|
||||||
|
)
|
||||||
|
|
||||||
from cognee.modules.graph.utils import (
|
from cognee.modules.graph.utils import (
|
||||||
expand_with_nodes_and_edges,
|
expand_with_nodes_and_edges,
|
||||||
retrieve_existing_edges,
|
retrieve_existing_edges,
|
||||||
|
|
@ -59,10 +74,17 @@ async def extract_graph_from_data(
|
||||||
Extracts and integrates a knowledge graph from the text content of document chunks using a specified graph model.
|
Extracts and integrates a knowledge graph from the text content of document chunks using a specified graph model.
|
||||||
"""
|
"""
|
||||||
chunk_graphs = await asyncio.gather(
|
chunk_graphs = await asyncio.gather(
|
||||||
*[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
|
# *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
|
||||||
|
# *[extract_content_node_edge_multi_parallel(content=chunk.text, node_rounds=2) for chunk in data_chunks]
|
||||||
|
# *[extract_content_graph_sequential(content=chunk.text, response_model=graph_model, graph_extraction_rounds=2) for chunk in data_chunks]
|
||||||
|
*[
|
||||||
|
extract_content_node_edge_multi_sequential(
|
||||||
|
content=chunk.text, node_rounds=1, edge_rounds=1
|
||||||
|
)
|
||||||
|
for chunk in data_chunks
|
||||||
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
# Note: Filter edges with missing source or target nodes
|
|
||||||
if graph_model == KnowledgeGraph:
|
if graph_model == KnowledgeGraph:
|
||||||
for graph in chunk_graphs:
|
for graph in chunk_graphs:
|
||||||
valid_node_ids = {node.id for node in graph.nodes}
|
valid_node_ids = {node.id for node in graph.nodes}
|
||||||
|
|
@ -71,7 +93,6 @@ async def extract_graph_from_data(
|
||||||
for edge in graph.edges
|
for edge in graph.edges
|
||||||
if edge.source_node_id in valid_node_ids and edge.target_node_id in valid_node_ids
|
if edge.source_node_id in valid_node_ids and edge.target_node_id in valid_node_ids
|
||||||
]
|
]
|
||||||
|
|
||||||
return await integrate_chunk_graphs(
|
return await integrate_chunk_graphs(
|
||||||
data_chunks, chunk_graphs, graph_model, ontology_adapter or OntologyResolver()
|
data_chunks, chunk_graphs, graph_model, ontology_adapter or OntologyResolver()
|
||||||
)
|
)
|
||||||
|
|
|
||||||
|
|
@ -180,14 +180,9 @@ async def main(enable_steps):
|
||||||
|
|
||||||
# Step 3: Create knowledge graph
|
# Step 3: Create knowledge graph
|
||||||
if enable_steps.get("cognify"):
|
if enable_steps.get("cognify"):
|
||||||
pipeline_run = await cognee.cognify()
|
await cognee.cognify()
|
||||||
print("Knowledge graph created.")
|
print("Knowledge graph created.")
|
||||||
|
|
||||||
# Step 4: Calculate descriptive metrics
|
|
||||||
if enable_steps.get("graph_metrics"):
|
|
||||||
await get_pipeline_run_metrics(pipeline_run, include_optional=True)
|
|
||||||
print("Descriptive graph metrics saved to database.")
|
|
||||||
|
|
||||||
# Step 5: Query insights
|
# Step 5: Query insights
|
||||||
if enable_steps.get("retriever"):
|
if enable_steps.get("retriever"):
|
||||||
search_results = await cognee.search(
|
search_results = await cognee.search(
|
||||||
|
|
|
||||||
|
|
@ -62,13 +62,9 @@ async def main():
|
||||||
os.path.dirname(os.path.abspath(__file__)), "ontology_input_example/basic_ontology.owl"
|
os.path.dirname(os.path.abspath(__file__)), "ontology_input_example/basic_ontology.owl"
|
||||||
)
|
)
|
||||||
|
|
||||||
pipeline_run = await cognee.cognify(ontology_file_path=ontology_path)
|
await cognee.cognify(ontology_file_path=ontology_path)
|
||||||
print("Knowledge with ontology created.")
|
print("Knowledge with ontology created.")
|
||||||
|
|
||||||
# Step 4: Calculate descriptive metrics
|
|
||||||
await get_pipeline_run_metrics(pipeline_run, include_optional=True)
|
|
||||||
print("Descriptive graph metrics saved to database.")
|
|
||||||
|
|
||||||
# Step 5: Query insights
|
# Step 5: Query insights
|
||||||
search_results = await cognee.search(
|
search_results = await cognee.search(
|
||||||
query_type=SearchType.GRAPH_COMPLETION,
|
query_type=SearchType.GRAPH_COMPLETION,
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue