diff --git a/README-zh.md b/README-zh.md index d3403b35..ea2ad24b 100644 --- a/README-zh.md +++ b/README-zh.md @@ -275,7 +275,7 @@ if __name__ == "__main__": | **vector_db_storage_cls_kwargs** | `dict` | 向量数据库的附加参数,如设置节点和关系检索的阈值 | cosine_better_than_threshold: 0.2(默认值由环境变量COSINE_THRESHOLD更改) | | **enable_llm_cache** | `bool` | 如果为`TRUE`,将LLM结果存储在缓存中;重复的提示返回缓存的响应 | `TRUE` | | **enable_llm_cache_for_entity_extract** | `bool` | 如果为`TRUE`,将实体提取的LLM结果存储在缓存中;适合初学者调试应用程序 | `TRUE` | -| **addon_params** | `dict` | 附加参数,例如`{"example_number": 1, "language": "Simplified Chinese", "entity_types": ["organization", "person", "geo", "event"]}`:设置示例限制、输出语言和文档处理的批量大小 | `example_number: 所有示例, language: English` | +| **addon_params** | `dict` | 附加参数,例如`{"language": "Simplified Chinese", "entity_types": ["organization", "person", "location", "event"]}`:设置示例限制、输出语言和文档处理的批量大小 | language: English` | | **embedding_cache_config** | `dict` | 问答缓存的配置。包含三个参数:`enabled`:布尔值,启用/禁用缓存查找功能。启用时,系统将在生成新答案之前检查缓存的响应。`similarity_threshold`:浮点值(0-1),相似度阈值。当新问题与缓存问题的相似度超过此阈值时,将直接返回缓存的答案而不调用LLM。`use_llm_check`:布尔值,启用/禁用LLM相似度验证。启用时,在返回缓存答案之前,将使用LLM作为二次检查来验证问题之间的相似度。 | 默认:`{"enabled": False, "similarity_threshold": 0.95, "use_llm_check": False}` | diff --git a/README.md b/README.md index 5ad37f01..e5a1625f 100644 --- a/README.md +++ b/README.md @@ -282,7 +282,7 @@ A full list of LightRAG init parameters: | **vector_db_storage_cls_kwargs** | `dict` | Additional parameters for vector database, like setting the threshold for nodes and relations retrieval | cosine_better_than_threshold: 0.2(default value changed by env var COSINE_THRESHOLD) | | **enable_llm_cache** | `bool` | If `TRUE`, stores LLM results in cache; repeated prompts return cached responses | `TRUE` | | **enable_llm_cache_for_entity_extract** | `bool` | If `TRUE`, stores LLM results in cache for entity extraction; Good for beginners to debug your application | `TRUE` | -| **addon_params** | `dict` | Additional parameters, e.g., `{"example_number": 1, "language": "Simplified Chinese", "entity_types": ["organization", "person", "geo", "event"]}`: sets example limit, entiy/relation extraction output language | `example_number: all examples, language: English` | +| **addon_params** | `dict` | Additional parameters, e.g., `{"language": "Simplified Chinese", "entity_types": ["organization", "person", "location", "event"]}`: sets example limit, entiy/relation extraction output language | language: English` | | **embedding_cache_config** | `dict` | Configuration for question-answer caching. Contains three parameters: `enabled`: Boolean value to enable/disable cache lookup functionality. When enabled, the system will check cached responses before generating new answers. `similarity_threshold`: Float value (0-1), similarity threshold. When a new question's similarity with a cached question exceeds this threshold, the cached answer will be returned directly without calling the LLM. `use_llm_check`: Boolean value to enable/disable LLM similarity verification. When enabled, LLM will be used as a secondary check to verify the similarity between questions before returning cached answers. | Default: `{"enabled": False, "similarity_threshold": 0.95, "use_llm_check": False}` | diff --git a/lightrag/constants.py b/lightrag/constants.py index d78d869c..9accdc52 100644 --- a/lightrag/constants.py +++ b/lightrag/constants.py @@ -24,11 +24,14 @@ DEFAULT_SUMMARY_LENGTH_RECOMMENDED = 600 DEFAULT_SUMMARY_CONTEXT_SIZE = 12000 # Default entities to extract if ENTITY_TYPES is not specified in .env DEFAULT_ENTITY_TYPES = [ - "organization", - "person", - "geo", - "event", - "category", + "Organization", + "Person", + "Equiment", + "Product", + "Technology", + "Location", + "Event", + "Category", ] # Separator for graph fields diff --git a/lightrag/operate.py b/lightrag/operate.py index b8dd8bb8..1a4b9266 100644 --- a/lightrag/operate.py +++ b/lightrag/operate.py @@ -314,8 +314,8 @@ async def _handle_single_entity_extraction( chunk_key: str, file_path: str = "unknown_source", ): - if len(record_attributes) < 4 or '"entity"' not in record_attributes[0]: - if len(record_attributes) > 1 and '"entity"' in record_attributes[0]: + if len(record_attributes) < 4 or "entity" not in record_attributes[0]: + if len(record_attributes) > 1 and "entity" in record_attributes[0]: logger.warning( f"Entity extraction failed in {chunk_key}: expecting 4 fields but got {len(record_attributes)}" ) @@ -381,10 +381,10 @@ async def _handle_single_relationship_extraction( chunk_key: str, file_path: str = "unknown_source", ): - if len(record_attributes) < 6 or '"relationship"' not in record_attributes[0]: - if len(record_attributes) > 1 and '"relationship"' in record_attributes[0]: + if len(record_attributes) < 5 or "relationship" not in record_attributes[0]: + if len(record_attributes) > 1 and "relationship" in record_attributes[0]: logger.warning( - f"Relation extraction failed in {chunk_key}: expecting 6 fields but got {len(record_attributes)}" + f"Relation extraction failed in {chunk_key}: expecting 5 fields but got {len(record_attributes)}" ) logger.warning(f"Relation extracted: {record_attributes[1]}") return None @@ -416,15 +416,15 @@ async def _handle_single_relationship_extraction( ) return None - # Process relationship description with same cleaning pipeline - edge_description = sanitize_and_normalize_extracted_text(record_attributes[3]) - # Process keywords with same cleaning pipeline edge_keywords = sanitize_and_normalize_extracted_text( - record_attributes[4], remove_inner_quotes=True + record_attributes[3], remove_inner_quotes=True ) edge_keywords = edge_keywords.replace(",", ",") + # Process relationship description with same cleaning pipeline + edge_description = sanitize_and_normalize_extracted_text(record_attributes[4]) + edge_source_id = chunk_key weight = ( float(record_attributes[-1].strip('"').strip("'")) @@ -1686,13 +1686,8 @@ async def extract_entities( entity_types = global_config["addon_params"].get( "entity_types", DEFAULT_ENTITY_TYPES ) - example_number = global_config["addon_params"].get("example_number", None) - if example_number and example_number < len(PROMPTS["entity_extraction_examples"]): - examples = "\n".join( - PROMPTS["entity_extraction_examples"][: int(example_number)] - ) - else: - examples = "\n".join(PROMPTS["entity_extraction_examples"]) + + examples = "\n".join(PROMPTS["entity_extraction_examples"]) example_context_base = dict( tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"], @@ -2137,13 +2132,8 @@ async def extract_keywords_only( ) # 2. Build the examples - example_number = global_config["addon_params"].get("example_number", None) - if example_number and example_number < len(PROMPTS["keywords_extraction_examples"]): - examples = "\n".join( - PROMPTS["keywords_extraction_examples"][: int(example_number)] - ) - else: - examples = "\n".join(PROMPTS["keywords_extraction_examples"]) + examples = "\n".join(PROMPTS["keywords_extraction_examples"]) + language = global_config["addon_params"].get("language", DEFAULT_SUMMARY_LANGUAGE) # 3. Process conversation history diff --git a/lightrag/prompt.py b/lightrag/prompt.py index f8ea6589..bd7451ee 100644 --- a/lightrag/prompt.py +++ b/lightrag/prompt.py @@ -15,27 +15,35 @@ Given a text document that is potentially relevant to this activity and a list o Use {language} as output language. ---Steps--- -1. Identify all entities. For each identified entity, extract the following information: +1. Recognizing definitively conceptualized entities in text. For each identified entity, extract the following information: - entity_name: Name of the entity, use same language as input text. If English, capitalized the name -- entity_type: One of the following types: [{entity_types}] -- entity_description: Provide a comprehensive description of the entity's attributes and activities *based solely on the information present in the input text*. **Do not infer or hallucinate information not explicitly stated.** If the text provides insufficient information to create a comprehensive description, state "Description not available in text." -Format each entity as ("entity"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) +- entity_type: One of the following types: [{entity_types}]. If the entity doesn't clearly fit any category, classify it as "Other". +- entity_description: Provide a comprehensive description of the entity's attributes and activities based on the information present in the input text. Do not add external knowledge. -2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other. +2. Format each entity as: +("entity"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) + +3. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are directly and clearly related based on the text. Unsubstantiated relationships must be excluded from the output. For each pair of related entities, extract the following information: - source_entity: name of the source entity, as identified in step 1 - target_entity: name of the target entity, as identified in step 1 -- relationship_description: explanation as to why you think the source entity and the target entity are related to each other -- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity - relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details -Format each relationship as ("relationship"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) +- relationship_description: Explain the nature of the relationship between the source and target entities, providing a clear rationale for their connection -3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document. -Format the content-level key words as ("content_keywords"{tuple_delimiter}) +4. Format each relationship as: +("relationship"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) -4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter. +5. Use `{tuple_delimiter}` as feild delimiter, and use `{record_delimiter}` as the list delimiter. -5. When finished, output {completion_delimiter} +6. When finished, output `{completion_delimiter}` + +7. Return identified entities and relationships in {language}. + +---Quality Guidelines--- +- Only extract entities that are clearly defined and meaningful in the context +- Avoid over-interpretation; stick to what is explicitly stated in the text +- Include specific numerical data in entity name when relevant +- Ensure entity names are consistent throughout the extraction ---Examples--- {examples} @@ -43,15 +51,18 @@ Format the content-level key words as ("content_keywords"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) +- entity_type: One of the following types: [{entity_types}]. If the entity doesn't clearly fit any category, classify it as "Other". +- entity_description: Provide a comprehensive description of the entity's attributes and activities based on the information present in the input text. Do not add external knowledge. -2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other. +2. Format each entity as: +("entity"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) + +3. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are directly and clearly related based on the text. Unsubstantiated relationships must be excluded from the output. For each pair of related entities, extract the following information: - source_entity: name of the source entity, as identified in step 1 - target_entity: name of the target entity, as identified in step 1 -- relationship_description: explanation as to why you think the source entity and the target entity are related to each other -- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity - relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details -Format each relationship as ("relationship"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) +- relationship_description: Explain the nature of the relationship between the source and target entities, providing a clear rationale for their connection -3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document. -Format the content-level key words as ("content_keywords"{tuple_delimiter}) +4. Format each relationship as: +("relationship"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) -4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter. +5. Use `{tuple_delimiter}` as feild delimiter, and use `{record_delimiter}` as the list delimiter. -5. When finished, output {completion_delimiter} +6. When finished, output `{completion_delimiter}` + +7. Return identified entities and relationships in {language}. ---Output--- - -Add new entities and relations below using the same format, and do not include entities and relations that have been previously extracted. :\n -""".strip() +Output: +""" PROMPTS["entity_if_loop_extraction"] = """ ---Goal---'