This commit is contained in:
hzywhite 2025-09-16 14:32:36 +08:00
commit 680b7c5b89
20 changed files with 1707 additions and 494 deletions

View file

@ -33,6 +33,9 @@
<a href="README-zh.md"><img src="https://img.shields.io/badge/🇨🇳中文版-1a1a2e?style=for-the-badge"></a>
<a href="README.md"><img src="https://img.shields.io/badge/🇺🇸English-1a1a2e?style=for-the-badge"></a>
</p>
<p>
<a href="https://pepy.tech/projects/lightrag-hku"><img src="https://static.pepy.tech/personalized-badge/lightrag-hku?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads"></a>
</p>
</div>
</div>

View file

@ -33,6 +33,9 @@
<a href="README-zh.md"><img src="https://img.shields.io/badge/🇨🇳中文版-1a1a2e?style=for-the-badge"></a>
<a href="README.md"><img src="https://img.shields.io/badge/🇺🇸English-1a1a2e?style=for-the-badge"></a>
</p>
<p>
<a href="https://pepy.tech/projects/lightrag-hku"><img src="https://static.pepy.tech/personalized-badge/lightrag-hku?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads"></a>
</p>
</div>
</div>

View file

@ -65,7 +65,7 @@ ENABLE_LLM_CACHE=true
# COSINE_THRESHOLD=0.2
### Number of entities or relations retrieved from KG
# TOP_K=40
### Maxmium number or chunks for naive vactor search
### Maxmium number or chunks for naive vector search
# CHUNK_TOP_K=20
### control the actual enties send to LLM
# MAX_ENTITY_TOKENS=6000
@ -125,7 +125,7 @@ ENABLE_LLM_CACHE_FOR_EXTRACT=true
SUMMARY_LANGUAGE=English
### Entity types that the LLM will attempt to recognize
# ENTITY_TYPES='["Organization", "Person", "Location", "Event", "Technology", "Equipment", "Product", "Document", "Category"]'
# ENTITY_TYPES='["Person", "Organization", "Location", "Event", "Concept", "Method", "Content", "Data", "Artifact", "NaturalObject"]'
### Chunk size for document splitting, 500~1500 is recommended
# CHUNK_SIZE=1200
@ -174,9 +174,17 @@ LLM_BINDING_API_KEY=your_api_key
# LLM_BINDING_API_KEY=your_api_key
# LLM_BINDING=openai
### OpenAI Specific Parameters
### Set the max_output_tokens to mitigate endless output of some LLM (less than LLM_TIMEOUT * llm_output_tokens/second, i.e. 9000 = 180s * 50 tokens/s)
### OpenAI Compatible API Specific Parameters
### Set the max_tokens to mitigate endless output of some LLM (less than LLM_TIMEOUT * llm_output_tokens/second, i.e. 9000 = 180s * 50 tokens/s)
### Typically, max_tokens does not include prompt content, though some models, such as Gemini Models, are exceptions
### For vLLM/SGLang doployed models, or most of OpenAI compatible API provider
# OPENAI_LLM_MAX_TOKENS=9000
### For OpenAI o1-mini or newer modles
OPENAI_LLM_MAX_COMPLETION_TOKENS=9000
#### OpenAI's new API utilizes max_completion_tokens instead of max_tokens
# OPENAI_LLM_MAX_TOKENS=9000
# OPENAI_LLM_MAX_COMPLETION_TOKENS=9000
### OpenRouter Specific Parameters
# OPENAI_LLM_EXTRA_BODY='{"reasoning": {"enabled": false}}'

View file

@ -1,5 +1,5 @@
from .lightrag import LightRAG as LightRAG, QueryParam as QueryParam
__version__ = "1.4.8"
__version__ = "1.4.8.1"
__author__ = "Zirui Guo"
__url__ = "https://github.com/HKUDS/LightRAG"

View file

@ -372,7 +372,20 @@ lightrag-server --llm-binding ollama --help
lightrag-server --embedding-binding ollama --help
```
> 请使用openai兼容方式访问OpenRouter或vLLM部署的LLM。可以通过 `OPENAI_LLM_EXTRA_BODY` 环境变量给OpenRouter或vLLM传递额外的参数实现推理模式的关闭或者其它个性化控制。
> 请使用openai兼容方式访问OpenRouter、vLLM或SLang部署的LLM。可以通过 `OPENAI_LLM_EXTRA_BODY` 环境变量给OpenRouter、vLLM或SGLang推理框架传递额外的参数实现推理模式的关闭或者其它个性化控制。
设置 `max_tokens` 参数旨在**防止在实体关系提取阶段出现LLM 响应输出过长或无休止的循环输出的问题**。设置 `max_tokens` 参数的目的是在超时发生之前截断 LLM 输出,从而防止文档提取失败。这解决了某些包含大量实体和关系的文本块(例如表格或引文)可能导致 LLM 产生过长甚至无限循环输出的问题。此设置对于本地部署的小参数模型尤为重要。`max_tokens` 值可以通过以下公式计算:
```
# For vLLM/SGLang doployed models, or most of OpenAI compatible API provider
OPENAI_LLM_MAX_TOKENS=9000
# For Ollama Deployed Modeles
OLLAMA_LLM_NUM_PREDICT=9000
# For OpenAI o1-mini or newer modles
OPENAI_LLM_MAX_COMPLETION_TOKENS=9000
```
### 实体提取配置

View file

@ -374,9 +374,23 @@ lightrag-server --llm-binding ollama --help
lightrag-server --embedding-binding ollama --help
```
> Please use OpenAI-compatible method to access LLMs deployed by OpenRouter or vLLM. You can pass additional parameters to OpenRouter or vLLM through the `OPENAI_LLM_EXTRA_BODY` environment variable to disable reasoning mode or achieve other personalized controls.
> Please use OpenAI-compatible method to access LLMs deployed by OpenRouter or vLLM/SGLang. You can pass additional parameters to OpenRouter or vLLM/SGLang through the `OPENAI_LLM_EXTRA_BODY` environment variable to disable reasoning mode or achieve other personalized controls.
Set the max_tokens to **prevent excessively long or endless output loop** during the entity relationship extraction phase for Large Language Model (LLM) responses. The purpose of setting max_tokens parameter is to truncate LLM output before timeouts occur, thereby preventing document extraction failures. This addresses issues where certain text blocks (e.g., tables or citations) containing numerous entities and relationships can lead to overly long or even endless loop outputs from LLMs. This setting is particularly crucial for locally deployed, smaller-parameter models. Max tokens value can be calculated by this formula: `LLM_TIMEOUT * llm_output_tokens/second` (i.e. `180s * 50 tokens/s = 9000`)
```
# For vLLM/SGLang doployed models, or most of OpenAI compatible API provider
OPENAI_LLM_MAX_TOKENS=9000
# For Ollama Deployed Modeles
OLLAMA_LLM_NUM_PREDICT=9000
# For OpenAI o1-mini or newer modles
OPENAI_LLM_MAX_COMPLETION_TOKENS=9000
```
### Entity Extraction Configuration
* ENABLE_LLM_CACHE_FOR_EXTRACT: Enable LLM cache for entity extraction (default: true)
It's very common to set `ENABLE_LLM_CACHE_FOR_EXTRACT` to true for a test environment to reduce the cost of LLM calls.

View file

@ -1 +1 @@
__api_version__ = "0218"
__api_version__ = "0222"

View file

@ -134,6 +134,21 @@ class QueryResponse(BaseModel):
)
class QueryDataResponse(BaseModel):
entities: List[Dict[str, Any]] = Field(
description="Retrieved entities from knowledge graph"
)
relationships: List[Dict[str, Any]] = Field(
description="Retrieved relationships from knowledge graph"
)
chunks: List[Dict[str, Any]] = Field(
description="Retrieved text chunks from documents"
)
metadata: Dict[str, Any] = Field(
description="Query metadata including mode, keywords, and processing information"
)
def create_query_routes(rag, api_key: Optional[str] = None, top_k: int = 60):
combined_auth = get_combined_auth_dependency(api_key)
@ -221,4 +236,71 @@ def create_query_routes(rag, api_key: Optional[str] = None, top_k: int = 60):
trace_exception(e)
raise HTTPException(status_code=500, detail=str(e))
@router.post(
"/query/data",
response_model=QueryDataResponse,
dependencies=[Depends(combined_auth)],
)
async def query_data(request: QueryRequest):
"""
Retrieve structured data without LLM generation.
This endpoint returns raw retrieval results including entities, relationships,
and text chunks that would be used for RAG, but without generating a final response.
All parameters are compatible with the regular /query endpoint.
Parameters:
request (QueryRequest): The request object containing the query parameters.
Returns:
QueryDataResponse: A Pydantic model containing structured data with entities,
relationships, chunks, and metadata.
Raises:
HTTPException: Raised when an error occurs during the request handling process,
with status code 500 and detail containing the exception message.
"""
try:
param = request.to_query_params(False) # No streaming for data endpoint
response = await rag.aquery_data(request.query, param=param)
# The aquery_data method returns a dict with entities, relationships, chunks, and metadata
if isinstance(response, dict):
# Ensure all required fields exist and are lists/dicts
entities = response.get("entities", [])
relationships = response.get("relationships", [])
chunks = response.get("chunks", [])
metadata = response.get("metadata", {})
# Validate data types
if not isinstance(entities, list):
entities = []
if not isinstance(relationships, list):
relationships = []
if not isinstance(chunks, list):
chunks = []
if not isinstance(metadata, dict):
metadata = {}
return QueryDataResponse(
entities=entities,
relationships=relationships,
chunks=chunks,
metadata=metadata,
)
else:
# Fallback for unexpected response format
return QueryDataResponse(
entities=[],
relationships=[],
chunks=[],
metadata={
"error": "Unexpected response format",
"raw_response": str(response),
},
)
except Exception as e:
trace_exception(e)
raise HTTPException(status_code=500, detail=str(e))
return router

View file

@ -24,15 +24,16 @@ DEFAULT_SUMMARY_LENGTH_RECOMMENDED = 600
DEFAULT_SUMMARY_CONTEXT_SIZE = 12000
# Default entities to extract if ENTITY_TYPES is not specified in .env
DEFAULT_ENTITY_TYPES = [
"Organization",
"Person",
"Organization",
"Location",
"Event",
"Technology",
"Equipment",
"Product",
"Document",
"Category",
"Concept",
"Method",
"Content",
"Data",
"Artifact",
"NaturalObject",
]
# Separator for graph fields

View file

@ -67,9 +67,21 @@ class QdrantVectorDBStorage(BaseVectorStorage):
def create_collection_if_not_exist(
client: QdrantClient, collection_name: str, **kwargs
):
if client.collection_exists(collection_name):
return
client.create_collection(collection_name, **kwargs)
exists = False
if hasattr(client, "collection_exists"):
try:
exists = client.collection_exists(collection_name)
except Exception:
exists = False
else:
try:
client.get_collection(collection_name)
exists = True
except Exception:
exists = False
if not exists:
client.create_collection(collection_name, **kwargs)
def __post_init__(self):
# Check for QDRANT_WORKSPACE environment variable first (higher priority)
@ -464,7 +476,20 @@ class QdrantVectorDBStorage(BaseVectorStorage):
async with get_storage_lock():
try:
# Delete the collection and recreate it
if self._client.collection_exists(self.final_namespace):
exists = False
if hasattr(self._client, "collection_exists"):
try:
exists = self._client.collection_exists(self.final_namespace)
except Exception:
exists = False
else:
try:
self._client.get_collection(self.final_namespace)
exists = True
except Exception:
exists = False
if exists:
self._client.delete_collection(self.final_namespace)
# Recreate the collection

View file

@ -2263,6 +2263,78 @@ class LightRAG:
await self._query_done()
return response
async def aquery_data(
self,
query: str,
param: QueryParam = QueryParam(),
) -> dict[str, Any]:
"""
Asynchronous data retrieval API: returns structured retrieval results without LLM generation.
This function reuses the same logic as aquery but stops before LLM generation,
returning the final processed entities, relationships, and chunks data that would be sent to LLM.
Args:
query: Query text.
param: Query parameters (same as aquery).
Returns:
dict[str, Any]: Structured data result with entities, relationships, chunks, and metadata
"""
global_config = asdict(self)
if param.mode in ["local", "global", "hybrid", "mix"]:
logger.debug(f"[aquery_data] Using kg_query for mode: {param.mode}")
final_data = await kg_query(
query.strip(),
self.chunk_entity_relation_graph,
self.entities_vdb,
self.relationships_vdb,
self.text_chunks,
param,
global_config,
hashing_kv=self.llm_response_cache,
system_prompt=None,
chunks_vdb=self.chunks_vdb,
return_raw_data=True, # Get final processed data
)
elif param.mode == "naive":
logger.debug(f"[aquery_data] Using naive_query for mode: {param.mode}")
final_data = await naive_query(
query.strip(),
self.chunks_vdb,
param,
global_config,
hashing_kv=self.llm_response_cache,
system_prompt=None,
return_raw_data=True, # Get final processed data
)
elif param.mode == "bypass":
logger.debug("[aquery_data] Using bypass mode")
# bypass mode returns empty data
final_data = {
"entities": [],
"relationships": [],
"chunks": [],
"metadata": {
"query_mode": "bypass",
"keywords": {"high_level": [], "low_level": []},
},
}
else:
raise ValueError(f"Unknown mode {param.mode}")
# Log final result counts
entities_count = len(final_data.get("entities", []))
relationships_count = len(final_data.get("relationships", []))
chunks_count = len(final_data.get("chunks", []))
logger.debug(
f"[aquery_data] Final result: {entities_count} entities, {relationships_count} relationships, {chunks_count} chunks"
)
await self._query_done()
return final_data
async def _query_done(self):
await self.llm_response_cache.index_done_callback()

File diff suppressed because it is too large Load diff

View file

@ -4,8 +4,8 @@ from typing import Any
PROMPTS: dict[str, Any] = {}
PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|>"
PROMPTS["DEFAULT_RECORD_DELIMITER"] = "##"
# All delimiters must be formatted as "<|UPPER_CASE_STRING|>"
PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|#|>"
PROMPTS["DEFAULT_COMPLETION_DELIMITER"] = "<|COMPLETE|>"
PROMPTS["DEFAULT_USER_PROMPT"] = "n/a"
@ -16,22 +16,49 @@ You are a Knowledge Graph Specialist responsible for extracting entities and rel
---Instructions---
1. **Entity Extraction:** Identify clearly defined and meaningful entities in the input text, and extract the following information:
- entity_name: Name of the entity, ensure entity names are consistent throughout the extraction.
- entity_type: Categorize the entity using the following entity types: {entity_types}; if none of the provided types are suitable, classify it as `Other`.
- entity_description: Provide a comprehensive description of the entity's attributes and activities based on the information present in the input text.
2. **Entity Output Format:** (entity{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description)
3. **Relationship Extraction:** Identify direct, clearly-stated and meaningful relationships between extracted entities within the input text, and extract the following information:
- source_entity: name of the source entity.
- target_entity: name of the target entity.
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details.
- relationship_description: Explain the nature of the relationship between the source and target entities, providing a clear rationale for their connection.
4. **Relationship Output Format:** (relationship{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_description)
5. **Relationship Order:** Prioritize relationships based on their significance to the intended meaning of input text, and output more crucial relationships first.
6. **Avoid Pronouns:** For entity names and all descriptions, explicitly name the subject or object instead of using pronouns; avoid pronouns such as `this document`, `our company`, `I`, `you`, and `he/she`.
7. **Undirectional Relationship:** Treat relationships as undirected; swapping the source and target entities does not constitute a new relationship. Avoid outputting duplicate relationships.
8. **Language:** Output entity names, keywords and descriptions in {language}.
9. **Delimiter:** Use `{record_delimiter}` as the entity or relationship list delimiter; output `{completion_delimiter}` when all the entities and relationships are extracted.
1. **Entity Extraction & Output:**
* **Identification:** Identify clearly defined and meaningful entities in the input text.
* **Entity Details:** For each identified entity, extract the following information:
* `entity_name`: The name of the entity. If the entity name is case-insensitive, capitalize the first letter of each significant word (title case). Ensure **consistent naming** across the entire extraction process.
* `entity_type`: Categorize the entity using one of the following types: `{entity_types}`. If none of the provided entity types apply, do not add new entity type and classify it as `Other`.
* `entity_description`: Provide a concise yet comprehensive description of the entity's attributes and activities, based *solely* on the information present in the input text.
* **Output Format - Entities:** Output a total of 4 fields for each entity, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `entity`.
* Format: `entity{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description`
2. **Relationship Extraction & Output:**
* **Identification:** Identify direct, clearly stated, and meaningful relationships between previously extracted entities.
* **N-ary Relationship Decomposition:** If a single statement describes a relationship involving more than two entities (an N-ary relationship), decompose it into multiple binary (two-entity) relationship pairs for separate description.
* **Example:** For "Alice, Bob, and Carol collaborated on Project X," extract binary relationships such as "Alice collaborated with Project X," "Bob collaborated with Project X," and "Carol collaborated with Project X," or "Alice collaborated with Bob," based on the most reasonable binary interpretations.
* **Relationship Details:** For each binary relationship, extract the following fields:
* `source_entity`: The name of the source entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive.
* `target_entity`: The name of the target entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive.
* `relationship_keywords`: One or more high-level keywords summarizing the overarching nature, concepts, or themes of the relationship. Multiple keywords within this field must be separated by a comma `,`. **DO NOT use `{tuple_delimiter}` for separating multiple keywords within this field.**
* `relationship_description`: A concise explanation of the nature of the relationship between the source and target entities, providing a clear rationale for their connection.
* **Output Format - Relationships:** Output a total of 5 fields for each relationship, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `relation`.
* Format: `relation{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_description`
3. **Delimiter Usage Protocol:**
* The `{tuple_delimiter}` is a complete, atomic marker and **must not be filled with content**. It serves strictly as a field separator.
* **Incorrect Example:** `entity{tuple_delimiter}Tokyo<|location|>Tokyo is the capital of Japan.`
* **Correct Example:** `entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the capital of Japan.`
4. **Relationship Direction & Duplication:**
* Treat all relationships as **undirected** unless explicitly stated otherwise. Swapping the source and target entities for an undirected relationship does not constitute a new relationship.
* Avoid outputting duplicate relationships.
5. **Output Order & Prioritization:**
* Output all extracted entities first, followed by all extracted relationships.
* Within the list of relationships, prioritize and output those relationships that are **most significant** to the core meaning of the input text first.
6. **Context & Objectivity:**
* Ensure all entity names and descriptions are written in the **third person**.
* Explicitly name the subject or object; **avoid using pronouns** such as `this article`, `this paper`, `our company`, `I`, `you`, and `he/she`.
7. **Language & Proper Nouns:**
* The entire output (entity names, keywords, and descriptions) must be written in `{language}`.
* Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity.
8. **Completion Signal:** Output the literal string `{completion_delimiter}` only after all entities and relationships, following all criteria, have been completely extracted and outputted.
---Examples---
@ -47,25 +74,31 @@ Text:
"""
PROMPTS["entity_extraction_user_prompt"] = """---Task---
Extract entities and relationships from the input text to be Processed.
Extract entities and relationships from the input text to be processed.
---Instructions---
1. Output entities and relationships, prioritized by their relevance to the input text's core meaning.
2. Output `{completion_delimiter}` when all the entities and relationships are extracted.
3. Ensure the output language is {language}.
1. **Strict Adherence to Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system prompt.
2. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list.
3. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant entities and relationships have been extracted and presented.
4. **Oputput Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated.
<Output>
"""
PROMPTS["entity_continue_extraction_user_prompt"] = """---Task---
Identify any missed entities or relationships from the input text to be Processed of last extraction task.
Based on the last extraction task, identify and extract any **missed or incorrectly formatted** entities and relationships from the input text.
---Instructions---
1. Output the entities and realtionships in the same format as previous extraction task.
2. Do not include entities and relations that have been correctly extracted in last extraction task.
3. If the entity or relation output is truncated or has missing fields in last extraction task, please re-output it in the correct format.
4. Output `{completion_delimiter}` when all the entities and relationships are extracted.
5. Ensure the output language is {language}.
1. **Strict Adherence to System Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system instructions.
2. **Focus on Corrections/Additions:**
* **Do NOT** re-output entities and relationships that were **correctly and fully** extracted in the last task.
* If an entity or relationship was **missed** in the last task, extract and output it now according to the system format.
* If an entity or relationship was **truncated, had missing fields, or was otherwise incorrectly formatted** in the last task, re-output the *corrected and complete* version in the specified format.
3. **Output Format - Entities:** Output a total of 4 fields for each entity, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `entity`.
4. **Output Format - Relationships:** Output a total of 5 fields for each relationship, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `relation`.
5. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list.
6. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant missing or corrected entities and relationships have been extracted and presented.
7. **Oputput Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated.
<Output>
"""
@ -83,24 +116,24 @@ It was a small transformation, barely perceptible, but one that Alex noted with
```
<Output>
(entity{tuple_delimiter}Alex{tuple_delimiter}person{tuple_delimiter}Alex is a character who experiences frustration and is observant of the dynamics among other characters.){record_delimiter}
(entity{tuple_delimiter}Taylor{tuple_delimiter}person{tuple_delimiter}Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective.){record_delimiter}
(entity{tuple_delimiter}Jordan{tuple_delimiter}person{tuple_delimiter}Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device.){record_delimiter}
(entity{tuple_delimiter}Cruz{tuple_delimiter}person{tuple_delimiter}Cruz is associated with a vision of control and order, influencing the dynamics among other characters.){record_delimiter}
(entity{tuple_delimiter}The Device{tuple_delimiter}equiment{tuple_delimiter}The Device is central to the story, with potential game-changing implications, and is revered by Taylor.){record_delimiter}
(relationship{tuple_delimiter}Alex{tuple_delimiter}Taylor{tuple_delimiter}power dynamics, observation{tuple_delimiter}Alex observes Taylor's authoritarian behavior and notes changes in Taylor's attitude toward the device.){record_delimiter}
(relationship{tuple_delimiter}Alex{tuple_delimiter}Jordan{tuple_delimiter}shared goals, rebellion{tuple_delimiter}Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision.){record_delimiter}
(relationship{tuple_delimiter}Taylor{tuple_delimiter}Jordan{tuple_delimiter}conflict resolution, mutual respect{tuple_delimiter}Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce.){record_delimiter}
(relationship{tuple_delimiter}Jordan{tuple_delimiter}Cruz{tuple_delimiter}ideological conflict, rebellion{tuple_delimiter}Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order.){record_delimiter}
(relationship{tuple_delimiter}Taylor{tuple_delimiter}The Device{tuple_delimiter}reverence, technological significance{tuple_delimiter}Taylor shows reverence towards the device, indicating its importance and potential impact.){record_delimiter}
entity{tuple_delimiter}Alex{tuple_delimiter}person{tuple_delimiter}Alex is a character who experiences frustration and is observant of the dynamics among other characters.
entity{tuple_delimiter}Taylor{tuple_delimiter}person{tuple_delimiter}Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective.
entity{tuple_delimiter}Jordan{tuple_delimiter}person{tuple_delimiter}Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device.
entity{tuple_delimiter}Cruz{tuple_delimiter}person{tuple_delimiter}Cruz is associated with a vision of control and order, influencing the dynamics among other characters.
entity{tuple_delimiter}The Device{tuple_delimiter}equiment{tuple_delimiter}The Device is central to the story, with potential game-changing implications, and is revered by Taylor.
relation{tuple_delimiter}Alex{tuple_delimiter}Taylor{tuple_delimiter}power dynamics, observation{tuple_delimiter}Alex observes Taylor's authoritarian behavior and notes changes in Taylor's attitude toward the device.
relation{tuple_delimiter}Alex{tuple_delimiter}Jordan{tuple_delimiter}shared goals, rebellion{tuple_delimiter}Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision.)
relation{tuple_delimiter}Taylor{tuple_delimiter}Jordan{tuple_delimiter}conflict resolution, mutual respect{tuple_delimiter}Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce.
relation{tuple_delimiter}Jordan{tuple_delimiter}Cruz{tuple_delimiter}ideological conflict, rebellion{tuple_delimiter}Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order.
relation{tuple_delimiter}Taylor{tuple_delimiter}The Device{tuple_delimiter}reverence, technological significance{tuple_delimiter}Taylor shows reverence towards the device, indicating its importance and potential impact.
{completion_delimiter}
""",
"""<Input Text>
```
Stock markets faced a sharp downturn today as tech giants saw significant declines, with the Global Tech Index dropping by 3.4% in midday trading. Analysts attribute the selloff to investor concerns over rising interest rates and regulatory uncertainty.
Stock markets faced a sharp downturn today as tech giants saw significant declines, with the global tech index dropping by 3.4% in midday trading. Analysts attribute the selloff to investor concerns over rising interest rates and regulatory uncertainty.
Among the hardest hit, Nexon Technologies saw its stock plummet by 7.8% after reporting lower-than-expected quarterly earnings. In contrast, Omega Energy posted a modest 2.1% gain, driven by rising oil prices.
Among the hardest hit, nexon technologies saw its stock plummet by 7.8% after reporting lower-than-expected quarterly earnings. In contrast, Omega Energy posted a modest 2.1% gain, driven by rising oil prices.
Meanwhile, commodity markets reflected a mixed sentiment. Gold futures rose by 1.5%, reaching $2,080 per ounce, as investors sought safe-haven assets. Crude oil prices continued their rally, climbing to $87.60 per barrel, supported by supply constraints and strong demand.
@ -108,18 +141,18 @@ Financial experts are closely watching the Federal Reserve's next move, as specu
```
<Output>
(entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}The Global Tech Index tracks the performance of major technology stocks and experienced a 3.4% decline today.){record_delimiter}
(entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Nexon Technologies is a tech company that saw its stock decline by 7.8% after disappointing earnings.){record_delimiter}
(entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Omega Energy is an energy company that gained 2.1% in stock value due to rising oil prices.){record_delimiter}
(entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Gold futures rose by 1.5%, indicating increased investor interest in safe-haven assets.){record_delimiter}
(entity{tuple_delimiter}Crude Oil{tuple_delimiter}product{tuple_delimiter}Crude oil prices rose to $87.60 per barrel due to supply constraints and strong demand.){record_delimiter}
(entity{tuple_delimiter}Market Selloff{tuple_delimiter}category{tuple_delimiter}Market selloff refers to the significant decline in stock values due to investor concerns over interest rates and regulations.){record_delimiter}
(entity{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}category{tuple_delimiter}The Federal Reserve's upcoming policy announcement is expected to impact investor confidence and market stability.){record_delimiter}
(entity{tuple_delimiter}3.4% Decline{tuple_delimiter}category{tuple_delimiter}The Global Tech Index experienced a 3.4% decline in midday trading.){record_delimiter}
(relationship{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market performance, investor sentiment{tuple_delimiter}The decline in the Global Tech Index is part of the broader market selloff driven by investor concerns.){record_delimiter}
(relationship{tuple_delimiter}Nexon Technologies{tuple_delimiter}Global Tech Index{tuple_delimiter}company impact, index movement{tuple_delimiter}Nexon Technologies' stock decline contributed to the overall drop in the Global Tech Index.){record_delimiter}
(relationship{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}market reaction, safe-haven investment{tuple_delimiter}Gold prices rose as investors sought safe-haven assets during the market selloff.){record_delimiter}
(relationship{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}Market Selloff{tuple_delimiter}interest rate impact, financial regulation{tuple_delimiter}Speculation over Federal Reserve policy changes contributed to market volatility and investor selloff.){record_delimiter}
entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}The Global Tech Index tracks the performance of major technology stocks and experienced a 3.4% decline today.
entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Nexon Technologies is a tech company that saw its stock decline by 7.8% after disappointing earnings.
entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Omega Energy is an energy company that gained 2.1% in stock value due to rising oil prices.
entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Gold futures rose by 1.5%, indicating increased investor interest in safe-haven assets.
entity{tuple_delimiter}Crude Oil{tuple_delimiter}product{tuple_delimiter}Crude oil prices rose to $87.60 per barrel due to supply constraints and strong demand.
entity{tuple_delimiter}Market Selloff{tuple_delimiter}category{tuple_delimiter}Market selloff refers to the significant decline in stock values due to investor concerns over interest rates and regulations.
entity{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}category{tuple_delimiter}The Federal Reserve's upcoming policy announcement is expected to impact investor confidence and market stability.
entity{tuple_delimiter}3.4% Decline{tuple_delimiter}category{tuple_delimiter}The Global Tech Index experienced a 3.4% decline in midday trading.
relation{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market performance, investor sentiment{tuple_delimiter}The decline in the Global Tech Index is part of the broader market selloff driven by investor concerns.
relation{tuple_delimiter}Nexon Technologies{tuple_delimiter}Global Tech Index{tuple_delimiter}company impact, index movement{tuple_delimiter}Nexon Technologies' stock decline contributed to the overall drop in the Global Tech Index.
relation{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}market reaction, safe-haven investment{tuple_delimiter}Gold prices rose as investors sought safe-haven assets during the market selloff.
relation{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}Market Selloff{tuple_delimiter}interest rate impact, financial regulation{tuple_delimiter}Speculation over Federal Reserve policy changes contributed to market volatility and investor selloff.
{completion_delimiter}
""",
@ -129,39 +162,52 @@ At the World Athletics Championship in Tokyo, Noah Carter broke the 100m sprint
```
<Output>
(entity{tuple_delimiter}World Athletics Championship{tuple_delimiter}event{tuple_delimiter}The World Athletics Championship is a global sports competition featuring top athletes in track and field.){record_delimiter}
(entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the host city of the World Athletics Championship.){record_delimiter}
(entity{tuple_delimiter}Noah Carter{tuple_delimiter}person{tuple_delimiter}Noah Carter is a sprinter who set a new record in the 100m sprint at the World Athletics Championship.){record_delimiter}
(entity{tuple_delimiter}100m Sprint Record{tuple_delimiter}category{tuple_delimiter}The 100m sprint record is a benchmark in athletics, recently broken by Noah Carter.){record_delimiter}
(entity{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}equipment{tuple_delimiter}Carbon-fiber spikes are advanced sprinting shoes that provide enhanced speed and traction.){record_delimiter}
(entity{tuple_delimiter}World Athletics Federation{tuple_delimiter}organization{tuple_delimiter}The World Athletics Federation is the governing body overseeing the World Athletics Championship and record validations.){record_delimiter}
(relationship{tuple_delimiter}World Athletics Championship{tuple_delimiter}Tokyo{tuple_delimiter}event location, international competition{tuple_delimiter}The World Athletics Championship is being hosted in Tokyo.){record_delimiter}
(relationship{tuple_delimiter}Noah Carter{tuple_delimiter}100m Sprint Record{tuple_delimiter}athlete achievement, record-breaking{tuple_delimiter}Noah Carter set a new 100m sprint record at the championship.){record_delimiter}
(relationship{tuple_delimiter}Noah Carter{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}athletic equipment, performance boost{tuple_delimiter}Noah Carter used carbon-fiber spikes to enhance performance during the race.){record_delimiter}
(relationship{tuple_delimiter}Noah Carter{tuple_delimiter}World Athletics Championship{tuple_delimiter}athlete participation, competition{tuple_delimiter}Noah Carter is competing at the World Athletics Championship.){record_delimiter}
entity{tuple_delimiter}World Athletics Championship{tuple_delimiter}event{tuple_delimiter}The World Athletics Championship is a global sports competition featuring top athletes in track and field.
entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the host city of the World Athletics Championship.
entity{tuple_delimiter}Noah Carter{tuple_delimiter}person{tuple_delimiter}Noah Carter is a sprinter who set a new record in the 100m sprint at the World Athletics Championship.
entity{tuple_delimiter}100m Sprint Record{tuple_delimiter}category{tuple_delimiter}The 100m sprint record is a benchmark in athletics, recently broken by Noah Carter.
entity{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}equipment{tuple_delimiter}Carbon-fiber spikes are advanced sprinting shoes that provide enhanced speed and traction.
entity{tuple_delimiter}World Athletics Federation{tuple_delimiter}organization{tuple_delimiter}The World Athletics Federation is the governing body overseeing the World Athletics Championship and record validations.
relation{tuple_delimiter}World Athletics Championship{tuple_delimiter}Tokyo{tuple_delimiter}event location, international competition{tuple_delimiter}The World Athletics Championship is being hosted in Tokyo.
relation{tuple_delimiter}Noah Carter{tuple_delimiter}100m Sprint Record{tuple_delimiter}athlete achievement, record-breaking{tuple_delimiter}Noah Carter set a new 100m sprint record at the championship.
relation{tuple_delimiter}Noah Carter{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}athletic equipment, performance boost{tuple_delimiter}Noah Carter used carbon-fiber spikes to enhance performance during the race.
relation{tuple_delimiter}Noah Carter{tuple_delimiter}World Athletics Championship{tuple_delimiter}athlete participation, competition{tuple_delimiter}Noah Carter is competing at the World Athletics Championship.
{completion_delimiter}
""",
]
PROMPTS["summarize_entity_descriptions"] = """---Role---
You are a Knowledge Graph Specialist responsible for data curation and synthesis.
You are a Knowledge Graph Specialist, proficient in data curation and synthesis.
---Task---
Your task is to synthesize a list of descriptions of a given entity or relation into a single, comprehensive, and cohesive summary.
---Instructions---
1. **Comprehensiveness:** The summary must integrate key information from all provided descriptions. Do not omit important facts.
2. **Context:** The summary must explicitly mention the name of the entity or relation for full context.
3. **Conflict:** In case of conflicting or inconsistent descriptions, determine if they originate from multiple, distinct entities or relationships that share the same name. If so, summarize each entity or relationship separately and then consolidate all summaries.
4. **Style:** The output must be written from an objective, third-person perspective.
5. **Length:** Maintain depth and completeness while ensuring the summary's length not exceed {summary_length} tokens.
6. **Language:** The entire output must be written in {language}.
1. Input Format: The description list is provided in JSON format. Each JSON object (representing a single description) appears on a new line within the `Description List` section.
2. Output Format: The merged description will be returned as plain text, presented in multiple paragraphs, without any additional formatting or extraneous comments before or after the summary.
3. Comprehensiveness: The summary must integrate all key information from *every* provided description. Do not omit any important facts or details.
4. Context: Ensure the summary is written from an objective, third-person perspective; explicitly mention the name of the entity or relation for full clarity and context.
5. Context & Objectivity:
- Write the summary from an objective, third-person perspective.
- Explicitly mention the full name of the entity or relation at the beginning of the summary to ensure immediate clarity and context.
6. Conflict Handling:
- In cases of conflicting or inconsistent descriptions, first determine if these conflicts arise from multiple, distinct entities or relationships that share the same name.
- If distinct entities/relations are identified, summarize each one *separately* within the overall output.
- If conflicts within a single entity/relation (e.g., historical discrepancies) exist, attempt to reconcile them or present both viewpoints with noted uncertainty.
7. Length Constraint:The summary's total length must not exceed {summary_length} tokens, while still maintaining depth and completeness.
8. Language: The entire output must be written in {language}. Proper nouns (e.g., personal names, place names, organization names) may in their original language if proper translation is not available.
- The entire output must be written in {language}.
- Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity.
---Data---
---Input---
{description_type} Name: {description_name}
Description List:
```
{description_list}
```
---Output---
"""
@ -179,32 +225,30 @@ You are a helpful assistant responding to user query about Knowledge Graph and D
Generate a concise response based on Knowledge Base and follow Response Rules, considering both current query and the conversation history if provided. Summarize all information in the provided Knowledge Base, and incorporating general knowledge relevant to the Knowledge Base. Do not include information not provided by Knowledge Base.
---Conversation History---
{history}
---Knowledge Graph and Document Chunks---
{context_data}
---Response Guidelines---
**1. Content & Adherence:**
- Strictly adhere to the provided context from the Knowledge Base. Do not invent, assume, or include any information not present in the source data.
- If the answer cannot be found in the provided context, state that you do not have enough information to answer.
- Ensure the response maintains continuity with the conversation history.
1. **Content & Adherence:**
- Strictly adhere to the provided context from the Knowledge Base. Do not invent, assume, or include any information not present in the source data.
- If the answer cannot be found in the provided context, state that you do not have enough information to answer.
- Ensure the response maintains continuity with the conversation history.
**2. Formatting & Language:**
- Format the response using markdown with appropriate section headings.
- The response language must in the same language as the user's question.
- Target format and length: {response_type}
2. **Formatting & Language:**
- Format the response using markdown with appropriate section headings.
- The response language must in the same language as the user's question.
- Target format and length: {response_type}
**3. Citations / References:**
- At the end of the response, under a "References" section, each citation must clearly indicate its origin (KG or DC).
- The maximum number of citations is 5, including both KG and DC.
- Use the following formats for citations:
- For a Knowledge Graph Entity: `[KG] <entity_name>`
- For a Knowledge Graph Relationship: `[KG] <entity1_name> - <entity2_name>`
- For a Document Chunk: `[DC] <file_path_or_document_name>`
3. **Citations / References:**
- At the end of the response, under a "References" section, each citation must clearly indicate its origin (KG or DC).
- The maximum number of citations is 5, including both KG and DC.
- Use the following formats for citations:
- For a Knowledge Graph Entity: `[KG] <entity_name>`
- For a Knowledge Graph Relationship: `[KG] <entity1_name> ~ <entity2_name>`
- For a Document Chunk: `[DC] <file_path_or_document_name>`
---USER CONTEXT---
---User Context---
- Additional user prompt: {user_prompt}
---Response---
@ -220,7 +264,7 @@ Given a user query, your task is to extract two distinct types of keywords:
---Instructions & Constraints---
1. **Output Format**: Your output MUST be a valid JSON object and nothing else. Do not include any explanatory text, markdown code fences (like ```json), or any other text before or after the JSON. It will be parsed directly by a JSON parser.
2. **Source of Truth**: All keywords must be explicitly derived from the user query, with both high-level and low-level keyword categories required to contain content.
2. **Source of Truth**: All keywords must be explicitly derived from the user query, with both high-level and low-level keyword categories are required to contain content.
3. **Concise & Meaningful**: Keywords should be concise words or meaningful phrases. Prioritize multi-word phrases when they represent a single concept. For example, from "latest financial report of Apple Inc.", you should extract "latest financial report" and "Apple Inc." rather than "latest", "financial", "report", and "Apple".
4. **Handle Edge Cases**: For queries that are too simple, vague, or nonsensical (e.g., "hello", "ok", "asdfghjkl"), you must return a JSON object with empty lists for both keyword types.
@ -277,9 +321,6 @@ You are a helpful assistant responding to user query about Document Chunks provi
Generate a concise response based on Document Chunks and follow Response Rules, considering both the conversation history and the current query. Summarize all information in the provided Document Chunks, and incorporating general knowledge relevant to the Document Chunks. Do not include information not provided by Document Chunks.
---Conversation History---
{history}
---Document Chunks(DC)---
{content_data}

View file

@ -9,6 +9,7 @@ import logging
import logging.handlers
import os
import re
import time
import uuid
from dataclasses import dataclass
from datetime import datetime
@ -1035,8 +1036,12 @@ async def handle_cache(
prompt,
mode="default",
cache_type="unknown",
) -> str | None:
"""Generic cache handling function with flattened cache keys"""
) -> tuple[str, int] | None:
"""Generic cache handling function with flattened cache keys
Returns:
tuple[str, int] | None: (content, create_time) if cache hit, None if cache miss
"""
if hashing_kv is None:
return None
@ -1052,7 +1057,9 @@ async def handle_cache(
cache_entry = await hashing_kv.get_by_id(flattened_key)
if cache_entry:
logger.debug(f"Flattened cache hit(key:{flattened_key})")
return cache_entry["return"]
content = cache_entry["return"]
timestamp = cache_entry.get("create_time", 0)
return content, timestamp
logger.debug(f"Cache missed(mode:{mode} type:{cache_type})")
return None
@ -1144,68 +1151,6 @@ def exists_func(obj, func_name: str) -> bool:
return False
def get_conversation_turns(
conversation_history: list[dict[str, Any]], num_turns: int
) -> str:
"""
Process conversation history to get the specified number of complete turns.
Args:
conversation_history: List of conversation messages in chronological order
num_turns: Number of complete turns to include
Returns:
Formatted string of the conversation history
"""
# Check if num_turns is valid
if num_turns <= 0:
return ""
# Group messages into turns
turns: list[list[dict[str, Any]]] = []
messages: list[dict[str, Any]] = []
# First, filter out keyword extraction messages
for msg in conversation_history:
if msg["role"] == "assistant" and (
msg["content"].startswith('{ "high_level_keywords"')
or msg["content"].startswith("{'high_level_keywords'")
):
continue
messages.append(msg)
# Then process messages in chronological order
i = 0
while i < len(messages) - 1:
msg1 = messages[i]
msg2 = messages[i + 1]
# Check if we have a user-assistant or assistant-user pair
if (msg1["role"] == "user" and msg2["role"] == "assistant") or (
msg1["role"] == "assistant" and msg2["role"] == "user"
):
# Always put user message first in the turn
if msg1["role"] == "assistant":
turn = [msg2, msg1] # user, assistant
else:
turn = [msg1, msg2] # user, assistant
turns.append(turn)
i += 2
# Keep only the most recent num_turns
if len(turns) > num_turns:
turns = turns[-num_turns:]
# Format the turns into a string
formatted_turns: list[str] = []
for turn in turns:
formatted_turns.extend(
[f"user: {turn[0]['content']}", f"assistant: {turn[1]['content']}"]
)
return "\n".join(formatted_turns)
def always_get_an_event_loop() -> asyncio.AbstractEventLoop:
"""
Ensure that there is always an event loop available.
@ -1655,7 +1600,7 @@ async def use_llm_func_with_cache(
cache_type: str = "extract",
chunk_id: str | None = None,
cache_keys_collector: list = None,
) -> str:
) -> tuple[str, int]:
"""Call LLM function with cache support and text sanitization
If cache is available and enabled (determined by handle_cache based on mode),
@ -1675,7 +1620,9 @@ async def use_llm_func_with_cache(
cache_keys_collector: Optional list to collect cache keys for batch processing
Returns:
LLM response text
tuple[str, int]: (LLM response text, timestamp)
- For cache hits: (content, cache_create_time)
- For cache misses: (content, current_timestamp)
"""
# Sanitize input text to prevent UTF-8 encoding errors for all LLM providers
safe_user_prompt = sanitize_text_for_encoding(user_prompt)
@ -1710,14 +1657,15 @@ async def use_llm_func_with_cache(
# Generate cache key for this LLM call
cache_key = generate_cache_key("default", cache_type, arg_hash)
cached_return = await handle_cache(
cached_result = await handle_cache(
llm_response_cache,
arg_hash,
_prompt,
"default",
cache_type=cache_type,
)
if cached_return:
if cached_result:
content, timestamp = cached_result
logger.debug(f"Found cache for {arg_hash}")
statistic_data["llm_cache"] += 1
@ -1725,7 +1673,7 @@ async def use_llm_func_with_cache(
if cache_keys_collector is not None:
cache_keys_collector.append(cache_key)
return cached_return
return content, timestamp
statistic_data["llm_call"] += 1
# Call LLM with sanitized input
@ -1741,6 +1689,9 @@ async def use_llm_func_with_cache(
res = remove_think_tags(res)
# Generate timestamp for cache miss (LLM call completion time)
current_timestamp = int(time.time())
if llm_response_cache.global_config.get("enable_llm_cache_for_entity_extract"):
await save_to_cache(
llm_response_cache,
@ -1757,7 +1708,7 @@ async def use_llm_func_with_cache(
if cache_keys_collector is not None:
cache_keys_collector.append(cache_key)
return res
return res, current_timestamp
# When cache is disabled, directly call LLM with sanitized input
kwargs = {}
@ -1776,7 +1727,9 @@ async def use_llm_func_with_cache(
# Re-raise with the same exception type but modified message
raise type(e)(error_msg) from e
return remove_think_tags(res)
# Generate timestamp for non-cached LLM call
current_timestamp = int(time.time())
return remove_think_tags(res), current_timestamp
def get_content_summary(content: str, max_length: int = 250) -> str:
@ -2490,7 +2443,9 @@ async def process_chunks_unified(
unique_chunks = truncate_list_by_token_size(
unique_chunks,
key=lambda x: json.dumps(x, ensure_ascii=False),
key=lambda x: "\n".join(
json.dumps(item, ensure_ascii=False) for item in [x]
),
max_token_size=chunk_token_limit,
tokenizer=tokenizer,
)
@ -2604,6 +2559,118 @@ def get_pinyin_sort_key(text: str) -> str:
return text.lower()
def fix_tuple_delimiter_corruption(
record: str, delimiter_core: str, tuple_delimiter: str
) -> str:
"""
Fix various forms of tuple_delimiter corruption from LLM output.
This function handles missing or replaced characters around the core delimiter.
It fixes common corruption patterns where the LLM output doesn't match the expected
tuple_delimiter format.
Args:
record: The text record to fix
delimiter_core: The core delimiter (e.g., "S" from "<|#|>")
tuple_delimiter: The complete tuple delimiter (e.g., "<|#|>")
Returns:
The corrected record with proper tuple_delimiter format
"""
if not record or not delimiter_core or not tuple_delimiter:
return record
# Escape the delimiter core for regex use
escaped_delimiter_core = re.escape(delimiter_core)
# Fix: <|##|> -> <|#|>, <|#||#|> -> <|#|>, <|#|||#|> -> <|#|>
record = re.sub(
rf"<\|{escaped_delimiter_core}\|*?{escaped_delimiter_core}\|>",
tuple_delimiter,
record,
)
# Fix: <|\#|> -> <|#|>
record = re.sub(
rf"<\|\\{escaped_delimiter_core}\|>",
tuple_delimiter,
record,
)
# Fix: <|> -> <|#|>, <||> -> <|#|>
record = re.sub(
r"<\|+>",
tuple_delimiter,
record,
)
# Fix: <X|#|> -> <|#|>, <|#|Y> -> <|#|>, <X|#|Y> -> <|#|>, <||#||> -> <|#|>, <||#> -> <|#|> (one extra characters outside pipes)
record = re.sub(
rf"<.?\|{escaped_delimiter_core}\|*?>",
tuple_delimiter,
record,
)
# Fix: <#>, <#|>, <|#> -> <|#|> (missing one or both pipes)
record = re.sub(
rf"<\|?{escaped_delimiter_core}\|?>",
tuple_delimiter,
record,
)
# Fix: <X#|> -> <|#|>, <|#X> -> <|#|> (one pipe is replaced by other character)
record = re.sub(
rf"<[^|]{escaped_delimiter_core}\|>|<\|{escaped_delimiter_core}[^|]>",
tuple_delimiter,
record,
)
# Fix: <|#| -> <|#|>, <|#|| -> <|#|> (missing closing >)
record = re.sub(
rf"<\|{escaped_delimiter_core}\|+(?!>)",
tuple_delimiter,
record,
)
# Fix <|#: -> <|#|> (missing closing >)
record = re.sub(
rf"<\|{escaped_delimiter_core}:(?!>)",
tuple_delimiter,
record,
)
# Fix: <|| -> <|#|>
record = re.sub(
r"<\|\|(?!>)",
tuple_delimiter,
record,
)
# Fix: |#|> -> <|#|> (missing opening <)
record = re.sub(
rf"(?<!<)\|{escaped_delimiter_core}\|>",
tuple_delimiter,
record,
)
# Fix: <|#|>| -> <|#|> ( this is a fix for: <|#|| -> <|#|> )
record = re.sub(
rf"<\|{escaped_delimiter_core}\|>\|",
tuple_delimiter,
record,
)
# Fix: ||#|| -> <|#|> (double pipes on both sides without angle brackets)
record = re.sub(
rf"\|\|{escaped_delimiter_core}\|\|",
tuple_delimiter,
record,
)
return record
def create_prefixed_exception(original_exception: Exception, prefix: str) -> Exception:
"""
Safely create a prefixed exception that adapts to all error types.
@ -2644,3 +2711,120 @@ def create_prefixed_exception(original_exception: Exception, prefix: str) -> Exc
f"{prefix}: {type(original_exception).__name__}: {str(original_exception)} "
f"(Original exception could not be reconstructed: {construct_error})"
)
def _convert_to_user_format(
entities_context: list[dict],
relations_context: list[dict],
final_chunks: list[dict],
query_mode: str,
entity_id_to_original: dict = None,
relation_id_to_original: dict = None,
) -> dict[str, Any]:
"""Convert internal data format to user-friendly format using original database data"""
# Convert entities format using original data when available
formatted_entities = []
for entity in entities_context:
entity_name = entity.get("entity", "")
# Try to get original data first
original_entity = None
if entity_id_to_original and entity_name in entity_id_to_original:
original_entity = entity_id_to_original[entity_name]
if original_entity:
# Use original database data
formatted_entities.append(
{
"entity_name": original_entity.get("entity_name", entity_name),
"entity_type": original_entity.get("entity_type", "UNKNOWN"),
"description": original_entity.get("description", ""),
"source_id": original_entity.get("source_id", ""),
"file_path": original_entity.get("file_path", "unknown_source"),
"created_at": original_entity.get("created_at", ""),
}
)
else:
# Fallback to LLM context data (for backward compatibility)
formatted_entities.append(
{
"entity_name": entity_name,
"entity_type": entity.get("type", "UNKNOWN"),
"description": entity.get("description", ""),
"source_id": entity.get("source_id", ""),
"file_path": entity.get("file_path", "unknown_source"),
"created_at": entity.get("created_at", ""),
}
)
# Convert relationships format using original data when available
formatted_relationships = []
for relation in relations_context:
entity1 = relation.get("entity1", "")
entity2 = relation.get("entity2", "")
relation_key = (entity1, entity2)
# Try to get original data first
original_relation = None
if relation_id_to_original and relation_key in relation_id_to_original:
original_relation = relation_id_to_original[relation_key]
if original_relation:
# Use original database data
formatted_relationships.append(
{
"src_id": original_relation.get("src_id", entity1),
"tgt_id": original_relation.get("tgt_id", entity2),
"description": original_relation.get("description", ""),
"keywords": original_relation.get("keywords", ""),
"weight": original_relation.get("weight", 1.0),
"source_id": original_relation.get("source_id", ""),
"file_path": original_relation.get("file_path", "unknown_source"),
"created_at": original_relation.get("created_at", ""),
}
)
else:
# Fallback to LLM context data (for backward compatibility)
formatted_relationships.append(
{
"src_id": entity1,
"tgt_id": entity2,
"description": relation.get("description", ""),
"keywords": relation.get("keywords", ""),
"weight": relation.get("weight", 1.0),
"source_id": relation.get("source_id", ""),
"file_path": relation.get("file_path", "unknown_source"),
"created_at": relation.get("created_at", ""),
}
)
# Convert chunks format (chunks already contain complete data)
formatted_chunks = []
for i, chunk in enumerate(final_chunks):
chunk_data = {
"content": chunk.get("content", ""),
"file_path": chunk.get("file_path", "unknown_source"),
"chunk_id": chunk.get("chunk_id", ""),
}
formatted_chunks.append(chunk_data)
logger.debug(
f"[_convert_to_user_format] Formatted {len(formatted_chunks)}/{len(final_chunks)} chunks"
)
# Build basic metadata (metadata details will be added by calling functions)
metadata = {
"query_mode": query_mode,
"keywords": {
"high_level": [],
"low_level": [],
}, # Placeholder, will be set by calling functions
}
return {
"entities": formatted_entities,
"relationships": formatted_relationships,
"chunks": formatted_chunks,
"metadata": metadata,
}

View file

@ -194,6 +194,12 @@
"technology": "العلوم",
"product": "منتج",
"document": "وثيقة",
"content": "محتوى",
"data": "بيانات",
"artifact": "قطعة أثرية",
"concept": "مفهوم",
"naturalobject": "كائن طبيعي",
"method": "عملية",
"other": "أخرى"
},
"sideBar": {

View file

@ -194,6 +194,12 @@
"technology": "Technology",
"product": "Product",
"document": "Document",
"content": "Content",
"data": "Data",
"artifact": "Artifact",
"concept": "Concept",
"naturalobject": "Natural Object",
"method": "Method",
"other": "Other"
},
"sideBar": {

View file

@ -194,6 +194,12 @@
"technology": "Technologie",
"product": "Produit",
"document": "Document",
"content": "Contenu",
"data": "Données",
"artifact": "Artefact",
"concept": "Concept",
"naturalobject": "Objet naturel",
"method": "Méthode",
"other": "Autre"
},
"sideBar": {

View file

@ -194,6 +194,12 @@
"technology": "技术",
"product": "产品",
"document": "文档",
"content": "内容",
"data": "数据",
"artifact": "人工制品",
"concept": "概念",
"naturalobject": "自然对象",
"method": "方法",
"other": "其他"
},
"sideBar": {

View file

@ -194,6 +194,12 @@
"technology": "技術",
"product": "產品",
"document": "文檔",
"content": "內容",
"data": "資料",
"artifact": "人工製品",
"concept": "概念",
"naturalobject": "自然物件",
"method": "方法",
"other": "其他"
},
"sideBar": {

View file

@ -0,0 +1,212 @@
#!/usr/bin/env python3
"""
Test script: Demonstrates usage of aquery_data FastAPI endpoint
Query content: Who is the author of LightRAG
"""
import requests
import time
from typing import Dict, Any
# API configuration
API_KEY = "your-secure-api-key-here-123"
BASE_URL = "http://localhost:9621"
# Unified authentication headers
AUTH_HEADERS = {"Content-Type": "application/json", "X-API-Key": API_KEY}
def test_aquery_data_endpoint():
"""Test the /query/data endpoint"""
# Use unified configuration
endpoint = f"{BASE_URL}/query/data"
# Query request
query_request = {
"query": "who authored LighRAG",
"mode": "mix", # Use mixed mode to get the most comprehensive results
"top_k": 20,
"chunk_top_k": 15,
"max_entity_tokens": 4000,
"max_relation_tokens": 4000,
"max_total_tokens": 16000,
"enable_rerank": True,
}
print("=" * 60)
print("LightRAG aquery_data endpoint test")
print(
" Returns structured data including entities, relationships and text chunks"
)
print(" Can be used for custom processing and analysis")
print("=" * 60)
print(f"Query content: {query_request['query']}")
print(f"Query mode: {query_request['mode']}")
print(f"API endpoint: {endpoint}")
print("-" * 60)
try:
# Send request
print("Sending request...")
start_time = time.time()
response = requests.post(
endpoint, json=query_request, headers=AUTH_HEADERS, timeout=30
)
end_time = time.time()
response_time = end_time - start_time
print(f"Response time: {response_time:.2f} seconds")
print(f"HTTP status code: {response.status_code}")
if response.status_code == 200:
data = response.json()
print_query_results(data)
else:
print(f"Request failed: {response.status_code}")
print(f"Error message: {response.text}")
except requests.exceptions.ConnectionError:
print("❌ Connection failed: Please ensure LightRAG API service is running")
print(" Start command: python -m lightrag.api.lightrag_server")
except requests.exceptions.Timeout:
print("❌ Request timeout: Query processing took too long")
except Exception as e:
print(f"❌ Error occurred: {str(e)}")
def print_query_results(data: Dict[str, Any]):
"""Format and print query results"""
entities = data.get("entities", [])
relationships = data.get("relationships", [])
chunks = data.get("chunks", [])
metadata = data.get("metadata", {})
print("\n📊 Query result statistics:")
print(f" Entity count: {len(entities)}")
print(f" Relationship count: {len(relationships)}")
print(f" Text chunk count: {len(chunks)}")
# Print metadata
if metadata:
print("\n🔍 Query metadata:")
print(f" Query mode: {metadata.get('query_mode', 'unknown')}")
keywords = metadata.get("keywords", {})
if keywords:
high_level = keywords.get("high_level", [])
low_level = keywords.get("low_level", [])
if high_level:
print(f" High-level keywords: {', '.join(high_level)}")
if low_level:
print(f" Low-level keywords: {', '.join(low_level)}")
processing_info = metadata.get("processing_info", {})
if processing_info:
print(" Processing info:")
for key, value in processing_info.items():
print(f" {key}: {value}")
# Print entity information
if entities:
print("\n👥 Retrieved entities (first 5):")
for i, entity in enumerate(entities[:5]):
entity_name = entity.get("entity_name", "Unknown")
entity_type = entity.get("entity_type", "Unknown")
description = entity.get("description", "No description")
file_path = entity.get("file_path", "Unknown source")
print(f" {i+1}. {entity_name} ({entity_type})")
print(
f" Description: {description[:100]}{'...' if len(description) > 100 else ''}"
)
print(f" Source: {file_path}")
print()
# Print relationship information
if relationships:
print("🔗 Retrieved relationships (first 5):")
for i, rel in enumerate(relationships[:5]):
src = rel.get("src_id", "Unknown")
tgt = rel.get("tgt_id", "Unknown")
description = rel.get("description", "No description")
keywords = rel.get("keywords", "No keywords")
file_path = rel.get("file_path", "Unknown source")
print(f" {i+1}. {src}{tgt}")
print(f" Keywords: {keywords}")
print(
f" Description: {description[:100]}{'...' if len(description) > 100 else ''}"
)
print(f" Source: {file_path}")
print()
# Print text chunk information
if chunks:
print("📄 Retrieved text chunks (first 3):")
for i, chunk in enumerate(chunks[:3]):
content = chunk.get("content", "No content")
file_path = chunk.get("file_path", "Unknown source")
chunk_id = chunk.get("chunk_id", "Unknown ID")
print(f" {i+1}. Text chunk ID: {chunk_id}")
print(f" Source: {file_path}")
print(
f" Content: {content[:200]}{'...' if len(content) > 200 else ''}"
)
print()
print("=" * 60)
def compare_with_regular_query():
"""Compare results between regular query and data query"""
query_text = "LightRAG的作者是谁"
print("\n🔄 Comparison test: Regular query vs Data query")
print("-" * 60)
# Regular query
try:
print("1. Regular query (/query):")
regular_response = requests.post(
f"{BASE_URL}/query",
json={"query": query_text, "mode": "mix"},
headers=AUTH_HEADERS,
timeout=30,
)
if regular_response.status_code == 200:
regular_data = regular_response.json()
response_text = regular_data.get("response", "No response")
print(
f" Generated answer: {response_text[:300]}{'...' if len(response_text) > 300 else ''}"
)
else:
print(f" Regular query failed: {regular_response.status_code}")
if regular_response.status_code == 403:
print(" Authentication failed - Please check API Key configuration")
elif regular_response.status_code == 401:
print(" Unauthorized - Please check authentication information")
print(f" Error details: {regular_response.text}")
except Exception as e:
print(f" Regular query error: {str(e)}")
if __name__ == "__main__":
# Run main test
test_aquery_data_endpoint()
# Run comparison test
compare_with_regular_query()
print("\n💡 Usage tips:")
print("1. Ensure LightRAG API service is running")
print("2. Adjust base_url and authentication information as needed")
print("3. Modify query parameters to test different retrieval strategies")
print("4. Data query results can be used for further analysis and processing")