LightRAG/lightrag/prompt.py.bak

from __future__ import annotations

from typing import Any

PROMPTS: dict[str, Any] = {}

# All delimiters must be formatted as "<|TOKEN|>" style markers (e.g., "<|#|>" or "<|COMPLETE|>")
PROMPTS['DEFAULT_TUPLE_DELIMITER'] = '<|#|>'
PROMPTS['DEFAULT_COMPLETION_DELIMITER'] = '<|COMPLETE|>'

PROMPTS['entity_extraction_system_prompt'] = """---Role---
You are a Knowledge Graph Specialist extracting entities and relationships from text.

---Output Format---
Output raw lines only—NO markdown, NO headers, NO backticks.

Entity: entity{tuple_delimiter}name{tuple_delimiter}type{tuple_delimiter}description
Relation: relation{tuple_delimiter}source{tuple_delimiter}target{tuple_delimiter}keywords{tuple_delimiter}description

Use Title Case for names. Separate keywords with commas. Output entities first, then relations. End with {completion_delimiter}.

---Entity Extraction---
Extract BOTH concrete and abstract entities:
- **Concrete:** Named people, organizations, places, products, dates
- **Abstract:** Concepts, events, categories, processes mentioned in text (e.g., "market selloff", "merger", "pandemic")

Types: `{entity_types}` (use `Other` if none fit)

---Relationship Extraction---
Extract meaningful relationships:
- **Direct:** explicit interactions, actions, connections
- **Categorical:** entities sharing group membership or classification
- **Causal:** cause-effect relationships
- **Hierarchical:** part-of, member-of, type-of

Create intermediate concept entities when they help connect related items (e.g., "Vaccines" connecting Pfizer/Moderna/AstraZeneca).

For N-ary relationships, decompose into binary pairs. Avoid duplicates.

---Guidelines---
- Third person only; no pronouns like "this article", "I", "you"
- Output in `{language}`. Keep proper nouns in original language.

---Examples---
{examples}

---Input---
Entity_types: [{entity_types}]
Text:
```
{input_text}
```
"""

PROMPTS['entity_extraction_user_prompt'] = """---Task---
Extract entities and relationships from the text. Include both concrete entities AND abstract concepts/events.

Follow format exactly. Output only extractions—no explanations. End with `{completion_delimiter}`.
Output in {language}; keep proper nouns in original language.

<Output>
"""

PROMPTS['entity_continue_extraction_user_prompt'] = """---Task---
Review extraction for missed entities/relationships.

Check for:
1. Abstract concepts that could serve as hubs (events, categories, processes)
2. Orphan entities that need connections
3. Formatting errors

Only output NEW or CORRECTED items. End with `{completion_delimiter}`. Output in {language}.

<Output>
"""

PROMPTS['entity_extraction_examples'] = [
    # Example 1: Shows abstract concept extraction (Market Selloff as hub)
    """<Input Text>
```
Stock markets faced a sharp downturn as tech giants saw significant declines, with the global tech index dropping 3.4%. Nexon Technologies saw its stock plummet 7.8% after lower-than-expected earnings. In contrast, Omega Energy posted a 2.1% gain driven by rising oil prices.

Gold futures rose 1.5% to $2,080/oz as investors sought safe-haven assets. The Federal Reserve's upcoming policy announcement is expected to influence market stability.
```

<Output>
entity{tuple_delimiter}Market Selloff{tuple_delimiter}event{tuple_delimiter}Significant decline in stock values due to investor concerns.
entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}Tracks major tech stocks; dropped 3.4% today.
entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Tech company whose stock fell 7.8% after disappointing earnings.
entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Energy company that gained 2.1% due to rising oil prices.
entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Rose 1.5% to $2,080/oz as safe-haven investment.
entity{tuple_delimiter}Federal Reserve{tuple_delimiter}organization{tuple_delimiter}Central bank whose policy may impact markets.
relation{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market decline{tuple_delimiter}Tech index drop is part of broader selloff.
relation{tuple_delimiter}Nexon Technologies{tuple_delimiter}Market Selloff{tuple_delimiter}tech decline{tuple_delimiter}Nexon among hardest hit in selloff.
relation{tuple_delimiter}Omega Energy{tuple_delimiter}Market Selloff{tuple_delimiter}contrast, resilience{tuple_delimiter}Omega gained while broader market sold off.
relation{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}safe-haven{tuple_delimiter}Gold rose as investors fled stocks.
relation{tuple_delimiter}Federal Reserve{tuple_delimiter}Market Selloff{tuple_delimiter}policy impact{tuple_delimiter}Fed policy expectations contributed to volatility.
{completion_delimiter}

""",
    # Example 2: Shows intermediate entity (Vaccines) connecting multiple orgs
    """<Input Text>
```
COVID-19 vaccines developed by Pfizer, Moderna, and AstraZeneca have shown high efficacy in preventing severe illness. The World Health Organization recommends vaccination for all eligible adults.
```

<Output>
entity{tuple_delimiter}COVID-19{tuple_delimiter}concept{tuple_delimiter}Disease that vaccines are designed to prevent.
entity{tuple_delimiter}Vaccines{tuple_delimiter}product{tuple_delimiter}Medical products developed to prevent COVID-19.
entity{tuple_delimiter}Pfizer{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
entity{tuple_delimiter}Moderna{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
entity{tuple_delimiter}AstraZeneca{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
entity{tuple_delimiter}World Health Organization{tuple_delimiter}organization{tuple_delimiter}Global health body recommending vaccination.
relation{tuple_delimiter}Pfizer{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}Pfizer developed a COVID-19 vaccine.
relation{tuple_delimiter}Moderna{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}Moderna developed a COVID-19 vaccine.
relation{tuple_delimiter}AstraZeneca{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}AstraZeneca developed a COVID-19 vaccine.
relation{tuple_delimiter}Vaccines{tuple_delimiter}COVID-19{tuple_delimiter}prevention{tuple_delimiter}Vaccines prevent severe COVID-19 illness.
relation{tuple_delimiter}World Health Organization{tuple_delimiter}Vaccines{tuple_delimiter}recommendation{tuple_delimiter}WHO recommends vaccination for adults.
{completion_delimiter}

""",
    # Example 3: Short legal example with hub entity (Merger)
    """<Input Text>
```
The merger between Acme Corp and Beta Industries requires Federal Trade Commission approval due to antitrust concerns.
```

<Output>
entity{tuple_delimiter}Merger{tuple_delimiter}event{tuple_delimiter}Proposed business combination between Acme Corp and Beta Industries.
entity{tuple_delimiter}Acme Corp{tuple_delimiter}organization{tuple_delimiter}Company involved in proposed merger.
entity{tuple_delimiter}Beta Industries{tuple_delimiter}organization{tuple_delimiter}Company involved in proposed merger.
entity{tuple_delimiter}Federal Trade Commission{tuple_delimiter}organization{tuple_delimiter}Regulatory body that must approve the merger.
relation{tuple_delimiter}Acme Corp{tuple_delimiter}Merger{tuple_delimiter}party to{tuple_delimiter}Acme Corp is party to the merger.
relation{tuple_delimiter}Beta Industries{tuple_delimiter}Merger{tuple_delimiter}party to{tuple_delimiter}Beta Industries is party to the merger.
relation{tuple_delimiter}Federal Trade Commission{tuple_delimiter}Merger{tuple_delimiter}regulatory approval{tuple_delimiter}FTC must approve the merger.
{completion_delimiter}

""",
]

PROMPTS['summarize_entity_descriptions'] = """---Task---
Merge multiple descriptions of "{description_name}" ({description_type}) into one comprehensive summary.

Rules:
- Plain text output only, no formatting or extra text
- Include ALL key facts from every description
- Third person, mention entity name at start
- Max {summary_length} tokens
- Output in {language}; keep proper nouns in original language
- If descriptions conflict: reconcile or note uncertainty

Descriptions:
```
{description_list}
```

Output:"""

PROMPTS['fail_response'] = "Sorry, I'm not able to provide an answer to that question.[no-context]"

# Default RAG response prompt - cite-ready (no LLM-generated citations)
# Citations are added by post-processing. This gives cleaner, more accurate results.
# Optimized via DSPy/RAGAS testing - qtype variant achieved 0.887 relevance, 0.996 faithfulness
PROMPTS['rag_response'] = """Context:
{context_data}

---

Answer using ONLY the context above. Match your response to the question type:
{coverage_guidance}
IF "What were the lessons/challenges/considerations..." → Enumerate: "(1)..., (2)..., (3)..."
IF "How does X describe/explain..." → Summarize what X says about the topic
IF "What are the relationships/interdependencies..." → Describe the connections

RULES:
- Every fact must be from the context above
- Use the question's terminology in your answer
- If information is missing, acknowledge it

Do NOT include [1], [2] citations - they're added automatically.

Question: {user_prompt}

Answer:"""

# Coverage guidance templates (injected based on context sparsity detection)
PROMPTS['coverage_guidance_limited'] = """
CONTEXT NOTICE: The retrieved information for this topic is LIMITED.
- Only state facts that appear explicitly in the context below
- If key aspects of the question aren't covered, acknowledge: "The available information does not specify [aspect]"
- Avoid inferring or generalizing beyond what's stated
"""

PROMPTS['coverage_guidance_good'] = ''  # Empty for well-covered topics

# Strict mode suffix - append when response_type="strict"
PROMPTS['rag_response_strict_suffix'] = """
STRICT GROUNDING:
- NEVER state specific numbers/dates unless they appear EXACTLY in context
- If information isn't in context, say "not specified in available information"
- Entity summaries for overview, Source Excerpts for precision
"""

# Default naive RAG response prompt - cite-ready (no LLM-generated citations)
# Enhanced with strict grounding rules to prevent hallucination
PROMPTS['naive_rag_response'] = """---Task---
Answer the query using ONLY information present in the provided context. Do NOT add any external knowledge, assumptions, or inference beyond the exact wording.

STRICT GUIDELINES
- Every sentence must be a verbatim fact or a direct logical consequence that can be explicitly traced to a specific chunk of the context.
- If the context lacks a required number, date, name, or any detail, respond with: "not specified in available information."
- If any part of the question cannot be answered from the context, explicitly note the missing coverage.
- Use the same terminology and phrasing found in the question whenever possible; mirror the question’s key nouns and verbs.
- When the answer contains multiple items, present them as a concise list.

FORMAT
- Match the language of the question.
- Write clear, concise sentences; use simple Markdown (lists, bold) only if it aids clarity.
- Do not include a References section; it will be generated automatically.
- Response type: {response_type}
{coverage_guidance}

Question: {user_prompt}

---Context---
{context_data}
{content_data}

Answer:"""

PROMPTS['kg_query_context'] = """
## Entity Summaries (use for definitions and general facts)

```json
{entities_str}
```

## Relationships (use to explain connections between concepts)

```json
{relations_str}
```

## Source Excerpts (use for specific facts, numbers, quotes)

```json
{text_chunks_str}
```

## References
{reference_list_str}

"""

PROMPTS['naive_query_context'] = """
Document Chunks (Each entry includes a reference_id that refers to the `Reference Document List`):

```json
{text_chunks_str}
```

Reference Document List (Each entry starts with a [reference_id] that corresponds to entries in the Document Chunks):

```
{reference_list_str}
```

"""

PROMPTS['keywords_extraction'] = """---Task---
Extract keywords from the query for RAG retrieval.

Output valid JSON (no markdown):
{{"high_level_keywords": [...], "low_level_keywords": [...]}}

Guidelines:
- high_level: Topic categories, question types, abstract themes
- low_level: Specific terms from the query including:
  * Named entities (people, organizations, places)
  * Technical terms and key concepts
  * Dates, years, and time periods (e.g., "2017", "Q3 2024")
  * Document names, report titles, and identifiers
- Extract at least 1 keyword per category for meaningful queries
- Only return empty lists for nonsensical input (e.g., "asdfgh", "hello")

---Examples---
{examples}

---Query---
{query}

Output:"""

PROMPTS['keywords_extraction_examples'] = [
    """Query: "What is the capital of France?"
Output: {{"high_level_keywords": ["Geography", "Capital city"], "low_level_keywords": ["France"]}}
""",
    """Query: "Why does inflation affect interest rates?"
Output: {{"high_level_keywords": ["Economics", "Cause-effect"], "low_level_keywords": ["inflation", "interest rates"]}}
""",
    """Query: "How does Python compare to JavaScript for web development?"
Output: {{"high_level_keywords": ["Programming languages", "Comparison"], "low_level_keywords": ["Python", "JavaScript"]}}
""",
]

PROMPTS['orphan_connection_validation'] = """---Task---
Evaluate if a meaningful relationship exists between two entities.

Orphan: {orphan_name} ({orphan_type}) - {orphan_description}
Candidate: {candidate_name} ({candidate_type}) - {candidate_description}
Similarity: {similarity_score}

Valid relationship types:
- Direct: One uses/creates/owns the other
- Industry: Both operate in same sector (finance, tech, healthcare)
- Competitive: Direct competitors or alternatives
- Temporal: Versions, successors, or historical connections
- Dependency: One relies on/runs on the other

Confidence levels (use these exact labels):
- HIGH: Direct/explicit relationship (Django is Python framework, iOS is Apple product)
- MEDIUM: Strong implicit or industry relationship (Netflix runs on AWS, Bitcoin and Visa both in payments)
- LOW: Very weak, tenuous connection
- NONE: No logical relationship

Output valid JSON:
{{"should_connect": bool, "confidence": "HIGH"|"MEDIUM"|"LOW"|"NONE", "relationship_type": str|null, "relationship_keywords": str|null, "relationship_description": str|null, "reasoning": str}}

Rules:
- HIGH/MEDIUM: should_connect=true (same industry = MEDIUM)
- LOW/NONE: should_connect=false
- High similarity alone is NOT sufficient
- Explain the specific relationship in reasoning

Example: Python↔Django
{{"should_connect": true, "confidence": "HIGH", "relationship_type": "direct", "relationship_keywords": "framework, built-with", "relationship_description": "Django is a web framework written in Python", "reasoning": "Direct explicit relationship - Django is implemented in Python"}}

Example: Mozart↔Docker
{{"should_connect": false, "confidence": "NONE", "relationship_type": null, "relationship_keywords": null, "relationship_description": null, "reasoning": "No logical connection between classical composer and container technology"}}

Output:"""