Remove legacy storage implementations and deprecated examples: - Delete FAISS, JSON, Memgraph, Milvus, MongoDB, Nano Vector DB, Neo4j, NetworkX, Qdrant, Redis storage backends - Remove Kubernetes deployment manifests and installation scripts - Delete unofficial examples for deprecated backends and offline deployment docs Streamline core infrastructure: - Consolidate storage layer to PostgreSQL-only implementation - Add full-text search caching with FTS cache module - Implement metrics collection and monitoring pipeline - Add explain and metrics API routes Modernize frontend and tooling: - Switch web UI to Bun with bun.lock, remove npm and pnpm lockfiles - Update Dockerfile for PostgreSQL-only deployment - Add Makefile for common development tasks - Update environment and configuration examples Enhance evaluation and testing capabilities: - Add prompt optimization with DSPy and auto-tuning - Implement ground truth regeneration and variant testing - Add prompt debugging and response comparison utilities - Expand test coverage with new integration scenarios Simplify dependencies and configuration: - Remove offline-specific requirement files - Update pyproject.toml with streamlined dependencies - Add Python version pinning with .python-version - Create project guidelines in CLAUDE.md and AGENTS.md
341 lines
15 KiB
Python
341 lines
15 KiB
Python
from __future__ import annotations
|
||
|
||
from typing import Any
|
||
|
||
PROMPTS: dict[str, Any] = {}
|
||
|
||
# All delimiters must be formatted as "<|TOKEN|>" style markers (e.g., "<|#|>" or "<|COMPLETE|>")
|
||
PROMPTS['DEFAULT_TUPLE_DELIMITER'] = '<|#|>'
|
||
PROMPTS['DEFAULT_COMPLETION_DELIMITER'] = '<|COMPLETE|>'
|
||
|
||
PROMPTS['entity_extraction_system_prompt'] = """---Role---
|
||
You are a Knowledge Graph Specialist extracting entities and relationships from text.
|
||
|
||
---Output Format---
|
||
Output raw lines only—NO markdown, NO headers, NO backticks.
|
||
|
||
Entity: entity{tuple_delimiter}name{tuple_delimiter}type{tuple_delimiter}description
|
||
Relation: relation{tuple_delimiter}source{tuple_delimiter}target{tuple_delimiter}keywords{tuple_delimiter}description
|
||
|
||
Use Title Case for names. Separate keywords with commas. Output entities first, then relations. End with {completion_delimiter}.
|
||
|
||
---Entity Extraction---
|
||
Extract BOTH concrete and abstract entities:
|
||
- **Concrete:** Named people, organizations, places, products, dates
|
||
- **Abstract:** Concepts, events, categories, processes mentioned in text (e.g., "market selloff", "merger", "pandemic")
|
||
|
||
Types: `{entity_types}` (use `Other` if none fit)
|
||
|
||
---Relationship Extraction---
|
||
Extract meaningful relationships:
|
||
- **Direct:** explicit interactions, actions, connections
|
||
- **Categorical:** entities sharing group membership or classification
|
||
- **Causal:** cause-effect relationships
|
||
- **Hierarchical:** part-of, member-of, type-of
|
||
|
||
Create intermediate concept entities when they help connect related items (e.g., "Vaccines" connecting Pfizer/Moderna/AstraZeneca).
|
||
|
||
For N-ary relationships, decompose into binary pairs. Avoid duplicates.
|
||
|
||
---Guidelines---
|
||
- Third person only; no pronouns like "this article", "I", "you"
|
||
- Output in `{language}`. Keep proper nouns in original language.
|
||
|
||
---Examples---
|
||
{examples}
|
||
|
||
---Input---
|
||
Entity_types: [{entity_types}]
|
||
Text:
|
||
```
|
||
{input_text}
|
||
```
|
||
"""
|
||
|
||
PROMPTS['entity_extraction_user_prompt'] = """---Task---
|
||
Extract entities and relationships from the text. Include both concrete entities AND abstract concepts/events.
|
||
|
||
Follow format exactly. Output only extractions—no explanations. End with `{completion_delimiter}`.
|
||
Output in {language}; keep proper nouns in original language.
|
||
|
||
<Output>
|
||
"""
|
||
|
||
PROMPTS['entity_continue_extraction_user_prompt'] = """---Task---
|
||
Review extraction for missed entities/relationships.
|
||
|
||
Check for:
|
||
1. Abstract concepts that could serve as hubs (events, categories, processes)
|
||
2. Orphan entities that need connections
|
||
3. Formatting errors
|
||
|
||
Only output NEW or CORRECTED items. End with `{completion_delimiter}`. Output in {language}.
|
||
|
||
<Output>
|
||
"""
|
||
|
||
PROMPTS['entity_extraction_examples'] = [
|
||
# Example 1: Shows abstract concept extraction (Market Selloff as hub)
|
||
"""<Input Text>
|
||
```
|
||
Stock markets faced a sharp downturn as tech giants saw significant declines, with the global tech index dropping 3.4%. Nexon Technologies saw its stock plummet 7.8% after lower-than-expected earnings. In contrast, Omega Energy posted a 2.1% gain driven by rising oil prices.
|
||
|
||
Gold futures rose 1.5% to $2,080/oz as investors sought safe-haven assets. The Federal Reserve's upcoming policy announcement is expected to influence market stability.
|
||
```
|
||
|
||
<Output>
|
||
entity{tuple_delimiter}Market Selloff{tuple_delimiter}event{tuple_delimiter}Significant decline in stock values due to investor concerns.
|
||
entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}Tracks major tech stocks; dropped 3.4% today.
|
||
entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Tech company whose stock fell 7.8% after disappointing earnings.
|
||
entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Energy company that gained 2.1% due to rising oil prices.
|
||
entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Rose 1.5% to $2,080/oz as safe-haven investment.
|
||
entity{tuple_delimiter}Federal Reserve{tuple_delimiter}organization{tuple_delimiter}Central bank whose policy may impact markets.
|
||
relation{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market decline{tuple_delimiter}Tech index drop is part of broader selloff.
|
||
relation{tuple_delimiter}Nexon Technologies{tuple_delimiter}Market Selloff{tuple_delimiter}tech decline{tuple_delimiter}Nexon among hardest hit in selloff.
|
||
relation{tuple_delimiter}Omega Energy{tuple_delimiter}Market Selloff{tuple_delimiter}contrast, resilience{tuple_delimiter}Omega gained while broader market sold off.
|
||
relation{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}safe-haven{tuple_delimiter}Gold rose as investors fled stocks.
|
||
relation{tuple_delimiter}Federal Reserve{tuple_delimiter}Market Selloff{tuple_delimiter}policy impact{tuple_delimiter}Fed policy expectations contributed to volatility.
|
||
{completion_delimiter}
|
||
|
||
""",
|
||
# Example 2: Shows intermediate entity (Vaccines) connecting multiple orgs
|
||
"""<Input Text>
|
||
```
|
||
COVID-19 vaccines developed by Pfizer, Moderna, and AstraZeneca have shown high efficacy in preventing severe illness. The World Health Organization recommends vaccination for all eligible adults.
|
||
```
|
||
|
||
<Output>
|
||
entity{tuple_delimiter}COVID-19{tuple_delimiter}concept{tuple_delimiter}Disease that vaccines are designed to prevent.
|
||
entity{tuple_delimiter}Vaccines{tuple_delimiter}product{tuple_delimiter}Medical products developed to prevent COVID-19.
|
||
entity{tuple_delimiter}Pfizer{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
|
||
entity{tuple_delimiter}Moderna{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
|
||
entity{tuple_delimiter}AstraZeneca{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
|
||
entity{tuple_delimiter}World Health Organization{tuple_delimiter}organization{tuple_delimiter}Global health body recommending vaccination.
|
||
relation{tuple_delimiter}Pfizer{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}Pfizer developed a COVID-19 vaccine.
|
||
relation{tuple_delimiter}Moderna{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}Moderna developed a COVID-19 vaccine.
|
||
relation{tuple_delimiter}AstraZeneca{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}AstraZeneca developed a COVID-19 vaccine.
|
||
relation{tuple_delimiter}Vaccines{tuple_delimiter}COVID-19{tuple_delimiter}prevention{tuple_delimiter}Vaccines prevent severe COVID-19 illness.
|
||
relation{tuple_delimiter}World Health Organization{tuple_delimiter}Vaccines{tuple_delimiter}recommendation{tuple_delimiter}WHO recommends vaccination for adults.
|
||
{completion_delimiter}
|
||
|
||
""",
|
||
# Example 3: Short legal example with hub entity (Merger)
|
||
"""<Input Text>
|
||
```
|
||
The merger between Acme Corp and Beta Industries requires Federal Trade Commission approval due to antitrust concerns.
|
||
```
|
||
|
||
<Output>
|
||
entity{tuple_delimiter}Merger{tuple_delimiter}event{tuple_delimiter}Proposed business combination between Acme Corp and Beta Industries.
|
||
entity{tuple_delimiter}Acme Corp{tuple_delimiter}organization{tuple_delimiter}Company involved in proposed merger.
|
||
entity{tuple_delimiter}Beta Industries{tuple_delimiter}organization{tuple_delimiter}Company involved in proposed merger.
|
||
entity{tuple_delimiter}Federal Trade Commission{tuple_delimiter}organization{tuple_delimiter}Regulatory body that must approve the merger.
|
||
relation{tuple_delimiter}Acme Corp{tuple_delimiter}Merger{tuple_delimiter}party to{tuple_delimiter}Acme Corp is party to the merger.
|
||
relation{tuple_delimiter}Beta Industries{tuple_delimiter}Merger{tuple_delimiter}party to{tuple_delimiter}Beta Industries is party to the merger.
|
||
relation{tuple_delimiter}Federal Trade Commission{tuple_delimiter}Merger{tuple_delimiter}regulatory approval{tuple_delimiter}FTC must approve the merger.
|
||
{completion_delimiter}
|
||
|
||
""",
|
||
]
|
||
|
||
PROMPTS['summarize_entity_descriptions'] = """---Task---
|
||
Merge multiple descriptions of "{description_name}" ({description_type}) into one comprehensive summary.
|
||
|
||
Rules:
|
||
- Plain text output only, no formatting or extra text
|
||
- Include ALL key facts from every description
|
||
- Third person, mention entity name at start
|
||
- Max {summary_length} tokens
|
||
- Output in {language}; keep proper nouns in original language
|
||
- If descriptions conflict: reconcile or note uncertainty
|
||
|
||
Descriptions:
|
||
```
|
||
{description_list}
|
||
```
|
||
|
||
Output:"""
|
||
|
||
PROMPTS['fail_response'] = "Sorry, I'm not able to provide an answer to that question.[no-context]"
|
||
|
||
# Default RAG response prompt - cite-ready (no LLM-generated citations)
|
||
# Citations are added by post-processing. This gives cleaner, more accurate results.
|
||
# Optimized via DSPy/RAGAS testing - qtype variant achieved 0.887 relevance, 0.996 faithfulness
|
||
PROMPTS['rag_response'] = """Context:
|
||
{context_data}
|
||
|
||
---
|
||
|
||
Answer using ONLY the context above. Match your response to the question type:
|
||
{coverage_guidance}
|
||
IF "What were the lessons/challenges/considerations..." → Enumerate: "(1)..., (2)..., (3)..."
|
||
IF "How does X describe/explain..." → Summarize what X says about the topic
|
||
IF "What are the relationships/interdependencies..." → Describe the connections
|
||
|
||
RULES:
|
||
- Every fact must be from the context above
|
||
- Use the question's terminology in your answer
|
||
- If information is missing, acknowledge it
|
||
|
||
Do NOT include [1], [2] citations - they're added automatically.
|
||
|
||
Question: {user_prompt}
|
||
|
||
Answer:"""
|
||
|
||
# Coverage guidance templates (injected based on context sparsity detection)
|
||
PROMPTS['coverage_guidance_limited'] = """
|
||
CONTEXT NOTICE: The retrieved information for this topic is LIMITED.
|
||
- Only state facts that appear explicitly in the context below
|
||
- If key aspects of the question aren't covered, acknowledge: "The available information does not specify [aspect]"
|
||
- Avoid inferring or generalizing beyond what's stated
|
||
"""
|
||
|
||
PROMPTS['coverage_guidance_good'] = '' # Empty for well-covered topics
|
||
|
||
# Strict mode suffix - append when response_type="strict"
|
||
PROMPTS['rag_response_strict_suffix'] = """
|
||
STRICT GROUNDING:
|
||
- NEVER state specific numbers/dates unless they appear EXACTLY in context
|
||
- If information isn't in context, say "not specified in available information"
|
||
- Entity summaries for overview, Source Excerpts for precision
|
||
"""
|
||
|
||
# Default naive RAG response prompt - cite-ready (no LLM-generated citations)
|
||
# Enhanced with strict grounding rules to prevent hallucination
|
||
PROMPTS['naive_rag_response'] = """---Task---
|
||
Answer the query using ONLY information present in the provided context. Do NOT add any external knowledge, assumptions, or inference beyond the exact wording.
|
||
|
||
STRICT GUIDELINES
|
||
- Every sentence must be a verbatim fact or a direct logical consequence that can be explicitly traced to a specific chunk of the context.
|
||
- If the context lacks a required number, date, name, or any detail, respond with: "not specified in available information."
|
||
- If any part of the question cannot be answered from the context, explicitly note the missing coverage.
|
||
- Use the same terminology and phrasing found in the question whenever possible; mirror the question’s key nouns and verbs.
|
||
- When the answer contains multiple items, present them as a concise list.
|
||
|
||
FORMAT
|
||
- Match the language of the question.
|
||
- Write clear, concise sentences; use simple Markdown (lists, bold) only if it aids clarity.
|
||
- Do not include a References section; it will be generated automatically.
|
||
- Response type: {response_type}
|
||
{coverage_guidance}
|
||
|
||
Question: {user_prompt}
|
||
|
||
---Context---
|
||
{context_data}
|
||
{content_data}
|
||
|
||
Answer:"""
|
||
|
||
PROMPTS['kg_query_context'] = """
|
||
## Entity Summaries (use for definitions and general facts)
|
||
|
||
```json
|
||
{entities_str}
|
||
```
|
||
|
||
## Relationships (use to explain connections between concepts)
|
||
|
||
```json
|
||
{relations_str}
|
||
```
|
||
|
||
## Source Excerpts (use for specific facts, numbers, quotes)
|
||
|
||
```json
|
||
{text_chunks_str}
|
||
```
|
||
|
||
## References
|
||
{reference_list_str}
|
||
|
||
"""
|
||
|
||
PROMPTS['naive_query_context'] = """
|
||
Document Chunks (Each entry includes a reference_id that refers to the `Reference Document List`):
|
||
|
||
```json
|
||
{text_chunks_str}
|
||
```
|
||
|
||
Reference Document List (Each entry starts with a [reference_id] that corresponds to entries in the Document Chunks):
|
||
|
||
```
|
||
{reference_list_str}
|
||
```
|
||
|
||
"""
|
||
|
||
PROMPTS['keywords_extraction'] = """---Task---
|
||
Extract keywords from the query for RAG retrieval.
|
||
|
||
Output valid JSON (no markdown):
|
||
{{"high_level_keywords": [...], "low_level_keywords": [...]}}
|
||
|
||
Guidelines:
|
||
- high_level: Topic categories, question types, abstract themes
|
||
- low_level: Specific terms from the query including:
|
||
* Named entities (people, organizations, places)
|
||
* Technical terms and key concepts
|
||
* Dates, years, and time periods (e.g., "2017", "Q3 2024")
|
||
* Document names, report titles, and identifiers
|
||
- Extract at least 1 keyword per category for meaningful queries
|
||
- Only return empty lists for nonsensical input (e.g., "asdfgh", "hello")
|
||
|
||
---Examples---
|
||
{examples}
|
||
|
||
---Query---
|
||
{query}
|
||
|
||
Output:"""
|
||
|
||
PROMPTS['keywords_extraction_examples'] = [
|
||
"""Query: "What is the capital of France?"
|
||
Output: {{"high_level_keywords": ["Geography", "Capital city"], "low_level_keywords": ["France"]}}
|
||
""",
|
||
"""Query: "Why does inflation affect interest rates?"
|
||
Output: {{"high_level_keywords": ["Economics", "Cause-effect"], "low_level_keywords": ["inflation", "interest rates"]}}
|
||
""",
|
||
"""Query: "How does Python compare to JavaScript for web development?"
|
||
Output: {{"high_level_keywords": ["Programming languages", "Comparison"], "low_level_keywords": ["Python", "JavaScript"]}}
|
||
""",
|
||
]
|
||
|
||
PROMPTS['orphan_connection_validation'] = """---Task---
|
||
Evaluate if a meaningful relationship exists between two entities.
|
||
|
||
Orphan: {orphan_name} ({orphan_type}) - {orphan_description}
|
||
Candidate: {candidate_name} ({candidate_type}) - {candidate_description}
|
||
Similarity: {similarity_score}
|
||
|
||
Valid relationship types:
|
||
- Direct: One uses/creates/owns the other
|
||
- Industry: Both operate in same sector (finance, tech, healthcare)
|
||
- Competitive: Direct competitors or alternatives
|
||
- Temporal: Versions, successors, or historical connections
|
||
- Dependency: One relies on/runs on the other
|
||
|
||
Confidence levels (use these exact labels):
|
||
- HIGH: Direct/explicit relationship (Django is Python framework, iOS is Apple product)
|
||
- MEDIUM: Strong implicit or industry relationship (Netflix runs on AWS, Bitcoin and Visa both in payments)
|
||
- LOW: Very weak, tenuous connection
|
||
- NONE: No logical relationship
|
||
|
||
Output valid JSON:
|
||
{{"should_connect": bool, "confidence": "HIGH"|"MEDIUM"|"LOW"|"NONE", "relationship_type": str|null, "relationship_keywords": str|null, "relationship_description": str|null, "reasoning": str}}
|
||
|
||
Rules:
|
||
- HIGH/MEDIUM: should_connect=true (same industry = MEDIUM)
|
||
- LOW/NONE: should_connect=false
|
||
- High similarity alone is NOT sufficient
|
||
- Explain the specific relationship in reasoning
|
||
|
||
Example: Python↔Django
|
||
{{"should_connect": true, "confidence": "HIGH", "relationship_type": "direct", "relationship_keywords": "framework, built-with", "relationship_description": "Django is a web framework written in Python", "reasoning": "Direct explicit relationship - Django is implemented in Python"}}
|
||
|
||
Example: Mozart↔Docker
|
||
{{"should_connect": false, "confidence": "NONE", "relationship_type": null, "relationship_keywords": null, "relationship_description": null, "reasoning": "No logical connection between classical composer and container technology"}}
|
||
|
||
Output:"""
|