feat(lightrag): improve entity extraction prompts and rerank chunking

Enhance entity extraction with better structured prompts:
- Reorganize prompt format for improved clarity and consistency
- Add XML-style formatting tags for better LLM parsing
- Include language parameter in keywords extraction cache key
- Fix language parameter usage in keywords_extraction prompt

Improve rerank module with chunking fixes:
- Fix top_n behavior to limit documents instead of chunks
- Add Cohere reranker support with proper chunking
- Improve error handling for rerank API responses

Update operate.py:
- Better entity extraction parsing and validation
- Improved cache key generation for multilingual support
This commit is contained in:
clssck 2025-12-12 16:43:09 +01:00
parent 59e89772de
commit abb44eccb1
3 changed files with 2419 additions and 2633 deletions

File diff suppressed because it is too large Load diff

View file

@ -1,260 +1,348 @@
from __future__ import annotations
from typing import Any
PROMPTS: dict[str, Any] = {}
# All delimiters must be formatted as "<|TOKEN|>" style markers (e.g., "<|#|>" or "<|COMPLETE|>")
PROMPTS['DEFAULT_TUPLE_DELIMITER'] = '<|#|>'
PROMPTS['DEFAULT_COMPLETION_DELIMITER'] = '<|COMPLETE|>'
# All delimiters must be formatted as "<|UPPER_CASE_STRING|>"
PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|#|>"
PROMPTS["DEFAULT_COMPLETION_DELIMITER"] = "<|COMPLETE|>"
PROMPTS['entity_extraction_system_prompt'] = """---Role---
You are a Knowledge Graph Specialist extracting entities and relationships from text.
PROMPTS["entity_extraction_system_prompt"] = """---Role---
You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the input text.
---Output Format---
Output raw lines onlyNO markdown, NO headers, NO backticks.
---Instructions---
1. **Entity Extraction & Output:**
* **Identification:** Identify clearly defined and meaningful entities in the input text.
* **Entity Details:** For each identified entity, extract the following information:
* `entity_name`: The name of the entity. If the entity name is case-insensitive, capitalize the first letter of each significant word (title case). Ensure **consistent naming** across the entire extraction process.
* `entity_type`: Categorize the entity using one of the following types: `{entity_types}`. If none of the provided entity types apply, do not add new entity type and classify it as `Other`.
* `entity_description`: Provide a concise yet comprehensive description of the entity's attributes and activities, based *solely* on the information present in the input text.
* **Output Format - Entities:** Output a total of 4 fields for each entity, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `entity`.
* Format: `entity{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description`
Entity: entity{tuple_delimiter}name{tuple_delimiter}type{tuple_delimiter}description
Relation: relation{tuple_delimiter}source{tuple_delimiter}target{tuple_delimiter}keywords{tuple_delimiter}description
2. **Relationship Extraction & Output:**
* **Identification:** Identify direct, clearly stated, and meaningful relationships between previously extracted entities.
* **N-ary Relationship Decomposition:** If a single statement describes a relationship involving more than two entities (an N-ary relationship), decompose it into multiple binary (two-entity) relationship pairs for separate description.
* **Example:** For "Alice, Bob, and Carol collaborated on Project X," extract binary relationships such as "Alice collaborated with Project X," "Bob collaborated with Project X," and "Carol collaborated with Project X," or "Alice collaborated with Bob," based on the most reasonable binary interpretations.
* **Relationship Details:** For each binary relationship, extract the following fields:
* `source_entity`: The name of the source entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive.
* `target_entity`: The name of the target entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive.
* `relationship_keywords`: One or more high-level keywords summarizing the overarching nature, concepts, or themes of the relationship. Multiple keywords within this field must be separated by a comma `,`. **DO NOT use `{tuple_delimiter}` for separating multiple keywords within this field.**
* `relationship_description`: A concise explanation of the nature of the relationship between the source and target entities, providing a clear rationale for their connection.
* **Output Format - Relationships:** Output a total of 5 fields for each relationship, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `relation`.
* Format: `relation{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_description`
Use Title Case for names. Separate keywords with commas. Output entities first, then relations. End with {completion_delimiter}.
3. **Delimiter Usage Protocol:**
* The `{tuple_delimiter}` is a complete, atomic marker and **must not be filled with content**. It serves strictly as a field separator.
* **Incorrect Example:** `entity{tuple_delimiter}Tokyo<|location|>Tokyo is the capital of Japan.`
* **Correct Example:** `entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the capital of Japan.`
---Entity Extraction---
Extract BOTH concrete and abstract entities:
- **Concrete:** Named people, organizations, places, products, dates
- **Abstract:** Concepts, events, categories, processes mentioned in text (e.g., "market selloff", "merger", "pandemic")
4. **Relationship Direction & Duplication:**
* Treat all relationships as **undirected** unless explicitly stated otherwise. Swapping the source and target entities for an undirected relationship does not constitute a new relationship.
* Avoid outputting duplicate relationships.
Types: `{entity_types}` (use `Other` if none fit)
5. **Output Order & Prioritization:**
* Output all extracted entities first, followed by all extracted relationships.
* Within the list of relationships, prioritize and output those relationships that are **most significant** to the core meaning of the input text first.
---Relationship Extraction---
Extract meaningful relationships:
- **Direct:** explicit interactions, actions, connections
- **Categorical:** entities sharing group membership or classification
- **Causal:** cause-effect relationships
- **Hierarchical:** part-of, member-of, type-of
6. **Context & Objectivity:**
* Ensure all entity names and descriptions are written in the **third person**.
* Explicitly name the subject or object; **avoid using pronouns** such as `this article`, `this paper`, `our company`, `I`, `you`, and `he/she`.
Create intermediate concept entities when they help connect related items (e.g., "Vaccines" connecting Pfizer/Moderna/AstraZeneca).
7. **Language & Proper Nouns:**
* The entire output (entity names, keywords, and descriptions) must be written in `{language}`.
* Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity.
For N-ary relationships, decompose into binary pairs. Avoid duplicates.
---Guidelines---
- Third person only; no pronouns like "this article", "I", "you"
- Output in `{language}`. Keep proper nouns in original language.
8. **Completion Signal:** Output the literal string `{completion_delimiter}` only after all entities and relationships, following all criteria, have been completely extracted and outputted.
---Examples---
{examples}
"""
---Input---
Entity_types: [{entity_types}]
Text:
PROMPTS["entity_extraction_user_prompt"] = """---Task---
Extract entities and relationships from the input text in Data to be Processed below.
---Instructions---
1. **Strict Adherence to Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system prompt.
2. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list.
3. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant entities and relationships have been extracted and presented.
4. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated.
---Data to be Processed---
<Entity_types>
[{entity_types}]
<Input Text>
```
{input_text}
```
"""
PROMPTS['entity_extraction_user_prompt'] = """---Task---
Extract entities and relationships from the text. Include both concrete entities AND abstract concepts/events.
Follow format exactly. Output only extractionsno explanations. End with `{completion_delimiter}`.
Output in {language}; keep proper nouns in original language.
<Output>
"""
PROMPTS['entity_continue_extraction_user_prompt'] = """---Task---
Review extraction for missed entities/relationships.
PROMPTS["entity_continue_extraction_user_prompt"] = """---Task---
Based on the last extraction task, identify and extract any **missed or incorrectly formatted** entities and relationships from the input text.
Check for:
1. Abstract concepts that could serve as hubs (events, categories, processes)
2. Orphan entities that need connections
3. Formatting errors
Only output NEW or CORRECTED items. End with `{completion_delimiter}`. Output in {language}.
---Instructions---
1. **Strict Adherence to System Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system instructions.
2. **Focus on Corrections/Additions:**
* **Do NOT** re-output entities and relationships that were **correctly and fully** extracted in the last task.
* If an entity or relationship was **missed** in the last task, extract and output it now according to the system format.
* If an entity or relationship was **truncated, had missing fields, or was otherwise incorrectly formatted** in the last task, re-output the *corrected and complete* version in the specified format.
3. **Output Format - Entities:** Output a total of 4 fields for each entity, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `entity`.
4. **Output Format - Relationships:** Output a total of 5 fields for each relationship, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `relation`.
5. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list.
6. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant missing or corrected entities and relationships have been extracted and presented.
7. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated.
<Output>
"""
PROMPTS['entity_extraction_examples'] = [
# Example 1: Shows abstract concept extraction (Market Selloff as hub)
"""<Input Text>
PROMPTS["entity_extraction_examples"] = [
"""<Entity_types>
["Person","Creature","Organization","Location","Event","Concept","Method","Content","Data","Artifact","NaturalObject"]
<Input Text>
```
Stock markets faced a sharp downturn as tech giants saw significant declines, with the global tech index dropping 3.4%. Nexon Technologies saw its stock plummet 7.8% after lower-than-expected earnings. In contrast, Omega Energy posted a 2.1% gain driven by rising oil prices.
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.
Gold futures rose 1.5% to $2,080/oz as investors sought safe-haven assets. The Federal Reserve's upcoming policy announcement is expected to influence market stability.
Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. "If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us."
The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.
It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
```
<Output>
entity{tuple_delimiter}Market Selloff{tuple_delimiter}event{tuple_delimiter}Significant decline in stock values due to investor concerns.
entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}Tracks major tech stocks; dropped 3.4% today.
entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Tech company whose stock fell 7.8% after disappointing earnings.
entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Energy company that gained 2.1% due to rising oil prices.
entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Rose 1.5% to $2,080/oz as safe-haven investment.
entity{tuple_delimiter}Federal Reserve{tuple_delimiter}organization{tuple_delimiter}Central bank whose policy may impact markets.
relation{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market decline{tuple_delimiter}Tech index drop is part of broader selloff.
relation{tuple_delimiter}Nexon Technologies{tuple_delimiter}Market Selloff{tuple_delimiter}tech decline{tuple_delimiter}Nexon among hardest hit in selloff.
relation{tuple_delimiter}Omega Energy{tuple_delimiter}Market Selloff{tuple_delimiter}contrast, resilience{tuple_delimiter}Omega gained while broader market sold off.
relation{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}safe-haven{tuple_delimiter}Gold rose as investors fled stocks.
relation{tuple_delimiter}Federal Reserve{tuple_delimiter}Market Selloff{tuple_delimiter}policy impact{tuple_delimiter}Fed policy expectations contributed to volatility.
entity{tuple_delimiter}Alex{tuple_delimiter}person{tuple_delimiter}Alex is a character who experiences frustration and is observant of the dynamics among other characters.
entity{tuple_delimiter}Taylor{tuple_delimiter}person{tuple_delimiter}Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective.
entity{tuple_delimiter}Jordan{tuple_delimiter}person{tuple_delimiter}Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device.
entity{tuple_delimiter}Cruz{tuple_delimiter}person{tuple_delimiter}Cruz is associated with a vision of control and order, influencing the dynamics among other characters.
entity{tuple_delimiter}The Device{tuple_delimiter}equipment{tuple_delimiter}The Device is central to the story, with potential game-changing implications, and is revered by Taylor.
relation{tuple_delimiter}Alex{tuple_delimiter}Taylor{tuple_delimiter}power dynamics, observation{tuple_delimiter}Alex observes Taylor's authoritarian behavior and notes changes in Taylor's attitude toward the device.
relation{tuple_delimiter}Alex{tuple_delimiter}Jordan{tuple_delimiter}shared goals, rebellion{tuple_delimiter}Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision.)
relation{tuple_delimiter}Taylor{tuple_delimiter}Jordan{tuple_delimiter}conflict resolution, mutual respect{tuple_delimiter}Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce.
relation{tuple_delimiter}Jordan{tuple_delimiter}Cruz{tuple_delimiter}ideological conflict, rebellion{tuple_delimiter}Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order.
relation{tuple_delimiter}Taylor{tuple_delimiter}The Device{tuple_delimiter}reverence, technological significance{tuple_delimiter}Taylor shows reverence towards the device, indicating its importance and potential impact.
{completion_delimiter}
""",
# Example 2: Shows intermediate entity (Vaccines) connecting multiple orgs
"""<Input Text>
"""<Entity_types>
["Person","Creature","Organization","Location","Event","Concept","Method","Content","Data","Artifact","NaturalObject"]
<Input Text>
```
COVID-19 vaccines developed by Pfizer, Moderna, and AstraZeneca have shown high efficacy in preventing severe illness. The World Health Organization recommends vaccination for all eligible adults.
Stock markets faced a sharp downturn today as tech giants saw significant declines, with the global tech index dropping by 3.4% in midday trading. Analysts attribute the selloff to investor concerns over rising interest rates and regulatory uncertainty.
Among the hardest hit, nexon technologies saw its stock plummet by 7.8% after reporting lower-than-expected quarterly earnings. In contrast, Omega Energy posted a modest 2.1% gain, driven by rising oil prices.
Meanwhile, commodity markets reflected a mixed sentiment. Gold futures rose by 1.5%, reaching $2,080 per ounce, as investors sought safe-haven assets. Crude oil prices continued their rally, climbing to $87.60 per barrel, supported by supply constraints and strong demand.
Financial experts are closely watching the Federal Reserve's next move, as speculation grows over potential rate hikes. The upcoming policy announcement is expected to influence investor confidence and overall market stability.
```
<Output>
entity{tuple_delimiter}COVID-19{tuple_delimiter}concept{tuple_delimiter}Disease that vaccines are designed to prevent.
entity{tuple_delimiter}Vaccines{tuple_delimiter}product{tuple_delimiter}Medical products developed to prevent COVID-19.
entity{tuple_delimiter}Pfizer{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
entity{tuple_delimiter}Moderna{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
entity{tuple_delimiter}AstraZeneca{tuple_delimiter}organization{tuple_delimiter}Pharmaceutical company that developed a COVID-19 vaccine.
entity{tuple_delimiter}World Health Organization{tuple_delimiter}organization{tuple_delimiter}Global health body recommending vaccination.
relation{tuple_delimiter}Pfizer{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}Pfizer developed a COVID-19 vaccine.
relation{tuple_delimiter}Moderna{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}Moderna developed a COVID-19 vaccine.
relation{tuple_delimiter}AstraZeneca{tuple_delimiter}Vaccines{tuple_delimiter}development{tuple_delimiter}AstraZeneca developed a COVID-19 vaccine.
relation{tuple_delimiter}Vaccines{tuple_delimiter}COVID-19{tuple_delimiter}prevention{tuple_delimiter}Vaccines prevent severe COVID-19 illness.
relation{tuple_delimiter}World Health Organization{tuple_delimiter}Vaccines{tuple_delimiter}recommendation{tuple_delimiter}WHO recommends vaccination for adults.
entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}The Global Tech Index tracks the performance of major technology stocks and experienced a 3.4% decline today.
entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Nexon Technologies is a tech company that saw its stock decline by 7.8% after disappointing earnings.
entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Omega Energy is an energy company that gained 2.1% in stock value due to rising oil prices.
entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Gold futures rose by 1.5%, indicating increased investor interest in safe-haven assets.
entity{tuple_delimiter}Crude Oil{tuple_delimiter}product{tuple_delimiter}Crude oil prices rose to $87.60 per barrel due to supply constraints and strong demand.
entity{tuple_delimiter}Market Selloff{tuple_delimiter}category{tuple_delimiter}Market selloff refers to the significant decline in stock values due to investor concerns over interest rates and regulations.
entity{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}category{tuple_delimiter}The Federal Reserve's upcoming policy announcement is expected to impact investor confidence and market stability.
entity{tuple_delimiter}3.4% Decline{tuple_delimiter}category{tuple_delimiter}The Global Tech Index experienced a 3.4% decline in midday trading.
relation{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market performance, investor sentiment{tuple_delimiter}The decline in the Global Tech Index is part of the broader market selloff driven by investor concerns.
relation{tuple_delimiter}Nexon Technologies{tuple_delimiter}Global Tech Index{tuple_delimiter}company impact, index movement{tuple_delimiter}Nexon Technologies' stock decline contributed to the overall drop in the Global Tech Index.
relation{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}market reaction, safe-haven investment{tuple_delimiter}Gold prices rose as investors sought safe-haven assets during the market selloff.
relation{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}Market Selloff{tuple_delimiter}interest rate impact, financial regulation{tuple_delimiter}Speculation over Federal Reserve policy changes contributed to market volatility and investor selloff.
{completion_delimiter}
""",
# Example 3: Short legal example with hub entity (Merger)
"""<Input Text>
"""<Entity_types>
["Person","Creature","Organization","Location","Event","Concept","Method","Content","Data","Artifact","NaturalObject"]
<Input Text>
```
The merger between Acme Corp and Beta Industries requires Federal Trade Commission approval due to antitrust concerns.
At the World Athletics Championship in Tokyo, Noah Carter broke the 100m sprint record using cutting-edge carbon-fiber spikes.
```
<Output>
entity{tuple_delimiter}Merger{tuple_delimiter}event{tuple_delimiter}Proposed business combination between Acme Corp and Beta Industries.
entity{tuple_delimiter}Acme Corp{tuple_delimiter}organization{tuple_delimiter}Company involved in proposed merger.
entity{tuple_delimiter}Beta Industries{tuple_delimiter}organization{tuple_delimiter}Company involved in proposed merger.
entity{tuple_delimiter}Federal Trade Commission{tuple_delimiter}organization{tuple_delimiter}Regulatory body that must approve the merger.
relation{tuple_delimiter}Acme Corp{tuple_delimiter}Merger{tuple_delimiter}party to{tuple_delimiter}Acme Corp is party to the merger.
relation{tuple_delimiter}Beta Industries{tuple_delimiter}Merger{tuple_delimiter}party to{tuple_delimiter}Beta Industries is party to the merger.
relation{tuple_delimiter}Federal Trade Commission{tuple_delimiter}Merger{tuple_delimiter}regulatory approval{tuple_delimiter}FTC must approve the merger.
entity{tuple_delimiter}World Athletics Championship{tuple_delimiter}event{tuple_delimiter}The World Athletics Championship is a global sports competition featuring top athletes in track and field.
entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the host city of the World Athletics Championship.
entity{tuple_delimiter}Noah Carter{tuple_delimiter}person{tuple_delimiter}Noah Carter is a sprinter who set a new record in the 100m sprint at the World Athletics Championship.
entity{tuple_delimiter}100m Sprint Record{tuple_delimiter}category{tuple_delimiter}The 100m sprint record is a benchmark in athletics, recently broken by Noah Carter.
entity{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}equipment{tuple_delimiter}Carbon-fiber spikes are advanced sprinting shoes that provide enhanced speed and traction.
entity{tuple_delimiter}World Athletics Federation{tuple_delimiter}organization{tuple_delimiter}The World Athletics Federation is the governing body overseeing the World Athletics Championship and record validations.
relation{tuple_delimiter}World Athletics Championship{tuple_delimiter}Tokyo{tuple_delimiter}event location, international competition{tuple_delimiter}The World Athletics Championship is being hosted in Tokyo.
relation{tuple_delimiter}Noah Carter{tuple_delimiter}100m Sprint Record{tuple_delimiter}athlete achievement, record-breaking{tuple_delimiter}Noah Carter set a new 100m sprint record at the championship.
relation{tuple_delimiter}Noah Carter{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}athletic equipment, performance boost{tuple_delimiter}Noah Carter used carbon-fiber spikes to enhance performance during the race.
relation{tuple_delimiter}Noah Carter{tuple_delimiter}World Athletics Championship{tuple_delimiter}athlete participation, competition{tuple_delimiter}Noah Carter is competing at the World Athletics Championship.
{completion_delimiter}
""",
]
PROMPTS['summarize_entity_descriptions'] = """---Task---
Merge multiple descriptions of "{description_name}" ({description_type}) into one comprehensive summary.
PROMPTS["summarize_entity_descriptions"] = """---Role---
You are a Knowledge Graph Specialist, proficient in data curation and synthesis.
Rules:
- Plain text output only, no formatting or extra text
- Include ALL key facts from every description
- Third person, mention entity name at start
- Max {summary_length} tokens
- Output in {language}; keep proper nouns in original language
- If descriptions conflict: reconcile or note uncertainty
---Task---
Your task is to synthesize a list of descriptions of a given entity or relation into a single, comprehensive, and cohesive summary.
---Instructions---
1. Input Format: The description list is provided in JSON format. Each JSON object (representing a single description) appears on a new line within the `Description List` section.
2. Output Format: The merged description will be returned as plain text, presented in multiple paragraphs, without any additional formatting or extraneous comments before or after the summary.
3. Comprehensiveness: The summary must integrate all key information from *every* provided description. Do not omit any important facts or details.
4. Context: Ensure the summary is written from an objective, third-person perspective; explicitly mention the name of the entity or relation for full clarity and context.
5. Context & Objectivity:
- Write the summary from an objective, third-person perspective.
- Explicitly mention the full name of the entity or relation at the beginning of the summary to ensure immediate clarity and context.
6. Conflict Handling:
- In cases of conflicting or inconsistent descriptions, first determine if these conflicts arise from multiple, distinct entities or relationships that share the same name.
- If distinct entities/relations are identified, summarize each one *separately* within the overall output.
- If conflicts within a single entity/relation (e.g., historical discrepancies) exist, attempt to reconcile them or present both viewpoints with noted uncertainty.
7. Length Constraint:The summary's total length must not exceed {summary_length} tokens, while still maintaining depth and completeness.
8. Language: The entire output must be written in {language}. Proper nouns (e.g., personal names, place names, organization names) may in their original language if proper translation is not available.
- The entire output must be written in {language}.
- Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity.
---Input---
{description_type} Name: {description_name}
Description List:
Descriptions:
```
{description_list}
```
Output:"""
PROMPTS['fail_response'] = "Sorry, I'm not able to provide an answer to that question.[no-context]"
# Default RAG response prompt - cite-ready (no LLM-generated citations)
# Citations are added by post-processing. This gives cleaner, more accurate results.
# Optimized via DSPy/RAGAS testing - qtype variant achieved 0.887 relevance, 0.996 faithfulness
PROMPTS['rag_response'] = """Context:
{context_data}
STRICT GROUNDING RULES (MUST FOLLOW):
- ONLY state information that appears EXPLICITLY in the Context above
- NEVER add specific numbers, percentages, dates, or quantities unless they appear VERBATIM in the Context
- NEVER reference documents, meetings, or sources not mentioned in the Context
- NEVER elaborate, interpret, or infer beyond what the text actually says
- If information is missing, state: "not specified in the available context"
- Each sentence must be directly traceable to a specific passage in the Context
{coverage_guidance}
Format Guidelines:
- Use the exact terminology from the question in your response
- The first sentence must directly answer the question
- Response type: {response_type}
- If enumerated items are requested, present as (1), (2), (3)...
- Do not include citation markers; they will be added automatically
Question: {user_prompt}
Answer:"""
# Coverage guidance templates (injected based on context sparsity detection)
PROMPTS['coverage_guidance_limited'] = """
CONTEXT NOTICE: The retrieved information for this topic is LIMITED.
- Only state facts that appear explicitly in the context below
- If key aspects of the question aren't covered, acknowledge: "The available information does not specify [aspect]"
- Avoid inferring or generalizing beyond what's stated
---Output---
"""
PROMPTS['coverage_guidance_good'] = '' # Empty for well-covered topics
PROMPTS["fail_response"] = (
"Sorry, I'm not able to provide an answer to that question.[no-context]"
)
# Strict mode suffix - append when response_type="strict"
PROMPTS['rag_response_strict_suffix'] = """
STRICT GROUNDING:
- NEVER state specific numbers/dates unless they appear EXACTLY in context
- If information isn't in context, say "not specified in available information"
- Entity summaries for overview, Source Excerpts for precision
"""
PROMPTS["rag_response"] = """---Role---
# Default naive RAG response prompt - cite-ready (no LLM-generated citations)
# Enhanced with strict grounding rules to prevent hallucination
PROMPTS['naive_rag_response'] = """---Task---
Answer the query using ONLY information present in the provided context. Do NOT add any external knowledge, assumptions, or inference beyond the exact wording.
You are an expert AI assistant specializing in synthesizing information from a provided knowledge base. Your primary function is to answer user queries accurately by ONLY using the information within the provided **Context**.
STRICT GUIDELINES
- Every sentence must be a verbatim fact or a direct logical consequence that can be explicitly traced to a specific chunk of the context.
- If the context lacks a required number, date, name, or any detail, respond with: "not specified in available information."
- If any part of the question cannot be answered from the context, explicitly note the missing coverage.
- Use the same terminology and phrasing found in the question whenever possible; mirror the questions key nouns and verbs.
- When the answer contains multiple items, present them as a concise list.
---Goal---
FORMAT
- Match the language of the question.
- Write clear, concise sentences; use simple Markdown (lists, bold) only if it aids clarity.
- Do not include a References section; it will be generated automatically.
- Response type: {response_type}
{coverage_guidance}
Generate a comprehensive, well-structured answer to the user query.
The answer must integrate relevant facts from the Knowledge Graph and Document Chunks found in the **Context**.
Consider the conversation history if provided to maintain conversational flow and avoid repeating information.
---Instructions---
1. Step-by-Step Instruction:
- Carefully determine the user's query intent in the context of the conversation history to fully understand the user's information need.
- Scrutinize both `Knowledge Graph Data` and `Document Chunks` in the **Context**. Identify and extract all pieces of information that are directly relevant to answering the user query.
- Weave the extracted facts into a coherent and logical response. Your own knowledge must ONLY be used to formulate fluent sentences and connect ideas, NOT to introduce any external information.
- Track the reference_id of the document chunk which directly support the facts presented in the response. Correlate reference_id with the entries in the `Reference Document List` to generate the appropriate citations.
- Generate a references section at the end of the response. Each reference document must directly support the facts presented in the response.
- Do not generate anything after the reference section.
2. Content & Grounding:
- Strictly adhere to the provided context from the **Context**; DO NOT invent, assume, or infer any information not explicitly stated.
- If the answer cannot be found in the **Context**, state that you do not have enough information to answer. Do not attempt to guess.
3. Formatting & Language:
- The response MUST be in the same language as the user query.
- The response MUST utilize Markdown formatting for enhanced clarity and structure (e.g., headings, bold text, bullet points).
- The response should be presented in {response_type}.
4. References Section Format:
- The References section should be under heading: `### References`
- Reference list entries should adhere to the format: `* [n] Document Title`. Do not include a caret (`^`) after opening square bracket (`[`).
- The Document Title in the citation must retain its original language.
- Output each citation on an individual line
- Provide maximum of 5 most relevant citations.
- Do not generate footnotes section or any comment, summary, or explanation after the references.
5. Reference Section Example:
```
### References
- [1] Document Title One
- [2] Document Title Two
- [3] Document Title Three
```
6. Additional Instructions: {user_prompt}
Question: {user_prompt}
---Context---
{context_data}
"""
PROMPTS["naive_rag_response"] = """---Role---
You are an expert AI assistant specializing in synthesizing information from a provided knowledge base. Your primary function is to answer user queries accurately by ONLY using the information within the provided **Context**.
---Goal---
Generate a comprehensive, well-structured answer to the user query.
The answer must integrate relevant facts from the Document Chunks found in the **Context**.
Consider the conversation history if provided to maintain conversational flow and avoid repeating information.
---Instructions---
1. Step-by-Step Instruction:
- Carefully determine the user's query intent in the context of the conversation history to fully understand the user's information need.
- Scrutinize `Document Chunks` in the **Context**. Identify and extract all pieces of information that are directly relevant to answering the user query.
- Weave the extracted facts into a coherent and logical response. Your own knowledge must ONLY be used to formulate fluent sentences and connect ideas, NOT to introduce any external information.
- Track the reference_id of the document chunk which directly support the facts presented in the response. Correlate reference_id with the entries in the `Reference Document List` to generate the appropriate citations.
- Generate a **References** section at the end of the response. Each reference document must directly support the facts presented in the response.
- Do not generate anything after the reference section.
2. Content & Grounding:
- Strictly adhere to the provided context from the **Context**; DO NOT invent, assume, or infer any information not explicitly stated.
- If the answer cannot be found in the **Context**, state that you do not have enough information to answer. Do not attempt to guess.
3. Formatting & Language:
- The response MUST be in the same language as the user query.
- The response MUST utilize Markdown formatting for enhanced clarity and structure (e.g., headings, bold text, bullet points).
- The response should be presented in {response_type}.
4. References Section Format:
- The References section should be under heading: `### References`
- Reference list entries should adhere to the format: `* [n] Document Title`. Do not include a caret (`^`) after opening square bracket (`[`).
- The Document Title in the citation must retain its original language.
- Output each citation on an individual line
- Provide maximum of 5 most relevant citations.
- Do not generate footnotes section or any comment, summary, or explanation after the references.
5. Reference Section Example:
```
### References
- [1] Document Title One
- [2] Document Title Two
- [3] Document Title Three
```
6. Additional Instructions: {user_prompt}
---Context---
{content_data}
"""
Answer:"""
PROMPTS['kg_query_context'] = """
## Entity Summaries (use for definitions and general facts)
PROMPTS["kg_query_context"] = """
Knowledge Graph Data (Entity):
```json
{entities_str}
```
## Relationships (use to explain connections between concepts)
Knowledge Graph Data (Relationship):
```json
{relations_str}
```
## Source Excerpts (use for specific facts, numbers, quotes)
```json
{text_chunks_str}
```
## References
{reference_list_str}
"""
PROMPTS['naive_query_context'] = """
Document Chunks (Each entry includes a reference_id that refers to the `Reference Document List`):
Document Chunks (Each entry has a reference_id refer to the `Reference Document List`):
```json
{text_chunks_str}
@ -268,75 +356,77 @@ Reference Document List (Each entry starts with a [reference_id] that correspond
"""
PROMPTS['keywords_extraction'] = """---Task---
Extract keywords from the query for RAG retrieval.
PROMPTS["naive_query_context"] = """
Document Chunks (Each entry has a reference_id refer to the `Reference Document List`):
Output valid JSON (no markdown):
{{"high_level_keywords": [...], "low_level_keywords": [...]}}
```json
{text_chunks_str}
```
Guidelines:
- high_level: Topic categories, question types, abstract themes
- low_level: Specific terms from the query including:
* Named entities (people, organizations, places)
* Technical terms and key concepts
* Dates, years, and time periods (e.g., "2017", "Q3 2024")
* Document names, report titles, and identifiers
- Extract at least 1 keyword per category for meaningful queries
- Only return empty lists for nonsensical input (e.g., "asdfgh", "hello")
Reference Document List (Each entry starts with a [reference_id] that corresponds to entries in the Document Chunks):
```
{reference_list_str}
```
"""
PROMPTS["keywords_extraction"] = """---Role---
You are an expert keyword extractor, specializing in analyzing user queries for a Retrieval-Augmented Generation (RAG) system. Your purpose is to identify both high-level and low-level keywords in the user's query that will be used for effective document retrieval.
---Goal---
Given a user query, your task is to extract two distinct types of keywords:
1. **high_level_keywords**: for overarching concepts or themes, capturing user's core intent, the subject area, or the type of question being asked.
2. **low_level_keywords**: for specific entities or details, identifying the specific entities, proper nouns, technical jargon, product names, or concrete items.
---Instructions & Constraints---
1. **Output Format**: Your output MUST be a valid JSON object and nothing else. Do not include any explanatory text, markdown code fences (like ```json), or any other text before or after the JSON. It will be parsed directly by a JSON parser.
2. **Source of Truth**: All keywords must be explicitly derived from the user query, with both high-level and low-level keyword categories are required to contain content.
3. **Concise & Meaningful**: Keywords should be concise words or meaningful phrases. Prioritize multi-word phrases when they represent a single concept. For example, from "latest financial report of Apple Inc.", you should extract "latest financial report" and "Apple Inc." rather than "latest", "financial", "report", and "Apple".
4. **Handle Edge Cases**: For queries that are too simple, vague, or nonsensical (e.g., "hello", "ok", "asdfghjkl"), you must return a JSON object with empty lists for both keyword types.
5. **Language**: All extracted keywords MUST be in {language}. Proper nouns (e.g., personal names, place names, organization names) should be kept in their original language.
---Examples---
{examples}
---Query---
{query}
---Real Data---
User Query: {query}
---Output---
Output:"""
PROMPTS['keywords_extraction_examples'] = [
"""Query: "What is the capital of France?"
Output: {{"high_level_keywords": ["Geography", "Capital city"], "low_level_keywords": ["France"]}}
PROMPTS["keywords_extraction_examples"] = [
"""Example 1:
Query: "How does international trade influence global economic stability?"
Output:
{
"high_level_keywords": ["International trade", "Global economic stability", "Economic impact"],
"low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"]
}
""",
"""Query: "Why does inflation affect interest rates?"
Output: {{"high_level_keywords": ["Economics", "Cause-effect"], "low_level_keywords": ["inflation", "interest rates"]}}
"""Example 2:
Query: "What are the environmental consequences of deforestation on biodiversity?"
Output:
{
"high_level_keywords": ["Environmental consequences", "Deforestation", "Biodiversity loss"],
"low_level_keywords": ["Species extinction", "Habitat destruction", "Carbon emissions", "Rainforest", "Ecosystem"]
}
""",
"""Query: "How does Python compare to JavaScript for web development?"
Output: {{"high_level_keywords": ["Programming languages", "Comparison"], "low_level_keywords": ["Python", "JavaScript"]}}
"""Example 3:
Query: "What is the role of education in reducing poverty?"
Output:
{
"high_level_keywords": ["Education", "Poverty reduction", "Socioeconomic development"],
"low_level_keywords": ["School access", "Literacy rates", "Job training", "Income inequality"]
}
""",
]
PROMPTS['orphan_connection_validation'] = """---Task---
Evaluate if a meaningful relationship exists between two entities.
Orphan: {orphan_name} ({orphan_type}) - {orphan_description}
Candidate: {candidate_name} ({candidate_type}) - {candidate_description}
Similarity: {similarity_score}
Valid relationship types:
- Direct: One uses/creates/owns the other
- Industry: Both operate in same sector (finance, tech, healthcare)
- Competitive: Direct competitors or alternatives
- Temporal: Versions, successors, or historical connections
- Dependency: One relies on/runs on the other
Confidence levels (use these exact labels):
- HIGH: Direct/explicit relationship (Django is Python framework, iOS is Apple product)
- MEDIUM: Strong implicit or industry relationship (Netflix runs on AWS, Bitcoin and Visa both in payments)
- LOW: Very weak, tenuous connection
- NONE: No logical relationship
Output valid JSON:
{{"should_connect": bool, "confidence": "HIGH"|"MEDIUM"|"LOW"|"NONE", "relationship_type": str|null, "relationship_keywords": str|null, "relationship_description": str|null, "reasoning": str}}
Rules:
- HIGH/MEDIUM: should_connect=true (same industry = MEDIUM)
- LOW/NONE: should_connect=false
- High similarity alone is NOT sufficient
- Explain the specific relationship in reasoning
Example: PythonDjango
{{"should_connect": true, "confidence": "HIGH", "relationship_type": "direct", "relationship_keywords": "framework, built-with", "relationship_description": "Django is a web framework written in Python", "reasoning": "Direct explicit relationship - Django is implemented in Python"}}
Example: MozartDocker
{{"should_connect": false, "confidence": "NONE", "relationship_type": null, "relationship_keywords": null, "relationship_description": null, "reasoning": "No logical connection between classical composer and container technology"}}
Output:"""

View file

@ -1,211 +1,577 @@
"""
Local reranker using sentence-transformers CrossEncoder.
Uses cross-encoder/ms-marco-MiniLM-L-6-v2 by default - a 22M param model with
excellent accuracy and clean score separation (-11 to +10 range).
Runs entirely locally without API calls.
"""
from __future__ import annotations
import os
from collections.abc import Awaitable, Callable, Sequence
from typing import Protocol, SupportsFloat, TypedDict, runtime_checkable
import aiohttp
from typing import Any, List, Dict, Optional, Tuple
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
from .utils import logger
# Global model cache to avoid reloading on every call
_reranker_model: RerankerModel | None = None
_reranker_model_name: str | None = None
from dotenv import load_dotenv
# Default model - mxbai-rerank-xsmall-v1 performs best on domain-specific content
# Used for ordering only (no score filtering) - see constants.py DEFAULT_MIN_RERANK_SCORE
DEFAULT_RERANK_MODEL = 'mixedbread-ai/mxbai-rerank-xsmall-v1'
# use the .env that is inside the current folder
# allows to use different .env file for each lightrag instance
# the OS environment variables take precedence over the .env file
load_dotenv(dotenv_path=".env", override=False)
class RerankResult(TypedDict):
index: int
relevance_score: float
@runtime_checkable
class SupportsToList(Protocol):
def tolist(self) -> list[float]: ...
ScoreLike = Sequence[SupportsFloat] | SupportsToList
@runtime_checkable
class RerankerModel(Protocol):
def predict(
self,
sentences: list[list[str]],
batch_size: int = ...,
) -> ScoreLike: ...
def get_reranker_model(model_name: str | None = None):
def chunk_documents_for_rerank(
documents: List[str],
max_tokens: int = 480,
overlap_tokens: int = 32,
tokenizer_model: str = "gpt-4o-mini",
) -> Tuple[List[str], List[int]]:
"""
Get or initialize the reranker model (cached).
Chunk documents that exceed token limit for reranking.
Args:
model_name: HuggingFace model name. Defaults to mxbai-rerank-xsmall-v1
documents: List of document strings to chunk
max_tokens: Maximum tokens per chunk (default 480 to leave margin for 512 limit)
overlap_tokens: Number of tokens to overlap between chunks
tokenizer_model: Model name for tiktoken tokenizer
Returns:
CrossEncoder-like model instance implementing predict(pairs)->list[float]
Tuple of (chunked_documents, original_doc_indices)
- chunked_documents: List of document chunks (may be more than input)
- original_doc_indices: Maps each chunk back to its original document index
"""
global _reranker_model, _reranker_model_name
if model_name is None:
model_name = os.getenv('RERANK_MODEL', DEFAULT_RERANK_MODEL)
# Return cached model if same name
if _reranker_model is not None and _reranker_model_name == model_name:
return _reranker_model
# Clamp overlap_tokens to ensure the loop always advances
# If overlap_tokens >= max_tokens, the chunking loop would hang
if overlap_tokens >= max_tokens:
original_overlap = overlap_tokens
# Ensure overlap is at least 1 token less than max to guarantee progress
# For very small max_tokens (e.g., 1), set overlap to 0
overlap_tokens = max(0, max_tokens - 1)
logger.warning(
f"overlap_tokens ({original_overlap}) must be less than max_tokens ({max_tokens}). "
f"Clamping to {overlap_tokens} to prevent infinite loop."
)
try:
from sentence_transformers import CrossEncoder
from .utils import TiktokenTokenizer
logger.info(f'Loading reranker model: {model_name}')
_reranker_model = CrossEncoder(model_name, trust_remote_code=True)
_reranker_model_name = model_name
logger.info(f'Reranker model loaded: {model_name}')
return _reranker_model
except ImportError as err:
raise ImportError(
'sentence-transformers is required for local reranking. Install with: pip install sentence-transformers'
) from err
tokenizer = TiktokenTokenizer(model_name=tokenizer_model)
except Exception as e:
logger.error(f'Failed to load reranker model {model_name}: {e}')
raise
logger.warning(
f"Failed to initialize tokenizer: {e}. Using character-based approximation."
)
# Fallback: approximate 1 token ≈ 4 characters
max_chars = max_tokens * 4
overlap_chars = overlap_tokens * 4
chunked_docs = []
doc_indices = []
for idx, doc in enumerate(documents):
if len(doc) <= max_chars:
chunked_docs.append(doc)
doc_indices.append(idx)
else:
# Split into overlapping chunks
start = 0
while start < len(doc):
end = min(start + max_chars, len(doc))
chunk = doc[start:end]
chunked_docs.append(chunk)
doc_indices.append(idx)
if end >= len(doc):
break
start = end - overlap_chars
return chunked_docs, doc_indices
# Use tokenizer for accurate chunking
chunked_docs = []
doc_indices = []
for idx, doc in enumerate(documents):
tokens = tokenizer.encode(doc)
if len(tokens) <= max_tokens:
# Document fits in one chunk
chunked_docs.append(doc)
doc_indices.append(idx)
else:
# Split into overlapping chunks
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = tokenizer.decode(chunk_tokens)
chunked_docs.append(chunk_text)
doc_indices.append(idx)
if end >= len(tokens):
break
start = end - overlap_tokens
return chunked_docs, doc_indices
async def local_rerank(
query: str,
documents: list[str],
top_n: int | None = None,
model_name: str | None = None,
) -> list[RerankResult]:
def aggregate_chunk_scores(
chunk_results: List[Dict[str, Any]],
doc_indices: List[int],
num_original_docs: int,
aggregation: str = "max",
) -> List[Dict[str, Any]]:
"""
Rerank documents using local CrossEncoder model.
Aggregate rerank scores from document chunks back to original documents.
Args:
chunk_results: Rerank results for chunks [{"index": chunk_idx, "relevance_score": score}, ...]
doc_indices: Maps each chunk index to original document index
num_original_docs: Total number of original documents
aggregation: Strategy for aggregating scores ("max", "mean", "first")
Returns:
List of results for original documents [{"index": doc_idx, "relevance_score": score}, ...]
"""
# Group scores by original document index
doc_scores: Dict[int, List[float]] = {i: [] for i in range(num_original_docs)}
for result in chunk_results:
chunk_idx = result["index"]
score = result["relevance_score"]
if 0 <= chunk_idx < len(doc_indices):
original_doc_idx = doc_indices[chunk_idx]
doc_scores[original_doc_idx].append(score)
# Aggregate scores
aggregated_results = []
for doc_idx, scores in doc_scores.items():
if not scores:
continue
if aggregation == "max":
final_score = max(scores)
elif aggregation == "mean":
final_score = sum(scores) / len(scores)
elif aggregation == "first":
final_score = scores[0]
else:
logger.warning(f"Unknown aggregation strategy: {aggregation}, using max")
final_score = max(scores)
aggregated_results.append(
{
"index": doc_idx,
"relevance_score": final_score,
}
)
# Sort by relevance score (descending)
aggregated_results.sort(key=lambda x: x["relevance_score"], reverse=True)
return aggregated_results
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60),
retry=(
retry_if_exception_type(aiohttp.ClientError)
| retry_if_exception_type(aiohttp.ClientResponseError)
),
)
async def generic_rerank_api(
query: str,
documents: List[str],
model: str,
base_url: str,
api_key: Optional[str],
top_n: Optional[int] = None,
return_documents: Optional[bool] = None,
extra_body: Optional[Dict[str, Any]] = None,
response_format: str = "standard", # "standard" (Jina/Cohere) or "aliyun"
request_format: str = "standard", # "standard" (Jina/Cohere) or "aliyun"
enable_chunking: bool = False,
max_tokens_per_doc: int = 480,
) -> List[Dict[str, Any]]:
"""
Generic rerank API call for Jina/Cohere/Aliyun models.
Args:
query: The search query
documents: List of document strings to rerank
top_n: Number of top results to return (None = all)
model_name: HuggingFace model name (default: mxbai-rerank-xsmall-v1)
documents: List of strings to rerank
model: Model name to use
base_url: API endpoint URL
api_key: API key for authentication
top_n: Number of top results to return
return_documents: Whether to return document text (Jina only)
extra_body: Additional body parameters
response_format: Response format type ("standard" for Jina/Cohere, "aliyun" for Aliyun)
request_format: Request format type
enable_chunking: Whether to chunk documents exceeding token limit
max_tokens_per_doc: Maximum tokens per document for chunking
Returns:
List of dicts with 'index' and 'relevance_score', sorted by score descending
Example:
>>> results = await local_rerank(
... query="What is machine learning?",
... documents=["ML is a subset of AI...", "The weather is nice..."],
... top_n=5
... )
>>> print(results[0])
{'index': 0, 'relevance_score': 0.95}
List of dictionary of ["index": int, "relevance_score": float]
"""
if not documents:
return []
if not base_url:
raise ValueError("Base URL is required")
model = get_reranker_model(model_name)
headers = {"Content-Type": "application/json"}
if api_key is not None:
headers["Authorization"] = f"Bearer {api_key}"
# Create query-document pairs
pairs = [[query, doc] for doc in documents]
# Handle document chunking if enabled
original_documents = documents
doc_indices = None
original_top_n = top_n # Save original top_n for post-aggregation limiting
# Get scores from model
# CrossEncoder.predict returns a list[float]; guard None for type checkers
if model is None:
raise RuntimeError('Reranker model failed to load')
raw_scores = model.predict(pairs)
if enable_chunking:
documents, doc_indices = chunk_documents_for_rerank(
documents, max_tokens=max_tokens_per_doc
)
logger.debug(
f"Chunked {len(original_documents)} documents into {len(documents)} chunks"
)
# When chunking is enabled, disable top_n at API level to get all chunk scores
# This ensures proper document-level coverage after aggregation
# We'll apply top_n to aggregated document results instead
if top_n is not None:
logger.debug(
f"Chunking enabled: disabled API-level top_n={top_n} to ensure complete document coverage"
)
top_n = None
# Normalize to a list[float] regardless of backend (list, numpy array, tensor)
if isinstance(raw_scores, SupportsToList):
raw_scores = raw_scores.tolist()
# Build request payload based on request format
if request_format == "aliyun":
# Aliyun format: nested input/parameters structure
payload = {
"model": model,
"input": {
"query": query,
"documents": documents,
},
"parameters": {},
}
scores = [float(score) for score in raw_scores]
# Add optional parameters to parameters object
if top_n is not None:
payload["parameters"]["top_n"] = top_n
# Build results with index and score
results: list[RerankResult] = [
RerankResult(index=i, relevance_score=float(score)) for i, score in enumerate(scores)
]
if return_documents is not None:
payload["parameters"]["return_documents"] = return_documents
# Sort by score descending
results.sort(key=lambda x: x['relevance_score'], reverse=True)
# Add extra parameters to parameters object
if extra_body:
payload["parameters"].update(extra_body)
else:
# Standard format for Jina/Cohere/OpenAI
payload = {
"model": model,
"query": query,
"documents": documents,
}
# Apply top_n limit if specified
if top_n is not None and top_n < len(results):
results = results[:top_n]
# Add optional parameters
if top_n is not None:
payload["top_n"] = top_n
return results
# Only Jina API supports return_documents parameter
if return_documents is not None and response_format in ("standard",):
payload["return_documents"] = return_documents
# Add extra parameters
if extra_body:
payload.update(extra_body)
logger.debug(
f"Rerank request: {len(documents)} documents, model: {model}, format: {response_format}"
)
async with aiohttp.ClientSession() as session:
async with session.post(base_url, headers=headers, json=payload) as response:
if response.status != 200:
error_text = await response.text()
content_type = response.headers.get("content-type", "").lower()
is_html_error = (
error_text.strip().startswith("<!DOCTYPE html>")
or "text/html" in content_type
)
if is_html_error:
if response.status == 502:
clean_error = "Bad Gateway (502) - Rerank service temporarily unavailable. Please try again in a few minutes."
elif response.status == 503:
clean_error = "Service Unavailable (503) - Rerank service is temporarily overloaded. Please try again later."
elif response.status == 504:
clean_error = "Gateway Timeout (504) - Rerank service request timed out. Please try again."
else:
clean_error = f"HTTP {response.status} - Rerank service error. Please try again later."
else:
clean_error = error_text
logger.error(f"Rerank API error {response.status}: {clean_error}")
raise aiohttp.ClientResponseError(
request_info=response.request_info,
history=response.history,
status=response.status,
message=f"Rerank API error: {clean_error}",
)
response_json = await response.json()
if response_format == "aliyun":
# Aliyun format: {"output": {"results": [...]}}
results = response_json.get("output", {}).get("results", [])
if not isinstance(results, list):
logger.warning(
f"Expected 'output.results' to be list, got {type(results)}: {results}"
)
results = []
elif response_format == "standard":
# Standard format: {"results": [...]}
results = response_json.get("results", [])
if not isinstance(results, list):
logger.warning(
f"Expected 'results' to be list, got {type(results)}: {results}"
)
results = []
else:
raise ValueError(f"Unsupported response format: {response_format}")
if not results:
logger.warning("Rerank API returned empty results")
return []
# Standardize return format
standardized_results = [
{"index": result["index"], "relevance_score": result["relevance_score"]}
for result in results
]
# Aggregate chunk scores back to original documents if chunking was enabled
if enable_chunking and doc_indices:
standardized_results = aggregate_chunk_scores(
standardized_results,
doc_indices,
len(original_documents),
aggregation="max",
)
# Apply original top_n limit at document level (post-aggregation)
# This preserves document-level semantics: top_n limits documents, not chunks
if (
original_top_n is not None
and len(standardized_results) > original_top_n
):
standardized_results = standardized_results[:original_top_n]
return standardized_results
def create_local_rerank_func(
model_name: str | None = None,
) -> Callable[..., Awaitable[list[RerankResult]]]:
async def cohere_rerank(
query: str,
documents: List[str],
top_n: Optional[int] = None,
api_key: Optional[str] = None,
model: str = "rerank-v3.5",
base_url: str = "https://api.cohere.com/v2/rerank",
extra_body: Optional[Dict[str, Any]] = None,
enable_chunking: bool = False,
max_tokens_per_doc: int = 4096,
) -> List[Dict[str, Any]]:
"""
Create a rerank function with pre-configured model.
Rerank documents using Cohere API.
This is used by lightrag_server to create a rerank function
that can be passed to LightRAG initialization.
Supports both standard Cohere API and Cohere-compatible proxies
Args:
model_name: HuggingFace model name (default: mxbai-rerank-xsmall-v1)
query: The search query
documents: List of strings to rerank
top_n: Number of top results to return
api_key: API key for authentication
model: rerank model name (default: rerank-v3.5)
base_url: API endpoint
extra_body: Additional body for http request(reserved for extra params)
enable_chunking: Whether to chunk documents exceeding max_tokens_per_doc
max_tokens_per_doc: Maximum tokens per document (default: 4096 for Cohere v3.5)
Returns:
Async rerank function
List of dictionary of ["index": int, "relevance_score": float]
Example:
>>> # Standard Cohere API
>>> results = await cohere_rerank(
... query="What is the meaning of life?",
... documents=["Doc1", "Doc2"],
... api_key="your-cohere-key"
... )
>>> # LiteLLM proxy with user authentication
>>> results = await cohere_rerank(
... query="What is vector search?",
... documents=["Doc1", "Doc2"],
... model="answerai-colbert-small-v1",
... base_url="https://llm-proxy.example.com/v2/rerank",
... api_key="your-proxy-key",
... enable_chunking=True,
... max_tokens_per_doc=480
... )
"""
# Pre-load model to fail fast if there's an issue
get_reranker_model(model_name)
if api_key is None:
api_key = os.getenv("COHERE_API_KEY") or os.getenv("RERANK_BINDING_API_KEY")
async def rerank_func(
query: str,
documents: list[str],
top_n: int | None = None,
**kwargs,
) -> list[RerankResult]:
return await local_rerank(
query=query,
documents=documents,
top_n=top_n,
model_name=model_name,
)
return rerank_func
return await generic_rerank_api(
query=query,
documents=documents,
model=model,
base_url=base_url,
api_key=api_key,
top_n=top_n,
return_documents=None, # Cohere doesn't support this parameter
extra_body=extra_body,
response_format="standard",
enable_chunking=enable_chunking,
max_tokens_per_doc=max_tokens_per_doc,
)
# For backwards compatibility - alias to local_rerank
rerank = local_rerank
async def jina_rerank(
query: str,
documents: List[str],
top_n: Optional[int] = None,
api_key: Optional[str] = None,
model: str = "jina-reranker-v2-base-multilingual",
base_url: str = "https://api.jina.ai/v1/rerank",
extra_body: Optional[Dict[str, Any]] = None,
) -> List[Dict[str, Any]]:
"""
Rerank documents using Jina AI API.
Args:
query: The search query
documents: List of strings to rerank
top_n: Number of top results to return
api_key: API key
model: rerank model name
base_url: API endpoint
extra_body: Additional body for http request(reserved for extra params)
Returns:
List of dictionary of ["index": int, "relevance_score": float]
"""
if api_key is None:
api_key = os.getenv("JINA_API_KEY") or os.getenv("RERANK_BINDING_API_KEY")
return await generic_rerank_api(
query=query,
documents=documents,
model=model,
base_url=base_url,
api_key=api_key,
top_n=top_n,
return_documents=False,
extra_body=extra_body,
response_format="standard",
)
if __name__ == '__main__':
async def ali_rerank(
query: str,
documents: List[str],
top_n: Optional[int] = None,
api_key: Optional[str] = None,
model: str = "gte-rerank-v2",
base_url: str = "https://dashscope.aliyuncs.com/api/v1/services/rerank/text-rerank/text-rerank",
extra_body: Optional[Dict[str, Any]] = None,
) -> List[Dict[str, Any]]:
"""
Rerank documents using Aliyun DashScope API.
Args:
query: The search query
documents: List of strings to rerank
top_n: Number of top results to return
api_key: Aliyun API key
model: rerank model name
base_url: API endpoint
extra_body: Additional body for http request(reserved for extra params)
Returns:
List of dictionary of ["index": int, "relevance_score": float]
"""
if api_key is None:
api_key = os.getenv("DASHSCOPE_API_KEY") or os.getenv("RERANK_BINDING_API_KEY")
return await generic_rerank_api(
query=query,
documents=documents,
model=model,
base_url=base_url,
api_key=api_key,
top_n=top_n,
return_documents=False, # Aliyun doesn't need this parameter
extra_body=extra_body,
response_format="aliyun",
request_format="aliyun",
)
"""Please run this test as a module:
python -m lightrag.rerank
"""
if __name__ == "__main__":
import asyncio
async def main():
# Example usage - documents should be strings, not dictionaries
docs = [
'The capital of France is Paris.',
'Tokyo is the capital of Japan.',
'London is the capital of England.',
'Python is a programming language.',
"The capital of France is Paris.",
"Tokyo is the capital of Japan.",
"London is the capital of England.",
]
query = 'What is the capital of France?'
query = "What is the capital of France?"
print('=== Local Reranker Test ===')
print(f'Model: {os.getenv("RERANK_MODEL", DEFAULT_RERANK_MODEL)}')
print(f'Query: {query}')
print()
# Test Jina rerank
try:
print("=== Jina Rerank ===")
result = await jina_rerank(
query=query,
documents=docs,
top_n=2,
)
print("Results:")
for item in result:
print(f"Index: {item['index']}, Score: {item['relevance_score']:.4f}")
print(f"Document: {docs[item['index']]}")
except Exception as e:
print(f"Jina Error: {e}")
results = await local_rerank(query=query, documents=docs, top_n=3)
# Test Cohere rerank
try:
print("\n=== Cohere Rerank ===")
result = await cohere_rerank(
query=query,
documents=docs,
top_n=2,
)
print("Results:")
for item in result:
print(f"Index: {item['index']}, Score: {item['relevance_score']:.4f}")
print(f"Document: {docs[item['index']]}")
except Exception as e:
print(f"Cohere Error: {e}")
print('Results (top 3):')
for item in results:
idx = item['index']
score = item['relevance_score']
print(f' [{idx}] Score: {score:.4f} - {docs[idx]}')
# Test Aliyun rerank
try:
print("\n=== Aliyun Rerank ===")
result = await ali_rerank(
query=query,
documents=docs,
top_n=2,
)
print("Results:")
for item in result:
print(f"Index: {item['index']}, Score: {item['relevance_score']:.4f}")
print(f"Document: {docs[item['index']]}")
except Exception as e:
print(f"Aliyun Error: {e}")
asyncio.run(main())