Commit graph

476 commits

Author SHA1 Message Date
Tong Da
a60a8704ba Add search method to lightrag. Search is for retrieve structured objects (entities, relations, chunks) in their raw data format. 2025-09-01 02:19:58 +08:00
yangdx
5fd7682f16 Fix LLM output instability for <|> tuple delimiter
- Replace <||> with <|>
- Replace < | > with <|>
- Apply fix in both functions
- Handle delimiter variations
- Improve parsing reliability
2025-09-01 01:22:27 +08:00
yangdx
4e751e0653 refac: Enhance extraction with improved prompts and parser
-   **Prompts**: Restructured prompts with clearer steps and quality guidelines. Simplified the relationship tuple by removing `relationship_strength`
-   **Model**: Updated default entity types to be more comprehensive and consistently capitalized (e.g., `Location`, `Product`)
2025-08-31 22:24:11 +08:00
yangdx
75de40da41 Fix typo in relationship extraction log messages 2025-08-31 17:45:16 +08:00
yangdx
97c9600085 Improve extraction error handling and field validation
• Add field count validation warnings
• Fix relationship field count (5→6)
• Change error logs to warnings
2025-08-31 17:33:42 +08:00
yangdx
b747417961 feat: enhance text extraction text sanitization and normalization
- Improve reduntant quotes in entity and relation name, type and keywords
- Add HTML tag cleaning and Chinese symbol conversion
- Filter out short numeric content and malformed text
- Enhance entity type validation with character filtering
2025-08-31 13:17:20 +08:00
yangdx
d4bbc5dea9 refactor: Merge multi-step text sanitization into single function 2025-08-31 10:36:56 +08:00
yangdx
03d0fa3014 perf: add optional query_embedding parameter to avoid redundant embedding calls 2025-08-29 18:15:45 +08:00
yangdx
a923d378dd Remove deprecated ID-based filtering from vector storage queries
- Remove ids param from QueryParam
- Simplify BaseVectorStorage.query signature
- Update all vector storage implementations
- Streamline PostgreSQL query templates
- Remove ID filtering from operate.py calls
2025-08-29 17:06:48 +08:00
yangdx
99e28e815b fix: prevent document processing failures from UTF-8 surrogate characters
- Change sanitize_text_for_encoding to fail-fast instead of returning error placeholders
- Add strict UTF-8 cleaning pipeline to entity/relationship extraction
- Skip problematic entities/relationships instead of corrupting data

Fixes document processing crashes when encountering surrogate characters (U+D800-U+DFFF)
2025-08-27 23:52:39 +08:00
yangdx
6a2a592224 Fix linting 2025-08-27 12:51:50 +08:00
yangdx
28e07c89f9 Fix linting 2025-08-27 12:35:51 +08:00
yangdx
2ccc39de9a Fix language fallback in summarize error 2025-08-27 12:34:27 +08:00
yangdx
ff0a18e08c Unify SUMMARY_LANGUANGE and ENTITY_TYPES implementation method 2025-08-27 12:23:22 +08:00
Thibo Rosemplatt
c3aabfc251 Merge branch 'main' into entityTypesServerSupport 2025-08-26 21:48:20 +02:00
yangdx
d3623cc9ae fix: resolve infinite loop risk in _handle_entity_relation_summary
- Ensure oversized descriptions are force-merged with subsequent ones
- Add len(current_list) <= 2 termination condition to guarantee convergence
- Implement token-based truncation in _summarize_descriptions to prevent overflow
2025-08-26 21:58:31 +08:00
yangdx
79e0226b2b Refactor: move force_llm_summary_on_merge to global_config access
- Remove parameter from function signature
- Access from global_config instead
- Improve code consistency
2025-08-26 18:02:39 +08:00
yangdx
6bcfe696ee feat: add output length recommendation and description type to LLM summary
- Add SUMMARY_LENGTH_RECOMMENDED parameter (600 tokens)
- Optimize prompt temple for LLM summary
2025-08-26 14:41:12 +08:00
yangdx
025f70089a Simplify status messages in knowledge rebuild operations 2025-08-26 04:26:15 +08:00
yangdx
9eb2be79b8 feat: track actual LLM usage in entity/relation merging
- Modified _handle_entity_relation_summary to return tuple[str, bool]
- Updated merge functions to log "LLMmerg" vs "Merging" based on actual LLM usage
- Replaced hardcoded fragment count prediction with real-time LLM usage tracking
2025-08-26 03:56:18 +08:00
yangdx
de2daf6565 refac: Rename summary_max_tokens to summary_context_size, comprehensive parameter validation for summary configuration
- Update algorithm logic in operate.py for better token management
- Fix health endpoint to use correct parameter names
2025-08-26 01:35:50 +08:00
yangdx
91767ffcee Improve warning message formatting in entity/relationship rebuild 2025-08-25 21:55:29 +08:00
yangdx
15cdd0dd8f fix: Sort cached extraction results by the create_time within each chunk
This ensures the KG rebuilds maintain the original creation order of the first extraction result for each chunk.
2025-08-25 21:41:33 +08:00
yangdx
882d6857d8 feat: Implement map-reduce summarization to handle large humber of description merging 2025-08-25 21:03:16 +08:00
yangdx
cac8e189e7 Remove redundant entity vector deletion before upsert 2025-08-25 17:18:51 +08:00
yangdx
9b6de7512d Optimize the stability of description merging order 2025-08-25 17:10:51 +08:00
yangdx
31f4f96944 Exclude conversation history from context length calculation 2025-08-25 12:43:34 +08:00
yangdx
f688e95f56 Add warning for vector chunks missing chunk_id 2025-08-25 12:42:25 +08:00
yangdx
b6aedba7ae Add logging for empty naive query results in vector context 2025-08-25 12:21:31 +08:00
yangdx
f1ff5cf93f fix: initialize truncated_chunks variable in _build_query_context
Prevents local variable 'truncated_chunks'referenced before assignment
2025-08-25 11:56:56 +08:00
Thibo Rosemplatt
d054ec5d00 Added entity_types as a user defined variable (via .env) 2025-08-23 20:16:11 +02:00
yangdx
9bc349ddd6 Improve Empty Keyword Handling logic 2025-08-23 11:50:58 +08:00
yangdx
b5c230abdd optimize: avoid duplicate embedding calls in _build_query_context
Reduces API costs and improves query performance while maintaining backward compatibility.
2025-08-21 16:49:24 +08:00
yangdx
2a7fec2873 Optimize keyword extraction prompt, and remove conversation history from keywork extraction.
- Remove history context processing
- Update prompt to focus on single query
- Clarify high/low level keyword types
- Improve JSON output instructions
- Add edge case handling guidance
2025-08-18 23:35:04 +08:00
yangdx
d3fde60938 refactor: remove file_path and created_at from context, improve token truncation
- Remove file_path and created_at fields from entity and relationship contexts
- Update token truncation to include full JSON serialization instead of content only
2025-08-18 18:30:09 +08:00
yangdx
1e2d5252d7 Add get_vectors_by_ids method and filter out vector data from query results 2025-08-15 16:32:26 +08:00
yangdx
6cab68bb47 Improve KG chunk selection documentation and configuration clarity 2025-08-15 10:09:44 +08:00
yangdx
3acb32f547 Add comments explaining chunk deduplication behavior in query context 2025-08-15 02:19:01 +08:00
yangdx
f733ac829c Remove debug logging statements from query context building 2025-08-14 23:44:34 +08:00
yangdx
4a19d0de25 Add chunk tracking system to monitor chunk sources and frequencies
• Track chunk sources (E/R/C types)
• Log frequency and order metadata
• Preserve chunk_id through processing
• Add debug logging for chunk tracking
• Handle rerank and truncation operations
2025-08-14 22:58:26 +08:00
yangdx
a8b7890470 Rename chunk selection functions for better clarity 2025-08-14 16:01:13 +08:00
yangdx
3343833571 Remove query params from cache key generation for keyword extration 2025-08-14 02:36:01 +08:00
yangdx
bac09118d5 Simplify embedding func extraction 2025-08-14 01:09:18 +08:00
yangdx
ac3b5605a1 Refactor logging for relation chunk discovery with dedup info 2025-08-14 00:41:58 +08:00
yangdx
edac10906c fix: Add total_relation_chunks statistics and improve logging in _find_related_text_unit_from_relations 2025-08-13 23:45:31 +08:00
yangdx
5a40ff654e Change KG chunk selection default to VECTOR
- Set KG_CHUNK_PICK_METHOD default to VECTOR
- Update env.example with new config option
2025-08-13 23:10:42 +08:00
yangdx
f1dafa0d01 feat: KG related chunks selection by vector similarity
- Add env switch to toggle weighted polling vs vector-similarity strategy
- Implement similarity-based sorting with fallback to weighted
- Introduce batch vector read API for vector storage
- Implement vector store and retrive funtion for Nanovector DB
- Preserve default behavior (weighted polling selection method)
2025-08-13 18:16:42 +08:00
yangdx
095e0cbfa2 Refac: Add workspace infomation to all logger output for all storage type 2025-08-12 01:19:09 +08:00
yangdx
cf064579ce Remove deprecated keyword extraction query methods
- Delete query_with_keywords function
- Remove kg_query_with_keywords helper
- Drop separate keyword extraction methods
2025-08-08 14:59:39 +08:00
yangdx
eded6d1187 Unify document chunks context format in only_need_context query
- Update Document Chunks label to include (DC) abbreviation
2025-08-08 00:02:53 +08:00