Commit graph

261 commits

Author SHA1 Message Date
yangdx
c87eb2cfcf Increase timeout buffers for async function calls
• Extend execution timeout buffer to 150s
• Extend task duration buffer to 180s
• Account for low-level retry delays
• Improve health check phase handling
• Reduce timeout-related failures
2025-09-06 23:56:24 +08:00
yangdx
6be462511f Add error prefixing for better debugging context in async operations
* Add create_prefixed_exception utility
* Prefix entity processing errors
* Prefix relationship processing errors
* Prefix chunk extraction progress info
* Maintain original exception chains
2025-09-05 21:28:00 +08:00
yangdx
2c551cb5db Add support for Chinese book title marks in normalize_extracted_info 2025-09-04 18:51:57 +08:00
yangdx
9b516a8a53 Hot Fix: Preserve whitespace chars in text sanitization
• Keep \t, \n, \r in control char removal
2025-09-04 10:58:29 +08:00
yangdx
a25ce7f078 Fix linting 2025-09-03 21:58:30 +08:00
yangdx
7ef2f0dff6 Add VDB error handling with retries for data consistency
- Add safe_vdb_operation_with_exception util
- Wrap VDB ops in entity/relationship code
- Ensure exceptions propagate on failure
- Add retry logic with configurable delays
2025-09-03 21:15:09 +08:00
yangdx
5b2deccbef Improve text normalization and add entity type capitalization
- Capitalize entity types with .title()
- Add non-breaking space handling
- Add narrow non-breaking space regex
2025-09-02 02:51:41 +08:00
yangdx
e95622ca7b fix(utils): enhance remove_think_tags to handle orphaned </think> closing tags
The function now properly handles cases where text contains </think> closing tags
without corresponding <think> opening tags, which can occur due to content
truncation or processing errors.
2025-09-01 07:17:30 +08:00
yangdx
c8c59c38b0 Fix entity types configuration to support JSON list parsing
- Add JSON parsing for list env vars
- Update entity types example format
- Add list type support to get_env_value
2025-09-01 00:14:57 +08:00
yangdx
1a015a7015 Add queue_name parameter to priority_limit_async_func_call for better logging
• Add queue_name parameter to decorator
• Update all log messages with queue names
• Pass specific names for LLM and embedding
2025-08-31 23:47:22 +08:00
yangdx
b747417961 feat: enhance text extraction text sanitization and normalization
- Improve reduntant quotes in entity and relation name, type and keywords
- Add HTML tag cleaning and Chinese symbol conversion
- Filter out short numeric content and malformed text
- Enhance entity type validation with character filtering
2025-08-31 13:17:20 +08:00
yangdx
d4bbc5dea9 refactor: Merge multi-step text sanitization into single function 2025-08-31 10:36:56 +08:00
yangdx
d7e0701b63 Improve logging setup and add error prefixes for LLM functions
- Move logger init to top of file
- Add console handler by default
- Prefix LLM errors with "[LLM func]"
- Update timeout log messages
- Comment out pypinyin success log
2025-08-29 14:19:13 +08:00
yangdx
925e631a9a refac: Add robust time out handling for LLM request 2025-08-29 13:50:35 +08:00
yangdx
99e28e815b fix: prevent document processing failures from UTF-8 surrogate characters
- Change sanitize_text_for_encoding to fail-fast instead of returning error placeholders
- Add strict UTF-8 cleaning pipeline to entity/relationship extraction
- Skip problematic entities/relationships instead of corrupting data

Fixes document processing crashes when encountering surrogate characters (U+D800-U+DFFF)
2025-08-27 23:52:39 +08:00
yangdx
bf43e1b8c1 fix: Resolve default rerank config problem when env var missing
- Read config from selected_rerank_func when env var missing
- Make api_key optional for rerank function
- Add response format validation with proper error handling
- Update Cohere rerank default to official API endpoint
2025-08-23 01:07:59 +08:00
yangdx
580cb7906c feat: Add multiple rerank provider support to LightRAG Server by adding new env vars and cli params
- Add --enable-rerank CLI argument and ENABLE_RERANK env var
- Simplify rerank configuration logic to only check enable flag and binding
- Update health endpoint to show enable_rerank and rerank_configured status
- Improve logging messages for rerank enable/disable states
- Maintain backward compatibility with default value True
2025-08-22 19:29:45 +08:00
yangdx
b5c230abdd optimize: avoid duplicate embedding calls in _build_query_context
Reduces API costs and improves query performance while maintaining backward compatibility.
2025-08-21 16:49:24 +08:00
yangdx
ced3aef7cb refactor: simplify text encoding by removing redundant safe_encode_for_llm 2025-08-19 19:37:46 +08:00
yangdx
806081645f Refactor text cleaning to use sanitize_text_for_encoding consistently
• Replace clean_text with sanitize_text
• Remove deprecated clean_text function
• Add whitespace trimming to sanitizer
• Improve UTF-8 encoding safety
• Consolidate text cleaning logic
2025-08-19 19:20:01 +08:00
yangdx
f9cf544805 Add text sanitization to prevent UTF-8 encoding errors in LLM calls
• Remove surrogate characters
• Clean control characters
• Sanitize input and history messages
• Add comprehensive error handling
• Log sanitization activities
2025-08-19 18:50:52 +08:00
yangdx
64015548df Refactor MD5 hash functions and consolidate Unicode error handling 2025-08-19 17:49:23 +08:00
yangdx
64058c771f Refactor: Harden compute_args_hash against Unicode errors 2025-08-19 17:19:39 +08:00
yangdx
d3fde60938 refactor: remove file_path and created_at from context, improve token truncation
- Remove file_path and created_at fields from entity and relationship contexts
- Update token truncation to include full JSON serialization instead of content only
2025-08-18 18:30:09 +08:00
yangdx
453efeb924 Fix file path length checking to use UTF-8 byte length instead of char count 2025-08-18 13:59:27 +08:00
yangdx
14e083a1a6 fix: replace pyuca with pypinyin for Chinese pinyin sorting and add file_path sort 2025-08-17 15:21:24 +08:00
yangdx
61469c0a56 Add Chinese pinyin sorting support across document operations
• Replace pyuca with centralized utils function
• Add pinyin sort keys for file paths
• Update MongoDB indexes with zh collation
• Migrate existing indexes for compatibility
• Support Chinese chars in Redis/JSON storage
• Keep PostgreSQL sorting order controled by Database Collate order
2025-08-17 12:45:48 +08:00
yangdx
4a19d0de25 Add chunk tracking system to monitor chunk sources and frequencies
• Track chunk sources (E/R/C types)
• Log frequency and order metadata
• Preserve chunk_id through processing
• Add debug logging for chunk tracking
• Handle rerank and truncation operations
2025-08-14 22:58:26 +08:00
yangdx
a8b7890470 Rename chunk selection functions for better clarity 2025-08-14 16:01:13 +08:00
yangdx
a11e8d77eb Improve missing-vector warning logic in vector similarity
- Check for any missing vectors
- Separate no-vector vs partial-vector warnings
- Ensure early return on empty vectors
2025-08-14 14:24:15 +08:00
yangdx
2e5487305e Merge branch 'main' into pick-trunk-by-vector 2025-08-14 03:12:38 +08:00
yangdx
7fb11193b0 Fix linting 2025-08-14 03:07:29 +08:00
yangdx
331dcf0509 Remove query params from cache key generation for keyword extration 2025-08-14 02:57:39 +08:00
yangdx
3343833571 Remove query params from cache key generation for keyword extration 2025-08-14 02:36:01 +08:00
yangdx
f1dafa0d01 feat: KG related chunks selection by vector similarity
- Add env switch to toggle weighted polling vs vector-similarity strategy
- Implement similarity-based sorting with fallback to weighted
- Introduce batch vector read API for vector storage
- Implement vector store and retrive funtion for Nanovector DB
- Preserve default behavior (weighted polling selection method)
2025-08-13 18:16:42 +08:00
zrguo
f1c7233763 Avoid UTF-8 BOM 2025-08-12 17:06:54 +08:00
yangdx
0463963520 fix: include all query parameters in LLM cache hash key generation
- Add missing query parameters (top_k, enable_rerank, max_tokens, etc.) to cache key generation in kg_query, naive_query, and extract_keywords_only functions
- Add queryparam field to CacheData structure and PostgreSQL storage for debugging
- Update PostgreSQL schema with automatic migration for queryparam JSONB column
- Prevent incorrect cache hits between queries with different parameters

Fixes issue where different query parameters incorrectly shared the same cached results.
2025-08-05 18:03:10 +08:00
yangdx
cb75e6631e Remove quantized embedding info from LLM cache
- Delete quantize_embedding function
- Delete dequantize_embedding function
- Remove embedding fields from CacheData
- Update save_to_cache to exclude embedding data
- Clean up unused quantization-related code
2025-08-05 17:58:34 +08:00
yangdx
32af45ff46 refactor: improve JSON parsing reliability with json-repair library
Replace regex-based JSON extraction with json-repair for better handling of malformed LLM responses. Remove deprecated JSON parsing utilities and clean up keyword_extraction parameter across LLM providers.

- Remove locate_json_string_body_from_string() and convert_response_to_json()
- Use json-repair.loads() in extract_keywords_only() for robust parsing
- Clean up LLM interfaces and remove unused parameters
- Add json-repair dependency
2025-08-01 19:36:20 +08:00
yangdx
2af8a93dc7 fix: resolve _sort_key error in Redis get_docs_paginated function 2025-07-31 02:16:56 +08:00
yangdx
d0bc5e7c4a Extend path filter to also cover POST requests 2025-07-31 02:06:56 +08:00
yangdx
3e5efd0b27 Add /documents/paginated to filtered logging paths 2025-07-31 02:00:00 +08:00
yangdx
6014b9bf73 feat: add track_id support for document processing progress monitoring
- Add get_docs_by_track_id() method to all storage backends (MongoDB, PostgreSQL, Redis, JSON)
- Implement automatic track_id generation with upload_/insert_ prefixes
- Add /track_status/{track_id} API endpoint for frontend progress queries
- Create database indexes for efficient track_id lookups
- Enable real-time document processing status tracking across all storage types
2025-07-29 22:24:21 +08:00
yangdx
9923821d75 refactor: Remove deprecated max_token_size from embedding configuration
This parameter is no longer used. Its removal simplifies the API and clarifies that token length management is handled by upstream text chunking logic rather than the embedding wrapper.
2025-07-29 10:49:35 +08:00
yangdx
e09929b42e Refine rerank filtering log message for clarity 2025-07-27 16:57:38 +08:00
yangdx
f4bca7bfb2 Fix linting 2025-07-27 16:50:45 +08:00
yangdx
a9565d7379 feat: Skip rerank filtering when min_rerank_score is 0.0 2025-07-27 16:50:12 +08:00
yangdx
ebaff228aa feat: Add rerank score filtering with configurable threshold
- Add DEFAULT_MIN_RERANK_SCORE constant (default: 0.0)
- Add MIN_RERANK_SCORE environment variable support
- Filter chunks with rerank scores below threshold in process_chunks_unified
- Add info-level logging for filtering operations
- Handle empty results gracefully after filtering
- Maintain backward compatibility with non-reranked chunks
2025-07-27 16:37:44 +08:00
yangdx
a67f93acc9 Replace hardcoded max tokens with DEFAULT_MAX_TOTAL_TOKENS constant
- Use constant in process_chunks_unified
- Update WebUI default to match (32000)
2025-07-26 11:23:54 +08:00
yangdx
7b915b34f6 Refactor: move build_file_path function from operate.py to utils.py 2025-07-26 10:52:59 +08:00