yangdx
9009abed3e
Fix top_n behavior with chunking to limit documents not chunks
...
- Disable API-level top_n when chunking
- Apply top_n to aggregated documents
- Add comprehensive test coverage
2025-12-03 13:08:26 +08:00
yangdx
561ba4e4b5
Fix trailing whitespace and update test mocking for rerank module
...
• Remove trailing whitespace
• Fix TiktokenTokenizer import patch
• Add async context manager mocks
• Update aiohttp.ClientSession patch
• Improve test reliability
2025-12-03 12:40:48 +08:00
yangdx
8e50eef58b
Merge branch 'main' into cohere-rerank
2025-12-02 22:19:37 +08:00
yangdx
19c16bc464
Add content deduplication check for document insertion endpoints
...
• Check content hash before insertion
• Return duplicated status if exists
• Use sanitized text for hash computation
• Apply to both single and batch inserts
• Prevent duplicate content processing
2025-12-02 17:49:48 +08:00
yangdx
8d28b95966
Fix duplicate document responses to return original track_id
...
- Return existing track_id for duplicates
- Remove track_id generation in reprocess
- Update reprocess response documentation
- Clarify track_id behavior in comments
- Update API response examples
2025-12-02 14:32:28 +08:00
yangdx
381ddfffd4
Bump API version to 0259
2025-12-02 13:27:02 +08:00
yangdx
2ecf77efe2
Update help text to use correct gunicorn command with workers flag
2025-12-02 02:52:31 +08:00
yangdx
d6019c82af
Add CASCADE to AGE extension creation in PostgreSQL implementation
...
- Add CASCADE option to CREATE EXTENSION
- Ensure dependencies are installed
- Fix potential AGE setup issues
2025-12-02 00:17:41 +08:00
yangdx
112ed234c4
Bump API version to 0258
2025-12-01 12:20:27 +08:00
yangdx
ea8d55ab42
Add documentation for embedding provider configuration rules
2025-11-28 17:49:30 +08:00
yangdx
4ab4a7ac94
Allow embedding models to use provider defaults when unspecified
...
- Set EMBEDDING_MODEL default to None
- Pass model param only when provided
- Let providers use their own defaults
- Fix lollms embed function params
- Add ollama embed_model default param
2025-11-28 16:57:33 +08:00
yangdx
881b8d3a50
Bump API version to 0257
2025-11-28 15:39:55 +08:00
yangdx
56e0365cf0
Add configurable model parameter to jina_embed function
...
- Add model parameter to jina_embed
- Pass model from API server
- Default to jina-embeddings-v4
- Update function documentation
- Make model selection flexible
2025-11-28 15:38:29 +08:00
yangdx
6e2946e78a
Add max_token_size parameter to azure_openai_embed wrapper
2025-11-28 13:41:01 +08:00
yangdx
4f12fe121d
Change entity extraction logging from warning to info level
...
• Reduce log noise for empty entities
2025-11-27 11:00:34 +08:00
palanisd
a898f0548d
Merge branch 'HKUDS:main' into cohere-rerank
2025-11-25 14:21:43 -05:00
yangdx
93d445dfdd
Add pipeline status lock function for legacy compatibility
...
- Add get_pipeline_status_lock function
- Return NamespaceLock for consistency
- Support workspace parameter
- Enable logging option
- Legacy code compatibility
2025-11-25 18:24:39 +08:00
EightyOliveira
8994c70f2f
fix:exception handling order error
2025-11-25 16:36:41 +08:00
yangdx
48b67d3077
Handle missing WebUI assets gracefully without blocking server startup
...
- Change build check from error to warning
- Redirect to /docs when WebUI unavailable
- Add webui_available to health endpoint
- Only mount /webui if assets exist
- Return status tuple from build check
2025-11-25 02:51:55 +08:00
yangdx
8c4d7a00ad
Refactor: Extract retry decorator to reduce code duplication in Neo4J storage
...
• Define READ_RETRY_EXCEPTIONS constant
• Create reusable READ_RETRY decorator
• Replace 11 duplicate retry decorators
• Improve code maintainability
• Add missing retry to edge_degrees_batch
2025-11-25 01:35:21 +08:00
yangdx
7aaa51cda9
Add retry decorators to Neo4j read operations for resilience
2025-11-24 22:28:15 +08:00
copilot-swe-agent[bot]
8835fc244a
Improve edge case handling for max_tokens=1
...
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-24 03:43:05 +00:00
copilot-swe-agent[bot]
1d6ea0c5f7
Fix chunking infinite loop when overlap_tokens >= max_tokens
...
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-24 03:40:58 +00:00
netbrah
a05bbf105e
Add Cohere reranker config, chunking, and tests
2025-11-22 16:43:13 -05:00
yangdx
7b76211066
Add fallback to AZURE_OPENAI_API_VERSION for embedding API version
2025-11-22 00:14:35 +08:00
yangdx
ffd8da512e
Improve Azure OpenAI compatibility and error handling
...
• Reduce log noise for Azure content filters
• Add default API version fallback
• Change warning to debug log level
• Handle empty choices in streaming
• Better Azure OpenAI integration
2025-11-21 23:51:18 +08:00
yangdx
fafa1791f4
Fix Azure OpenAI model parameter to use deployment name consistently
...
- Use deployment name for Azure API calls
- Fix model param in embed function
- Consistent api_model logic
- Prevent Azure model name conflicts
2025-11-21 23:41:52 +08:00
yangdx
ac9f2574a5
Improve Azure OpenAI wrapper functions with full parameter support
...
• Add missing parameters to wrappers
• Update docstrings for clarity
• Ensure API consistency
• Fix parameter forwarding
• Maintain backward compatibility
2025-11-21 19:24:32 +08:00
yangdx
45f4f82392
Refactor Azure OpenAI client creation to support client_configs merging
...
- Handle None client_configs case
- Merge configs with explicit params
- Override client_configs with params
- Use dict unpacking for client init
- Maintain parameter precedence
2025-11-21 19:14:16 +08:00
yangdx
0c4cba3860
Fix double decoration in azure_openai_embed and document decorator usage
...
• Remove redundant @retry decorator
• Call openai_embed.func directly
• Add detailed decorator documentation
• Prevent double parameter injection
• Fix EmbeddingFunc wrapping issues
2025-11-21 18:03:53 +08:00
yangdx
b46c152306
Fix linting
2025-11-21 17:16:44 +08:00
yangdx
b709f8f869
Consolidate Azure OpenAI implementation into main OpenAI module
...
• Unified OpenAI/Azure client creation
• Azure module now re-exports functions
• Backward compatibility maintained
• Reduced code duplication
2025-11-21 17:12:33 +08:00
yangdx
66d6c7dd6f
Refactor main function to provide sync CLI entry point
2025-11-21 13:11:55 +08:00
yangdx
02fdceb959
Update OpenAI client to use stable API and bump minimum version to 2.0.0
...
- Remove beta prefix from completions.parse
- Update OpenAI dependency to >=2.0.0
- Fix whitespace formatting
- Update all requirement files
- Clean up pyproject.toml dependencies
2025-11-21 12:55:44 +08:00
yangdx
9f69c5bf85
feat: Support structured output parsed from OpenAI
...
Added support for structured output (JSON mode) from the OpenAI API in `openai.py` and `azure_openai.py`.
When `response_format` is used to request structured data, the new logic checks for the `message.parsed` attribute. If it exists, it's serialized into a JSON string as the final content. If not, the code falls back to the existing `message.content` handling, ensuring backward compatibility.
2025-11-21 12:46:31 +08:00
yangdx
c9e1c86e81
Refactor keyword extraction handling to centralize response format logic
...
• Move response format to core function
• Remove duplicate format assignments
• Standardize keyword extraction flow
• Clean up redundant parameter handling
• Improve Azure OpenAI compatibility
2025-11-21 12:10:04 +08:00
yangdx
46ce6d9a13
Fix Azure OpenAI embedding model parameter fallback
...
- Use model param if provided
- Fall back to deployment name
- Fix embedding API call
- Improve parameter handling
2025-11-20 18:20:22 +08:00
Amritpal Singh
30e86fa331
use deployment variable which extracted value from .env file or have default value
2025-11-20 09:00:27 +00:00
yangdx
b7de694f48
Add comprehensive error logging across API routes
...
- Add error logs to Ollama API endpoints
- Replace logging with unified logger
- Log streaming query errors
- Add data query error logging
- Include stack traces for debugging
2025-11-19 22:50:06 +08:00
yangdx
0fb2925c6a
Remove ascii_colors dependency and fix stream handling errors
...
• Remove ascii_colors.trace_exception calls
• Add SafeStreamHandler for closed streams
• Patch ascii_colors console handler
• Prevent ValueError on stream close
• Improve logging error handling
2025-11-19 21:38:17 +08:00
yangdx
6fea68bff9
Fix ChunkTokenLimitExceededError message formatting
...
- Prevent passes two separate string objects to __init__
- Maintain same error output
2025-11-19 18:50:45 +08:00
yangdx
f988a22652
Add token limit validation for character-only chunking
...
- Add ChunkTokenLimitExceededError exception
- Validate chunks against token limits
- Include chunk preview in error messages
- Add comprehensive test coverage
- Log warnings for oversized chunks
2025-11-19 18:32:43 +08:00
yangdx
95cd0ece74
Fix DOCX table extraction by escaping special characters in cells
...
- Add escape_cell() function
- Escape backslashes first
- Handle tabs and newlines
- Preserve tab-delimited format
- Prevent double-escaping issues
2025-11-19 09:54:35 +08:00
yangdx
87de2b3e9e
Update XLSX extraction documentation to reflect current implementation
2025-11-19 04:26:41 +08:00
yangdx
0244699d81
Optimize XLSX extraction by using sheet.max_column instead of two-pass scan
...
• Remove two-pass row scanning approach
• Use built-in sheet.max_column property
• Simplify column width detection logic
• Improve memory efficiency
• Maintain column alignment preservation
2025-11-19 04:02:39 +08:00
yangdx
2b16016312
Optimize XLSX extraction to avoid storing all rows in memory
...
• Remove intermediate row storage
• Use iterator twice instead of list()
• Preserve column alignment logic
• Reduce memory footprint
• Maintain same output format
2025-11-19 03:48:36 +08:00
yangdx
ef659a1e09
Preserve column alignment in XLSX extraction with two-pass processing
...
• Two-pass approach for consistent width
• Maintain tabular structure integrity
• Determine max columns first pass
• Extract with alignment second pass
• Prevent column misalignment issues
2025-11-19 03:34:22 +08:00
yangdx
3efb1716b4
Enhance XLSX extraction with structured tab-delimited format and escaping
...
- Add clear sheet separators
- Escape special characters
- Trim trailing empty columns
- Preserve row structure
- Single-pass optimization
2025-11-19 03:06:29 +08:00
yangdx
e7d2803a65
Remove text stripping in DOCX extraction to preserve whitespace
...
• Keep original paragraph spacing
• Preserve cell whitespace in tables
• Maintain document formatting
• Don't strip leading/trailing spaces
2025-11-19 02:12:27 +08:00
yangdx
186c8f0e16
Preserve blank paragraphs in DOCX extraction to maintain spacing
...
• Remove text emptiness check
• Always append paragraph text
• Maintain document formatting
• Preserve original spacing
2025-11-19 02:03:10 +08:00