Commit graph

239 commits

Author SHA1 Message Date
yangdx
19c16bc464 Add content deduplication check for document insertion endpoints
• Check content hash before insertion
• Return duplicated status if exists
• Use sanitized text for hash computation
• Apply to both single and batch inserts
• Prevent duplicate content processing
2025-12-02 17:49:48 +08:00
yangdx
8d28b95966 Fix duplicate document responses to return original track_id
- Return existing track_id for duplicates
- Remove track_id generation in reprocess
- Update reprocess response documentation
- Clarify track_id behavior in comments
- Update API response examples
2025-12-02 14:32:28 +08:00
yangdx
b7de694f48 Add comprehensive error logging across API routes
- Add error logs to Ollama API endpoints
- Replace logging with unified logger
- Log streaming query errors
- Add data query error logging
- Include stack traces for debugging
2025-11-19 22:50:06 +08:00
yangdx
0fb2925c6a Remove ascii_colors dependency and fix stream handling errors
• Remove ascii_colors.trace_exception calls
• Add SafeStreamHandler for closed streams
• Patch ascii_colors console handler
• Prevent ValueError on stream close
• Improve logging error handling
2025-11-19 21:38:17 +08:00
yangdx
95cd0ece74 Fix DOCX table extraction by escaping special characters in cells
- Add escape_cell() function
- Escape backslashes first
- Handle tabs and newlines
- Preserve tab-delimited format
- Prevent double-escaping issues
2025-11-19 09:54:35 +08:00
yangdx
87de2b3e9e Update XLSX extraction documentation to reflect current implementation 2025-11-19 04:26:41 +08:00
yangdx
0244699d81 Optimize XLSX extraction by using sheet.max_column instead of two-pass scan
• Remove two-pass row scanning approach
• Use built-in sheet.max_column property
• Simplify column width detection logic
• Improve memory efficiency
• Maintain column alignment preservation
2025-11-19 04:02:39 +08:00
yangdx
2b16016312 Optimize XLSX extraction to avoid storing all rows in memory
• Remove intermediate row storage
• Use iterator twice instead of list()
• Preserve column alignment logic
• Reduce memory footprint
• Maintain same output format
2025-11-19 03:48:36 +08:00
yangdx
ef659a1e09 Preserve column alignment in XLSX extraction with two-pass processing
• Two-pass approach for consistent width
• Maintain tabular structure integrity
• Determine max columns first pass
• Extract with alignment second pass
• Prevent column misalignment issues
2025-11-19 03:34:22 +08:00
yangdx
3efb1716b4 Enhance XLSX extraction with structured tab-delimited format and escaping
- Add clear sheet separators
- Escape special characters
- Trim trailing empty columns
- Preserve row structure
- Single-pass optimization
2025-11-19 03:06:29 +08:00
yangdx
e7d2803a65 Remove text stripping in DOCX extraction to preserve whitespace
• Keep original paragraph spacing
• Preserve cell whitespace in tables
• Maintain document formatting
• Don't strip leading/trailing spaces
2025-11-19 02:12:27 +08:00
yangdx
186c8f0e16 Preserve blank paragraphs in DOCX extraction to maintain spacing
• Remove text emptiness check
• Always append paragraph text
• Maintain document formatting
• Preserve original spacing
2025-11-19 02:03:10 +08:00
yangdx
fa887d811b Fix table column structure preservation in DOCX extraction
• Always append cell text to maintain columns
• Preserve empty cells in table structure
• Check for any content before adding rows
• Use tab separation for proper alignment
• Improve table formatting consistency
2025-11-19 01:52:02 +08:00
yangdx
4438ba41a3 Enhance DOCX extraction to preserve document order with tables
• Include tables in extracted content
• Maintain original document order
• Add spacing around tables
• Use tabs to separate table cells
• Process all body elements sequentially
2025-11-19 01:31:33 +08:00
yangdx
702cfd2981 Fix document deletion concurrency control and validation logic
• Clarify job naming for single vs batch deletion
• Update job name validation in busy pipeline check
2025-11-18 13:59:24 +08:00
yangdx
1745b30a5f Fix missing workspace parameter in update flags status call 2025-11-18 12:55:48 +08:00
yangdx
52c812b9a0 Fix workspace isolation for pipeline status across all operations
- Fix final_namespace error in get_namespace_data()
- Fix get_workspace_from_request return type
- Add workspace param to pipeline status calls
2025-11-17 12:54:33 +08:00
yangdx
926960e957 Refactor workspace handling to use default workspace and namespace locks
- Remove DB-specific workspace configs
- Add default workspace auto-setting
- Replace global locks with namespace locks
- Simplify pipeline status management
- Remove redundant graph DB locking
2025-11-17 12:54:33 +08:00
yangdx
c246eff725 Improve docling integration with macOS compatibility and CLI flag
- Add --docling CLI flag for easier setup
- Add numpy version constraints
- Exclude docling on macOS (fork-safety)
2025-11-17 12:54:32 +08:00
yangdx
69a0b74ce7 refactor: move document deps to api group, remove dynamic imports
- Merge offline-docs into api extras
- Remove pipmaster dynamic installs
- Add async document processing
- Pre-check docling availability
- Update offline deployment docs
2025-11-17 12:54:32 +08:00
yangdx
c434879c7a Replace PyPDF2 with pypdf for PDF processing
- Update import from PyPDF2 to pypdf
- Change dependency to pypdf>=6.1.0
- Update all requirements files
- Remove PyPDF2 from lock file
- Use modern pypdf library
2025-11-17 12:54:32 +08:00
BukeLy
eb52ec94d7 feat: Add workspace isolation support for pipeline status
Problem:
In multi-tenant scenarios, different workspaces share a single global
pipeline_status namespace, causing pipelines from different tenants to
block each other, severely impacting concurrent processing performance.

Solution:
- Extended get_namespace_data() to recognize workspace-specific pipeline
  namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)
- Added workspace parameter to initialize_pipeline_status() for per-tenant
  isolated pipeline namespaces
- Updated all 7 call sites to use workspace-aware locks:
  * lightrag.py: process_document_queue(), aremove_document()
  * document_routes.py: background_delete_documents(), clear_documents(),
    cancel_pipeline(), get_pipeline_status(), delete_documents()

Impact:
- Different workspaces can process documents concurrently without blocking
- Backward compatible: empty workspace defaults to "pipeline_status"
- Maintains fail-fast: uninitialized pipeline raises clear error
- Expected N× performance improvement for N concurrent tenants

Bug fixes:
- Fixed AttributeError by using self.workspace instead of self.global_config
- Fixed pipeline status endpoint to show workspace-specific status
- Fixed delete endpoint to check workspace-specific busy flag

Code changes: 4 files, 141 insertions(+), 28 deletions(-)

Testing: All syntax checks passed, comprehensive workspace isolation tests completed
2025-11-17 12:53:44 +08:00
anouarbm
c9e1c6c1c2 fix(api): change content field to list in query responses
BREAKING CHANGE: content field is now List[str] instead of str

- Add ReferenceItem Pydantic model for type safety
- Update /query and /query/stream to return content as list
- Update OpenAPI schema and examples
- Add migration guide to API README
- Fix RAGAS evaluation to handle list format

Addresses PR #2297 feedback. Tested with RAGAS: 97.37% score.
2025-11-03 04:57:08 +01:00
anouarbm
9d69e8d776 fix(api): Change content field from string to list in query responses
BREAKING CHANGE: The `content` field in query response references is now
an array of strings instead of a concatenated string. This preserves
individual chunk boundaries when a single file has multiple chunks.

Changes:
- Update QueryResponse Pydantic model to accept List[str] for content
- Modify query_text endpoint to return content as list (query_routes.py:425)
- Modify query_text_stream endpoint to support chunk content enrichment
- Update OpenAPI schema and examples to reflect array structure
- Update API README with breaking change notice and migration guide
- Fix RAGAS evaluation to flatten chunk content lists
2025-11-03 04:37:09 +01:00
anouarbm
0b5e3f9dc4 Use logger in RAG evaluation and optimize reference content joins 2025-11-02 18:43:53 +01:00
anouarbm
963ad4c637 docs: Add documentation and examples for include_chunk_content parameter
Added comprehensive documentation for the new include_chunk_content parameter
that enables retrieval of actual chunk text content in API responses.

Documentation Updates:
- Added "Include Chunk Content in References" section to API README
- Explained use cases: RAG evaluation, debugging, citations, transparency
- Provided JSON request/response examples
- Clarified parameter interaction with include_references

OpenAPI/Swagger Examples:
- Added "Response with chunk content" example to /query endpoint
- Shows complete reference structure with content field
- Demonstrates realistic chunk text content

This makes the feature discoverable through:
1. API documentation (README.md)
2. Interactive Swagger UI (http://localhost:9621/docs)
3. Code examples for developers
2025-11-02 17:53:27 +01:00
anouarbm
0bbef9814e Optimize RAGAS evaluation with parallel execution and chunk content enrichment
Added efficient RAG evaluation system with optimized API calls and comprehensive benchmarking.

Key Features:
- Single API call per evaluation (2x faster than before)
- Parallel evaluation based on MAX_ASYNC environment variable
- Chunk content enrichment in /query endpoint responses
- Comprehensive benchmark statistics (moyennes)
- NaN-safe metric calculations

API Changes:
- Added include_chunk_content parameter to QueryRequest (backward compatible)
- /query endpoint enriches references with actual chunk content when requested
- No breaking changes - default behavior unchanged

Evaluation Improvements:
- Parallel execution using asyncio.Semaphore (respects MAX_ASYNC)
- Shared HTTP client with connection pooling
- Proper timeout handling (3min connect, 5min read)
- Debug output for context retrieval verification
- Benchmark statistics with averages, min/max scores

Results:
- Moyenne RAGAS Score: 0.9772
- Perfect Faithfulness: 1.0000
- Perfect Context Recall: 1.0000
- Perfect Context Precision: 1.0000
- Excellent Answer Relevance: 0.9087
2025-11-02 17:39:43 +01:00
yangdx
61b57cbb5d Add PDF decryption support for password-protected files
• Add PDF_DECRYPT_PASSWORD env variable
• Check encryption status before reading
• Handle decrypt errors gracefully
• Log detailed error messages
• Support both encrypted/plain PDFs
2025-11-01 15:01:17 +08:00
yangdx
c46c1b26a9 Add pycryptodome dependency for PDF encryption support 2025-10-31 01:49:42 +08:00
yangdx
5155edd8d2 feat: Improve entity merge and edit UX
- **API:** The `graph/entity/edit` endpoint now returns a detailed `operation_summary` for better client-side handling of update, rename, and merge outcomes.
- **Web UI:** Added an "auto-merge on rename" option. The UI now gracefully handles merge success, partial failures (update OK, merge fail), and other errors with specific user feedback.
2025-10-27 23:42:08 +08:00
yangdx
97034f06e3 Add allow_merge parameter to entity update API endpoint 2025-10-27 14:30:27 +08:00
yangdx
6015e8bc68 Refactor graph utils to use unified persistence callback
- Add _persist_graph_updates function
- Remove duplicate callback functions
2025-10-26 20:20:16 +08:00
yangdx
bf1897a67e Normalize entity order for undirected graph consistency
• Normalize entity pairs for storage
• Update API docs for undirected edges
2025-10-26 15:53:31 +08:00
Daniel.y
c82485d94d
Merge pull request #2253 from Mobious/main
Allow users to provide keywords with QueryRequest
2025-10-25 11:26:54 +08:00
yangdx
78ad8873b8 Add cancellation check in delete loop 2025-10-24 14:47:20 +08:00
yangdx
743aefc655 Add pipeline cancellation feature for graceful processing termination
• Add cancel_pipeline API endpoint
• Implement PipelineCancelledException
• Add cancellation checks in main loop
• Handle task cancellation gracefully
• Mark cancelled docs as FAILED
2025-10-24 14:08:12 +08:00
Mobious
f24a261613 Allow users to provide keywords with QueryRequest 2025-10-23 12:53:19 -10:00
yangdx
8dc23eeff2 Fix RayAnything compatible problem
• Use "preprocessed" to indicate multimodal processing is required
• Update DocProcessingStatus to process status convertion automatically
• Remove multimodal_processed from DocStatus enum value
• Update UI filter logic
2025-10-22 20:15:29 +08:00
yangdx
162370b6e6 Add optional LLM cache deletion when deleting documents
• Add delete_llm_cache parameter to API
• Collect cache IDs from text chunks
• Delete cache after graph operations
• Update UI with new checkbox option
• Add i18n translations for cache option
2025-10-22 12:19:23 +08:00
yangdx
dc62c78f98 Add entity/relation chunk tracking with configurable source ID limits
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
2025-10-20 15:24:15 +08:00
yangdx
130b4959dc Add PREPROCESSED (multimodal_processed) status for multimodal document processing
• Add DocStatus.PREPROCESSED enum value
• Update API routes and response models
• Add preprocessed filter in web UI
• Update localization files
• Handle preprocessed status in deletion
2025-10-14 14:02:05 +08:00
yangdx
12facac506 Enhance graph API endpoints with detailed docs and field validation
- Remove redundant README section
- Add Pydantic field validation
- Expand endpoint docstrings
- Include request/response examples
- Document merge operation benefits
2025-10-10 12:49:00 +08:00
yangdx
85d1a563b3 Merge branch 'adminunblinded/main' 2025-10-10 12:31:47 +08:00
NeelM0906
b7c77396a0 Fix entity/relation creation endpoints to properly update vector stores
- Changed create_entity to use rag.acreate_entity() instead of direct graph manipulation
  - Changed create_relation to use rag.acreate_relation() instead of direct graph manipulation
  - This ensures vector embeddings are created and entities/relations are searchable
  - Adds proper concurrency locks and metadata population
2025-10-09 17:02:17 -04:00
NeelM0906
f6d1fb98ac Fix Linting errors 2025-10-09 16:52:22 -04:00
NeelM0906
9f44e89de7 Add knowledge graph manipulation endpoints
Added three new REST API endpoints for direct knowledge graph manipulation:

- POST /graph/entity/create: Create new entities in the knowledge graph
- POST /graph/relation/create: Create relationships between entities
- POST /graph/entities/merge: Merge duplicate/misspelled entities while preserving relationships

The merge endpoint is particularly useful for consolidating entities discovered after document processing, fixing spelling errors, and cleaning up the knowledge graph. All relationships from source entities are transferred to the target entity, with intelligent handling of duplicate relationships.

Updated API documentation in lightrag/api/README.md with usage examples for all three endpoints.
2025-10-08 15:59:47 -04:00
Jon
cf2a024e37 feat: Add endpoint and UI to retry failed documents
Add a new `/documents/reprocess_failed` API endpoint and corresponding
UI button to retry processing of failed and pending documents. This
addresses a common recovery scenario when document processing fails due
to server crashes, network errors, or LLM service outages.

Backend changes:
- Add ReprocessResponse model with status, message, and track_id fields
- Add POST /documents/reprocess_failed endpoint that triggers background
  reprocessing of FAILED, PENDING, and interrupted PROCESSING documents
- Reuses existing apipeline_process_enqueue_documents for consistency
- Includes comprehensive docstring and logging for observability

Frontend changes:
- Add TypeScript types and API function for the new endpoint
- Add retry handler with intelligent polling (fast refresh → normal)
- Add "Retry Failed" button in Documents page toolbar
- Button disabled when pipeline is busy to prevent duplicate operations
- Complete i18n support (English and Chinese translations)

This feature provides a convenient way to recover from processing
failures without requiring a full filesystem rescan.
2025-10-04 16:46:29 -04:00
yangdx
83d99e1424 fix(OllamaAPI): Add validation to ensure last message is from user role
• Validate last message role is "user"
• Raise 400 error for invalid role
• Improve API request validation
• Prevent invalid message sequences
2025-10-01 20:48:37 +08:00
yangdx
df43afc89b Relax conversation history role validation requirements
• Remove strict role value checking
• Allow any non-empty string roles
2025-09-29 13:10:15 +08:00
yangdx
7cba458f22 Limit deprecated documents endpoint to 1000 records with fair distribution 2025-09-28 11:18:10 +08:00