Commit graph

629 commits

Author SHA1 Message Date
clssck
663ada943a chore: add citation system and enhance RAG UI components
Add citation tracking and display system across backend and frontend components.
Backend changes include citation.py for document attribution, enhanced query routes
with citation metadata, improved prompt templates, and PostgreSQL schema updates.
Frontend includes CitationMarker component, HoverCard UI, QuerySettings refinements,
and ChatMessage enhancements for displaying document sources. Update dependencies
and docker-compose test configuration for improved development workflow.
2025-12-01 17:50:00 +01:00
clssck
43af31f888 feat: add db_degree visibility and orphan connection UI
Graph Connectivity Awareness:
- Add db_degree property to all KG implementations (NetworkX, Postgres, Neo4j, Mongo, Memgraph)
- Show database degree vs visual degree in node panel with amber badge
- Add visual indicator (amber border) for nodes with hidden connections
- Add "Load X hidden connection(s)" button to expand hidden neighbors
- Add configurable "Expand Depth" setting (1-5) in graph settings
- Use global maxNodes setting for node expansion consistency

Orphan Connection UI:
- Add OrphanConnectionDialog component for manual orphan entity connection
- Add OrphanConnectionControl button in graph sidebar
- Expose /graph/orphans/connect API endpoint for frontend use

Backend Improvements:
- Add get_orphan_entities() and connect_orphan_entities() to base storage
- Add orphan connection configuration parameters
- Improve entity extraction with relationship density requirements

Frontend:
- Add graphExpandDepth and graphIncludeOrphans to settings store
- Add min_degree and include_orphans graph filtering parameters
- Update translations (en.json, zh.json)
2025-11-29 21:08:07 +01:00
clssck
d2c9e6e2ec test(lightrag): add orphan connection feature with quality validation tests
Implement automatic orphan entity connection system that identifies entities with
no relationships and creates meaningful connections via vector similarity + LLM
validation. This improves knowledge graph connectivity and retrieval quality.
Changes:
- Add orphan connection configuration parameters (thresholds, cross-connect settings)
- Implement aconnect_orphan_entities() method with 4-step validation pipeline
- Add SQL templates for efficient orphan and candidate entity queries
- Create POST /graph/orphans/connect API endpoint with configurable parameters
- Add orphan connection validation prompt for LLM-based relationship verification
- Include relationship density requirement in extraction prompts to prevent orphans
- Update docker-compose.test.yml with optimized extraction parameters
- Add quality validation test suite (run_quality_tests.py) for retrieval evaluation
- Add unit test framework (test_orphan_connection_quality.py) with test cases
- Enable auto-run of orphan connection after document processing
2025-11-28 18:23:30 +01:00
clssck
48c7732edc feat: add automatic entity resolution with 3-layer matching
Implement automatic entity resolution to prevent duplicate nodes in the
knowledge graph. The system uses a 3-layer approach:

1. Case-insensitive exact matching (free, instant)
2. Fuzzy string matching >85% threshold (free, instant)
3. Vector similarity + LLM verification (for acronyms/synonyms)

Key features:
- Pre-resolution phase prevents race conditions in parallel processing
- Numeric suffix detection blocks false matches (IL-4 ≠ IL-13)
- PostgreSQL alias cache for fast lookups on subsequent ingestion
- Configurable thresholds via environment variables

Bug fixes included:
- Fix fuzzy matching false positives for numbered entities
- Fix alias cache not being populated (missing db parameter)
- Skip entity_aliases table from generic id index creation

New files:
- lightrag/entity_resolution/ - Core resolution module
- tests/test_entity_resolution/ - Unit tests
- docker/postgres-age-vector/ - Custom PG image with pgvector + AGE
- docker-compose.test.yml - Integration test environment

Configuration (env.example):
- ENTITY_RESOLUTION_ENABLED=true
- ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85
- ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5
- ENTITY_RESOLUTION_MAX_CANDIDATES=3
2025-11-27 15:35:02 +01:00
yangdx
9c10c87554 Fix linting 2025-11-18 22:38:43 +08:00
EightyOliveira
dacca334e0 refactor(chunking): rename params and improve docstring for chunking_by_token_size 2025-11-18 15:46:28 +08:00
yangdx
702cfd2981 Fix document deletion concurrency control and validation logic
• Clarify job naming for single vs batch deletion
• Update job name validation in busy pipeline check
2025-11-18 13:59:24 +08:00
yangdx
4048fc4b89 Fix: auto-acquire pipeline when idle in document deletion
• Track if we acquired the pipeline lock
• Auto-acquire pipeline when idle
• Only release if we acquired it
• Prevent concurrent deletion conflicts
• Improve deletion job validation
2025-11-18 13:25:13 +08:00
yangdx
6cef8df159 Reduce log level and improve workspace mismatch message clarity
• Change warning to info level
• Simplify workspace mismatch wording
2025-11-18 08:25:21 +08:00
yangdx
9d7b7981ce Add pipeline status validation before document deletion 2025-11-17 14:58:10 +08:00
yangdx
e22ac52ebc Auto-initialize pipeline status in LightRAG.initialize_storages()
• Remove manual initialize_pipeline_status calls
• Auto-init in initialize_storages method
• Update error messages for clarity
• Warn on workspace conflicts
2025-11-17 12:54:33 +08:00
yangdx
52c812b9a0 Fix workspace isolation for pipeline status across all operations
- Fix final_namespace error in get_namespace_data()
- Fix get_workspace_from_request return type
- Add workspace param to pipeline status calls
2025-11-17 12:54:33 +08:00
yangdx
926960e957 Refactor workspace handling to use default workspace and namespace locks
- Remove DB-specific workspace configs
- Add default workspace auto-setting
- Replace global locks with namespace locks
- Simplify pipeline status management
- Remove redundant graph DB locking
2025-11-17 12:54:33 +08:00
yangdx
2fb57e767d Fix embedding token limit initialization order
* Capture max_token_size before decorator
* Apply wrapper after capturing attribute
* Prevent decorator from stripping dataclass
* Ensure token limit is properly set
2025-11-17 12:54:32 +08:00
yangdx
f0254773c6 Convert embedding_token_limit from property to field with __post_init__
• Remove property decorator
• Add field with init=False
• Set value in __post_init__ method
• embedding_token_limit is now in config dictionary
2025-11-17 12:54:32 +08:00
yangdx
14a6c24ed7 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-17 12:54:32 +08:00
yangdx
7d394fb0a4 Replace asyncio.iscoroutine with inspect.isawaitable for better detection 2025-11-17 12:54:32 +08:00
yangdx
af5423919b Support async chunking functions in LightRAG processing pipeline
- Add Awaitable and Union type imports
- Update chunking_func type annotation
- Handle coroutine results with await
- Add return type validation
- Update docstring for async support
2025-11-17 12:54:32 +08:00
Tong Da
5016025453 easier version: detect chunking_func result is coroutine or not 2025-11-17 12:54:32 +08:00
Tong Da
7740500693 support async chunking func to improve processing performance when a heavy chunking_func is passed in by user 2025-11-17 12:54:32 +08:00
BukeLy
18a4870229 fix: Add default workspace support for backward compatibility
Fixes two compatibility issues in workspace isolation:

1. Problem: lightrag_server.py calls initialize_pipeline_status()
   without workspace parameter, causing pipeline to initialize in
   global namespace instead of rag's workspace.

   Solution: Add set_default_workspace() mechanism in shared_storage.
   LightRAG.initialize_storages() now sets default workspace, which
   initialize_pipeline_status() uses when called without parameters.

2. Problem: /health endpoint hardcoded to use "pipeline_status",
   cannot return workspace-specific status or support frontend
   workspace selection.

   Solution: Add LIGHTRAG-WORKSPACE header support. Endpoint now
   extracts workspace from header or falls back to server default,
   returning correct workspace-specific pipeline status.

Changes:
- lightrag/kg/shared_storage.py: Add set/get_default_workspace()
- lightrag/lightrag.py: Call set_default_workspace() in initialize_storages()
- lightrag/api/lightrag_server.py: Add get_workspace_from_request() helper,
  update /health endpoint to support LIGHTRAG-WORKSPACE header

Testing:
- Backward compatibility: Old code works without modification
- Multi-instance safety: Explicit workspace passing preserved
- /health endpoint: Supports both default and header-specified workspaces

Related: #2353
2025-11-17 12:54:20 +08:00
BukeLy
eb52ec94d7 feat: Add workspace isolation support for pipeline status
Problem:
In multi-tenant scenarios, different workspaces share a single global
pipeline_status namespace, causing pipelines from different tenants to
block each other, severely impacting concurrent processing performance.

Solution:
- Extended get_namespace_data() to recognize workspace-specific pipeline
  namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)
- Added workspace parameter to initialize_pipeline_status() for per-tenant
  isolated pipeline namespaces
- Updated all 7 call sites to use workspace-aware locks:
  * lightrag.py: process_document_queue(), aremove_document()
  * document_routes.py: background_delete_documents(), clear_documents(),
    cancel_pipeline(), get_pipeline_status(), delete_documents()

Impact:
- Different workspaces can process documents concurrently without blocking
- Backward compatible: empty workspace defaults to "pipeline_status"
- Maintains fail-fast: uninitialized pipeline raises clear error
- Expected N× performance improvement for N concurrent tenants

Bug fixes:
- Fixed AttributeError by using self.workspace instead of self.global_config
- Fixed pipeline status endpoint to show workspace-specific status
- Fixed delete endpoint to check workspace-specific busy flag

Code changes: 4 files, 141 insertions(+), 28 deletions(-)

Testing: All syntax checks passed, comprehensive workspace isolation tests completed
2025-11-17 12:53:44 +08:00
yangdx
ea141e2779 Fix: Remove redundant entity/relation chunk deletions 2025-11-07 02:56:16 +08:00
yangdx
04ed709b34 Optimize entity deletion by batching edge queries to avoid N+1 problem
• Add batch get_nodes_edges_batch call
• Remove individual get_node_edges calls
• Improve query performance
2025-11-06 21:34:47 +08:00
yangdx
afb5e5c1cb Fix edge cleanup when deleting entities to prevent orphaned relationships
- Track edges to delete in set
- Clean VDB before node deletion
- Remove from relation chunks storage
- Prevent orphaned relationship data
2025-10-31 02:36:15 +08:00
yangdx
c36afecba4 Remove redundant await call in file extraction pipeline 2025-10-30 20:35:41 +08:00
yangdx
3fa79026e0 Fix Entity Source IDs Tracking Problem
- Handle existing node updates properly in edge merging stage
- Fix source_ids merging logic
- Reorder entity deletion and optimize node operations
- Delete relationships before entities
- Add edge existence debugging logs
2025-10-29 01:19:55 +08:00
yangdx
c81a56a113 Fix entity and relationship deletion when no chunk references remain 2025-10-28 16:02:35 +08:00
yangdx
5155edd8d2 feat: Improve entity merge and edit UX
- **API:** The `graph/entity/edit` endpoint now returns a detailed `operation_summary` for better client-side handling of update, rename, and merge outcomes.
- **Web UI:** Added an "auto-merge on rename" option. The UI now gracefully handles merge success, partial failures (update OK, merge fail), and other errors with specific user feedback.
2025-10-27 23:42:08 +08:00
yangdx
2c09adb8d3 Add chunk tracking support to entity merge functionality
- Pass chunk storages to merge function
- Merge relation chunk tracking data
- Merge entity chunk tracking data
- Delete old chunk tracking records
- Persist chunk storage updates
2025-10-27 02:06:21 +08:00
yangdx
3fbd704bf9 Enhance entity/relation editing with chunk tracking synchronization
• Add chunk storage sync to edit ops
• Implement incremental chunk ID updates
• Support entity renaming migrations
• Normalize relation keys consistently
• Preserve chunk references on edits
2025-10-26 14:34:56 +08:00
yangdx
29bf593663 Fix entity and relation chunk cleanup in deletion pipeline
• Delete from entity_chunks storage
• Delete from relation_chunks storage
2025-10-25 22:32:27 +08:00
yangdx
a9bc348446 Remove enable_logging parameter from data init lock call 2025-10-25 11:48:14 +08:00
yangdx
97a2ee4ef1 Rename rebuild function name and improve relationship logging format 2025-10-25 11:17:43 +08:00
yangdx
a9ec15e669 Resolve lock leakage issue during user cancellation handling
• Change default log level to INFO
• Force enable error logging output
• Add lock cleanup rollback protection
• Handle LLM cache persistence errors
• Fix async task exception handling
2025-10-25 03:06:45 +08:00
yangdx
77336e50b6 Improve error handling and add cancellation checks in pipeline 2025-10-24 17:54:17 +08:00
yangdx
743aefc655 Add pipeline cancellation feature for graceful processing termination
• Add cancel_pipeline API endpoint
• Implement PipelineCancelledException
• Add cancellation checks in main loop
• Handle task cancellation gracefully
• Mark cancelled docs as FAILED
2025-10-24 14:08:12 +08:00
yangdx
b76350a3bc Fix linting 2025-10-22 12:53:42 +08:00
yangdx
d7e2527e1a Handle cache deletion errors gracefully instead of raising exceptions 2025-10-22 12:53:19 +08:00
yangdx
162370b6e6 Add optional LLM cache deletion when deleting documents
• Add delete_llm_cache parameter to API
• Collect cache IDs from text chunks
• Delete cache after graph operations
• Update UI with new checkbox option
• Add i18n translations for cache option
2025-10-22 12:19:23 +08:00
yangdx
e5e16b7bd1 Fix Redis data migration error
• Use proper Redis connection context
• Fix namespace pattern for key scanning
• Propagate storage check exceptions
• Remove defensive error swallowing
2025-10-21 16:27:04 +08:00
yangdx
a9fec26798 Add file path limit configuration for entities and relations
• Add MAX_FILE_PATHS env variable
• Implement file path count limiting
• Support KEEP/FIFO strategies
• Add truncation placeholder
• Remove old build_file_path function
2025-10-20 20:12:53 +08:00
yangdx
dc62c78f98 Add entity/relation chunk tracking with configurable source ID limits
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
2025-10-20 15:24:15 +08:00
yangdx
9f49e56a44 Merge branch 'main' into feat-entity-size-caps 2025-10-17 15:59:44 +08:00
DivinesLight
c06522b927 Get max source Id config from .env and lightRAG init 2025-10-15 18:24:38 +05:00
yangdx
29bac49fb9 Handle empty query results by returning None instead of fail responses
• Return None when no context found
• Add structured failure metadata
• Use PROMPTS["fail_response"] for content
• Keep API compatible
2025-10-15 12:04:49 +08:00
yangdx
130b4959dc Add PREPROCESSED (multimodal_processed) status for multimodal document processing
• Add DocStatus.PREPROCESSED enum value
• Update API routes and response models
• Add preprocessed filter in web UI
• Update localization files
• Handle preprocessed status in deletion
2025-10-14 14:02:05 +08:00
yangdx
074f0c8b23 Update docstring for adelete_by_doc_id method clarity 2025-10-12 10:12:45 +08:00
yangdx
457d51952e Add doc_name field to full docs storage
- Store file_path in full_docs storage
- Update PostgreSQL implementation by map file_path to doc_name
- Other storage implementation automatically handles the new field
2025-10-05 11:44:27 +08:00
yangdx
1766cddd6c Fix mode parameter serialization error in Ollama chat API
• Use mode.value for API requests
• Add debug logging in aquery_llm
2025-09-27 15:11:51 +08:00