Commit graph

606 commits

Author SHA1 Message Date
clssck
abb44eccb1 feat(lightrag): improve entity extraction prompts and rerank chunking
Enhance entity extraction with better structured prompts:
- Reorganize prompt format for improved clarity and consistency
- Add XML-style formatting tags for better LLM parsing
- Include language parameter in keywords extraction cache key
- Fix language parameter usage in keywords_extraction prompt

Improve rerank module with chunking fixes:
- Fix top_n behavior to limit documents instead of chunks
- Add Cohere reranker support with proper chunking
- Improve error handling for rerank API responses

Update operate.py:
- Better entity extraction parsing and validation
- Improved cache key generation for multilingual support
2025-12-12 16:45:14 +01:00
clssck
59e89772de refactor: consolidate to PostgreSQL-only backend and modernize stack
Remove legacy storage implementations and deprecated examples:
- Delete FAISS, JSON, Memgraph, Milvus, MongoDB, Nano Vector DB, Neo4j, NetworkX, Qdrant, Redis storage backends
- Remove Kubernetes deployment manifests and installation scripts
- Delete unofficial examples for deprecated backends and offline deployment docs
Streamline core infrastructure:
- Consolidate storage layer to PostgreSQL-only implementation
- Add full-text search caching with FTS cache module
- Implement metrics collection and monitoring pipeline
- Add explain and metrics API routes
Modernize frontend and tooling:
- Switch web UI to Bun with bun.lock, remove npm and pnpm lockfiles
- Update Dockerfile for PostgreSQL-only deployment
- Add Makefile for common development tasks
- Update environment and configuration examples
Enhance evaluation and testing capabilities:
- Add prompt optimization with DSPy and auto-tuning
- Implement ground truth regeneration and variant testing
- Add prompt debugging and response comparison utilities
- Expand test coverage with new integration scenarios
Simplify dependencies and configuration:
- Remove offline-specific requirement files
- Update pyproject.toml with streamlined dependencies
- Add Python version pinning with .python-version
- Create project guidelines in CLAUDE.md and AGENTS.md
2025-12-12 16:28:49 +01:00
clssck
da9070ecf7 refactor: remove legacy storage implementations and k8s deployment
Remove deprecated storage backends and Kubernetes deployment configuration:
- Delete unused storage implementations: FAISS, JSON, Memgraph, Milvus, MongoDB, Nano Vector DB, Neo4j, NetworkX, Qdrant, Redis
- Remove Kubernetes deployment manifests and installation scripts
- Delete legacy examples for deprecated backends
- Consolidate to PostgreSQL-only storage backend
Streamline dependencies and add new capabilities:
- Remove deprecated code documentation and migration guides
- Add full-text search caching layer with FTS cache module
- Implement metrics collection and monitoring pipeline
- Add explain and metrics API routes
- Simplify configuration with PostgreSQL-focused setup
Update documentation and configuration:
- Rewrite README to focus on supported features
- Update environment and configuration examples
- Remove Kubernetes-specific documentation
- Add new utility scripts for PDF uploads and pipeline monitoring
2025-12-09 14:02:00 +01:00
clssck
95c83abcf8 feat(lightrag,lightrag_webui): add S3 storage integration and UI
Add S3 storage client and API routes for document management:
- Implement s3_routes.py with file upload, download, delete endpoints
- Enhance s3_client.py with improved error handling and operations
- Add S3 browser UI component with file viewing and management
- Implement FileViewer and PDFViewer components for storage preview
- Add Resizable and Sheet UI components for layout control
Update backend infrastructure:
- Add bulk operations and parameterized queries to postgres_impl.py
- Enhance document routes with improved type hints
- Update API server registration for new S3 routes
- Refine upload routes and utility functions
Modernize web UI:
- Integrate S3 browser into main application layout
- Update localization files for storage UI strings
- Add storage settings to application configuration
- Sync package dependencies and lock files
Remove obsolete reproduction script:
- Delete reproduce_citation.py (replaced by test suite)
Update configuration:
- Enhance pyrightconfig.json for stricter type checking
2025-12-07 11:04:38 +01:00
clssck
082a5a8fad test(lightrag,api): add comprehensive test coverage and S3 support
Add extensive test suites for API routes and utilities:
- Implement test_search_routes.py (406 lines) for search endpoint validation
- Implement test_upload_routes.py (724 lines) for document upload workflows
- Implement test_s3_client.py (618 lines) for S3 storage operations
- Implement test_citation_utils.py (352 lines) for citation extraction
- Implement test_chunking.py (216 lines) for text chunking validation
Add S3 storage client implementation:
- Create lightrag/storage/s3_client.py with S3 operations
- Add storage module initialization with exports
- Integrate S3 client with document upload handling
Enhance API routes and core functionality:
- Add search_routes.py with full-text and graph search endpoints
- Add upload_routes.py with multipart document upload support
- Update operate.py with bulk operations and health checks
- Enhance postgres_impl.py with bulk upsert and parameterized queries
- Update lightrag_server.py to register new API routes
- Improve utils.py with citation and formatting utilities
Update dependencies and configuration:
- Add S3 and test dependencies to pyproject.toml
- Update docker-compose.test.yml for testing environment
- Sync uv.lock with new dependencies
Apply code quality improvements across all modified files:
- Add type hints to function signatures
- Update imports and router initialization
- Fix logging and error handling
2025-12-05 23:13:39 +01:00
clssck
dd1413f3eb test(lightrag,examples): add prompt accuracy and quality tests
Add comprehensive test suites for prompt evaluation:
- test_prompt_accuracy.py: 365 lines testing prompt extraction accuracy
- test_prompt_quality_deep.py: 672 lines for deep quality analysis
- Refactor prompt.py to consolidate optimized variants (removed prompt_optimized.py)
- Apply ruff formatting and type hints across 30 files
- Update pyrightconfig.json for static type checking
- Modernize reproduce scripts and examples with improved type annotations
- Sync uv.lock dependencies
2025-12-05 16:39:52 +01:00
clssck
69358d830d test(lightrag,examples,api): comprehensive ruff formatting and type hints
Format entire codebase with ruff and add type hints across all modules:
- Apply ruff formatting to all Python files (121 files, 17K insertions)
- Add type hints to function signatures throughout lightrag core and API
- Update test suite with improved type annotations and docstrings
- Add pyrightconfig.json for static type checking configuration
- Create prompt_optimized.py and test_extraction_prompt_ab.py test files
- Update ruff.toml and .gitignore for improved linting configuration
- Standardize code style across examples, reproduce scripts, and utilities
2025-12-05 15:17:06 +01:00
clssck
a6b87df758 feat(postgres): add bulk operations and health check
- Implement bulk upsert_nodes/edges via UNWIND reducing round trips
- Add health_check for graph connectivity and AGE catalog status
- Switch to parameterized queries preventing Cypher injection
- Fix node ID sanitization: strip control chars, escape quotes
2025-12-03 18:19:26 +00:00
clssck
663ada943a chore: add citation system and enhance RAG UI components
Add citation tracking and display system across backend and frontend components.
Backend changes include citation.py for document attribution, enhanced query routes
with citation metadata, improved prompt templates, and PostgreSQL schema updates.
Frontend includes CitationMarker component, HoverCard UI, QuerySettings refinements,
and ChatMessage enhancements for displaying document sources. Update dependencies
and docker-compose test configuration for improved development workflow.
2025-12-01 17:50:00 +01:00
clssck
43af31f888 feat: add db_degree visibility and orphan connection UI
Graph Connectivity Awareness:
- Add db_degree property to all KG implementations (NetworkX, Postgres, Neo4j, Mongo, Memgraph)
- Show database degree vs visual degree in node panel with amber badge
- Add visual indicator (amber border) for nodes with hidden connections
- Add "Load X hidden connection(s)" button to expand hidden neighbors
- Add configurable "Expand Depth" setting (1-5) in graph settings
- Use global maxNodes setting for node expansion consistency

Orphan Connection UI:
- Add OrphanConnectionDialog component for manual orphan entity connection
- Add OrphanConnectionControl button in graph sidebar
- Expose /graph/orphans/connect API endpoint for frontend use

Backend Improvements:
- Add get_orphan_entities() and connect_orphan_entities() to base storage
- Add orphan connection configuration parameters
- Improve entity extraction with relationship density requirements

Frontend:
- Add graphExpandDepth and graphIncludeOrphans to settings store
- Add min_degree and include_orphans graph filtering parameters
- Update translations (en.json, zh.json)
2025-11-29 21:08:07 +01:00
clssck
ef7327bb3e chore(docker-compose, lightrag): optimize test infrastructure and add evaluation tools
Add comprehensive E2E testing infrastructure with PostgreSQL performance tuning,
Gunicorn multi-worker support, and evaluation scripts for RAGAS-based quality
assessment. Introduces 4 new evaluation utilities: compare_results.py for A/B test
analysis, download_wikipedia.py for reproducible test datasets, e2e_test_harness.py
for automated evaluation pipelines, and ingest_test_docs.py for batch document
ingestion. Updates docker-compose.test.yml with aggressive async settings, memory
limits, and optimized chunking parameters. Parallelize entity summarization in
operate.py for improved extraction performance. Fix typos in merge node/edge logs.
2025-11-29 10:39:20 +01:00
clssck
48c7732edc feat: add automatic entity resolution with 3-layer matching
Implement automatic entity resolution to prevent duplicate nodes in the
knowledge graph. The system uses a 3-layer approach:

1. Case-insensitive exact matching (free, instant)
2. Fuzzy string matching >85% threshold (free, instant)
3. Vector similarity + LLM verification (for acronyms/synonyms)

Key features:
- Pre-resolution phase prevents race conditions in parallel processing
- Numeric suffix detection blocks false matches (IL-4 ≠ IL-13)
- PostgreSQL alias cache for fast lookups on subsequent ingestion
- Configurable thresholds via environment variables

Bug fixes included:
- Fix fuzzy matching false positives for numbered entities
- Fix alias cache not being populated (missing db parameter)
- Skip entity_aliases table from generic id index creation

New files:
- lightrag/entity_resolution/ - Core resolution module
- tests/test_entity_resolution/ - Unit tests
- docker/postgres-age-vector/ - Custom PG image with pgvector + AGE
- docker-compose.test.yml - Integration test environment

Configuration (env.example):
- ENTITY_RESOLUTION_ENABLED=true
- ENTITY_RESOLUTION_FUZZY_THRESHOLD=0.85
- ENTITY_RESOLUTION_VECTOR_THRESHOLD=0.5
- ENTITY_RESOLUTION_MAX_CANDIDATES=3
2025-11-27 15:35:02 +01:00
yangdx
4f12fe121d Change entity extraction logging from warning to info level
• Reduce log noise for empty entities
2025-11-27 11:00:34 +08:00
yangdx
f988a22652 Add token limit validation for character-only chunking
- Add ChunkTokenLimitExceededError exception
- Validate chunks against token limits
- Include chunk preview in error messages
- Add comprehensive test coverage
- Log warnings for oversized chunks
2025-11-19 18:32:43 +08:00
yangdx
e77340d4a1 Adjust chunking parameters to match the default environment variable settings 2025-11-18 23:14:50 +08:00
EightyOliveira
dacca334e0 refactor(chunking): rename params and improve docstring for chunking_by_token_size 2025-11-18 15:46:28 +08:00
yangdx
ab4d7ac2b0 Add configurable embedding token limit with validation
- Add EMBEDDING_TOKEN_LIMIT env var
- Set max_token_size on embedding func
- Add token limit property to LightRAG
- Validate summary length vs limit
- Log warning when limit exceeded
2025-11-14 19:28:36 +08:00
yangdx
03cc6262c4 Prohibit direct access to internal functions of EmbeddingFunc.
• Fix similarity search error in query stage
• Remove redundant null checks
• Improve log readability
2025-11-08 01:43:36 +08:00
yangdx
ec2ea4fd3f Rename function and variables for clarity in context building
- Rename _build_llm_context to _build_context_str
- Change text_units_context to chunks_context
- Move string building before early return
- Update log messages and comments
- Consistent variable naming throughout
2025-11-01 12:15:24 +08:00
yangdx
3fa79026e0 Fix Entity Source IDs Tracking Problem
- Handle existing node updates properly in edge merging stage
- Fix source_ids merging logic
- Reorder entity deletion and optimize node operations
- Delete relationships before entities
- Add edge existence debugging logs
2025-10-29 01:19:55 +08:00
yangdx
29c4a91dc3 Move relationship ID sorting to before vector DB operations
• Remove verbose entity rebuild logging
• Sort IDs before vector DB updates
• Keep graph storage with original order
2025-10-28 19:13:48 +08:00
yangdx
5ee9a2f8c6 Fix entity consistency in knowledge graph rebuilding and merging
• Sort src/tgt for consistent ordering
• Create missing nodes before edges
• Update entity chunks storage
• Pass entity_vdb to rebuild function
• Ensure entities exist in all storages
2025-10-25 21:37:03 +08:00
yangdx
97a2ee4ef1 Rename rebuild function name and improve relationship logging format 2025-10-25 11:17:43 +08:00
yangdx
a9ec15e669 Resolve lock leakage issue during user cancellation handling
• Change default log level to INFO
• Force enable error logging output
• Add lock cleanup rollback protection
• Handle LLM cache persistence errors
• Fix async task exception handling
2025-10-25 03:06:45 +08:00
yangdx
77336e50b6 Improve error handling and add cancellation checks in pipeline 2025-10-24 17:54:17 +08:00
yangdx
78ad8873b8 Add cancellation check in delete loop 2025-10-24 14:47:20 +08:00
yangdx
743aefc655 Add pipeline cancellation feature for graceful processing termination
• Add cancel_pipeline API endpoint
• Implement PipelineCancelledException
• Add cancellation checks in main loop
• Handle task cancellation gracefully
• Mark cancelled docs as FAILED
2025-10-24 14:08:12 +08:00
yangdx
00aa5e53a7 Improve entity identifier truncation warning message format 2025-10-22 15:56:19 +08:00
yangdx
904b1f46f9 Add entity name length truncation with configurable limit 2025-10-22 14:02:30 +08:00
yangdx
a809245aed Preserve file path order by using lists instead of sets 2025-10-21 18:57:54 +08:00
yangdx
fe890fca15 Improve formatting of limit method info in rebuild functions 2025-10-21 18:34:06 +08:00
yangdx
3ed2abd82c Improve logging to show source ID ratios when skipping entities/edges 2025-10-21 16:20:34 +08:00
yangdx
80668aae22 Improve file path truncation labels and UI consistency
• Standardize FIFO/KEEP truncation labels
• Update UI truncation text format
2025-10-21 15:39:31 +08:00
yangdx
be3d274a0b Refactor node and edge merging logic with improved code structure
• Add numbered steps for clarity
• Improve early return handling
• Enhance file path limiting logic
2025-10-21 15:16:47 +08:00
yangdx
a5253244f9 Simplify skip logging and reduce pipeline status updates 2025-10-21 06:33:34 +08:00
yangdx
cd1c48beaf Standardize placeholder format to use colon separator consistently 2025-10-21 05:03:57 +08:00
yangdx
1154c5683f Refactor deduplication calculation and remove unused variables 2025-10-21 04:41:15 +08:00
yangdx
665f60b90f Refactor entity/relation merge to consolidate VDB operations within functions
• Move VDB upserts into merge functions
• Fix early return data structure issues
• Update status messages (IGNORE_NEW → KEEP)
• Consolidate error handling paths
• Improve relationship content format
2025-10-21 03:19:34 +08:00
yangdx
e01c998ee9 Track placeholders in file paths for accurate source count display
• Add has_placeholder tracking variable
• Detect placeholder patterns in paths
• Show + sign for truncated counts
2025-10-20 23:48:04 +08:00
yangdx
e0fd31a60d Fix logging message formatting 2025-10-20 22:09:09 +08:00
yangdx
a9fec26798 Add file path limit configuration for entities and relations
• Add MAX_FILE_PATHS env variable
• Implement file path count limiting
• Support KEEP/FIFO strategies
• Add truncation placeholder
• Remove old build_file_path function
2025-10-20 20:12:53 +08:00
yangdx
dc62c78f98 Add entity/relation chunk tracking with configurable source ID limits
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
2025-10-20 15:24:15 +08:00
yangdx
9f49e56a44 Merge branch 'main' into feat-entity-size-caps 2025-10-17 15:59:44 +08:00
yangdx
35cd567c9e Allow related chunks missing in knowledge graph queries 2025-10-17 00:19:30 +08:00
DivinesLight
c06522b927 Get max source Id config from .env and lightRAG init 2025-10-15 18:24:38 +05:00
yangdx
29bac49fb9 Handle empty query results by returning None instead of fail responses
• Return None when no context found
• Add structured failure metadata
• Use PROMPTS["fail_response"] for content
• Keep API compatible
2025-10-15 12:04:49 +08:00
haseebuchiha
d52c3377b4 Import from env and use default if none and removed useless import 2025-10-14 16:14:03 +05:00
DivinesLight
54f0a7d1ca Quick fix to limit source_id ballooning while inserting nodes 2025-10-14 14:47:04 +05:00
yangdx
85d1a563b3 Merge branch 'adminunblinded/main' 2025-10-10 12:31:47 +08:00
NeelM0906
f6d1fb98ac Fix Linting errors 2025-10-09 16:52:22 -04:00