Commit graph

267 commits

Author SHA1 Message Date
clssck
59e89772de refactor: consolidate to PostgreSQL-only backend and modernize stack
Remove legacy storage implementations and deprecated examples:
- Delete FAISS, JSON, Memgraph, Milvus, MongoDB, Nano Vector DB, Neo4j, NetworkX, Qdrant, Redis storage backends
- Remove Kubernetes deployment manifests and installation scripts
- Delete unofficial examples for deprecated backends and offline deployment docs
Streamline core infrastructure:
- Consolidate storage layer to PostgreSQL-only implementation
- Add full-text search caching with FTS cache module
- Implement metrics collection and monitoring pipeline
- Add explain and metrics API routes
Modernize frontend and tooling:
- Switch web UI to Bun with bun.lock, remove npm and pnpm lockfiles
- Update Dockerfile for PostgreSQL-only deployment
- Add Makefile for common development tasks
- Update environment and configuration examples
Enhance evaluation and testing capabilities:
- Add prompt optimization with DSPy and auto-tuning
- Implement ground truth regeneration and variant testing
- Add prompt debugging and response comparison utilities
- Expand test coverage with new integration scenarios
Simplify dependencies and configuration:
- Remove offline-specific requirement files
- Update pyproject.toml with streamlined dependencies
- Add Python version pinning with .python-version
- Create project guidelines in CLAUDE.md and AGENTS.md
2025-12-12 16:28:49 +01:00
clssck
da9070ecf7 refactor: remove legacy storage implementations and k8s deployment
Remove deprecated storage backends and Kubernetes deployment configuration:
- Delete unused storage implementations: FAISS, JSON, Memgraph, Milvus, MongoDB, Nano Vector DB, Neo4j, NetworkX, Qdrant, Redis
- Remove Kubernetes deployment manifests and installation scripts
- Delete legacy examples for deprecated backends
- Consolidate to PostgreSQL-only storage backend
Streamline dependencies and add new capabilities:
- Remove deprecated code documentation and migration guides
- Add full-text search caching layer with FTS cache module
- Implement metrics collection and monitoring pipeline
- Add explain and metrics API routes
- Simplify configuration with PostgreSQL-focused setup
Update documentation and configuration:
- Rewrite README to focus on supported features
- Update environment and configuration examples
- Remove Kubernetes-specific documentation
- Add new utility scripts for PDF uploads and pipeline monitoring
2025-12-09 14:02:00 +01:00
clssck
95c83abcf8 feat(lightrag,lightrag_webui): add S3 storage integration and UI
Add S3 storage client and API routes for document management:
- Implement s3_routes.py with file upload, download, delete endpoints
- Enhance s3_client.py with improved error handling and operations
- Add S3 browser UI component with file viewing and management
- Implement FileViewer and PDFViewer components for storage preview
- Add Resizable and Sheet UI components for layout control
Update backend infrastructure:
- Add bulk operations and parameterized queries to postgres_impl.py
- Enhance document routes with improved type hints
- Update API server registration for new S3 routes
- Refine upload routes and utility functions
Modernize web UI:
- Integrate S3 browser into main application layout
- Update localization files for storage UI strings
- Add storage settings to application configuration
- Sync package dependencies and lock files
Remove obsolete reproduction script:
- Delete reproduce_citation.py (replaced by test suite)
Update configuration:
- Enhance pyrightconfig.json for stricter type checking
2025-12-07 11:04:38 +01:00
clssck
082a5a8fad test(lightrag,api): add comprehensive test coverage and S3 support
Add extensive test suites for API routes and utilities:
- Implement test_search_routes.py (406 lines) for search endpoint validation
- Implement test_upload_routes.py (724 lines) for document upload workflows
- Implement test_s3_client.py (618 lines) for S3 storage operations
- Implement test_citation_utils.py (352 lines) for citation extraction
- Implement test_chunking.py (216 lines) for text chunking validation
Add S3 storage client implementation:
- Create lightrag/storage/s3_client.py with S3 operations
- Add storage module initialization with exports
- Integrate S3 client with document upload handling
Enhance API routes and core functionality:
- Add search_routes.py with full-text and graph search endpoints
- Add upload_routes.py with multipart document upload support
- Update operate.py with bulk operations and health checks
- Enhance postgres_impl.py with bulk upsert and parameterized queries
- Update lightrag_server.py to register new API routes
- Improve utils.py with citation and formatting utilities
Update dependencies and configuration:
- Add S3 and test dependencies to pyproject.toml
- Update docker-compose.test.yml for testing environment
- Sync uv.lock with new dependencies
Apply code quality improvements across all modified files:
- Add type hints to function signatures
- Update imports and router initialization
- Fix logging and error handling
2025-12-05 23:13:39 +01:00
clssck
65d2cd16b1 feat(examples, lightrag): fix logging and code improvements
Fix logging output in evaluation test harness and examples:
- Replace print() statements with logger calls in e2e_test_harness.py
- Update copy_llm_cache_to_another_storage.py to use logger instead of print
- Remove redundant logging configuration in copy_llm_cache_to_another_storage.py
Fix path handling and typos:
- Correct makedirs() call in lightrag_openai_demo.py to create log_dir directly
- Update constants.py comments to clarify SOURCE_IDS_LIMIT_METHOD options
- Remove duplicate return statement in utils.py normalize_extracted_info()
- Fix error string formatting in chroma_impl.py with !s conversion
- Remove unused pipmaster import from chroma_impl.py
2025-12-05 18:10:19 +01:00
clssck
69358d830d test(lightrag,examples,api): comprehensive ruff formatting and type hints
Format entire codebase with ruff and add type hints across all modules:
- Apply ruff formatting to all Python files (121 files, 17K insertions)
- Add type hints to function signatures throughout lightrag core and API
- Update test suite with improved type annotations and docstrings
- Add pyrightconfig.json for static type checking configuration
- Create prompt_optimized.py and test_extraction_prompt_ab.py test files
- Update ruff.toml and .gitignore for improved linting configuration
- Standardize code style across examples, reproduce scripts, and utilities
2025-12-05 15:17:06 +01:00
yangdx
0c4cba3860 Fix double decoration in azure_openai_embed and document decorator usage
• Remove redundant @retry decorator
• Call openai_embed.func directly
• Add detailed decorator documentation
• Prevent double parameter injection
• Fix EmbeddingFunc wrapping issues
2025-11-21 18:03:53 +08:00
yangdx
0fb2925c6a Remove ascii_colors dependency and fix stream handling errors
• Remove ascii_colors.trace_exception calls
• Add SafeStreamHandler for closed streams
• Patch ascii_colors console handler
• Prevent ValueError on stream close
• Improve logging error handling
2025-11-19 21:38:17 +08:00
yangdx
90f52acf0c Fix linting 2025-11-17 12:28:53 +08:00
yangdx
c13f9116d9 Add embedding dimension validation to EmbeddingFunc wrapper
• Validate total elements divisibility
• Check vector count matches input count
• Raise clear error messages on mismatch
• Ensure embedding output correctness
• Add docstring for EmbeddingFunc class
2025-11-17 12:26:54 +08:00
yangdx
05852e1ab2 Add max_token_size parameter to embedding function decorators
- Add max_token_size=8192 to all embed funcs
- Move siliconcloud to deprecated folder
- Import wrap_embedding_func_with_attrs
- Update EmbeddingFunc docstring
- Fix langfuse import type annotation
2025-11-14 18:41:43 +08:00
yangdx
6de4123f74 Optimize JSON string sanitization with precompiled regex and zero-copy
- Precompile regex pattern at module level
- Zero-copy path for clean strings
- Use C-level regex for performance
- Remove deprecated _sanitize_json_data
- Fast detection for common case
2025-11-12 15:42:07 +08:00
yangdx
777c987371 Optimize JSON write with fast/slow path to reduce memory usage
- Fast path for clean data (no sanitization)
- Slow path sanitizes during encoding
- Reload shared memory after sanitization
- Custom encoder avoids deep copies
- Comprehensive test coverage
2025-11-12 13:48:56 +08:00
yangdx
f28a0c25b1 Improve JSON data sanitization to handle tuples and dict keys
- Sanitize dictionary keys
- Preserve tuple types
- Handle nested structures better
2025-11-12 00:50:18 +08:00
yangdx
6918a88f92 Add specialized JSON string sanitizer to prevent UTF-8 encoding errors
• Remove surrogate characters (U+D800-DFFF)
• Filter Unicode non-characters
• Direct char-by-char filtering
2025-11-12 00:38:47 +08:00
yangdx
d1f4b6e515 Add data sanitization to JSON writing to prevent UTF-8 encoding errors
• Add _sanitize_json_data helper function
• Recursively clean strings in data
• Sanitize before JSON serialization
• Prevent encoding-related crashes
• Use existing sanitize_text_for_encoding
2025-11-12 00:11:13 +08:00
yangdx
c14f25b7f8 Add mandatory dimension parameter handling for Jina API compliance 2025-11-07 21:08:34 +08:00
yangdx
33a1482f7f Add optional embedding dimension parameter control via env var
* Add EMBEDDING_SEND_DIM environment variable
* Update Jina/OpenAI embed functions
* Add send_dimensions to EmbeddingFunc
* Auto-inject embedding_dim when enabled
* Add parameter validation warnings
2025-11-07 20:46:40 +08:00
yangdx
5f49cee20f Merge branch 'main' into VOXWAVE-FOUNDRY/main 2025-11-06 15:37:35 +08:00
yangdx
3fbd704bf9 Enhance entity/relation editing with chunk tracking synchronization
• Add chunk storage sync to edit ops
• Implement incremental chunk ID updates
• Support entity renaming migrations
• Normalize relation keys consistently
• Preserve chunk references on edits
2025-10-26 14:34:56 +08:00
yangdx
a9fec26798 Add file path limit configuration for entities and relations
• Add MAX_FILE_PATHS env variable
• Implement file path count limiting
• Support KEEP/FIFO strategies
• Add truncation placeholder
• Remove old build_file_path function
2025-10-20 20:12:53 +08:00
Humphry
0b3d31507e extended to use gemini, sswitched to use gemini-flash-latest 2025-10-20 13:17:16 +03:00
yangdx
dc62c78f98 Add entity/relation chunk tracking with configurable source ID limits
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
2025-10-20 15:24:15 +08:00
yangdx
03333d63f3 Merge branch 'main' into limit-vdb-metadata-size 2025-10-17 21:36:06 +08:00
yangdx
f555824064 Fix tuple delimiter corruption handling in regex patterns 2025-10-17 18:43:45 +08:00
DivinesLight
c06522b927 Get max source Id config from .env and lightRAG init 2025-10-15 18:24:38 +05:00
haseebuchiha
d52c3377b4 Import from env and use default if none and removed useless import 2025-10-14 16:14:03 +05:00
DivinesLight
54f0a7d1ca Quick fix to limit source_id ballooning while inserting nodes 2025-10-14 14:47:04 +05:00
NeelM0906
f6d1fb98ac Fix Linting errors 2025-10-09 16:52:22 -04:00
yangdx
a528213210 Fix logging filter logic
• Fix boolean operator precedence in filter
• Consolidate GET/POST condition logic
2025-09-26 19:42:33 +08:00
yangdx
8cd4139cbf refactor: fix double query problem by add aquery_llm function for consistent response handling
- Add new aquery_llm/query_llm methods providing structured responses
- Consolidate /query and /query/stream endpoints to use unified aquery_llm
- Optimize cache handling by moving cache checks before LLM calls
2025-09-26 19:05:03 +08:00
yangdx
5eb4a4b799 feat: simplify citations, add reference merging, and restructure API response format 2025-09-24 14:30:10 +08:00
yangdx
6e2eab5c23 Add ID fields to entities, relations, and chunks in raw data query results 2025-09-21 23:31:35 +08:00
yangdx
18e886d7e9 Improve context item identification with meaningful IDs
- Add EN prefix to entitie IDs
- Add RE prefix to relation IDs
-Add DC prefix chunk IDs
- Enhance traceability across contexts
2025-09-21 20:19:14 +08:00
yangdx
37d01e2df8 fix: Ensures complete metadata (source_id, created_at, file_path) is preserved in aquery_data responses 2025-09-15 03:45:09 +08:00
yangdx
e71229698d refactor: centralize metadata generation in query functions
- Remove processing_info generation from _convert_to_user_format function
- Move all metadata generation (keywords, processing_info) to kg_query and naive_query functions
- Simplify _convert_to_user_format to focus only on data format conversion
2025-09-15 03:11:07 +08:00
yangdx
b1c8206346 Add aquery_data endpoint for structured retrieval without LLM generation
- Add QueryDataResponse model
- Implement /query/data endpoint
- Add aquery_data method to LightRAG
- Return entities, relationships, chunks
2025-09-15 02:15:14 +08:00
yangdx
82a67354d0 Code formatting improvements and style consistency fixes
* Remove trailing whitespace
* Fix function signature ellipsis style
2025-09-14 17:49:02 +08:00
yangdx
87bb8a023b Fix tuple delimiter regex patterns and add debug logging
- Add debug logs for malformed records
- Fix regex for consecutive delimiters
- Handle missing closing brackets
2025-09-14 17:29:27 +08:00
yangdx
70fee5bbeb Fix syntax warning by removin examples from fix_tuple_delimiter_corruption docstring 2025-09-14 12:37:21 +08:00
yangdx
20c5127c7c Merge branch 'optimize-extraction' into return-data-only 2025-09-14 12:33:37 +08:00
yangdx
ff705a2323 Fix tuple delimiter corruption when missing closing bracket, Handle <|#: -> <|#|> pattern 2025-09-14 11:44:21 +08:00
yangdx
1dc96f3959 Merge branch 'optimize-extraction' into return-data-only 2025-09-14 05:37:48 +08:00
yangdx
2686fc526e Change entity type from CreativeWork to Content and update delimiter
• Replace CreativeWork with Content type
• Improve LLM output error messages
• Update prompt for binary relationships
• Fix delimiter corruption examples
2025-09-14 00:55:15 +08:00
yangdx
0ffb5d5f2d Replace search API with aquery_data for consistent raw data retrieval, mirroring aquery results
• Reuse existing query logic paths and remove kg_search function entirely
• Update kg_query/naive_query to return raw data as needed
2025-09-13 15:30:29 +08:00
yangdx
8088b7e07a Fix tuple delimiter corruption handling and update documentation 2025-09-12 18:03:37 +08:00
yangdx
8a3e2c03a9 Fix tuple delimiter corruption patterns with pipes and brackets
- Handle <||S||> malformed delimiters
- Fix <||> empty pipe sequences
- Repair <|| incomplete patterns
- Process ||S|| missing brackets
- Improve delimiter normalization
2025-09-12 17:45:32 +08:00
yangdx
0221213b9b Improve entity summarization with JSONL format and fix tuple delimiters
• Convert descriptions to JSONL format
• Add token-based truncation helper
• Enhance entity name consistency rules
• Improve summarization prompt clarity
• Fix tuple delimiter corruption patterns
2025-09-12 12:32:08 +08:00
yangdx
1892ed23cc Change tuple delimiter from <|SEP|> to <|S|> across codebase
• Update prompt instruction clarity
• Correct utility function examples
• Update regex pattern comments
2025-09-12 08:57:46 +08:00
yangdx
c07bcbff44 Fix tuple delimiter corruption patterns and add missing edge cases 2025-09-12 08:35:37 +08:00