LightRAG/docs/diff_hku/unmerged_upstream.md
Raphaël MANSUY 6cca895ba9 Add logs for recent actions and decisions regarding upstream changes
- Documented major changes after pulling from upstream (HKUDS/LightRAG), focusing on multi-tenant support, security hardening, and RLS/RBAC.
- Created concise documentation under docs/diff_hku, including migration guides and security audits.
- Enumerated unmerged upstream commits and summarized substantive features and fixes.
- Outlined next steps for DB migrations, CI tests, and potential cherry-picking of upstream fixes.
2025-12-04 18:28:44 +08:00

81 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Upstream (HKUDS/main @ f0d67f16) — features & fixes missing from this version
Summary: this file lists new features, bug fixes, CI/docs/tooling changes and important refactors that exist in upstream `HKUDS/LightRAG` main (commit f0d67f16 and its ancestors) but are not merged into this local branch. I grouped changes into functional areas and included short remediation notes where appropriate.
NOTE: upstream/main contains many small dependency bumps and documentation commits; this document focuses on substantive features and functional fixes that affect runtime behavior, storage, security, tooling and testing.
1) Core storage, DB & Postgres improvements
- Add PostgreSQL vchordrq vector index support and unify vector index creation logic (dev-postgres-vchordrq) — improves Postgres vector indexing semantics and config handling.
- Add CASCADE to AGE extension creation in Postgres init scripts (avoid failures when recreating extension)
- Add postgres_impl fixes, retry improvements and support for vchordrq epsilon config when probes empty
- Postgres RLS-related and storage refinements (note: local branch already added postgres_rls.sql; upstream brings complementary DB/VECTOR engine improvements and fixes). Remediation: merge upstream postgres/vchordrq changes and ensure migration scripts align.
2) Chunking, indexing and document ingestion fixes
- Fix top_n behavior: limit by documents instead of chunks to avoid over-counting. (important for retrieval ranking)
- Fix infinite loop when overlap_tokens >= max_tokens and edge-case handling for max_tokens == 1.
- Add comprehensive tests for chunking logic (multi-token tokenizer, recursive split) and chunking parameters tuning.
- Add content deduplication check for document insertion endpoints and fix duplicate document response handling to return original track_id. (prevents duplicates and preserves original IDs)
3) Embeddings & LLM / cloud provider support improvements
- Major improvements in OpenAI/OLLAMA/Azure/Bedrock embedding wrappers and clients:
- Allow embedding provider defaults when unspecified
- Add configurable embedding token limits and validation
- Fix Azure OpenAI compatibility and support various deployments, fallback to AZURE_OPENAI_API_VERSION
- Convert OpenAI client to use a stable API and bump minimum version (>=2.0.0)
- Add support for structured OpenAI outputs via parsed field
- Improve Bedrock error handling and add retry logic/custom exceptions
- Additional refactors for embedding function wrapping rules, model param handling and function attribute inheritance
- Add helper flags like configurable model parameter to jina_embed
- Support async chunking functions for large, async chunkers
- Add new LLM support, additions under lightrag/llm (e.g., gemini file added upstream)
4) Document / file extraction improvements
- DOCX/XLSX handling fixes (preserve table structure, whitespace, column alignment; optimize memory use)
- Replace PyPDF2 with pypdf for PDF processing (faster, more reliable parsing)
5) Workspace isolation, pipeline status, RAG lifecycle fixes
- Fix document deletion concurrency control and auto-acquire pipeline when idle.
- Auto-initialize pipeline status on LightRAG.initialize_storages() (reduces error-prone manual calls)
- Namespace, workspace handling and locking fixes: improvements to NamespaceLock (ContextVar), default workspace handling, filtering logic, consistent empty workspace handling and many concurrency bug fixes.
6) Web UI — upgrades, feature additions, fixes
- Large set of dependency upgrades for `lightrag_webui` (vite, react-i18next, plugin-react-swc, syntax highlighter, etc.). Upstream also cleaned duplicate deps and improved build tooling.
- Add new UI components / improvements (MergeDialog, graph features, translations updates, many components updated).
- Handle missing WebUI assets gracefully so server startup is not blocked.
- Add static swagger UI assets for API docs (swagger-ui files added upstream).
7) CI, testing, and developer tooling
- New/updated GitHub workflows and test runners: tests.yml, improved offline/integration CI markers, Copilot setup steps, docker-build* workflows and improved GitHub Actions versions.
- Drop older Python versions in test matrices (3.10/3.11 removed; 3.13/3.14 added) — keep CI modern.
- Add ruff to pytest extras, add pre-commit hooks and refine pytest fixtures and markers.
- Add many new tests including workspace isolation, chunking tests, overlap validation, postgres retry integration tests, rerank chunking tests, and E2E test improvements.
8) Tools & CLI
- New helper tools: clean_llm_query_cache.py, migrate_llm_cache.py, download_cache.py and related README docs for cleaning/migrating LLM caches.
- Add `lightrag-clean-llmqc` console script entrypoint.
9) Docs & deployment support
- Added docs: FrontendBuildGuide.md, OfflineDeployment.md, UV_LOCK_GUIDE.md and evaluation assets.
- Added Dockerfile.lite and docker-build-push.sh to support smaller builds and multi-format distribution.
10) KaTeX & math / feature parity
- Upstream adds KaTeX copytex extension support and mhchem extension for chemistry formulas (enables better formula copying and chemistry rendering). Also fixed KaTeX loading in startup.
11) JSON, sanitization and performance
- Multiple JSON write/sanitizer enhancements (specialized sanitizers to handle tuples/dict keys/UTF8 errors, optimize sanitization performance) and fixes to avoid memory corruption on migrations.
12) Cloud model & misc improvements
- Improve cloud model detection/safety, macOS fork-safety check for Gunicorn multiworker cases; many small fixes for cloud model defaults and config.
13) Security / dependency hygiene
- Remove future dependency and replace passlib usage with direct bcrypt (adopt modern libs)
Actionable remediation checklist (priority):
- Merge and test upstream changes that affect: chunking, embeddings/LLM wrappers, doc processing, and Postgres vector indexing + RLS compatibility (High).
- Add or adapt DB migrations to incorporate any upstream schema changes required by tenant features and ensure no conflicts (High).
- Update CI matrix and tests to incorporate upstream tests (esp. workspace isolation and chunking tests) to verify no regressions (High).
- Merge Web UI updates separately behind feature flag/workflow (Medium) — major dependency churn.
If you want, I can automatically generate:
- a full commit-by-commit list (746 commits) in docs/diff_hku/unmerged_upstream_commits.txt (raw) — useful for exhaustive audit.
- cherry-pick safe/high-priority upstream commits onto this branch and prepare a candidate PR with resolved conflicts.