# Upstream (HKUDS/main @ f0d67f16) — features & fixes missing from this version Summary: this file lists new features, bug fixes, CI/docs/tooling changes and important refactors that exist in upstream `HKUDS/LightRAG` main (commit f0d67f16 and its ancestors) but are not merged into this local branch. I grouped changes into functional areas and included short remediation notes where appropriate. NOTE: upstream/main contains many small dependency bumps and documentation commits; this document focuses on substantive features and functional fixes that affect runtime behavior, storage, security, tooling and testing. 1) Core storage, DB & Postgres improvements - Add PostgreSQL vchordrq vector index support and unify vector index creation logic (dev-postgres-vchordrq) — improves Postgres vector indexing semantics and config handling. - Add CASCADE to AGE extension creation in Postgres init scripts (avoid failures when recreating extension) - Add postgres_impl fixes, retry improvements and support for vchordrq epsilon config when probes empty - Postgres RLS-related and storage refinements (note: local branch already added postgres_rls.sql; upstream brings complementary DB/VECTOR engine improvements and fixes). Remediation: merge upstream postgres/vchordrq changes and ensure migration scripts align. 2) Chunking, indexing and document ingestion fixes - Fix top_n behavior: limit by documents instead of chunks to avoid over-counting. (important for retrieval ranking) - Fix infinite loop when overlap_tokens >= max_tokens and edge-case handling for max_tokens == 1. - Add comprehensive tests for chunking logic (multi-token tokenizer, recursive split) and chunking parameters tuning. - Add content deduplication check for document insertion endpoints and fix duplicate document response handling to return original track_id. (prevents duplicates and preserves original IDs) 3) Embeddings & LLM / cloud provider support improvements - Major improvements in OpenAI/OLLAMA/Azure/Bedrock embedding wrappers and clients: - Allow embedding provider defaults when unspecified - Add configurable embedding token limits and validation - Fix Azure OpenAI compatibility and support various deployments, fallback to AZURE_OPENAI_API_VERSION - Convert OpenAI client to use a stable API and bump minimum version (>=2.0.0) - Add support for structured OpenAI outputs via parsed field - Improve Bedrock error handling and add retry logic/custom exceptions - Additional refactors for embedding function wrapping rules, model param handling and function attribute inheritance - Add helper flags like configurable model parameter to jina_embed - Support async chunking functions for large, async chunkers - Add new LLM support, additions under lightrag/llm (e.g., gemini file added upstream) 4) Document / file extraction improvements - DOCX/XLSX handling fixes (preserve table structure, whitespace, column alignment; optimize memory use) - Replace PyPDF2 with pypdf for PDF processing (faster, more reliable parsing) 5) Workspace isolation, pipeline status, RAG lifecycle fixes - Fix document deletion concurrency control and auto-acquire pipeline when idle. - Auto-initialize pipeline status on LightRAG.initialize_storages() (reduces error-prone manual calls) - Namespace, workspace handling and locking fixes: improvements to NamespaceLock (ContextVar), default workspace handling, filtering logic, consistent empty workspace handling and many concurrency bug fixes. 6) Web UI — upgrades, feature additions, fixes - Large set of dependency upgrades for `lightrag_webui` (vite, react-i18next, plugin-react-swc, syntax highlighter, etc.). Upstream also cleaned duplicate deps and improved build tooling. - Add new UI components / improvements (MergeDialog, graph features, translations updates, many components updated). - Handle missing WebUI assets gracefully so server startup is not blocked. - Add static swagger UI assets for API docs (swagger-ui files added upstream). 7) CI, testing, and developer tooling - New/updated GitHub workflows and test runners: tests.yml, improved offline/integration CI markers, Copilot setup steps, docker-build* workflows and improved GitHub Actions versions. - Drop older Python versions in test matrices (3.10/3.11 removed; 3.13/3.14 added) — keep CI modern. - Add ruff to pytest extras, add pre-commit hooks and refine pytest fixtures and markers. - Add many new tests including workspace isolation, chunking tests, overlap validation, postgres retry integration tests, rerank chunking tests, and E2E test improvements. 8) Tools & CLI - New helper tools: clean_llm_query_cache.py, migrate_llm_cache.py, download_cache.py and related README docs for cleaning/migrating LLM caches. - Add `lightrag-clean-llmqc` console script entrypoint. 9) Docs & deployment support - Added docs: FrontendBuildGuide.md, OfflineDeployment.md, UV_LOCK_GUIDE.md and evaluation assets. - Added Dockerfile.lite and docker-build-push.sh to support smaller builds and multi-format distribution. 10) KaTeX & math / feature parity - Upstream adds KaTeX copy‑tex extension support and mhchem extension for chemistry formulas (enables better formula copying and chemistry rendering). Also fixed KaTeX loading in startup. 11) JSON, sanitization and performance - Multiple JSON write/sanitizer enhancements (specialized sanitizers to handle tuples/dict keys/UTF8 errors, optimize sanitization performance) and fixes to avoid memory corruption on migrations. 12) Cloud model & misc improvements - Improve cloud model detection/safety, macOS fork-safety check for Gunicorn multiworker cases; many small fixes for cloud model defaults and config. 13) Security / dependency hygiene - Remove future dependency and replace passlib usage with direct bcrypt (adopt modern libs) Actionable remediation checklist (priority): - Merge and test upstream changes that affect: chunking, embeddings/LLM wrappers, doc processing, and Postgres vector indexing + RLS compatibility (High). - Add or adapt DB migrations to incorporate any upstream schema changes required by tenant features and ensure no conflicts (High). - Update CI matrix and tests to incorporate upstream tests (esp. workspace isolation and chunking tests) to verify no regressions (High). - Merge Web UI updates separately behind feature flag/workflow (Medium) — major dependency churn. If you want, I can automatically generate: - a full commit-by-commit list (746 commits) in docs/diff_hku/unmerged_upstream_commits.txt (raw) — useful for exhaustive audit. - cherry-pick safe/high-priority upstream commits onto this branch and prepare a candidate PR with resolved conflicts.