9.1 KiB
Top 10 upstream commits to cherry-pick next (curated, safe order)
Goal: small, prioritized set of high-impact fixes/features to bring from upstream/main into this branch before opening a clean PR to HKUDS. Each entry has: commit hash, short purpose, why it matters here, cherry-pick command and tests to run.
Prereq: create an integration branch from upstream/main:
git fetch upstream git checkout -b premerge/integration upstream/main
Apply cherry-picks in order. Resolve conflicts locally, add tests if needed, run CI locally, then push small PRs.
-
9009abed— Fix top_n behavior with chunking (retrieval correctness) Why: Retrieval results rely on top_n semantics. Fix prevents over-counting chunks vs documents. Command: git cherry-pick9009abedTests: pytest tests/test_rerank_chunking.py && pytest tests/test_chunking.py -
1d6ea0c5— Fix chunking infinite loop when overlap_tokens >= max_tokens Why: Critical stability fix for document processing pipelines; prevents hangs on pathological configs. Command: git cherry-pick1d6ea0c5Tests: pytest tests/test_chunking.py::test_overlap_edge_cases -
19c16bc4— Add content deduplication check for document insertion endpoints Why: Prevents duplicate ingestion and storage bloat — improves data correctness. Command: git cherry-pick19c16bc4Tests: add/ensure unit test that duplicates are deduped; run e2e insertion test. -
8d28b959— Fix duplicate document responses to return original track_id Why: Preserves client-visible IDs for duplicated docs (compatibility and dedup tracking). Command: git cherry-pick8d28b959Tests: replay ingestion tests and verify returned track_id is original. -
d07023c9— feat(postgres_impl): add vchordrq vector index support Why: Adds Postgres vector index support; required when Postgres is used as vector DB in production. Command: git cherry-pickd07023c9Special: Might require environment or lib changes (vchordrq) and additional tests. Run Postgres integration tests: pytest tests/test_postgres_retry_integration.py -
d6019c82— Add CASCADE to AGE extension creation in PostgreSQL implementation Why: Database initialization becomes idempotent and robust when re-creating extensions (deployment reliability). Command: git cherry-pickd6019c82Tests: Create fresh DB using docker-compose, run migrations and initialization script confirming no failure. -
02fdceb9— Update OpenAI client to stable API and bump minimum version to 2.0.0 Why: Major provider change; ensures embedding/LLM clients use stable official API and match upstream examples. Command: git cherry-pick02fdceb9Special: Requires bumpingopenaidependency and adding to CI matrix; run tests that cover OpenAI adapters. -
4ab4a7ac— Allow embedding models to use provider defaults when unspecified Why: Makes embedding config more resilient and simpler; avoids unexpected failures when settings missing. Command: git cherry-pick4ab4a7acTests: ensure embedding function wrapper tests pass and add a test where provider default is used. -
e22ac52e— Auto-initialize pipeline status in LightRAG.initialize_storages() Why: Reduces manual initialization errors and avoids missing pipeline state in new deployments. Command: git cherry-picke22ac52eTests: integration tests calling LightRAG.initialize_storages(); check pipeline status created. -
c434879c— Replace PyPDF2 with pypdf for PDF processing Why: pypdf is actively maintained and fixes several parsing bugs; improves reliability of PDF doc parsing. Command: git cherry-pickc434879cSpecial: Update requirements; run PDF extraction tests (if present) and any E2E doc parsing tests.
Post-cherry-pick checklist for each commit:
- Run targeted unit tests and the repo full test suite where practical (pytest -q).
- Add/adjust any missing migrations or dependency changes introduced by the cherries (e.g., openai>=2.0.0, pypdf). Commit those as separate small commits.
- If a cherry-pick touches database code, verify migrations exist and add small migration files when needed.
- Push each cherry-picked & verified branch to your fork and open a small PR referencing this upstream commit and linking back to the full planned merge.
After doing the top-10 picks, re-run the repo-level test matrix and then consider the next wave (embedding refinements, JSON sanitization, workspace isolation tests).
Additional picks (11–25) — next wave (high impact, medium complexity)
-
702cfd29— Fix document deletion concurrency control and validation logic Why: Prevents race conditions during deletion operations and fixes validation bugs that could leak resources or leave partial deletions. Command: git cherry-pick702cfd29Tests: run tests/test_deletion.py and concurrency-focused deletion scenarios. -
fec7c67f— Add comprehensive chunking tests with multi-token tokenizer edge cases Why: Strengthens confidence in chunking pipeline; these tests capture subtleties of tokenizer behavior. Command: git cherry-pickfec7c67fTests: pytest tests/test_chunking.py::(all chunking tests) -
57332925— Add comprehensive tests for chunking with recursive splitting Why: Complements previous commit — ensures coverage for recursive splitting and prevents regressions. Command: git cherry-pick57332925Tests: pytest tests/test_chunking.py -
3e759f46— Add real integration and E2E tests for workspace isolation Why: These tests increase confidence that tenant/workspace isolation is not regressively broken after merges. Command: git cherry-pick3e759f46Tests: pytest tests/test_workspace_isolation.py (integration + E2E variants) -
436e4143— Enhance workspace isolation test suite to 100% coverage Why: Improves the test suite resilience and ensures changes don't break isolation guarantees. Command: git cherry-pick436e4143Tests: pytest tests/test_workspace_isolation.py -
95cd0ece— Fix DOCX table extraction by escaping special characters in cells Why: Improves DOCX parsing reliability for complex documents and preserves table structure. Command: git cherry-pick95cd0eceTests: run DOCX extraction tests under tests/ and e2e doc parsing tests. -
0244699d— Optimize XLSX extraction by using sheet.max_column instead of two-pass scan Why: Reduces memory and CPU during XLSX ingestion on large spreadsheets. Command: git cherry-pick0244699dTests: tests covering XLSX extraction performance/behavior. -
2b160163— Optimize XLSX extraction to avoid storing all rows in memory Why: Major memory optimization for XLSX ingestion for large files. Command: git cherry-pick2b160163Tests: add a large-file XLSX test or run existing XLSX tests. -
23cbb9c9— Add data sanitization to JSON writing to prevent UTF-8 encoding errors Why: Prevents crashes/data corruption when storing untrusted metadata or malformed JSON. Command: git cherry-pick23cbb9c9Tests: tests/test_write_json_optimization.py and new tests for edge-case JSON characters. -
fc44f113— Remove future dependency and replace passlib with direct bcrypt Why: Dependency hygiene/security improvement — reduces transitive legacy deps and uses modern bcrypt. Command: git cherry-pickfc44f113Special: Update requirements/lockfiles and run auth-related tests. -
e5addf4d— Improve embedding config priority and add debug logging Why: Embedding config precedence issues can cause confusing failures — this improves observability. Command: git cherry-picke5addf4dTests: run embedding tests and check logs for clarity. -
6e2946e7— Add max_token_size parameter to azure_openai_embed wrapper Why: Azure OpenAI compatibility and safety for embeddings with token limits. Command: git cherry-pick6e2946e7Tests: tests for azure_openai wrapper and edge cases where token limits matter. -
4ea21240— Add GitHub CI workflow and test markers for offline/integration tests Why: Brings a repeatable CI job to run offline/integration markers — needed to run new tests reliably. Command: git cherry-pick4ea21240Tests: Validate new CI job config in a branch (dry-run); locally run tests that use markers. -
ff8f1588— Update env.example (important fixes/clarifications) Why: Env example updates reduce deployment mistakes — low complexity, important for ops. Command: git cherry-pickff8f1588Tests: manual verification of environment examples and readme changes. -
f72f435c— Fix chunk size handling (stability/regression prevention) Why: Prevents accidental misconfiguration of chunk sizes and related errors. Command: git cherry-pickf72f435cTests: pytest tests/test_chunking.py variations and edge-case runs.
Post-cherry-pick notes: This round focuses on tests, chunking stability, workspace isolation tests, DOCX/XLSX memory improvements, JSON sanitizer, and dependency hygiene. After applying, re-run the suite and promote problematic commits into small follow-up PRs (with migration files if DB changes are required).
If you want, I can:
- Apply these cherry-picks sequentially into a
premerge/integrationbranch here and run the tests. (Option B) - Or produce shell script that runs cherry-picks and tests locally so you can run it on your machine.