# Top 10 upstream commits to cherry-pick next (curated, safe order) Goal: small, prioritized set of high-impact fixes/features to bring from upstream/main into this branch before opening a clean PR to HKUDS. Each entry has: commit hash, short purpose, why it matters here, cherry-pick command and tests to run. Prereq: create an integration branch from upstream/main: git fetch upstream git checkout -b premerge/integration upstream/main Apply cherry-picks in order. Resolve conflicts locally, add tests if needed, run CI locally, then push small PRs. 1) 9009abed — Fix top_n behavior with chunking (retrieval correctness) Why: Retrieval results rely on top_n semantics. Fix prevents over-counting chunks vs documents. Command: git cherry-pick 9009abed Tests: pytest tests/test_rerank_chunking.py && pytest tests/test_chunking.py 2) 1d6ea0c5 — Fix chunking infinite loop when overlap_tokens >= max_tokens Why: Critical stability fix for document processing pipelines; prevents hangs on pathological configs. Command: git cherry-pick 1d6ea0c5 Tests: pytest tests/test_chunking.py::test_overlap_edge_cases 3) 19c16bc4 — Add content deduplication check for document insertion endpoints Why: Prevents duplicate ingestion and storage bloat — improves data correctness. Command: git cherry-pick 19c16bc4 Tests: add/ensure unit test that duplicates are deduped; run e2e insertion test. 4) 8d28b959 — Fix duplicate document responses to return original track_id Why: Preserves client-visible IDs for duplicated docs (compatibility and dedup tracking). Command: git cherry-pick 8d28b959 Tests: replay ingestion tests and verify returned track_id is original. 5) d07023c9 — feat(postgres_impl): add vchordrq vector index support Why: Adds Postgres vector index support; required when Postgres is used as vector DB in production. Command: git cherry-pick d07023c9 Special: Might require environment or lib changes (vchordrq) and additional tests. Run Postgres integration tests: pytest tests/test_postgres_retry_integration.py 6) d6019c82 — Add CASCADE to AGE extension creation in PostgreSQL implementation Why: Database initialization becomes idempotent and robust when re-creating extensions (deployment reliability). Command: git cherry-pick d6019c82 Tests: Create fresh DB using docker-compose, run migrations and initialization script confirming no failure. 7) 02fdceb9 — Update OpenAI client to stable API and bump minimum version to 2.0.0 Why: Major provider change; ensures embedding/LLM clients use stable official API and match upstream examples. Command: git cherry-pick 02fdceb9 Special: Requires bumping `openai` dependency and adding to CI matrix; run tests that cover OpenAI adapters. 8) 4ab4a7ac — Allow embedding models to use provider defaults when unspecified Why: Makes embedding config more resilient and simpler; avoids unexpected failures when settings missing. Command: git cherry-pick 4ab4a7ac Tests: ensure embedding function wrapper tests pass and add a test where provider default is used. 9) e22ac52e — Auto-initialize pipeline status in LightRAG.initialize_storages() Why: Reduces manual initialization errors and avoids missing pipeline state in new deployments. Command: git cherry-pick e22ac52e Tests: integration tests calling LightRAG.initialize_storages(); check pipeline status created. 10) c434879c — Replace PyPDF2 with pypdf for PDF processing Why: pypdf is actively maintained and fixes several parsing bugs; improves reliability of PDF doc parsing. Command: git cherry-pick c434879c Special: Update requirements; run PDF extraction tests (if present) and any E2E doc parsing tests. Post-cherry-pick checklist for each commit: - Run targeted unit tests and the repo full test suite where practical (pytest -q). - Add/adjust any missing migrations or dependency changes introduced by the cherries (e.g., openai>=2.0.0, pypdf). Commit those as separate small commits. - If a cherry-pick touches database code, verify migrations exist and add small migration files when needed. - Push each cherry-picked & verified branch to your fork and open a small PR referencing this upstream commit and linking back to the full planned merge. After doing the top-10 picks, re-run the repo-level test matrix and then consider the next wave (embedding refinements, JSON sanitization, workspace isolation tests). --- Additional picks (11–25) — next wave (high impact, medium complexity) 11) 702cfd29 — Fix document deletion concurrency control and validation logic Why: Prevents race conditions during deletion operations and fixes validation bugs that could leak resources or leave partial deletions. Command: git cherry-pick 702cfd29 Tests: run tests/test_deletion.py and concurrency-focused deletion scenarios. 12) fec7c67f — Add comprehensive chunking tests with multi-token tokenizer edge cases Why: Strengthens confidence in chunking pipeline; these tests capture subtleties of tokenizer behavior. Command: git cherry-pick fec7c67f Tests: pytest tests/test_chunking.py::(all chunking tests) 13) 57332925 — Add comprehensive tests for chunking with recursive splitting Why: Complements previous commit — ensures coverage for recursive splitting and prevents regressions. Command: git cherry-pick 57332925 Tests: pytest tests/test_chunking.py 14) 3e759f46 — Add real integration and E2E tests for workspace isolation Why: These tests increase confidence that tenant/workspace isolation is not regressively broken after merges. Command: git cherry-pick 3e759f46 Tests: pytest tests/test_workspace_isolation.py (integration + E2E variants) 15) 436e4143 — Enhance workspace isolation test suite to 100% coverage Why: Improves the test suite resilience and ensures changes don't break isolation guarantees. Command: git cherry-pick 436e4143 Tests: pytest tests/test_workspace_isolation.py 16) 95cd0ece — Fix DOCX table extraction by escaping special characters in cells Why: Improves DOCX parsing reliability for complex documents and preserves table structure. Command: git cherry-pick 95cd0ece Tests: run DOCX extraction tests under tests/ and e2e doc parsing tests. 17) 0244699d — Optimize XLSX extraction by using sheet.max_column instead of two-pass scan Why: Reduces memory and CPU during XLSX ingestion on large spreadsheets. Command: git cherry-pick 0244699d Tests: tests covering XLSX extraction performance/behavior. 18) 2b160163 — Optimize XLSX extraction to avoid storing all rows in memory Why: Major memory optimization for XLSX ingestion for large files. Command: git cherry-pick 2b160163 Tests: add a large-file XLSX test or run existing XLSX tests. 19) 23cbb9c9 — Add data sanitization to JSON writing to prevent UTF-8 encoding errors Why: Prevents crashes/data corruption when storing untrusted metadata or malformed JSON. Command: git cherry-pick 23cbb9c9 Tests: tests/test_write_json_optimization.py and new tests for edge-case JSON characters. 20) fc44f113 — Remove future dependency and replace passlib with direct bcrypt Why: Dependency hygiene/security improvement — reduces transitive legacy deps and uses modern bcrypt. Command: git cherry-pick fc44f113 Special: Update requirements/lockfiles and run auth-related tests. 21) e5addf4d — Improve embedding config priority and add debug logging Why: Embedding config precedence issues can cause confusing failures — this improves observability. Command: git cherry-pick e5addf4d Tests: run embedding tests and check logs for clarity. 22) 6e2946e7 — Add max_token_size parameter to azure_openai_embed wrapper Why: Azure OpenAI compatibility and safety for embeddings with token limits. Command: git cherry-pick 6e2946e7 Tests: tests for azure_openai wrapper and edge cases where token limits matter. 23) 4ea21240 — Add GitHub CI workflow and test markers for offline/integration tests Why: Brings a repeatable CI job to run offline/integration markers — needed to run new tests reliably. Command: git cherry-pick 4ea21240 Tests: Validate new CI job config in a branch (dry-run); locally run tests that use markers. 24) ff8f1588 — Update env.example (important fixes/clarifications) Why: Env example updates reduce deployment mistakes — low complexity, important for ops. Command: git cherry-pick ff8f1588 Tests: manual verification of environment examples and readme changes. 25) f72f435c — Fix chunk size handling (stability/regression prevention) Why: Prevents accidental misconfiguration of chunk sizes and related errors. Command: git cherry-pick f72f435c Tests: pytest tests/test_chunking.py variations and edge-case runs. Post-cherry-pick notes: This round focuses on tests, chunking stability, workspace isolation tests, DOCX/XLSX memory improvements, JSON sanitizer, and dependency hygiene. After applying, re-run the suite and promote problematic commits into small follow-up PRs (with migration files if DB changes are required). If you want, I can: - Apply these cherry-picks sequentially into a `premerge/integration` branch here and run the tests. (Option B) - Or produce shell script that runs cherry-picks and tests locally so you can run it on your machine.