8.6 KiB
8.6 KiB
Comprehensive premerge plan — map of all upstream features / fixes
Purpose
- This document expands the curated top-25 cherry-pick plan to a complete, review-friendly roadmap covering all functional areas present in upstream/main (the full commit list is saved as docs/diff_hku/unmerged_upstream_commits.txt).
- It does NOT automatically cherry-pick all 700+ commits. Instead it groups features/fixes, identifies safe integration order, test targets, migration notes, and the minimal follow-up actions needed for merging the whole upstream delta into your branch in safe waves.
How to use:
- Use the per-category sections below to plan incremental merges / PRs.
- Each group lists the types of commits present upstream, what they change, which tests you must run, and integration notes.
- For commit-level details, consult docs/diff_hku/unmerged_upstream_commits.txt which contains the raw upstream commit list (746 lines).
High-level merge waves (summary)
- Wave 0 — Safety & infra: DB migrations & RLS handling, secrets (token), CI for integration tests, and DB driver session wiring.
- Wave 1 — Stability & correctness: chunking fixes, document dedup, dedup ID handling, content parsers (DOCX/XLSX/PDF) and JSON sanitization.
- Wave 2 — Embeddings & LLMs: OpenAI/OLLAMA/Azure/Bedrock wrappers, embedding config, token limits and provider defaults; bump openai client if required.
- Wave 3 — Storage & vector DBs: Postgres vchordrq, faiss/milvus/qdrant adjustments, vector indexing and retry logic.
- Wave 4 — Workspace & lifecycle: workspace isolation tests, pipeline status, namespace locks and RAG lifecycle improvements.
- Wave 5 — Tests & CI: add upstream tests into CI, update matrix, add offline/integration jobs.
- Wave 6 — Web UI & tooling: web UI dependency updates, new components and docs; merge separately or behind flag to reduce churn.
- Wave 7 — Cleanups: dependency hygiene, docs, small refactors and non-critical enhancements.
Category-by-category (full integration plan)
- Chunking & tokenization (critical)
- What upstream changed: top_n behavior fix, infinite-loop fix when overlap_tokens >= max_tokens, token-limit validation, recursive split tests and many test additions.
- Why it matters: chunking affects indexing, retrieval correctness and can cause hangs or wrong search results.
- Merge plan: bring all chunking fixes first (top_n + overlap fixes), add full chunking tests and regressions to CI. Validate across tokenizers used in tests.
- Tests: tests/test_chunking.py, tests/test_rerank_chunking.py, new chunking-focused tests in upstream.
- Risk: low-medium; mainly algorithmic / test coverage. Keep refactors separate if they touch storage.
- Document ingestion & deduplication (high priority)
- What upstream changed: content dedup checks on insertion, duplicate track_id return fixes, DOCX/XLSX table handling improvements and PDF migration to pypdf, XLSX memory optimizations.
- Why it matters: correctness of ingestion and storage footprint; fixes prevent duplicate documents and structural loss for tables.
- Merge plan: merge dedup + duplicate-id fixes + DOCX/XLSX/PDF changes together. Add E2E ingestion tests using sample documents from upstream/evaluation samples.
- Tests: tests for ingestion endpoints, e2e parsing tests, sample docs under lightrag/evaluation.
- Embeddings & LLM adapters (high impact)
- What upstream changed: provider-default rules for embeddings, configurable token limits (Azure), OpenAI client upgrade to v2+, structured output support (parsed), new supports for jina/embed wrappers and integration notes.
- Why it matters: embedding quality, API compatibility and provider stability.
- Merge plan: apply embedding config fixes + OpenAI client upgrade in a separate PR; confirm dependency bump and run embedding adapter tests.
- Tests: embedding unit tests, e2e embedding flows, cloud-provider-specific wrappers (azure/openai/batch modes).
- Postgres & vector DB improvements (high priority for PG deployments)
- What upstream changed: vchordrq vector index support, AGE extension CASCADE, Postgres retry/instrumentation, various storage fixes and improvements.
- Why it matters: important for Postgres-backed vector DBs, migrations and RLS compatibility.
- Merge plan: ensure DB migrations are present before merging code that expects new schemas or behavior; add integration tests against a Postgres test container with vchordrq/AGE; ensure session var management for RLS.
- Tests: pytest tests/test_postgres_retry_integration.py and any DB-heavy E2E tests.
- Workspace isolation, pipeline status & concurrency safety (high)
- What upstream changed: namespace locks (ContextVar) safety, default workspace handling, auto-initialize pipeline status, deletion concurrency fixes and robust isolation tests.
- Why it matters: prevents correctness/consistency/DoS bugs across tenants/workspaces.
- Merge plan: merge isolation tests and small fixes early; add E2E tests and load test simulation for race conditions.
- Tests, CI & developer tooling (medium-high)
- What upstream changed: new CI workflows, offline/integration markers, tests improvements (workspaces, chunking), updated matrices (Python 3.13/3.14), and dependabot hygiene.
- Why it matters: keeps repo quality verifiable and avoids regressions.
- Merge plan: add CI changes to integration branch early to ensure subsequent PRs run tests properly; keep frontend CI separate where needed.
- Web UI & frontend (medium)
- What upstream changed: many dependency bumps (vite/react-i18next), UI components added/changed, static swagger assets added, handle missing webui assets gracefully.
- Why it matters: high churn and large PR surface — do separately to keep backend merge risk low.
- Merge plan: open dedicated PR for webui dependency upgrades and new components; keep it independent from backend changes.
- Tools / scripts / CLI (low-medium)
- What upstream changed: cache-clean tools, migrate_llm_cache, download_cache, CLI entrypoints and helpful helpers.
- Why it matters: operational utilities that help migrations and maintenance.
- Merge plan: merge these small tools after core infra stabilization.
- JSON sanitization, performance & data hygiene (medium)
- What upstream changed: improved JSON sanitizers, UTF-8 fixes, memory optimization for JSON write paths.
- Why it matters: prevents corruption and memory issues on large or malformed inputs.
- Merge plan: merge with ingestion fixes and run heavy-data tests.
- KaTeX & rendering enhancements (low)
- What upstream changed: KaTeX copy-tex extension, mhchem support, startup import fixes.
- Merge plan: safe to merge after core features as low-risk.
- Dependency hygiene & small refactors (ongoing)
- What upstream changed: many dependabot bumps, remove "future" dependency and switch passlib->bcrypt, grooming pyproject and extras.
- Merge plan: merge progressively; keep CI green after each bump.
- Misc / smaller improvements
- Cloud model detection, macOS Gunicorn safety, neo4j retry decorator/resilience, and other incremental robustness changes.
Automated next-step options (pick one)
- Option A (recommended): follow the merge waves above and implement small PRs per bullet (one category = several narrow PRs). This yields safe, reviewable commits and clear rollback paths.
- Option B: create an automated script to cherry-pick all commits from docs/diff_hku/unmerged_upstream_commits.txt into a
premerge/integrationbranch and run tests. This is riskier — expect many conflicts and manual resolution.
Practical guidance for full merge (how to convert this map to actions):
- Create
premerge/integrationfrom upstream/main. - Add migration PR (Wave 0) that includes DB SQL, session escrows for RLS and strict config toggles.
- Add CI workflows to run newly-added upstream tests (Wave 5).
- Merge chunking + ingestion fixes and corresponding tests (Wave 1).
- Merge embed/LLM + client upgrades (Wave 2) with dependency bumps.
- Merge Postgres/vector improvements + storage tests (Wave 3).
- Merge workspace-isolation/pipeline fixes and add integration tests (Wave 4).
- Merge web UI separately (Wave 6) once backend has stabilized.
- Gradually merge remaining minor items and dependency bumps (Wave 7).
Appendix & raw commit list
- Full raw upstream commits: docs/diff_hku/unmerged_upstream_commits.txt (746 commits) — use this as the authoritative per-commit mapping.
If you want I can:
- Generate a per-commit mapping file that assigns every upstream commit to one of the categories above (CSV) so you can cherry-pick in precise order, or
- Attempt an automated run here to apply the first N commits in the prioritized list into a
premerge/integrationbranch and run tests. (I recommend small batches first.)