ragflow/rag/app
少卿 7719fd6350
Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702)
### What problem does this PR solve?

This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:

1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
   This PR adds:

   * root-directory detection
   * fallback lookup using sanitized names
   * a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.

2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
   `ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:

   * parse position tags when present
   * default to empty positions when missing
     preserving backward compatibility and preventing crashes.

### Type of change

* [x] Bug Fix (non-breaking change which fixes an issue)


[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)

we need your help [yongtenglei](https://github.com/yongtenglei)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-05 19:25:45 +08:00
..
__init__.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
audio.py Move some constants to common (#11004) 2025-11-05 08:01:39 +08:00
book.py Feat: add context for figure and table (#11547) 2025-11-27 10:21:44 +08:00
email.py Refactor: Email parser use with to handle buffer (#11496) 2025-11-25 10:03:37 +08:00
laws.py Fix: missing parameters in by_plaintext method for PDF naive mode (#11408) 2025-11-21 09:33:36 +08:00
manual.py Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702) 2025-12-05 19:25:45 +08:00
naive.py Feat: add child parent chunking method in backend. (#11598) 2025-11-28 19:25:32 +08:00
one.py Fix: missing parameters in by_plaintext method for PDF naive mode (#11408) 2025-11-21 09:33:36 +08:00
paper.py Feat: add context for figure and table (#11547) 2025-11-27 10:21:44 +08:00
picture.py Feat: add context for figure and table (#11547) 2025-11-27 10:21:44 +08:00
presentation.py Fix: relative page_number in boxes (#11712) 2025-12-04 11:23:34 +08:00
qa.py Refactor: rename rmSpace to remove_redundant_spaces (#10796) 2025-10-28 09:46:32 +08:00
resume.py Refactor: rename rmSpace to remove_redundant_spaces (#10796) 2025-10-28 09:46:32 +08:00
table.py Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819) 2025-10-28 09:40:37 +08:00
tag.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00