Commit graph

20 commits

Author SHA1 Message Date
少卿
2d4750535f fix: MinerU crop tag matching and manual.py bbox parsing
- Fixed crop() to extract original tags from text instead of reconstructing
- Added MinerU-specific logic in manual.py to handle space/tab separated tags
- Removed redundant import re that caused UnboundLocalError
- Ensures correct bbox coordinates for native images, fallback images, and page selection
2025-12-10 23:43:32 +08:00
少卿
8a285d1230 feat(mineru): implement smart crop with page-width fallback and native image mixing
- Changed fallback image generation to page-width strips (full horizontal, bbox vertical)
- Implemented smart crop() with native+fallback mixing and deduplication
- Added thresholds: max 10 images, total height <2000px
- Established native_img_map for table/image/equation priority
- Removed 120px padding logic that caused super-long stitched thumbnails

This fixes the issue where chunk thumbnails were either missing or excessively long due to:
1. MinerU not providing images for pure text blocks
2. Official crop() adding 120px padding and stitching across pages
3. Manual.py merging multiple sections into one chunk

The new approach:
- Priority 1: Use MinerU's native high-quality images (tables/equations)
- Priority 2: Use page-width fallback strips (consistent width for stitching)
- Priority 3: Use full page as last resort
- Deduplicates identical bboxes during stitching
- Limits output to reasonable dimensions for UX
2025-12-10 21:26:25 +08:00
少卿
3bc3d82aa8 fix: Initialize imgs list in crop() fallback path
- Critical bug fix: imgs list was not initialized before use (line 439)
- Without this fix, NameError would occur when cache miss triggers fallback
- Discovered during reliability audit of MinerU image generation fix
2025-12-10 00:49:14 +08:00
少卿
1c7bc47579 fix(mineru): robust coordinate conversion in crop() fallback for 0-1000 tags
- Implement coordinate conversion (normalized -> pixels) in crop() fallback loop
- Ensures correct cropping from page_images when cache lookup fails
- Works consistently with _raw_line_tag (0-1000 normalized) changes
2025-12-09 23:32:27 +08:00
少卿
8049cb9275 fix(mineru): use consistent 0-1000 normalized coords for line_tag cache matching 2025-12-09 22:17:15 +08:00
少卿
eb004b6254 fix(mineru): use cached img_path in crop() to consume generated_images
- Add _img_path_cache dict to cache line_tag -> img_path mapping
- Populate cache in _generate_missing_images for fallback text block images
- Refactor crop() to check cache first, return cached image directly
- Fallback to single-position cropping to avoid super-tall merged images
- Fix text_types to use both string literals and enums for compatibility
- Add bbox clamping to prevent cropping errors
2025-12-09 21:18:19 +08:00
少卿
b443d34faf Fix: Generate missing images for MinerU text blocks using local crop 2025-12-09 19:53:56 +08:00
Yongteng Lei
648342b62f
Fix: handle MinerU sanitized filenames when reading output (#11701)
### What problem does this PR solve?

Handle MinerU sanitized filenames when reading output. #11613, #11620.

Thanks @shaoqing404 for raising this issue.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-03 17:24:37 +08:00
Yongteng Lei
9d0309aedc
Fix: [MinerU] Missing output file (#11623)
### What problem does this PR solve?

Add fallbacks for MinerU output path. #11613, #11620.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:17:43 +08:00
Billy Bao
d3d2ccc76c
Feat: add more chunking method (#11413)
### What problem does this PR solve?

Feat: add more chunking method #11311

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 19:07:17 +08:00
Billy Bao
0884e9a4d9
Fix: bbox not included in mineru output (#11365)
### What problem does this PR solve?

Fix: bbox not included in mineru output #11315

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-19 13:59:32 +08:00
Yongteng Lei
c2b7c305fa
Fix: crop index may out of range (#11341)
### What problem does this PR solve?

Crop index may out of range. #11323


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-18 17:01:54 +08:00
Billy Bao
fea157ba08
Fix: manual parser with mineru (#11336)
### What problem does this PR solve?

Fix: manual parser with mineru #11320
Fix: missing parameter in mineru #11334
Fix: add outlines parameter for pdf parsers

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-18 15:22:52 +08:00
Billy Bao
e7e89d3ecb
Doc: style fix (#11295)
### What problem does this PR solve?

Style fix based on  #11283
### Type of change

- [x] Documentation Update
2025-11-17 11:16:34 +08:00
Stephen Hu
12db62b9c7
Refactor: improve mineru_parser get property logic (#11268)
### What problem does this PR solve?

improve mineru_parser get property logic

### Type of change

- [x] Refactoring
2025-11-14 16:32:35 +08:00
Yongteng Lei
2677617f93
Feat: supports MinerU http-client/server method (#10961)
### What problem does this PR solve?

Add support for MinerU http-client/server method.

To use MinerU with vLLM server:

1. Set up a vLLM server running MinerU:
   ```bash
   mineru-vllm-server --port 30000
   ```

2. Configure the following environment variables:
- `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru` (or the path to
your MinerU executable)
   - `MINERU_BACKEND="vlm-http-client"`
   - `MINERU_SERVER_URL="http://your-vllm-server-ip:30000"`

3. Follow the standard MinerU setup steps as described above.

With this configuration, RAGFlow will connect to your vLLM server to
perform document parsing, which can significantly improve parsing
performance for complex documents while reducing the resource
requirements on your RAGFlow server.



![1](https://github.com/user-attachments/assets/46624a0c-0f3b-423e-ace8-81801e97a27d)

![2](https://github.com/user-attachments/assets/66ccc004-a598-47d4-93cb-fe176834f83b)


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---------

Co-authored-by: writinwaters <cai.keith@gmail.com>
2025-11-04 16:03:30 +08:00
Stephen Hu
09dd786674
Fix:KeyError: 'table_body' of mineru parser (#10773)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/10769

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-31 10:07:56 +08:00
Edward Chen
b52f09adfe
Mineru api support (#10874)
### What problem does this PR solve?

support local mineru api in docker instance. like no gpu in wsl on
windows, but has mineru api with gpu support.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-10-30 17:31:46 +08:00
Yongteng Lei
5acc407240
Feat: MinerU supports VLM-Transfomers backend (#10809)
### What problem does this PR solve?

MinerU supports VLM-Transfomers backend.

Set `MINERU_BACKEND="pipeline"` to choose the backend. (Options:
pipeline | vlm-transformers, default is pipeline)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-10-27 17:04:13 +08:00
Yongteng Lei
387baf858f
Feat: add MinerU parser (#10621)
### What problem does this PR solve?

Add MinerU parser. #3945, #8092.

Set `MINERU_EXECUTABLE` to the MinerU executable path, defaults to
`mineru`.

Set `MINERU_DELETE_OUTPUT=0` to preserve MinerU's output, default is 1,
which deletes temporary output.

Set `MINERU_OUTPUT_DIR` to choose the MinerU output directory (uses the
temporary directory if unset).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-17 09:55:39 +08:00