- Changed fallback image generation to page-width strips (full horizontal, bbox vertical) - Implemented smart crop() with native+fallback mixing and deduplication - Added thresholds: max 10 images, total height <2000px - Established native_img_map for table/image/equation priority - Removed 120px padding logic that caused super-long stitched thumbnails This fixes the issue where chunk thumbnails were either missing or excessively long due to: 1. MinerU not providing images for pure text blocks 2. Official crop() adding 120px padding and stitching across pages 3. Manual.py merging multiple sections into one chunk The new approach: - Priority 1: Use MinerU's native high-quality images (tables/equations) - Priority 2: Use page-width fallback strips (consistent width for stitching) - Priority 3: Use full page as last resort - Deduplicates identical bboxes during stitching - Limits output to reasonable dimensions for UX |
||
|---|---|---|
| .. | ||
| resume | ||
| __init__.py | ||
| docling_parser.py | ||
| docx_parser.py | ||
| excel_parser.py | ||
| figure_parser.py | ||
| html_parser.py | ||
| json_parser.py | ||
| markdown_parser.py | ||
| mineru_parser.py | ||
| pdf_parser.py | ||
| ppt_parser.py | ||
| tcadp_parser.py | ||
| txt_parser.py | ||
| utils.py | ||