ragflow/deepdoc/parser
liuzhenghua ea5e8caa69
feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562)
### What problem does this PR solve?

When the PDF uses vector fonts, the rendered text in the captured page
image often has missing strokes, leading to numerous OCR errors and
incorrect characters. Similar issues also occur in the extracted chart
images.

**Before**

![0089e1f76205b5b3](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1)

**After**

![03053149e919773a](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2)

You can use the following document for testing.

[Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf)


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
2025-05-12 09:50:21 +08:00
..
resume Fix:when start with source code not in docker env report 'UnicodeDec… (#5802) 2025-03-10 11:22:06 +08:00
__init__.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
docx_parser.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
excel_parser.py Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613) 2025-03-28 09:33:49 +08:00
figure_parser.py Fix: Sometimes VisionFigureParser.figures may is tuple (#7477) 2025-05-06 17:38:22 +08:00
html_parser.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
json_parser.py Update comments (#4569) 2025-01-21 20:52:28 +08:00
markdown_parser.py Feat:Optimize the table extraction logic in the Markdown parser: (#5663) 2025-03-07 17:02:35 +08:00
pdf_parser.py feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562) 2025-05-12 09:50:21 +08:00
ppt_parser.py Refa: Optimize pptx shape extraction to reduce content loss (#6703) 2025-04-22 10:16:24 +08:00
txt_parser.py Fix: delimiter issue. (#5720) 2025-03-06 17:51:22 +08:00
utils.py Update comments (#4569) 2025-01-21 20:52:28 +08:00