- Add whitespace validation to the PDF English text checking regex
- Reduce false negatives in English PDF content recognition
### What problem does this PR solve?
The core idea is to **expand the regex content used for English text
detection** so it can accommodate more valid characters commonly found
in English PDFs. The modifications include:
- Adding support for **space** in the regex.
- Ensuring the update does not reduce existing detection accuracy.
### Type of change
- [✅] Bug Fix (non-breaking change which fixes an issue)
|
||
|---|---|---|
| .. | ||
| resume | ||
| __init__.py | ||
| docling_parser.py | ||
| docx_parser.py | ||
| excel_parser.py | ||
| figure_parser.py | ||
| html_parser.py | ||
| json_parser.py | ||
| markdown_parser.py | ||
| mineru_parser.py | ||
| pdf_parser.py | ||
| ppt_parser.py | ||
| tcadp_parser.py | ||
| txt_parser.py | ||
| utils.py | ||