ragflow

History

FallingSnowFlake 1033a3ae26 Fix: improve PDF text type detection by expanding regex content (#11432 ) - Add whitespace validation to the PDF English text checking regex - Reduce false negatives in English PDF content recognition ### What problem does this PR solve? The core idea is to expand the regex content used for English text detection so it can accommodate more valid characters commonly found in English PDFs. The modifications include: - Adding support for space in the regex. - Ensuring the update does not reduce existing detection accuracy. ### Type of change - [✅] Bug Fix (non-breaking change which fixes an issue)		2025-11-21 14:33:29 +08:00
..
resume	Fix: resolve regex library warnings (#7782 )	2025-05-22 10:06:28 +08:00
__init__.py	Feat: advanced markdown parsing (#9607 )	2025-08-21 09:36:18 +08:00
docling_parser.py	Feat: add more chunking method (#11413 )	2025-11-20 19:07:17 +08:00
docx_parser.py	Refactor parser code (#9042 )	2025-07-25 12:04:07 +08:00
excel_parser.py	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 )	2025-10-28 09:40:37 +08:00
figure_parser.py	Refa: rm useless code. (#11238 )	2025-11-13 09:59:55 +08:00
html_parser.py	Fix: set default chunk_token_num in html_parser (#10118 )	2025-09-17 09:36:31 +08:00
json_parser.py	Feat: parsing supports jsonl or ldjson format (#9087 )	2025-07-30 09:48:20 +08:00
markdown_parser.py	Fix: Markdown table extractor (#11018 )	2025-11-05 16:10:21 +08:00
mineru_parser.py	Feat: add more chunking method (#11413 )	2025-11-20 19:07:17 +08:00
pdf_parser.py	Fix: improve PDF text type detection by expanding regex content (#11432 )	2025-11-21 14:33:29 +08:00
ppt_parser.py	fix "TypeError: '<' not supported between instances of 'Emu' and 'Non… (#9209 )	2025-08-04 16:07:03 +08:00
tcadp_parser.py	Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041 )	2025-11-20 10:08:42 +08:00
txt_parser.py	Move token related functions to common (#10942 )	2025-11-03 08:50:05 +08:00
utils.py	Update comments (#4569 )	2025-01-21 20:52:28 +08:00