ragflow/test
hsparks.codes ff408dea69 fix: Detect HTML tables in PDF content for Raptor auto-disable
Addresses issue reported by @ahmadshakil where PDFs with HTML tables
(like Fbr_IncomeTaxOrdinance_2001) were still being sent to Raptor.

Problem:
- Original implementation only checked parser_id and html4excel config
- PDFs parsed with 'naive' parser extract tables as <table> HTML
- These tables were not detected, so Raptor processed them anyway

Solution:
- Add content-based detection: analyze chunks for <table> HTML tags
- Skip Raptor if 30%+ of chunks contain HTML tables
- Check happens after chunks are loaded, before Raptor processing
- Configurable threshold via TABLE_CONTENT_THRESHOLD

New functions:
- contains_html_table(): Detect <table> tags in content
- analyze_chunks_for_tables(): Calculate table percentage in chunks
- should_skip_raptor_for_chunks(): Content-based skip decision

Tests:
- Added 21 new tests for content-based detection (65 total)
- Includes test case simulating ahmadshakil's PDF scenario
- All tests passing

This fix ensures PDFs with extracted tables are properly skipped,
regardless of which parser was used.
2025-12-05 09:43:38 +01:00
..
testcases Feat: add context for figure and table (#11547) 2025-11-27 10:21:44 +08:00
unit_test fix: Detect HTML tables in PDF content for Raptor auto-disable 2025-12-05 09:43:38 +01:00