Addresses issue reported by @ahmadshakil where PDFs with HTML tables (like Fbr_IncomeTaxOrdinance_2001) were still being sent to Raptor. Problem: - Original implementation only checked parser_id and html4excel config - PDFs parsed with 'naive' parser extract tables as <table> HTML - These tables were not detected, so Raptor processed them anyway Solution: - Add content-based detection: analyze chunks for <table> HTML tags - Skip Raptor if 30%+ of chunks contain HTML tables - Check happens after chunks are loaded, before Raptor processing - Configurable threshold via TABLE_CONTENT_THRESHOLD New functions: - contains_html_table(): Detect <table> tags in content - analyze_chunks_for_tables(): Calculate table percentage in chunks - should_skip_raptor_for_chunks(): Content-based skip decision Tests: - Added 21 new tests for content-based detection (65 total) - Includes test case simulating ahmadshakil's PDF scenario - All tests passing This fix ensures PDFs with extracted tables are properly skipped, regardless of which parser was used. |
||
|---|---|---|
| .. | ||
| testcases | ||
| unit_test | ||