ragflow

History

hsparks.codes ff408dea69 fix: Detect HTML tables in PDF content for Raptor auto-disable Addresses issue reported by @ahmadshakil where PDFs with HTML tables (like Fbr_IncomeTaxOrdinance_2001) were still being sent to Raptor. Problem: - Original implementation only checked parser_id and html4excel config - PDFs parsed with 'naive' parser extract tables as <table> HTML - These tables were not detected, so Raptor processed them anyway Solution: - Add content-based detection: analyze chunks for <table> HTML tags - Skip Raptor if 30%+ of chunks contain HTML tables - Check happens after chunks are loaded, before Raptor processing - Configurable threshold via TABLE_CONTENT_THRESHOLD New functions: - contains_html_table(): Detect <table> tags in content - analyze_chunks_for_tables(): Calculate table percentage in chunks - should_skip_raptor_for_chunks(): Content-based skip decision Tests: - Added 21 new tests for content-based detection (65 total) - Includes test case simulating ahmadshakil's PDF scenario - All tests passing This fix ensures PDFs with extracted tables are properly skipped, regardless of which parser was used.	2025-12-05 09:43:38 +01:00
..
test_raptor_utils.py	fix: Detect HTML tables in PDF content for Raptor auto-disable	2025-12-05 09:43:38 +01:00

hsparks.codes ff408dea69 fix: Detect HTML tables in PDF content for Raptor auto-disable

Addresses issue reported by @ahmadshakil where PDFs with HTML tables
(like Fbr_IncomeTaxOrdinance_2001) were still being sent to Raptor.

Problem:
- Original implementation only checked parser_id and html4excel config
- PDFs parsed with 'naive' parser extract tables as <table> HTML
- These tables were not detected, so Raptor processed them anyway

Solution:
- Add content-based detection: analyze chunks for <table> HTML tags
- Skip Raptor if 30%+ of chunks contain HTML tables
- Check happens after chunks are loaded, before Raptor processing
- Configurable threshold via TABLE_CONTENT_THRESHOLD

New functions:
- contains_html_table(): Detect <table> tags in content
- analyze_chunks_for_tables(): Calculate table percentage in chunks
- should_skip_raptor_for_chunks(): Content-based skip decision

Tests:
- Added 21 new tests for content-based detection (65 total)
- Includes test case simulating ahmadshakil's PDF scenario
- All tests passing

This fix ensures PDFs with extracted tables are properly skipped,
regardless of which parser was used.

2025-12-05 09:43:38 +01:00

test_raptor_utils.py

fix: Detect HTML tables in PDF content for Raptor auto-disable

2025-12-05 09:43:38 +01:00