ragflow/rag/utils
hsparks.codes ff408dea69 fix: Detect HTML tables in PDF content for Raptor auto-disable
Addresses issue reported by @ahmadshakil where PDFs with HTML tables
(like Fbr_IncomeTaxOrdinance_2001) were still being sent to Raptor.

Problem:
- Original implementation only checked parser_id and html4excel config
- PDFs parsed with 'naive' parser extract tables as <table> HTML
- These tables were not detected, so Raptor processed them anyway

Solution:
- Add content-based detection: analyze chunks for <table> HTML tags
- Skip Raptor if 30%+ of chunks contain HTML tables
- Check happens after chunks are loaded, before Raptor processing
- Configurable threshold via TABLE_CONTENT_THRESHOLD

New functions:
- contains_html_table(): Detect <table> tags in content
- analyze_chunks_for_tables(): Calculate table percentage in chunks
- should_skip_raptor_for_chunks(): Content-based skip decision

Tests:
- Added 21 new tests for content-based detection (65 total)
- Includes test case simulating ahmadshakil's PDF scenario
- All tests passing

This fix ensures PDFs with extracted tables are properly skipped,
regardless of which parser was used.
2025-12-05 09:43:38 +01:00
..
__init__.py Move token related functions to common (#10942) 2025-11-03 08:50:05 +08:00
azure_sas_conn.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00
azure_spn_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
base64_image.py Move some vars to globals (#11017) 2025-11-05 14:14:38 +08:00
doc_store_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
es_conn.py Fix: Table parse method issue. (#11627) 2025-12-01 12:42:35 +08:00
file_utils.py Move some funcs from api to rag module (#10972) 2025-11-03 19:26:09 +08:00
gcs_conn.py feat(gcs): Add support for Google Cloud Storage (GCS) integration (#11718) 2025-12-04 10:44:05 +08:00
infinity_conn.py Fix ft_title_rag_fine (#11555) 2025-11-27 10:26:08 +08:00
minio_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
ob_conn.py feat: add OceanBase doc engine (#11228) 2025-11-20 10:00:14 +08:00
opendal_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
opensearch_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
oss_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
raptor_utils.py fix: Detect HTML tables in PDF content for Raptor auto-disable 2025-12-05 09:43:38 +01:00
redis_conn.py feat: add Redis username support (#11608) 2025-12-01 11:26:20 +08:00
s3_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
storage_factory.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00
tavily_conn.py Remove 'get_lan_ip' and add common misc_utils.py (#10880) 2025-10-31 16:42:01 +08:00