ragflow/rag
hsparks.codes ff408dea69 fix: Detect HTML tables in PDF content for Raptor auto-disable
Addresses issue reported by @ahmadshakil where PDFs with HTML tables
(like Fbr_IncomeTaxOrdinance_2001) were still being sent to Raptor.

Problem:
- Original implementation only checked parser_id and html4excel config
- PDFs parsed with 'naive' parser extract tables as <table> HTML
- These tables were not detected, so Raptor processed them anyway

Solution:
- Add content-based detection: analyze chunks for <table> HTML tags
- Skip Raptor if 30%+ of chunks contain HTML tables
- Check happens after chunks are loaded, before Raptor processing
- Configurable threshold via TABLE_CONTENT_THRESHOLD

New functions:
- contains_html_table(): Detect <table> tags in content
- analyze_chunks_for_tables(): Calculate table percentage in chunks
- should_skip_raptor_for_chunks(): Content-based skip decision

Tests:
- Added 21 new tests for content-based detection (65 total)
- Includes test case simulating ahmadshakil's PDF scenario
- All tests passing

This fix ensures PDFs with extracted tables are properly skipped,
regardless of which parser was used.
2025-12-05 09:43:38 +01:00
..
app Fix: relative page_number in boxes (#11712) 2025-12-04 11:23:34 +08:00
flow Feat: support TOC transformer. (#11685) 2025-12-03 12:27:50 +08:00
llm Refa: make RAGFlow more asynchronous 2 (#11689) 2025-12-03 14:19:53 +08:00
nlp Fix: Correct pagination and early termination bugs in chunk_list() (#11692) 2025-12-03 19:44:20 +08:00
prompts Refa: cleanup synchronous functions in agent_with_tools (#11736) 2025-12-04 14:15:05 +08:00
res Remove huqie.txt from RAGFflow and bump infinity to 0.6.10 (#11661) 2025-12-04 14:53:57 +08:00
svr fix: Detect HTML tables in PDF content for Raptor auto-disable 2025-12-05 09:43:38 +01:00
utils fix: Detect HTML tables in PDF content for Raptor auto-disable 2025-12-05 09:43:38 +01:00
__init__.py Fix: incorrect async chat streamly output (#11679) 2025-12-03 11:15:45 +08:00
benchmark.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00
raptor.py Feat: add fault-tolerant mechanism to RAPTOR (#11206) 2025-11-13 18:48:07 +08:00
settings.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00