ragflow/rag/utils
hsparks-codes 4870d42949
feat: Auto-disable Raptor for structured data (Issue #11653) (#11676)
### What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.

**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.

**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)

**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications

**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx")  # True

# CSV file - automatically skipped  
should_skip_raptor(".csv")  # True

# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table")  # True

# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive")  # False

# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False})  # False
```

**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.

**Testing**: 44 comprehensive tests, 100% passing

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 17:02:29 +08:00
..
__init__.py Move token related functions to common (#10942) 2025-11-03 08:50:05 +08:00
azure_sas_conn.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00
azure_spn_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
base64_image.py Move some vars to globals (#11017) 2025-11-05 14:14:38 +08:00
doc_store_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
es_conn.py Fix: Table parse method issue. (#11627) 2025-12-01 12:42:35 +08:00
file_utils.py Move some funcs from api to rag module (#10972) 2025-11-03 19:26:09 +08:00
infinity_conn.py Fix ft_title_rag_fine (#11555) 2025-11-27 10:26:08 +08:00
minio_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
ob_conn.py feat: add OceanBase doc engine (#11228) 2025-11-20 10:00:14 +08:00
opendal_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
opensearch_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
oss_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
raptor_utils.py feat: Auto-disable Raptor for structured data (Issue #11653) (#11676) 2025-12-03 17:02:29 +08:00
redis_conn.py feat: add Redis username support (#11608) 2025-12-01 11:26:20 +08:00
s3_conn.py Refactor function name (#11210) 2025-11-12 19:00:15 +08:00
storage_factory.py Move api.settings to common.settings (#11036) 2025-11-06 09:36:38 +08:00
tavily_conn.py Remove 'get_lan_ip' and add common misc_utils.py (#10880) 2025-10-31 16:42:01 +08:00