ragflow

History

myoldcat 8c28587821 Fix issue where HTML file parsing may lose content. (#11536 ) ### What problem does this PR solve? ##### Problem Description When parsing HTML files, some page content may be lost. For example, text inside nested `<font>` tags within multiple `<div>` elements (e.g., `<div><font>Text_1</font></div><div><font>Text_2</font></div>`) fails to be preserved correctly. ###### Root Cause #1: Block ID propagation is interrupted 1. Block ID generation: When the parser encounters a `<div>`, it generates a new `block_id` because `<div>` belongs to `BLOCK_TAGS`. 2. Recursive processing: This `block_id` is passed down recursively to process the `<div>`’s child nodes. 3. Interruption occurs: When processing a child `<font>` tag, the code enters the `else` branch of `read_text_recursively` (since `<font>` is a Tag). 4. Bug location: The first line in this `else` branch explicitly sets `block_id = None`. - This discards the valid `block_id` inherited from the parent `<div>`. - Since `<font>` is not in `BLOCK_TAGS`, it does not generate a new `block_id`, so it passes `None` to its child text nodes. 5. Consequence: The extracted text nodes have an empty `block_id` in their `metadata`. During the subsequent `merge_block_text` step, these texts cannot be correctly associated with their original `<div>` block due to the missing ID. As a result, all text from `<font>` tags gets merged together, which then triggers a second issue during concatenation. 6. Solution: Remove the forced reset of `block_id` to `None`. When the current tag (e.g., `<font>`) is not a block-level element, it should inherit the `block_id` passed down from its parent. This ensures consistent ownership across the hierarchy: `div` → `font` → `text`. ###### Root Cause #2: Data loss during text concatenation 1. The line `current_content += (" " if current_content else "" + content)` has a misplaced parenthesis. When `current_content` is non-empty (`True`): - The ternary expression evaluates to `" "` (a single space). - The code executes `current_content += " "`. - Result: Only a space is appended—the new `content` string is completely discarded. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)		2025-11-27 09:40:10 +08:00
..
resume	Fix: resolve regex library warnings (#7782 )	2025-05-22 10:06:28 +08:00
__init__.py	Feat: advanced markdown parsing (#9607 )	2025-08-21 09:36:18 +08:00
docling_parser.py	Feat: add more chunking method (#11413 )	2025-11-20 19:07:17 +08:00
docx_parser.py	Refactor parser code (#9042 )	2025-07-25 12:04:07 +08:00
excel_parser.py	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 )	2025-10-28 09:40:37 +08:00
figure_parser.py	Refa: rm useless code. (#11238 )	2025-11-13 09:59:55 +08:00
html_parser.py	Fix issue where HTML file parsing may lose content. (#11536 )	2025-11-27 09:40:10 +08:00
json_parser.py	Feat: parsing supports jsonl or ldjson format (#9087 )	2025-07-30 09:48:20 +08:00
markdown_parser.py	Fix: incorrect image merging for naive markdown parser (#11520 )	2025-11-25 19:54:06 +08:00
mineru_parser.py	Feat: add more chunking method (#11413 )	2025-11-20 19:07:17 +08:00
pdf_parser.py	Fix: improve PDF text type detection by expanding regex content (#11432 )	2025-11-21 14:33:29 +08:00
ppt_parser.py	fix "TypeError: '<' not supported between instances of 'Emu' and 'Non… (#9209 )	2025-08-04 16:07:03 +08:00
tcadp_parser.py	Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041 )	2025-11-20 10:08:42 +08:00
txt_parser.py	Move token related functions to common (#10942 )	2025-11-03 08:50:05 +08:00
utils.py	Update comments (#4569 )	2025-01-21 20:52:28 +08:00