ragflow/deepdoc/parser
myoldcat 8c28587821
Fix issue where HTML file parsing may lose content. (#11536)
### What problem does this PR solve?

##### Problem Description
When parsing HTML files, some page content may be lost.  
For example, text inside nested `<font>` tags within multiple `<div>`
elements (e.g.,
`<div><font>Text_1</font></div><div><font>Text_2</font></div>`) fails to
be preserved correctly.

###### Root Cause #1: Block ID propagation is interrupted
1. **Block ID generation**: When the parser encounters a `<div>`, it
generates a new `block_id` because `<div>` belongs to `BLOCK_TAGS`.
2. **Recursive processing**: This `block_id` is passed down recursively
to process the `<div>`’s child nodes.
3. **Interruption occurs**: When processing a child `<font>` tag, the
code enters the `else` branch of `read_text_recursively` (since `<font>`
is a Tag).
4. **Bug location**: The first line in this `else` branch explicitly
sets **`block_id = None`**.
- This discards the valid `block_id` inherited from the parent `<div>`.
- Since `<font>` is not in `BLOCK_TAGS`, it does not generate a new
`block_id`, so it passes `None` to its child text nodes.
5. **Consequence**: The extracted text nodes have an empty `block_id` in
their `metadata`. During the subsequent `merge_block_text` step, these
texts cannot be correctly associated with their original `<div>` block
due to the missing ID. As a result, all text from `<font>` tags gets
merged together, which then triggers a second issue during
concatenation.
6. **Solution:** Remove the forced reset of `block_id` to `None`. When
the current tag (e.g., `<font>`) is not a block-level element, it should
inherit the `block_id` passed down from its parent. This ensures
consistent ownership across the hierarchy: `div` → `font` → `text`.

###### Root Cause #2: Data loss during text concatenation
1. The line `current_content += (" " if current_content else "" +
content)` has a misplaced parenthesis. When `current_content` is
non-empty (`True`):
    - The ternary expression evaluates to `" "` (a single space).
    - The code executes `current_content += " "`.
- **Result**: Only a space is appended—**the new `content` string is
completely discarded**.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 09:40:10 +08:00
..
resume Fix: resolve regex library warnings (#7782) 2025-05-22 10:06:28 +08:00
__init__.py Feat: advanced markdown parsing (#9607) 2025-08-21 09:36:18 +08:00
docling_parser.py Feat: add more chunking method (#11413) 2025-11-20 19:07:17 +08:00
docx_parser.py Refactor parser code (#9042) 2025-07-25 12:04:07 +08:00
excel_parser.py Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819) 2025-10-28 09:40:37 +08:00
figure_parser.py Refa: rm useless code. (#11238) 2025-11-13 09:59:55 +08:00
html_parser.py Fix issue where HTML file parsing may lose content. (#11536) 2025-11-27 09:40:10 +08:00
json_parser.py Feat: parsing supports jsonl or ldjson format (#9087) 2025-07-30 09:48:20 +08:00
markdown_parser.py Fix: incorrect image merging for naive markdown parser (#11520) 2025-11-25 19:54:06 +08:00
mineru_parser.py Feat: add more chunking method (#11413) 2025-11-20 19:07:17 +08:00
pdf_parser.py Fix: improve PDF text type detection by expanding regex content (#11432) 2025-11-21 14:33:29 +08:00
ppt_parser.py fix "TypeError: '<' not supported between instances of 'Emu' and 'Non… (#9209) 2025-08-04 16:07:03 +08:00
tcadp_parser.py Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041) 2025-11-20 10:08:42 +08:00
txt_parser.py Move token related functions to common (#10942) 2025-11-03 08:50:05 +08:00
utils.py Update comments (#4569) 2025-01-21 20:52:28 +08:00