ragflow/personal_analyze/07-DEEPDOC-DEEP-GUIDE/pdf_parser_steps_detail.md
Claude 1dcc9a870b
docs: Add detailed PDF parser processing steps documentation
Created comprehensive documentation for RAGFlowPdfParser processing pipeline:

- 10 major processing steps with code references
- Complete data flow diagrams
- Algorithm explanations (K-Means column detection, text merging)
- Box data structure evolution through pipeline
- Position tag format specification
- Line-by-line code analysis for key methods:
  - __init__ (model loading)
  - __images__ (OCR processing)
  - _layouts_rec (layout detection)
  - _table_transformer_job (table structure)
  - _assign_column (column detection)
  - _text_merge (horizontal merge)
  - _naive_vertical_merge (vertical merge)
  - _filter_forpages (cleanup)
  - _extract_table_figure (extraction)
  - __filterout_scraps (final output)
2025-11-27 06:29:12 +00:00

77 KiB
Raw Blame History

Chi Tiết Các Bước Xử Lý Khi Gọi RAGFlowPdfParser

Mục Lục

  1. Tổng Quan Pipeline
  2. Step 1: Khởi Tạo Parser
  3. Step 2: Load Images & OCR
  4. Step 3: Layout Recognition
  5. Step 4: Table Structure Detection
  6. Step 5: Column Detection
  7. Step 6: Text Merge (Horizontal)
  8. Step 7: Text Merge (Vertical)
  9. Step 8: Filter & Cleanup
  10. Step 9: Extract Tables & Figures
  11. Step 10: Final Output

1. Tổng Quan Pipeline

1.1 Entry Points

Có 2 entry points chính:

# Entry Point 1: Simple call (Line 1160-1168)
def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
    self.__images__(fnm, zoomin)           # Step 2
    self._layouts_rec(zoomin)              # Step 3
    self._table_transformer_job(zoomin)    # Step 4
    self._text_merge()                     # Step 6
    self._concat_downward()                # Step 7 (disabled)
    self._filter_forpages()                # Step 8
    tbls = self._extract_table_figure(...) # Step 9
    return self.__filterout_scraps(...), tbls  # Step 10

# Entry Point 2: Detailed parsing (Line 1170-1252)
def parse_into_bboxes(self, fnm, callback=None, zoomin=3):
    # Same steps but with callbacks và more detailed output

1.2 Pipeline Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PDF PARSING PIPELINE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

PDF File (path/bytes)
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 1: __init__()                                                          │
│  • Load OCR model (DBNet + CRNN)                                            │
│  • Load LayoutRecognizer (YOLOv10)                                          │
│  • Load TableStructureRecognizer (YOLOv10)                                  │
│  • Load XGBoost model (text concatenation)                                  │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 2: __images__()                                                        │
│  • Convert PDF pages to images (pdfplumber)                                 │
│  • Extract native PDF characters                                            │
│  • Run OCR detection + recognition                                          │
│  • Merge native chars with OCR boxes                                        │
│  Output: self.boxes[], self.page_images[], self.page_chars[]               │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 3: _layouts_rec()                                                      │
│  • Run YOLOv10 on page images                                               │
│  • Detect 10 layout types (Text, Title, Table, Figure...)                   │
│  • Associate OCR boxes with layouts                                         │
│  • Filter garbage (headers, footers, page numbers)                          │
│  Output: boxes[] with layout_type, layoutno attributes                      │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 4: _table_transformer_job()                                            │
│  • Crop table regions from images                                           │
│  • Run TableStructureRecognizer                                             │
│  • Detect rows, columns, headers, spanning cells                            │
│  • Tag boxes with R (row), C (column), H (header), SP (spanning)           │
│  Output: self.tb_cpns[], boxes[] with table attributes                      │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 5: _assign_column() (called in _text_merge)                            │
│  • K-Means clustering on X coordinates                                      │
│  • Silhouette score to find optimal k (1-4 columns)                         │
│  • Assign col_id to each text box                                           │
│  Output: boxes[] with col_id attribute                                      │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 6: _text_merge()                                                       │
│  • Horizontal merge: same line, same column, same layout                    │
│  Output: Fewer, wider text boxes                                            │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 7: _naive_vertical_merge() / _concat_downward()                        │
│  • Vertical merge: adjacent paragraphs                                      │
│  • Semantic checks (punctuation, distance, overlap)                         │
│  Output: Merged paragraphs                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 8: _filter_forpages()                                                  │
│  • Remove table of contents                                                 │
│  • Remove dirty pages (repetitive patterns)                                 │
│  Output: Cleaned boxes[]                                                    │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 9: _extract_table_figure()                                             │
│  • Extract table boxes → construct HTML/descriptive                         │
│  • Extract figure boxes → crop images                                       │
│  • Associate captions with tables/figures                                   │
│  Output: tables[], figures[]                                                │
└─────────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STEP 10: __filterout_scraps()                                               │
│  • Filter low-quality text blocks                                           │
│  • Add position tags                                                        │
│  • Format final output                                                      │
│  Output: (documents, tables)                                                │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: Khởi Tạo Parser (__init__)

File: pdf_parser.py Lines: 52-105

Code Analysis

class RAGFlowPdfParser:
    def __init__(self, **kwargs):
        # ═══════════════════════════════════════════════════════════════════
        # 1. LOAD OCR MODEL
        # ═══════════════════════════════════════════════════════════════════
        self.ocr = OCR()  # Line 66
        # OCR class chứa:
        # - TextDetector (DBNet): Phát hiện vùng text
        # - TextRecognizer (CRNN): Nhận dạng text trong vùng

        # ═══════════════════════════════════════════════════════════════════
        # 2. SETUP PARALLEL PROCESSING
        # ═══════════════════════════════════════════════════════════════════
        self.parallel_limiter = None
        if settings.PARALLEL_DEVICES > 1:
            # Tạo capacity limiter cho mỗi GPU
            self.parallel_limiter = [
                trio.CapacityLimiter(1)  # 1 task per device
                for _ in range(settings.PARALLEL_DEVICES)
            ]

        # ═══════════════════════════════════════════════════════════════════
        # 3. LOAD LAYOUT RECOGNIZER
        # ═══════════════════════════════════════════════════════════════════
        layout_recognizer_type = os.getenv("LAYOUT_RECOGNIZER_TYPE", "onnx")

        if layout_recognizer_type == "ascend":
            self.layouter = AscendLayoutRecognizer(recognizer_domain)  # Huawei NPU
        else:
            self.layouter = LayoutRecognizer(recognizer_domain)  # ONNX (default)

        # ═══════════════════════════════════════════════════════════════════
        # 4. LOAD TABLE STRUCTURE RECOGNIZER
        # ═══════════════════════════════════════════════════════════════════
        self.tbl_det = TableStructureRecognizer()  # Line 86

        # ═══════════════════════════════════════════════════════════════════
        # 5. LOAD XGBOOST MODEL (Text Concatenation)
        # ═══════════════════════════════════════════════════════════════════
        self.updown_cnt_mdl = xgb.Booster()  # Line 88

        # Try GPU first
        try:
            import torch.cuda
            if torch.cuda.is_available():
                self.updown_cnt_mdl.set_param({"device": "cuda"})
        except:
            pass

        # Load model weights
        model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
        self.updown_cnt_mdl.load_model(
            os.path.join(model_dir, "updown_concat_xgb.model")
        )

        # ═══════════════════════════════════════════════════════════════════
        # 6. INITIALIZE STATE
        # ═══════════════════════════════════════════════════════════════════
        self.page_from = 0
        self.column_num = 1

Models Loaded

Model Type Purpose Size
OCR (DBNet) ONNX Text detection ~30MB
OCR (CRNN) ONNX Text recognition ~20MB
LayoutRecognizer ONNX (YOLOv10) Layout detection ~50MB
TableStructureRecognizer ONNX (YOLOv10) Table structure ~50MB
XGBoost Binary Text concatenation ~5MB

Step 2: Load Images & OCR (__images__)

File: pdf_parser.py Lines: 1042-1159

2.1 PDF to Images Conversion

def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # ═══════════════════════════════════════════════════════════════════
    # INITIALIZE STATE VARIABLES
    # ═══════════════════════════════════════════════════════════════════
    self.lefted_chars = []       # Characters không match với OCR box
    self.mean_height = []        # Average character height per page
    self.mean_width = []         # Average character width per page
    self.boxes = []              # OCR results
    self.garbages = {}           # Garbage patterns found
    self.page_cum_height = [0]   # Cumulative page heights
    self.page_layout = []        # Layout detection results
    self.page_from = page_from

    # ═══════════════════════════════════════════════════════════════════
    # CONVERT PDF PAGES TO IMAGES (Lines 1052-1067)
    # ═══════════════════════════════════════════════════════════════════
    with pdfplumber.open(fnm) as pdf:
        self.pdf = pdf

        # Convert each page to image
        # resolution = 72 * zoomin (default: 72 * 3 = 216 DPI)
        self.page_images = [
            p.to_image(resolution=72 * zoomin, antialias=True).annotated
            for p in pdf.pages[page_from:page_to]
        ]

        # ═══════════════════════════════════════════════════════════════
        # EXTRACT NATIVE PDF CHARACTERS (Lines 1058-1062)
        # ═══════════════════════════════════════════════════════════════
        # Extract character-level info from PDF text layer
        self.page_chars = [
            [c for c in page.dedupe_chars().chars if self._has_color(c)]
            for page in pdf.pages[page_from:page_to]
        ]

        self.total_page = len(pdf.pages)

2.2 Language Detection

    # ═══════════════════════════════════════════════════════════════════
    # DETECT DOCUMENT LANGUAGE (Lines 1093-1100)
    # ═══════════════════════════════════════════════════════════════════
    # Sample random characters, check if English
    self.is_english = [
        re.search(r"[ a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}",
                  "".join(random.choices([c["text"] for c in self.page_chars[i]],
                                        k=min(100, len(self.page_chars[i])))))
        for i in range(len(self.page_chars))
    ]

    # If >50% pages are English, mark document as English
    if sum([1 if e else 0 for e in self.is_english]) > len(self.page_images) / 2:
        self.is_english = True
    else:
        self.is_english = False

2.3 OCR Processing (Parallel)

    # ═══════════════════════════════════════════════════════════════════
    # ASYNC OCR PROCESSING (Lines 1102-1145)
    # ═══════════════════════════════════════════════════════════════════
    async def __img_ocr(i, device_id, img, chars, limiter):
        # Add spaces between characters if needed
        j = 0
        while j + 1 < len(chars):
            if (chars[j]["text"] and chars[j + 1]["text"]
                and re.match(r"[0-9a-zA-Z,.:;!%]+",
                            chars[j]["text"] + chars[j + 1]["text"])
                and chars[j + 1]["x0"] - chars[j]["x1"] >=
                    min(chars[j + 1]["width"], chars[j]["width"]) / 2):
                chars[j]["text"] += " "
            j += 1

        # Run OCR with rate limiting for parallel execution
        if limiter:
            async with limiter:
                await trio.to_thread.run_sync(
                    lambda: self.__ocr(i + 1, img, chars, zoomin, device_id)
                )
        else:
            self.__ocr(i + 1, img, chars, zoomin, device_id)

    # Launch OCR tasks
    async def __img_ocr_launcher():
        if self.parallel_limiter:
            # Parallel processing across multiple GPUs
            async with trio.open_nursery() as nursery:
                for i, img in enumerate(self.page_images):
                    chars = preprocess(i)
                    nursery.start_soon(
                        __img_ocr, i,
                        i % settings.PARALLEL_DEVICES,  # Round-robin GPU
                        img, chars,
                        self.parallel_limiter[i % settings.PARALLEL_DEVICES]
                    )
        else:
            # Sequential processing
            for i, img in enumerate(self.page_images):
                chars = preprocess(i)
                await __img_ocr(i, 0, img, chars, None)

    trio.run(__img_ocr_launcher)

2.4 OCR Core Function (__ocr)

def __ocr(self, pagenum, img, chars, ZM=3, device_id=None):
    """
    Core OCR function for a single page.

    Lines: 282-345
    """
    # ═══════════════════════════════════════════════════════════════════
    # STEP 2.4.1: TEXT DETECTION
    # ═══════════════════════════════════════════════════════════════════
    bxs = self.ocr.detect(np.array(img), device_id)  # Line 284
    # Returns: [(box_points, (text_hint, confidence)), ...]

    if not bxs:
        self.boxes.append([])
        return

    # ═══════════════════════════════════════════════════════════════════
    # STEP 2.4.2: CONVERT TO BOX FORMAT
    # ═══════════════════════════════════════════════════════════════════
    bxs = [(line[0], line[1][0]) for line in bxs]
    bxs = Recognizer.sort_Y_firstly([
        {
            "x0": b[0][0] / ZM,
            "x1": b[1][0] / ZM,
            "top": b[0][1] / ZM,
            "bottom": b[-1][1] / ZM,
            "text": "",
            "txt": t,
            "chars": [],
            "page_number": pagenum
        }
        for b, t in bxs
        if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]
    ], self.mean_height[pagenum - 1] / 3)

    # ═══════════════════════════════════════════════════════════════════
    # STEP 2.4.3: MERGE NATIVE PDF CHARS WITH OCR BOXES
    # ═══════════════════════════════════════════════════════════════════
    for c in chars:
        # Find overlapping OCR box
        ii = Recognizer.find_overlapped(c, bxs)
        if ii is None:
            self.lefted_chars.append(c)
            continue

        # Check height compatibility (within 70% tolerance)
        ch = c["bottom"] - c["top"]
        bh = bxs[ii]["bottom"] - bxs[ii]["top"]
        if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != " ":
            self.lefted_chars.append(c)
            continue

        # Add character to box
        bxs[ii]["chars"].append(c)

    # ═══════════════════════════════════════════════════════════════════
    # STEP 2.4.4: RECONSTRUCT TEXT FROM CHARS
    # ═══════════════════════════════════════════════════════════════════
    for b in bxs:
        if not b["chars"]:
            del b["chars"]
            continue

        # Sort chars by Y position, then concatenate
        m_ht = np.mean([c["height"] for c in b["chars"]])
        for c in Recognizer.sort_Y_firstly(b["chars"], m_ht):
            if c["text"] == " " and b["text"]:
                if re.match(r"[0-9a-zA-Zа-яА-Я,.?;:!%%]", b["text"][-1]):
                    b["text"] += " "
            else:
                b["text"] += c["text"]
        del b["chars"]

    # ═══════════════════════════════════════════════════════════════════
    # STEP 2.4.5: OCR RECOGNITION FOR BOXES WITHOUT NATIVE TEXT
    # ═══════════════════════════════════════════════════════════════════
    boxes_to_reg = []
    img_np = np.array(img)
    for b in bxs:
        if not b["text"]:
            # Crop region for OCR
            left, right = b["x0"] * ZM, b["x1"] * ZM
            top, bott = b["top"] * ZM, b["bottom"] * ZM
            b["box_image"] = self.ocr.get_rotate_crop_image(
                img_np,
                np.array([[left, top], [right, top],
                         [right, bott], [left, bott]], dtype=np.float32)
            )
            boxes_to_reg.append(b)
        del b["txt"]

    # Batch recognition
    texts = self.ocr.recognize_batch(
        [b["box_image"] for b in boxes_to_reg],
        device_id
    )
    for i, b in enumerate(boxes_to_reg):
        b["text"] = texts[i]
        del b["box_image"]

    # Filter empty boxes
    bxs = [b for b in bxs if b["text"]]
    self.boxes.append(bxs)

2.5 Data Flow Diagram

PDF File
    │
    ├──────────────────────────────────────────────────────────────┐
    │                                                              │
    ▼                                                              ▼
┌─────────────────────────┐                          ┌─────────────────────────┐
│    pdfplumber.open()    │                          │     pdf.pages[i]        │
│                         │                          │    .to_image()          │
│  Extract text layer     │                          │                         │
│  (native characters)    │                          │  Resolution: 216 DPI    │
└───────────┬─────────────┘                          └───────────┬─────────────┘
            │                                                    │
            ▼                                                    ▼
    page_chars[]                                          page_images[]
    (Native PDF text)                                     (PIL Images)
            │                                                    │
            │                                                    │
            │                     ┌──────────────────────────────┘
            │                     │
            │                     ▼
            │         ┌─────────────────────────┐
            │         │   OCR Detection         │
            │         │   (DBNet)               │
            │         │                         │
            │         │   Input: page_image     │
            │         │   Output: bounding boxes│
            │         └───────────┬─────────────┘
            │                     │
            │                     ▼
            │         ┌─────────────────────────┐
            │         │   Box-Char Matching     │
            │         │                         │
            └────────▶│   Match native chars    │
                      │   to OCR boxes          │
                      │   (overlap detection)   │
                      └───────────┬─────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
                    ▼                           ▼
            Boxes with text              Boxes without text
            (from native)                (need OCR recognition)
                    │                           │
                    │                           ▼
                    │               ┌─────────────────────────┐
                    │               │   OCR Recognition       │
                    │               │   (CRNN)                │
                    │               │                         │
                    │               │   Crop → Recognize      │
                    │               └───────────┬─────────────┘
                    │                           │
                    └─────────────┬─────────────┘
                                  │
                                  ▼
                          self.boxes[]
                          [{"x0", "x1", "top", "bottom", "text", "page_number"}, ...]

Step 3: Layout Recognition (_layouts_rec)

File: pdf_parser.py Lines: 347-353

Code Analysis

def _layouts_rec(self, ZM, drop=True):
    """
    Run layout recognition on all pages.

    Args:
        ZM: Zoom factor (default 3)
        drop: Whether to filter garbage layouts (headers, footers)
    """
    assert len(self.page_images) == len(self.boxes)

    # ═══════════════════════════════════════════════════════════════════
    # CALL LAYOUT RECOGNIZER
    # ═══════════════════════════════════════════════════════════════════
    # LayoutRecognizer.__call__() internally:
    # 1. Runs YOLOv10 on each page image
    # 2. Detects 10 layout types
    # 3. Associates OCR boxes with layouts
    # 4. Filters garbage if drop=True
    self.boxes, self.page_layout = self.layouter(
        self.page_images,  # List of page images
        self.boxes,        # List of OCR boxes per page (flattened after this)
        ZM,                # Zoom factor
        drop=drop          # Filter garbage
    )

    # ═══════════════════════════════════════════════════════════════════
    # ADD CUMULATIVE Y COORDINATES
    # ═══════════════════════════════════════════════════════════════════
    # After layouter, self.boxes is flattened (not per-page anymore)
    for i in range(len(self.boxes)):
        self.boxes[i]["top"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
        self.boxes[i]["bottom"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]

Layout Types

# From layout_recognizer.py, lines 34-46
labels = [
    "_background_",     # 0: Ignored
    "Text",             # 1: Body text paragraphs
    "Title",            # 2: Section titles
    "Figure",           # 3: Images, charts, diagrams
    "Figure caption",   # 4: Text describing figures
    "Table",            # 5: Data tables
    "Table caption",    # 6: Text describing tables
    "Header",           # 7: Page headers
    "Footer",           # 8: Page footers
    "Reference",        # 9: Bibliography
    "Equation",         # 10: Mathematical formulas
]

Box Attributes After Layout Recognition

# Each box in self.boxes now has:
{
    "x0": float,           # Left edge
    "x1": float,           # Right edge
    "top": float,          # Top edge (cumulative)
    "bottom": float,       # Bottom edge (cumulative)
    "text": str,           # Recognized text
    "page_number": int,    # 1-indexed page number
    "layout_type": str,    # "text", "title", "table", "figure", etc.
    "layoutno": int,       # Layout region ID
}

Step 4: Table Structure Detection (_table_transformer_job)

File: pdf_parser.py Lines: 196-281

Code Analysis

def _table_transformer_job(self, ZM):
    """
    Detect table structure and tag boxes with R/C/H/SP attributes.
    """
    logging.debug("Table processing...")

    # ═══════════════════════════════════════════════════════════════════
    # STEP 4.1: EXTRACT TABLE REGIONS
    # ═══════════════════════════════════════════════════════════════════
    imgs, pos = [], []
    tbcnt = [0]
    MARGIN = 10
    self.tb_cpns = []

    for p, tbls in enumerate(self.page_layout):
        # Filter only table layouts
        tbls = [f for f in tbls if f["type"] == "table"]
        tbcnt.append(len(tbls))

        if not tbls:
            continue

        for tb in tbls:
            # Crop table region with margin
            left = tb["x0"] - MARGIN
            top = tb["top"] - MARGIN
            right = tb["x1"] + MARGIN
            bott = tb["bottom"] + MARGIN

            # Scale by zoom factor
            pos.append((left * ZM, top * ZM))
            imgs.append(self.page_images[p].crop((
                left * ZM, top * ZM,
                right * ZM, bott * ZM
            )))

    if not imgs:
        return

    # ═══════════════════════════════════════════════════════════════════
    # STEP 4.2: RUN TABLE STRUCTURE RECOGNIZER
    # ═══════════════════════════════════════════════════════════════════
    recos = self.tbl_det(imgs)  # Line 220
    # Returns per table: [{"label": "table row|column|header|spanning", "x0", "top", ...}, ...]

    # ═══════════════════════════════════════════════════════════════════
    # STEP 4.3: MAP COORDINATES BACK TO FULL PAGE
    # ═══════════════════════════════════════════════════════════════════
    tbcnt = np.cumsum(tbcnt)
    for i in range(len(tbcnt) - 1):  # For each page
        pg = []
        for j, tb_items in enumerate(recos[tbcnt[i]:tbcnt[i + 1]]):
            poss = pos[tbcnt[i]:tbcnt[i + 1]]
            for it in tb_items:
                # Add offset back
                it["x0"] += poss[j][0]
                it["x1"] += poss[j][0]
                it["top"] += poss[j][1]
                it["bottom"] += poss[j][1]

                # Scale back from zoom
                for n in ["x0", "x1", "top", "bottom"]:
                    it[n] /= ZM

                # Add cumulative height
                it["top"] += self.page_cum_height[i]
                it["bottom"] += self.page_cum_height[i]
                it["pn"] = i
                it["layoutno"] = j
                pg.append(it)
        self.tb_cpns.extend(pg)

    # ═══════════════════════════════════════════════════════════════════
    # STEP 4.4: GATHER COMPONENTS BY TYPE
    # ═══════════════════════════════════════════════════════════════════
    def gather(kwd, fzy=10, ption=0.6):
        eles = Recognizer.sort_Y_firstly(
            [r for r in self.tb_cpns if re.match(kwd, r["label"])],
            fzy
        )
        eles = Recognizer.layouts_cleanup(self.boxes, eles, 5, ption)
        return Recognizer.sort_Y_firstly(eles, 0)

    headers = gather(r".*header$")
    rows = gather(r".* (row|header)")
    spans = gather(r".*spanning")
    clmns = sorted(
        [r for r in self.tb_cpns if re.match(r"table column$", r["label"])],
        key=lambda x: (x["pn"], x["layoutno"], x["x0"])
    )
    clmns = Recognizer.layouts_cleanup(self.boxes, clmns, 5, 0.5)

    # ═══════════════════════════════════════════════════════════════════
    # STEP 4.5: TAG BOXES WITH TABLE ATTRIBUTES
    # ═══════════════════════════════════════════════════════════════════
    for b in self.boxes:
        if b.get("layout_type", "") != "table":
            continue

        # Find row (R)
        ii = Recognizer.find_overlapped_with_threshold(b, rows, thr=0.3)
        if ii is not None:
            b["R"] = ii
            b["R_top"] = rows[ii]["top"]
            b["R_bott"] = rows[ii]["bottom"]

        # Find header (H)
        ii = Recognizer.find_overlapped_with_threshold(b, headers, thr=0.3)
        if ii is not None:
            b["H"] = ii
            b["H_top"] = headers[ii]["top"]
            b["H_bott"] = headers[ii]["bottom"]
            b["H_left"] = headers[ii]["x0"]
            b["H_right"] = headers[ii]["x1"]

        # Find column (C)
        ii = Recognizer.find_horizontally_tightest_fit(b, clmns)
        if ii is not None:
            b["C"] = ii
            b["C_left"] = clmns[ii]["x0"]
            b["C_right"] = clmns[ii]["x1"]

        # Find spanning cell (SP)
        ii = Recognizer.find_overlapped_with_threshold(b, spans, thr=0.3)
        if ii is not None:
            b["SP"] = ii
            b["H_top"] = spans[ii]["top"]
            b["H_bott"] = spans[ii]["bottom"]
            b["H_left"] = spans[ii]["x0"]
            b["H_right"] = spans[ii]["x1"]

Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TABLE STRUCTURE DETECTION                                 │
└─────────────────────────────────────────────────────────────────────────────┘

page_layout[]  ───────────────────────────┐
(Table regions)                           │
                                          ▼
                              ┌─────────────────────────┐
                              │   Crop Table Regions    │
                              │   + MARGIN (10px)       │
                              └───────────┬─────────────┘
                                          │
                                          ▼
                              ┌─────────────────────────┐
                              │  TableStructureRec()    │
                              │  (YOLOv10)              │
                              │                         │
                              │  Detects:               │
                              │  • table row            │
                              │  • table column         │
                              │  • table column header  │
                              │  • table spanning cell  │
                              └───────────┬─────────────┘
                                          │
                                          ▼
                              ┌─────────────────────────┐
                              │  Tag OCR Boxes          │
                              │                         │
                              │  • R (row index)        │
                              │  • C (column index)     │
                              │  • H (header index)     │
                              │  • SP (spanning cell)   │
                              └─────────────────────────┘

After this step, table boxes have:
{
    "R": 0,           # Row index
    "R_top": 100,     # Row top boundary
    "R_bott": 150,    # Row bottom boundary
    "C": 1,           # Column index
    "C_left": 50,     # Column left boundary
    "C_right": 200,   # Column right boundary
    "H": 0,           # Header row index (if header)
    "SP": 2,          # Spanning cell index (if spanning)
}

Step 5: Column Detection (_assign_column)

File: pdf_parser.py Lines: 355-440

Algorithm Overview

K-Means Column Detection:

1. Group boxes by page
2. For each page:
   a. Extract X0 coordinates
   b. Normalize indented text (within 12% page width)
   c. Try K from 1 to 4
   d. Select K with highest silhouette score
3. Use majority voting for global column count
4. Final clustering with selected K
5. Remap cluster IDs to left-to-right order

Code Analysis

def _assign_column(self, boxes, zoomin=3):
    """
    Detect number of columns using K-Means clustering.
    """
    if not boxes:
        return boxes
    if all("col_id" in b for b in boxes):
        return boxes

    # ═══════════════════════════════════════════════════════════════════
    # GROUP BOXES BY PAGE
    # ═══════════════════════════════════════════════════════════════════
    by_page = defaultdict(list)
    for b in boxes:
        by_page[b["page_number"]].append(b)

    page_cols = {}

    # ═══════════════════════════════════════════════════════════════════
    # FOR EACH PAGE: FIND OPTIMAL K
    # ═══════════════════════════════════════════════════════════════════
    for pg, bxs in by_page.items():
        if not bxs:
            page_cols[pg] = 1
            continue

        x0s_raw = np.array([b["x0"] for b in bxs], dtype=float)

        # Calculate page width
        min_x0 = np.min(x0s_raw)
        max_x1 = np.max([b["x1"] for b in bxs])
        width = max_x1 - min_x0

        # ═══════════════════════════════════════════════════════════════
        # INDENT TOLERANCE: Normalize near-left-edge text
        # ═══════════════════════════════════════════════════════════════
        INDENT_TOL = width * 0.12  # 12% of page width
        x0s = []
        for x in x0s_raw:
            if abs(x - min_x0) < INDENT_TOL:
                x0s.append([min_x0])  # Snap to left edge
            else:
                x0s.append([x])
        x0s = np.array(x0s, dtype=float)

        # ═══════════════════════════════════════════════════════════════
        # TRY K FROM 1 TO 4
        # ═══════════════════════════════════════════════════════════════
        max_try = min(4, len(bxs))
        if max_try < 2:
            max_try = 1

        best_k = 1
        best_score = -1

        for k in range(1, max_try + 1):
            km = KMeans(n_clusters=k, n_init="auto")
            labels = km.fit_predict(x0s)

            centers = np.sort(km.cluster_centers_.flatten())
            if len(centers) > 1:
                try:
                    score = silhouette_score(x0s, labels)
                except ValueError:
                    continue
            else:
                score = 0

            if score > best_score:
                best_score = score
                best_k = k

        page_cols[pg] = best_k
        logging.info(f"[Page {pg}] best_score={best_score:.2f}, best_k={best_k}")

    # ═══════════════════════════════════════════════════════════════════
    # MAJORITY VOTING FOR GLOBAL COLUMN COUNT
    # ═══════════════════════════════════════════════════════════════════
    global_cols = Counter(page_cols.values()).most_common(1)[0][0]
    logging.info(f"Global column_num by majority: {global_cols}")

    # ═══════════════════════════════════════════════════════════════════
    # FINAL CLUSTERING WITH SELECTED K
    # ═══════════════════════════════════════════════════════════════════
    for pg, bxs in by_page.items():
        if not bxs:
            continue

        k = page_cols[pg]
        if len(bxs) < k:
            k = 1

        x0s = np.array([[b["x0"]] for b in bxs], dtype=float)
        km = KMeans(n_clusters=k, n_init="auto")
        labels = km.fit_predict(x0s)

        # ═══════════════════════════════════════════════════════════════
        # REMAP CLUSTER IDS: Left-to-right order
        # ═══════════════════════════════════════════════════════════════
        centers = km.cluster_centers_.flatten()
        order = np.argsort(centers)
        remap = {orig: new for new, orig in enumerate(order)}

        for b, lb in zip(bxs, labels):
            b["col_id"] = remap[lb]

    return boxes

Visualization

Single column (k=1):                    Two columns (k=2):
┌────────────────────────────┐          ┌─────────────┬─────────────┐
│ Text text text text        │          │ Col 0       │ Col 1       │
│ text text text text        │          │ Text text   │ Text text   │
│ text text text text        │          │ text text   │ text text   │
│ text text text text        │          │ text text   │ text text   │
└────────────────────────────┘          └─────────────┴─────────────┘
       col_id = 0                        col_id = 0    col_id = 1

X coordinates:                          X coordinates:
[50, 52, 48, 51, ...]                   [50, 52, 300, 302, 49, 301, ...]
     ↓ K-Means                               ↓ K-Means
  k=1, all → 0                           k=2, cluster 0 → 0, cluster 1 → 1

Step 6: Text Merge (Horizontal) (_text_merge)

File: pdf_parser.py Lines: 442-478

Algorithm

Horizontal Merge Conditions:
1. Same page
2. Same column (col_id)
3. Same layout (layoutno)
4. Not table/figure/equation
5. Y distance < mean_height / 3

Code Analysis

def _text_merge(self, zoomin=3):
    """
    Merge horizontally adjacent boxes with same layout.
    """
    bxs = self._assign_column(self.boxes, zoomin)  # Ensure col_id assigned

    # Helper functions
    def end_with(b, txt):
        txt = txt.strip()
        tt = b.get("text", "").strip()
        return tt and tt.find(txt) == len(tt) - len(txt)

    def start_with(b, txts):
        tt = b.get("text", "").strip()
        return tt and any([tt.find(t.strip()) == 0 for t in txts])

    # ═══════════════════════════════════════════════════════════════════
    # HORIZONTAL MERGE LOOP
    # ═══════════════════════════════════════════════════════════════════
    i = 0
    while i < len(bxs) - 1:
        b = bxs[i]
        b_ = bxs[i + 1]

        # Skip if different page or column
        if b["page_number"] != b_["page_number"]:
            i += 1
            continue
        if b.get("col_id") != b_.get("col_id"):
            i += 1
            continue

        # Skip if different layout or special type
        if b.get("layoutno", "0") != b_.get("layoutno", "1"):
            i += 1
            continue
        if b.get("layout_type", "") in ["table", "figure", "equation"]:
            i += 1
            continue

        # Check Y distance
        y_dis = abs(self._y_dis(b, b_))
        threshold = self.mean_height[bxs[i]["page_number"] - 1] / 3

        if y_dis < threshold:
            # ═══════════════════════════════════════════════════════════
            # MERGE: Expand box to include next
            # ═══════════════════════════════════════════════════════════
            bxs[i]["x1"] = b_["x1"]                    # Extend right edge
            bxs[i]["top"] = (b["top"] + b_["top"]) / 2      # Average top
            bxs[i]["bottom"] = (b["bottom"] + b_["bottom"]) / 2  # Average bottom
            bxs[i]["text"] += b_["text"]              # Concatenate text
            bxs.pop(i + 1)                             # Remove merged box
            continue  # Check if can merge more

        i += 1

    self.boxes = bxs

Visualization

Before horizontal merge:
┌──────┐ ┌──────┐ ┌──────┐
│Hello │ │World │ │!     │  (same line, same layout)
└──────┘ └──────┘ └──────┘

After horizontal merge:
┌────────────────────────┐
│Hello World!            │
└────────────────────────┘

Step 7: Text Merge (Vertical) (_naive_vertical_merge)

File: pdf_parser.py Lines: 480-556

Algorithm

Vertical Merge Conditions:
1. Same page and column
2. Same layout (layoutno)
3. Y distance < 1.5 * mean_height
4. Horizontal overlap > 30%
5. Semantic checks (punctuation, text patterns)

Code Analysis

def _naive_vertical_merge(self, zoomin=3):
    """
    Merge vertically adjacent boxes within same layout.
    """
    bxs = self._assign_column(self.boxes, zoomin)

    # ═══════════════════════════════════════════════════════════════════
    # GROUP BY PAGE AND COLUMN
    # ═══════════════════════════════════════════════════════════════════
    grouped = defaultdict(list)
    for b in bxs:
        grouped[(b["page_number"], b.get("col_id", 0))].append(b)

    merged_boxes = []

    for (pg, col), bxs in grouped.items():
        # Sort by top-to-bottom, left-to-right
        bxs = sorted(bxs, key=lambda x: (x["top"], x["x0"]))
        if not bxs:
            continue

        mh = self.mean_height[pg - 1] if self.mean_height else 10

        i = 0
        while i + 1 < len(bxs):
            b = bxs[i]
            b_ = bxs[i + 1]

            # ═══════════════════════════════════════════════════════════
            # SKIP CONDITIONS
            # ═══════════════════════════════════════════════════════════

            # Remove page numbers at page boundaries
            if b["page_number"] < b_["page_number"]:
                if re.match(r"[0-9  •一—-]+$", b["text"]):
                    bxs.pop(i)
                    continue

            # Skip empty text
            if not b["text"].strip():
                bxs.pop(i)
                continue

            # Skip different layouts
            if b.get("layoutno") != b_.get("layoutno"):
                i += 1
                continue

            # Skip if too far apart vertically
            if b_["top"] - b["bottom"] > mh * 1.5:
                i += 1
                continue

            # ═══════════════════════════════════════════════════════════
            # CHECK HORIZONTAL OVERLAP
            # ═══════════════════════════════════════════════════════════
            overlap = max(0, min(b["x1"], b_["x1"]) - max(b["x0"], b_["x0"]))
            min_width = min(b["x1"] - b["x0"], b_["x1"] - b_["x0"])
            if overlap / max(1, min_width) < 0.3:
                i += 1
                continue

            # ═══════════════════════════════════════════════════════════
            # SEMANTIC ANALYSIS
            # ═══════════════════════════════════════════════════════════
            # Features favoring concatenation
            concatting_feats = [
                b["text"].strip()[-1] in ",;:'\",、'"-",       # Ends with continuation punct
                len(b["text"].strip()) > 1 and
                    b["text"].strip()[-2] in ",;:'\"'"、;:",
                b_["text"].strip() and
                    b_["text"].strip()[0] in "。;?!?"),,、:", # Starts with ending punct
            ]

            # Features preventing concatenation
            feats = [
                b.get("layoutno", 0) != b_.get("layoutno", 0),    # Different layout
                b["text"].strip()[-1] in "。?!?",               # Sentence end
                self.is_english and b["text"].strip()[-1] in ".!?",
                b["page_number"] == b_["page_number"] and
                    b_["top"] - b["bottom"] > mh * 1.5,           # Too far
                b["page_number"] < b_["page_number"] and
                    abs(b["x0"] - b_["x0"]) > self.mean_width[b["page_number"] - 1] * 4,
            ]

            # Features for definite split
            detach_feats = [
                b["x1"] < b_["x0"],  # No horizontal overlap at all
                b["x0"] > b_["x1"],
            ]

            # ═══════════════════════════════════════════════════════════
            # DECISION
            # ═══════════════════════════════════════════════════════════
            if (any(feats) and not any(concatting_feats)) or any(detach_feats):
                i += 1
                continue

            # ═══════════════════════════════════════════════════════════
            # MERGE
            # ═══════════════════════════════════════════════════════════
            b["text"] = (b["text"].rstrip() + " " + b_["text"].lstrip()).strip()
            b["bottom"] = b_["bottom"]
            b["x0"] = min(b["x0"], b_["x0"])
            b["x1"] = max(b["x1"], b_["x1"])
            bxs.pop(i + 1)

        merged_boxes.extend(bxs)

    self.boxes = sorted(merged_boxes, key=lambda x: (x["page_number"], x.get("col_id", 0), x["top"]))

Visualization

Before vertical merge:
┌────────────────────────┐
│This is paragraph one   │
└────────────────────────┘
┌────────────────────────┐
│that continues here and │
└────────────────────────┘
┌────────────────────────┐
│ends with this line.    │
└────────────────────────┘

After vertical merge:
┌────────────────────────┐
│This is paragraph one   │
│that continues here and │
│ends with this line.    │
└────────────────────────┘

Step 8: Filter & Cleanup

8.1 _filter_forpages

Lines: 685-729

def _filter_forpages(self):
    """
    Remove table of contents and dirty pages.
    """
    if not self.boxes:
        return

    # ═══════════════════════════════════════════════════════════════════
    # DETECT AND REMOVE TABLE OF CONTENTS
    # ═══════════════════════════════════════════════════════════════════
    findit = False
    i = 0
    while i < len(self.boxes):
        # Check for TOC headers
        text_lower = re.sub(r"( | |\u3000)+", "", self.boxes[i]["text"].lower())
        if not re.match(r"(contents|目录|目次|table of contents|致谢|acknowledge)$", text_lower):
            i += 1
            continue

        findit = True
        eng = re.match(r"[0-9a-zA-Z :'.-]{5,}", self.boxes[i]["text"].strip())
        self.boxes.pop(i)  # Remove TOC header

        if i >= len(self.boxes):
            break

        # Get prefix of first TOC entry
        prefix = self.boxes[i]["text"].strip()[:3] if not eng else \
                 " ".join(self.boxes[i]["text"].strip().split()[:2])

        # Remove empty entries
        while not prefix:
            self.boxes.pop(i)
            if i >= len(self.boxes):
                break
            prefix = self.boxes[i]["text"].strip()[:3] if not eng else \
                     " ".join(self.boxes[i]["text"].strip().split()[:2])

        self.boxes.pop(i)
        if i >= len(self.boxes) or not prefix:
            break

        # Remove entries matching TOC pattern
        for j in range(i, min(i + 128, len(self.boxes))):
            if not re.match(prefix, self.boxes[j]["text"]):
                continue
            for k in range(i, j):
                self.boxes.pop(i)
            break

    if findit:
        return

    # ═══════════════════════════════════════════════════════════════════
    # DETECT AND REMOVE DIRTY PAGES
    # ═══════════════════════════════════════════════════════════════════
    page_dirty = [0] * len(self.page_images)
    for b in self.boxes:
        # Count repetitive patterns (common in scanned TOC)
        if re.search(r"(··|··|··)", b["text"]):
            page_dirty[b["page_number"] - 1] += 1

    # Pages with >3 repetitive patterns are dirty
    page_dirty = set([i + 1 for i, t in enumerate(page_dirty) if t > 3])

    if not page_dirty:
        return

    # Remove all boxes from dirty pages
    i = 0
    while i < len(self.boxes):
        if self.boxes[i]["page_number"] in page_dirty:
            self.boxes.pop(i)
            continue
        i += 1

Step 9: Extract Tables & Figures (_extract_table_figure)

File: pdf_parser.py Lines: 757-930

Code Analysis

def _extract_table_figure(self, need_image, ZM, return_html, need_position,
                          separate_tables_figures=False):
    """
    Extract tables and figures from detected layouts.
    """
    tables = {}
    figures = {}

    # ═══════════════════════════════════════════════════════════════════
    # STEP 9.1: SEPARATE TABLE AND FIGURE BOXES
    # ═══════════════════════════════════════════════════════════════════
    i = 0
    lst_lout_no = ""
    nomerge_lout_no = []

    while i < len(self.boxes):
        if "layoutno" not in self.boxes[i]:
            i += 1
            continue

        lout_no = f"{self.boxes[i]['page_number']}-{self.boxes[i]['layoutno']}"

        # Mark captions as non-mergeable
        if (TableStructureRecognizer.is_caption(self.boxes[i]) or
            self.boxes[i]["layout_type"] in ["table caption", "title",
                                             "figure caption", "reference"]):
            nomerge_lout_no.append(lst_lout_no)

        # ═══════════════════════════════════════════════════════════════
        # EXTRACT TABLE BOXES
        # ═══════════════════════════════════════════════════════════════
        if self.boxes[i]["layout_type"] == "table":
            # Skip source citations
            if re.match(r"(数据|资料|图表)*来源[: ]", self.boxes[i]["text"]):
                self.boxes.pop(i)
                continue

            if lout_no not in tables:
                tables[lout_no] = []
            tables[lout_no].append(self.boxes[i])
            self.boxes.pop(i)
            lst_lout_no = lout_no
            continue

        # ═══════════════════════════════════════════════════════════════
        # EXTRACT FIGURE BOXES
        # ═══════════════════════════════════════════════════════════════
        if need_image and self.boxes[i]["layout_type"] == "figure":
            if re.match(r"(数据|资料|图表)*来源[: ]", self.boxes[i]["text"]):
                self.boxes.pop(i)
                continue

            if lout_no not in figures:
                figures[lout_no] = []
            figures[lout_no].append(self.boxes[i])
            self.boxes.pop(i)
            lst_lout_no = lout_no
            continue

        i += 1

    # ═══════════════════════════════════════════════════════════════════
    # STEP 9.2: MERGE CROSS-PAGE TABLES
    # ═══════════════════════════════════════════════════════════════════
    nomerge_lout_no = set(nomerge_lout_no)
    tbls = sorted([(k, bxs) for k, bxs in tables.items()],
                  key=lambda x: (x[1][0]["top"], x[1][0]["x0"]))

    i = len(tbls) - 1
    while i - 1 >= 0:
        k0, bxs0 = tbls[i - 1]
        k, bxs = tbls[i]
        i -= 1

        if k0 in nomerge_lout_no:
            continue
        if bxs[0]["page_number"] == bxs0[0]["page_number"]:
            continue
        if bxs[0]["page_number"] - bxs0[0]["page_number"] > 1:
            continue

        mh = self.mean_height[bxs[0]["page_number"] - 1]
        if self._y_dis(bxs0[-1], bxs[0]) > mh * 23:
            continue

        # Merge tables
        tables[k0].extend(tables[k])
        del tables[k]

    # ═══════════════════════════════════════════════════════════════════
    # STEP 9.3: ASSOCIATE CAPTIONS WITH TABLES/FIGURES
    # ═══════════════════════════════════════════════════════════════════
    i = 0
    while i < len(self.boxes):
        c = self.boxes[i]
        if not TableStructureRecognizer.is_caption(c):
            i += 1
            continue

        # Find nearest table/figure
        def nearest(tbls):
            mink, minv = "", float('inf')
            for k, bxs in tbls.items():
                for b in bxs:
                    if b.get("layout_type", "").find("caption") >= 0:
                        continue
                    y_dis = self._y_dis(c, b)
                    x_dis = self._x_dis(c, b) if not x_overlapped(c, b) else 0
                    dis = y_dis**2 + x_dis**2
                    if dis < minv:
                        mink, minv = k, dis
            return mink, minv

        tk, tv = nearest(tables)
        fk, fv = nearest(figures)

        if tv < fv and tk:
            tables[tk].insert(0, c)
        elif fk:
            figures[fk].insert(0, c)
        self.boxes.pop(i)

    # ═══════════════════════════════════════════════════════════════════
    # STEP 9.4: CONSTRUCT TABLE OUTPUT
    # ═══════════════════════════════════════════════════════════════════
    res = []
    for k, bxs in tables.items():
        if not bxs:
            continue

        bxs = Recognizer.sort_Y_firstly(bxs, np.mean([(b["bottom"] - b["top"]) / 2 for b in bxs]))
        poss = []

        # Crop table image
        img = cropout(bxs, "table", poss)

        # Construct table content (HTML or descriptive)
        content = self.tbl_det.construct_table(
            bxs,
            html=return_html,
            is_english=self.is_english
        )

        res.append((img, content))

    return res

Table Construction Flow

Table boxes with R/C/H/SP attributes
                │
                ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  TableStructureRecognizer.construct_table()                                  │
│                                                                              │
│  1. Sort by row (R attribute)                                               │
│  2. Group into rows                                                         │
│  3. Sort each row by column (C attribute)                                   │
│  4. Build 2D table matrix                                                   │
│  5. Handle spanning cells (SP attribute)                                    │
│  6. Generate output format                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                │
        ┌───────┴───────┐
        │               │
        ▼               ▼
    HTML Output     Descriptive Output

HTML:
<table>
  <caption>Table 1: Data</caption>
  <tr><th>Name</th><th>Value</th></tr>
  <tr><td>Item 1</td><td>100</td></tr>
</table>

Descriptive:
Name: Item 1; Value: 100
Name: Item 2; Value: 200
(from "Table 1: Data")

Step 10: Final Output (__filterout_scraps)

File: pdf_parser.py Lines: 971-1029

Code Analysis

def __filterout_scraps(self, boxes, ZM):
    """
    Filter low-quality text blocks and format final output.
    """
    def width(b):
        return b["x1"] - b["x0"]

    def height(b):
        return b["bottom"] - b["top"]

    def usefull(b):
        """Check if box is useful."""
        if b.get("layout_type"):
            return True
        # Width > 1/3 page width
        if width(b) > self.page_images[b["page_number"] - 1].size[0] / ZM / 3:
            return True
        # Height > mean character height
        if height(b) > self.mean_height[b["page_number"] - 1]:
            return True
        return False

    res = []

    while boxes:
        lines = []
        widths = []
        pw = self.page_images[boxes[0]["page_number"] - 1].size[0] / ZM
        mh = self.mean_height[boxes[0]["page_number"] - 1]
        mj = self.proj_match(boxes[0]["text"]) or \
             boxes[0].get("layout_type", "") == "title"

        # ═══════════════════════════════════════════════════════════════
        # DFS TO FIND CONNECTED LINES
        # ═══════════════════════════════════════════════════════════════
        def dfs(line, st):
            nonlocal mh, pw, lines, widths
            lines.append(line)
            widths.append(width(line))
            mmj = self.proj_match(line["text"]) or \
                  line.get("layout_type", "") == "title"

            for i in range(st + 1, min(st + 20, len(boxes))):
                # Stop at page boundary
                if boxes[i]["page_number"] - line["page_number"] > 0:
                    break

                # Stop if too far vertically
                if not mmj and self._y_dis(line, boxes[i]) >= 3 * mh and \
                   height(line) < 1.5 * mh:
                    break

                if not usefull(boxes[i]):
                    continue

                # Check horizontal proximity
                if mmj or (self._x_dis(boxes[i], line) < pw / 10):
                    dfs(boxes[i], i)
                    boxes.pop(i)
                    break

        try:
            if usefull(boxes[0]):
                dfs(boxes[0], 0)
            else:
                logging.debug("WASTE: " + boxes[0]["text"])
        except:
            pass

        boxes.pop(0)

        # ═══════════════════════════════════════════════════════════════
        # FILTER AND FORMAT OUTPUT
        # ═══════════════════════════════════════════════════════════════
        mw = np.mean(widths)
        if mj or mw / pw >= 0.35 or mw > 200:
            # Add position tags to each line
            result = "\n".join([
                c["text"] + self._line_tag(c, ZM)
                for c in lines
            ])
            res.append(result)
        else:
            logging.debug("REMOVED: " + "<<".join([c["text"] for c in lines]))

    return "\n\n".join(res)

Position Tag Format

def _line_tag(self, bx, ZM):
    """
    Generate position tag for a text box.

    Format: @@{page_numbers}\t{x0}\t{x1}\t{top}\t{bottom}##

    Example: @@1-2\t50.0\t450.0\t100.0\t120.0##
    (Text spans pages 1-2, coordinates in original scale)
    """
    pn = [bx["page_number"]]
    top = bx["top"] - self.page_cum_height[pn[0] - 1]
    bott = bx["bottom"] - self.page_cum_height[pn[0] - 1]

    # Handle multi-page spanning
    while bott * ZM > self.page_images[pn[-1] - 1].size[1]:
        bott -= self.page_images[pn[-1] - 1].size[1] / ZM
        pn.append(pn[-1] + 1)

    return "@@{}\t{:.1f}\t{:.1f}\t{:.1f}\t{:.1f}##".format(
        "-".join([str(p) for p in pn]),
        bx["x0"], bx["x1"], top, bott
    )

Final Output Format

# Return value of __call__:
(
    # documents: str (paragraphs separated by \n\n)
    "Paragraph 1 text@@1\t50.0\t450.0\t100.0\t150.0##\n\n"
    "Paragraph 2 text@@1\t50.0\t450.0\t200.0\t250.0##\n\n"
    "...",

    # tables: List[Tuple[PIL.Image, str|List[str]]]
    [
        (table_image_1, "<table>...</table>"),
        (table_image_2, ["desc line 1", "desc line 2"]),
    ]
)

Tổng Kết

Complete Pipeline Summary

Step Method Lines Input Output
1 __init__ 52-105 - Models loaded
2 __images__ 1042-1159 PDF file boxes[], page_images[]
3 _layouts_rec 347-353 page_images, boxes boxes with layout_type
4 _table_transformer_job 196-281 page_images, boxes boxes with R/C/H/SP
5 _assign_column 355-440 boxes boxes with col_id
6 _text_merge 442-478 boxes merged boxes (horizontal)
7 _naive_vertical_merge 480-556 boxes merged boxes (vertical)
8 _filter_forpages 685-729 boxes cleaned boxes
9 _extract_table_figure 757-930 boxes tables[], figures[]
10 __filterout_scraps 971-1029 boxes formatted text

Key Data Structures

# Box structure throughout pipeline
{
    # Basic (from OCR)
    "x0": float,           # Left edge
    "x1": float,           # Right edge
    "top": float,          # Top edge (cumulative Y)
    "bottom": float,       # Bottom edge (cumulative Y)
    "text": str,           # Recognized text
    "page_number": int,    # 1-indexed page

    # From layout recognition (Step 3)
    "layout_type": str,    # "text", "title", "table", "figure"...
    "layoutno": int,       # Layout region ID

    # From table detection (Step 4)
    "R": int,              # Row index
    "C": int,              # Column index
    "H": int,              # Header row index
    "SP": int,             # Spanning cell index

    # From column detection (Step 5)
    "col_id": int,         # Column ID (0-based)
}