ragflow/personal_analyze/07-DEEPDOC-DEEP-GUIDE/pdf_parser_steps_detail.md
Claude 1dcc9a870b
docs: Add detailed PDF parser processing steps documentation
Created comprehensive documentation for RAGFlowPdfParser processing pipeline:

- 10 major processing steps with code references
- Complete data flow diagrams
- Algorithm explanations (K-Means column detection, text merging)
- Box data structure evolution through pipeline
- Position tag format specification
- Line-by-line code analysis for key methods:
  - __init__ (model loading)
  - __images__ (OCR processing)
  - _layouts_rec (layout detection)
  - _table_transformer_job (table structure)
  - _assign_column (column detection)
  - _text_merge (horizontal merge)
  - _naive_vertical_merge (vertical merge)
  - _filter_forpages (cleanup)
  - _extract_table_figure (extraction)
  - __filterout_scraps (final output)
2025-11-27 06:29:12 +00:00

1651 lines
77 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Chi Tiết Các Bước Xử Lý Khi Gọi RAGFlowPdfParser
## Mục Lục
1. [Tổng Quan Pipeline](#1-tổng-quan-pipeline)
2. [Step 1: Khởi Tạo Parser](#step-1-khởi-tạo-parser-__init__)
3. [Step 2: Load Images & OCR](#step-2-load-images--ocr-__images__)
4. [Step 3: Layout Recognition](#step-3-layout-recognition-_layouts_rec)
5. [Step 4: Table Structure Detection](#step-4-table-structure-detection-_table_transformer_job)
6. [Step 5: Column Detection](#step-5-column-detection-_assign_column)
7. [Step 6: Text Merge (Horizontal)](#step-6-text-merge-horizontal-_text_merge)
8. [Step 7: Text Merge (Vertical)](#step-7-text-merge-vertical-_naive_vertical_merge)
9. [Step 8: Filter & Cleanup](#step-8-filter--cleanup)
10. [Step 9: Extract Tables & Figures](#step-9-extract-tables--figures-_extract_table_figure)
11. [Step 10: Final Output](#step-10-final-output-__filterout_scraps)
---
## 1. Tổng Quan Pipeline
### 1.1 Entry Points
Có 2 entry points chính:
```python
# Entry Point 1: Simple call (Line 1160-1168)
def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
self.__images__(fnm, zoomin) # Step 2
self._layouts_rec(zoomin) # Step 3
self._table_transformer_job(zoomin) # Step 4
self._text_merge() # Step 6
self._concat_downward() # Step 7 (disabled)
self._filter_forpages() # Step 8
tbls = self._extract_table_figure(...) # Step 9
return self.__filterout_scraps(...), tbls # Step 10
# Entry Point 2: Detailed parsing (Line 1170-1252)
def parse_into_bboxes(self, fnm, callback=None, zoomin=3):
# Same steps but with callbacks và more detailed output
```
### 1.2 Pipeline Flow Diagram
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ PDF PARSING PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
PDF File (path/bytes)
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 1: __init__() │
│ • Load OCR model (DBNet + CRNN) │
│ • Load LayoutRecognizer (YOLOv10) │
│ • Load TableStructureRecognizer (YOLOv10) │
│ • Load XGBoost model (text concatenation) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 2: __images__() │
│ • Convert PDF pages to images (pdfplumber) │
│ • Extract native PDF characters │
│ • Run OCR detection + recognition │
│ • Merge native chars with OCR boxes │
│ Output: self.boxes[], self.page_images[], self.page_chars[] │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 3: _layouts_rec() │
│ • Run YOLOv10 on page images │
│ • Detect 10 layout types (Text, Title, Table, Figure...) │
│ • Associate OCR boxes with layouts │
│ • Filter garbage (headers, footers, page numbers) │
│ Output: boxes[] with layout_type, layoutno attributes │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 4: _table_transformer_job() │
│ • Crop table regions from images │
│ • Run TableStructureRecognizer │
│ • Detect rows, columns, headers, spanning cells │
│ • Tag boxes with R (row), C (column), H (header), SP (spanning) │
│ Output: self.tb_cpns[], boxes[] with table attributes │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 5: _assign_column() (called in _text_merge) │
│ • K-Means clustering on X coordinates │
│ • Silhouette score to find optimal k (1-4 columns) │
│ • Assign col_id to each text box │
│ Output: boxes[] with col_id attribute │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 6: _text_merge() │
│ • Horizontal merge: same line, same column, same layout │
│ Output: Fewer, wider text boxes │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 7: _naive_vertical_merge() / _concat_downward() │
│ • Vertical merge: adjacent paragraphs │
│ • Semantic checks (punctuation, distance, overlap) │
│ Output: Merged paragraphs │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 8: _filter_forpages() │
│ • Remove table of contents │
│ • Remove dirty pages (repetitive patterns) │
│ Output: Cleaned boxes[] │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 9: _extract_table_figure() │
│ • Extract table boxes → construct HTML/descriptive │
│ • Extract figure boxes → crop images │
│ • Associate captions with tables/figures │
│ Output: tables[], figures[] │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 10: __filterout_scraps() │
│ • Filter low-quality text blocks │
│ • Add position tags │
│ • Format final output │
│ Output: (documents, tables) │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Step 1: Khởi Tạo Parser (`__init__`)
**File**: `pdf_parser.py`
**Lines**: 52-105
### Code Analysis
```python
class RAGFlowPdfParser:
def __init__(self, **kwargs):
# ═══════════════════════════════════════════════════════════════════
# 1. LOAD OCR MODEL
# ═══════════════════════════════════════════════════════════════════
self.ocr = OCR() # Line 66
# OCR class chứa:
# - TextDetector (DBNet): Phát hiện vùng text
# - TextRecognizer (CRNN): Nhận dạng text trong vùng
# ═══════════════════════════════════════════════════════════════════
# 2. SETUP PARALLEL PROCESSING
# ═══════════════════════════════════════════════════════════════════
self.parallel_limiter = None
if settings.PARALLEL_DEVICES > 1:
# Tạo capacity limiter cho mỗi GPU
self.parallel_limiter = [
trio.CapacityLimiter(1) # 1 task per device
for _ in range(settings.PARALLEL_DEVICES)
]
# ═══════════════════════════════════════════════════════════════════
# 3. LOAD LAYOUT RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
layout_recognizer_type = os.getenv("LAYOUT_RECOGNIZER_TYPE", "onnx")
if layout_recognizer_type == "ascend":
self.layouter = AscendLayoutRecognizer(recognizer_domain) # Huawei NPU
else:
self.layouter = LayoutRecognizer(recognizer_domain) # ONNX (default)
# ═══════════════════════════════════════════════════════════════════
# 4. LOAD TABLE STRUCTURE RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
self.tbl_det = TableStructureRecognizer() # Line 86
# ═══════════════════════════════════════════════════════════════════
# 5. LOAD XGBOOST MODEL (Text Concatenation)
# ═══════════════════════════════════════════════════════════════════
self.updown_cnt_mdl = xgb.Booster() # Line 88
# Try GPU first
try:
import torch.cuda
if torch.cuda.is_available():
self.updown_cnt_mdl.set_param({"device": "cuda"})
except:
pass
# Load model weights
model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
self.updown_cnt_mdl.load_model(
os.path.join(model_dir, "updown_concat_xgb.model")
)
# ═══════════════════════════════════════════════════════════════════
# 6. INITIALIZE STATE
# ═══════════════════════════════════════════════════════════════════
self.page_from = 0
self.column_num = 1
```
### Models Loaded
| Model | Type | Purpose | Size |
|-------|------|---------|------|
| OCR (DBNet) | ONNX | Text detection | ~30MB |
| OCR (CRNN) | ONNX | Text recognition | ~20MB |
| LayoutRecognizer | ONNX (YOLOv10) | Layout detection | ~50MB |
| TableStructureRecognizer | ONNX (YOLOv10) | Table structure | ~50MB |
| XGBoost | Binary | Text concatenation | ~5MB |
---
## Step 2: Load Images & OCR (`__images__`)
**File**: `pdf_parser.py`
**Lines**: 1042-1159
### 2.1 PDF to Images Conversion
```python
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# ═══════════════════════════════════════════════════════════════════
# INITIALIZE STATE VARIABLES
# ═══════════════════════════════════════════════════════════════════
self.lefted_chars = [] # Characters không match với OCR box
self.mean_height = [] # Average character height per page
self.mean_width = [] # Average character width per page
self.boxes = [] # OCR results
self.garbages = {} # Garbage patterns found
self.page_cum_height = [0] # Cumulative page heights
self.page_layout = [] # Layout detection results
self.page_from = page_from
# ═══════════════════════════════════════════════════════════════════
# CONVERT PDF PAGES TO IMAGES (Lines 1052-1067)
# ═══════════════════════════════════════════════════════════════════
with pdfplumber.open(fnm) as pdf:
self.pdf = pdf
# Convert each page to image
# resolution = 72 * zoomin (default: 72 * 3 = 216 DPI)
self.page_images = [
p.to_image(resolution=72 * zoomin, antialias=True).annotated
for p in pdf.pages[page_from:page_to]
]
# ═══════════════════════════════════════════════════════════════
# EXTRACT NATIVE PDF CHARACTERS (Lines 1058-1062)
# ═══════════════════════════════════════════════════════════════
# Extract character-level info from PDF text layer
self.page_chars = [
[c for c in page.dedupe_chars().chars if self._has_color(c)]
for page in pdf.pages[page_from:page_to]
]
self.total_page = len(pdf.pages)
```
### 2.2 Language Detection
```python
# ═══════════════════════════════════════════════════════════════════
# DETECT DOCUMENT LANGUAGE (Lines 1093-1100)
# ═══════════════════════════════════════════════════════════════════
# Sample random characters, check if English
self.is_english = [
re.search(r"[ a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}",
"".join(random.choices([c["text"] for c in self.page_chars[i]],
k=min(100, len(self.page_chars[i])))))
for i in range(len(self.page_chars))
]
# If >50% pages are English, mark document as English
if sum([1 if e else 0 for e in self.is_english]) > len(self.page_images) / 2:
self.is_english = True
else:
self.is_english = False
```
### 2.3 OCR Processing (Parallel)
```python
# ═══════════════════════════════════════════════════════════════════
# ASYNC OCR PROCESSING (Lines 1102-1145)
# ═══════════════════════════════════════════════════════════════════
async def __img_ocr(i, device_id, img, chars, limiter):
# Add spaces between characters if needed
j = 0
while j + 1 < len(chars):
if (chars[j]["text"] and chars[j + 1]["text"]
and re.match(r"[0-9a-zA-Z,.:;!%]+",
chars[j]["text"] + chars[j + 1]["text"])
and chars[j + 1]["x0"] - chars[j]["x1"] >=
min(chars[j + 1]["width"], chars[j]["width"]) / 2):
chars[j]["text"] += " "
j += 1
# Run OCR with rate limiting for parallel execution
if limiter:
async with limiter:
await trio.to_thread.run_sync(
lambda: self.__ocr(i + 1, img, chars, zoomin, device_id)
)
else:
self.__ocr(i + 1, img, chars, zoomin, device_id)
# Launch OCR tasks
async def __img_ocr_launcher():
if self.parallel_limiter:
# Parallel processing across multiple GPUs
async with trio.open_nursery() as nursery:
for i, img in enumerate(self.page_images):
chars = preprocess(i)
nursery.start_soon(
__img_ocr, i,
i % settings.PARALLEL_DEVICES, # Round-robin GPU
img, chars,
self.parallel_limiter[i % settings.PARALLEL_DEVICES]
)
else:
# Sequential processing
for i, img in enumerate(self.page_images):
chars = preprocess(i)
await __img_ocr(i, 0, img, chars, None)
trio.run(__img_ocr_launcher)
```
### 2.4 OCR Core Function (`__ocr`)
```python
def __ocr(self, pagenum, img, chars, ZM=3, device_id=None):
"""
Core OCR function for a single page.
Lines: 282-345
"""
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.1: TEXT DETECTION
# ═══════════════════════════════════════════════════════════════════
bxs = self.ocr.detect(np.array(img), device_id) # Line 284
# Returns: [(box_points, (text_hint, confidence)), ...]
if not bxs:
self.boxes.append([])
return
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.2: CONVERT TO BOX FORMAT
# ═══════════════════════════════════════════════════════════════════
bxs = [(line[0], line[1][0]) for line in bxs]
bxs = Recognizer.sort_Y_firstly([
{
"x0": b[0][0] / ZM,
"x1": b[1][0] / ZM,
"top": b[0][1] / ZM,
"bottom": b[-1][1] / ZM,
"text": "",
"txt": t,
"chars": [],
"page_number": pagenum
}
for b, t in bxs
if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]
], self.mean_height[pagenum - 1] / 3)
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.3: MERGE NATIVE PDF CHARS WITH OCR BOXES
# ═══════════════════════════════════════════════════════════════════
for c in chars:
# Find overlapping OCR box
ii = Recognizer.find_overlapped(c, bxs)
if ii is None:
self.lefted_chars.append(c)
continue
# Check height compatibility (within 70% tolerance)
ch = c["bottom"] - c["top"]
bh = bxs[ii]["bottom"] - bxs[ii]["top"]
if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != " ":
self.lefted_chars.append(c)
continue
# Add character to box
bxs[ii]["chars"].append(c)
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.4: RECONSTRUCT TEXT FROM CHARS
# ═══════════════════════════════════════════════════════════════════
for b in bxs:
if not b["chars"]:
del b["chars"]
continue
# Sort chars by Y position, then concatenate
m_ht = np.mean([c["height"] for c in b["chars"]])
for c in Recognizer.sort_Y_firstly(b["chars"], m_ht):
if c["text"] == " " and b["text"]:
if re.match(r"[0-9a-zA-Zа-яА-Я,.?;:!%%]", b["text"][-1]):
b["text"] += " "
else:
b["text"] += c["text"]
del b["chars"]
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.5: OCR RECOGNITION FOR BOXES WITHOUT NATIVE TEXT
# ═══════════════════════════════════════════════════════════════════
boxes_to_reg = []
img_np = np.array(img)
for b in bxs:
if not b["text"]:
# Crop region for OCR
left, right = b["x0"] * ZM, b["x1"] * ZM
top, bott = b["top"] * ZM, b["bottom"] * ZM
b["box_image"] = self.ocr.get_rotate_crop_image(
img_np,
np.array([[left, top], [right, top],
[right, bott], [left, bott]], dtype=np.float32)
)
boxes_to_reg.append(b)
del b["txt"]
# Batch recognition
texts = self.ocr.recognize_batch(
[b["box_image"] for b in boxes_to_reg],
device_id
)
for i, b in enumerate(boxes_to_reg):
b["text"] = texts[i]
del b["box_image"]
# Filter empty boxes
bxs = [b for b in bxs if b["text"]]
self.boxes.append(bxs)
```
### 2.5 Data Flow Diagram
```
PDF File
├──────────────────────────────────────────────────────────────┐
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ pdfplumber.open() │ │ pdf.pages[i] │
│ │ │ .to_image() │
│ Extract text layer │ │ │
│ (native characters) │ │ Resolution: 216 DPI │
└───────────┬─────────────┘ └───────────┬─────────────┘
│ │
▼ ▼
page_chars[] page_images[]
(Native PDF text) (PIL Images)
│ │
│ │
│ ┌──────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ OCR Detection │
│ │ (DBNet) │
│ │ │
│ │ Input: page_image │
│ │ Output: bounding boxes│
│ └───────────┬─────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ Box-Char Matching │
│ │ │
└────────▶│ Match native chars │
│ to OCR boxes │
│ (overlap detection) │
└───────────┬─────────────┘
┌─────────────┴─────────────┐
│ │
▼ ▼
Boxes with text Boxes without text
(from native) (need OCR recognition)
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ OCR Recognition │
│ │ (CRNN) │
│ │ │
│ │ Crop → Recognize │
│ └───────────┬─────────────┘
│ │
└─────────────┬─────────────┘
self.boxes[]
[{"x0", "x1", "top", "bottom", "text", "page_number"}, ...]
```
---
## Step 3: Layout Recognition (`_layouts_rec`)
**File**: `pdf_parser.py`
**Lines**: 347-353
### Code Analysis
```python
def _layouts_rec(self, ZM, drop=True):
"""
Run layout recognition on all pages.
Args:
ZM: Zoom factor (default 3)
drop: Whether to filter garbage layouts (headers, footers)
"""
assert len(self.page_images) == len(self.boxes)
# ═══════════════════════════════════════════════════════════════════
# CALL LAYOUT RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
# LayoutRecognizer.__call__() internally:
# 1. Runs YOLOv10 on each page image
# 2. Detects 10 layout types
# 3. Associates OCR boxes with layouts
# 4. Filters garbage if drop=True
self.boxes, self.page_layout = self.layouter(
self.page_images, # List of page images
self.boxes, # List of OCR boxes per page (flattened after this)
ZM, # Zoom factor
drop=drop # Filter garbage
)
# ═══════════════════════════════════════════════════════════════════
# ADD CUMULATIVE Y COORDINATES
# ═══════════════════════════════════════════════════════════════════
# After layouter, self.boxes is flattened (not per-page anymore)
for i in range(len(self.boxes)):
self.boxes[i]["top"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
self.boxes[i]["bottom"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
```
### Layout Types
```python
# From layout_recognizer.py, lines 34-46
labels = [
"_background_", # 0: Ignored
"Text", # 1: Body text paragraphs
"Title", # 2: Section titles
"Figure", # 3: Images, charts, diagrams
"Figure caption", # 4: Text describing figures
"Table", # 5: Data tables
"Table caption", # 6: Text describing tables
"Header", # 7: Page headers
"Footer", # 8: Page footers
"Reference", # 9: Bibliography
"Equation", # 10: Mathematical formulas
]
```
### Box Attributes After Layout Recognition
```python
# Each box in self.boxes now has:
{
"x0": float, # Left edge
"x1": float, # Right edge
"top": float, # Top edge (cumulative)
"bottom": float, # Bottom edge (cumulative)
"text": str, # Recognized text
"page_number": int, # 1-indexed page number
"layout_type": str, # "text", "title", "table", "figure", etc.
"layoutno": int, # Layout region ID
}
```
---
## Step 4: Table Structure Detection (`_table_transformer_job`)
**File**: `pdf_parser.py`
**Lines**: 196-281
### Code Analysis
```python
def _table_transformer_job(self, ZM):
"""
Detect table structure and tag boxes with R/C/H/SP attributes.
"""
logging.debug("Table processing...")
# ═══════════════════════════════════════════════════════════════════
# STEP 4.1: EXTRACT TABLE REGIONS
# ═══════════════════════════════════════════════════════════════════
imgs, pos = [], []
tbcnt = [0]
MARGIN = 10
self.tb_cpns = []
for p, tbls in enumerate(self.page_layout):
# Filter only table layouts
tbls = [f for f in tbls if f["type"] == "table"]
tbcnt.append(len(tbls))
if not tbls:
continue
for tb in tbls:
# Crop table region with margin
left = tb["x0"] - MARGIN
top = tb["top"] - MARGIN
right = tb["x1"] + MARGIN
bott = tb["bottom"] + MARGIN
# Scale by zoom factor
pos.append((left * ZM, top * ZM))
imgs.append(self.page_images[p].crop((
left * ZM, top * ZM,
right * ZM, bott * ZM
)))
if not imgs:
return
# ═══════════════════════════════════════════════════════════════════
# STEP 4.2: RUN TABLE STRUCTURE RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
recos = self.tbl_det(imgs) # Line 220
# Returns per table: [{"label": "table row|column|header|spanning", "x0", "top", ...}, ...]
# ═══════════════════════════════════════════════════════════════════
# STEP 4.3: MAP COORDINATES BACK TO FULL PAGE
# ═══════════════════════════════════════════════════════════════════
tbcnt = np.cumsum(tbcnt)
for i in range(len(tbcnt) - 1): # For each page
pg = []
for j, tb_items in enumerate(recos[tbcnt[i]:tbcnt[i + 1]]):
poss = pos[tbcnt[i]:tbcnt[i + 1]]
for it in tb_items:
# Add offset back
it["x0"] += poss[j][0]
it["x1"] += poss[j][0]
it["top"] += poss[j][1]
it["bottom"] += poss[j][1]
# Scale back from zoom
for n in ["x0", "x1", "top", "bottom"]:
it[n] /= ZM
# Add cumulative height
it["top"] += self.page_cum_height[i]
it["bottom"] += self.page_cum_height[i]
it["pn"] = i
it["layoutno"] = j
pg.append(it)
self.tb_cpns.extend(pg)
# ═══════════════════════════════════════════════════════════════════
# STEP 4.4: GATHER COMPONENTS BY TYPE
# ═══════════════════════════════════════════════════════════════════
def gather(kwd, fzy=10, ption=0.6):
eles = Recognizer.sort_Y_firstly(
[r for r in self.tb_cpns if re.match(kwd, r["label"])],
fzy
)
eles = Recognizer.layouts_cleanup(self.boxes, eles, 5, ption)
return Recognizer.sort_Y_firstly(eles, 0)
headers = gather(r".*header$")
rows = gather(r".* (row|header)")
spans = gather(r".*spanning")
clmns = sorted(
[r for r in self.tb_cpns if re.match(r"table column$", r["label"])],
key=lambda x: (x["pn"], x["layoutno"], x["x0"])
)
clmns = Recognizer.layouts_cleanup(self.boxes, clmns, 5, 0.5)
# ═══════════════════════════════════════════════════════════════════
# STEP 4.5: TAG BOXES WITH TABLE ATTRIBUTES
# ═══════════════════════════════════════════════════════════════════
for b in self.boxes:
if b.get("layout_type", "") != "table":
continue
# Find row (R)
ii = Recognizer.find_overlapped_with_threshold(b, rows, thr=0.3)
if ii is not None:
b["R"] = ii
b["R_top"] = rows[ii]["top"]
b["R_bott"] = rows[ii]["bottom"]
# Find header (H)
ii = Recognizer.find_overlapped_with_threshold(b, headers, thr=0.3)
if ii is not None:
b["H"] = ii
b["H_top"] = headers[ii]["top"]
b["H_bott"] = headers[ii]["bottom"]
b["H_left"] = headers[ii]["x0"]
b["H_right"] = headers[ii]["x1"]
# Find column (C)
ii = Recognizer.find_horizontally_tightest_fit(b, clmns)
if ii is not None:
b["C"] = ii
b["C_left"] = clmns[ii]["x0"]
b["C_right"] = clmns[ii]["x1"]
# Find spanning cell (SP)
ii = Recognizer.find_overlapped_with_threshold(b, spans, thr=0.3)
if ii is not None:
b["SP"] = ii
b["H_top"] = spans[ii]["top"]
b["H_bott"] = spans[ii]["bottom"]
b["H_left"] = spans[ii]["x0"]
b["H_right"] = spans[ii]["x1"]
```
### Data Flow
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ TABLE STRUCTURE DETECTION │
└─────────────────────────────────────────────────────────────────────────────┘
page_layout[] ───────────────────────────┐
(Table regions) │
┌─────────────────────────┐
│ Crop Table Regions │
│ + MARGIN (10px) │
└───────────┬─────────────┘
┌─────────────────────────┐
│ TableStructureRec() │
│ (YOLOv10) │
│ │
│ Detects: │
│ • table row │
│ • table column │
│ • table column header │
│ • table spanning cell │
└───────────┬─────────────┘
┌─────────────────────────┐
│ Tag OCR Boxes │
│ │
│ • R (row index) │
│ • C (column index) │
│ • H (header index) │
│ • SP (spanning cell) │
└─────────────────────────┘
After this step, table boxes have:
{
"R": 0, # Row index
"R_top": 100, # Row top boundary
"R_bott": 150, # Row bottom boundary
"C": 1, # Column index
"C_left": 50, # Column left boundary
"C_right": 200, # Column right boundary
"H": 0, # Header row index (if header)
"SP": 2, # Spanning cell index (if spanning)
}
```
---
## Step 5: Column Detection (`_assign_column`)
**File**: `pdf_parser.py`
**Lines**: 355-440
### Algorithm Overview
```
K-Means Column Detection:
1. Group boxes by page
2. For each page:
a. Extract X0 coordinates
b. Normalize indented text (within 12% page width)
c. Try K from 1 to 4
d. Select K with highest silhouette score
3. Use majority voting for global column count
4. Final clustering with selected K
5. Remap cluster IDs to left-to-right order
```
### Code Analysis
```python
def _assign_column(self, boxes, zoomin=3):
"""
Detect number of columns using K-Means clustering.
"""
if not boxes:
return boxes
if all("col_id" in b for b in boxes):
return boxes
# ═══════════════════════════════════════════════════════════════════
# GROUP BOXES BY PAGE
# ═══════════════════════════════════════════════════════════════════
by_page = defaultdict(list)
for b in boxes:
by_page[b["page_number"]].append(b)
page_cols = {}
# ═══════════════════════════════════════════════════════════════════
# FOR EACH PAGE: FIND OPTIMAL K
# ═══════════════════════════════════════════════════════════════════
for pg, bxs in by_page.items():
if not bxs:
page_cols[pg] = 1
continue
x0s_raw = np.array([b["x0"] for b in bxs], dtype=float)
# Calculate page width
min_x0 = np.min(x0s_raw)
max_x1 = np.max([b["x1"] for b in bxs])
width = max_x1 - min_x0
# ═══════════════════════════════════════════════════════════════
# INDENT TOLERANCE: Normalize near-left-edge text
# ═══════════════════════════════════════════════════════════════
INDENT_TOL = width * 0.12 # 12% of page width
x0s = []
for x in x0s_raw:
if abs(x - min_x0) < INDENT_TOL:
x0s.append([min_x0]) # Snap to left edge
else:
x0s.append([x])
x0s = np.array(x0s, dtype=float)
# ═══════════════════════════════════════════════════════════════
# TRY K FROM 1 TO 4
# ═══════════════════════════════════════════════════════════════
max_try = min(4, len(bxs))
if max_try < 2:
max_try = 1
best_k = 1
best_score = -1
for k in range(1, max_try + 1):
km = KMeans(n_clusters=k, n_init="auto")
labels = km.fit_predict(x0s)
centers = np.sort(km.cluster_centers_.flatten())
if len(centers) > 1:
try:
score = silhouette_score(x0s, labels)
except ValueError:
continue
else:
score = 0
if score > best_score:
best_score = score
best_k = k
page_cols[pg] = best_k
logging.info(f"[Page {pg}] best_score={best_score:.2f}, best_k={best_k}")
# ═══════════════════════════════════════════════════════════════════
# MAJORITY VOTING FOR GLOBAL COLUMN COUNT
# ═══════════════════════════════════════════════════════════════════
global_cols = Counter(page_cols.values()).most_common(1)[0][0]
logging.info(f"Global column_num by majority: {global_cols}")
# ═══════════════════════════════════════════════════════════════════
# FINAL CLUSTERING WITH SELECTED K
# ═══════════════════════════════════════════════════════════════════
for pg, bxs in by_page.items():
if not bxs:
continue
k = page_cols[pg]
if len(bxs) < k:
k = 1
x0s = np.array([[b["x0"]] for b in bxs], dtype=float)
km = KMeans(n_clusters=k, n_init="auto")
labels = km.fit_predict(x0s)
# ═══════════════════════════════════════════════════════════════
# REMAP CLUSTER IDS: Left-to-right order
# ═══════════════════════════════════════════════════════════════
centers = km.cluster_centers_.flatten()
order = np.argsort(centers)
remap = {orig: new for new, orig in enumerate(order)}
for b, lb in zip(bxs, labels):
b["col_id"] = remap[lb]
return boxes
```
### Visualization
```
Single column (k=1): Two columns (k=2):
┌────────────────────────────┐ ┌─────────────┬─────────────┐
│ Text text text text │ │ Col 0 │ Col 1 │
│ text text text text │ │ Text text │ Text text │
│ text text text text │ │ text text │ text text │
│ text text text text │ │ text text │ text text │
└────────────────────────────┘ └─────────────┴─────────────┘
col_id = 0 col_id = 0 col_id = 1
X coordinates: X coordinates:
[50, 52, 48, 51, ...] [50, 52, 300, 302, 49, 301, ...]
↓ K-Means ↓ K-Means
k=1, all → 0 k=2, cluster 0 → 0, cluster 1 → 1
```
---
## Step 6: Text Merge (Horizontal) (`_text_merge`)
**File**: `pdf_parser.py`
**Lines**: 442-478
### Algorithm
```
Horizontal Merge Conditions:
1. Same page
2. Same column (col_id)
3. Same layout (layoutno)
4. Not table/figure/equation
5. Y distance < mean_height / 3
```
### Code Analysis
```python
def _text_merge(self, zoomin=3):
"""
Merge horizontally adjacent boxes with same layout.
"""
bxs = self._assign_column(self.boxes, zoomin) # Ensure col_id assigned
# Helper functions
def end_with(b, txt):
txt = txt.strip()
tt = b.get("text", "").strip()
return tt and tt.find(txt) == len(tt) - len(txt)
def start_with(b, txts):
tt = b.get("text", "").strip()
return tt and any([tt.find(t.strip()) == 0 for t in txts])
# ═══════════════════════════════════════════════════════════════════
# HORIZONTAL MERGE LOOP
# ═══════════════════════════════════════════════════════════════════
i = 0
while i < len(bxs) - 1:
b = bxs[i]
b_ = bxs[i + 1]
# Skip if different page or column
if b["page_number"] != b_["page_number"]:
i += 1
continue
if b.get("col_id") != b_.get("col_id"):
i += 1
continue
# Skip if different layout or special type
if b.get("layoutno", "0") != b_.get("layoutno", "1"):
i += 1
continue
if b.get("layout_type", "") in ["table", "figure", "equation"]:
i += 1
continue
# Check Y distance
y_dis = abs(self._y_dis(b, b_))
threshold = self.mean_height[bxs[i]["page_number"] - 1] / 3
if y_dis < threshold:
# ═══════════════════════════════════════════════════════════
# MERGE: Expand box to include next
# ═══════════════════════════════════════════════════════════
bxs[i]["x1"] = b_["x1"] # Extend right edge
bxs[i]["top"] = (b["top"] + b_["top"]) / 2 # Average top
bxs[i]["bottom"] = (b["bottom"] + b_["bottom"]) / 2 # Average bottom
bxs[i]["text"] += b_["text"] # Concatenate text
bxs.pop(i + 1) # Remove merged box
continue # Check if can merge more
i += 1
self.boxes = bxs
```
### Visualization
```
Before horizontal merge:
┌──────┐ ┌──────┐ ┌──────┐
│Hello │ │World │ │! │ (same line, same layout)
└──────┘ └──────┘ └──────┘
After horizontal merge:
┌────────────────────────┐
│Hello World! │
└────────────────────────┘
```
---
## Step 7: Text Merge (Vertical) (`_naive_vertical_merge`)
**File**: `pdf_parser.py`
**Lines**: 480-556
### Algorithm
```
Vertical Merge Conditions:
1. Same page and column
2. Same layout (layoutno)
3. Y distance < 1.5 * mean_height
4. Horizontal overlap > 30%
5. Semantic checks (punctuation, text patterns)
```
### Code Analysis
```python
def _naive_vertical_merge(self, zoomin=3):
"""
Merge vertically adjacent boxes within same layout.
"""
bxs = self._assign_column(self.boxes, zoomin)
# ═══════════════════════════════════════════════════════════════════
# GROUP BY PAGE AND COLUMN
# ═══════════════════════════════════════════════════════════════════
grouped = defaultdict(list)
for b in bxs:
grouped[(b["page_number"], b.get("col_id", 0))].append(b)
merged_boxes = []
for (pg, col), bxs in grouped.items():
# Sort by top-to-bottom, left-to-right
bxs = sorted(bxs, key=lambda x: (x["top"], x["x0"]))
if not bxs:
continue
mh = self.mean_height[pg - 1] if self.mean_height else 10
i = 0
while i + 1 < len(bxs):
b = bxs[i]
b_ = bxs[i + 1]
# ═══════════════════════════════════════════════════════════
# SKIP CONDITIONS
# ═══════════════════════════════════════════════════════════
# Remove page numbers at page boundaries
if b["page_number"] < b_["page_number"]:
if re.match(r"[0-9 •一—-]+$", b["text"]):
bxs.pop(i)
continue
# Skip empty text
if not b["text"].strip():
bxs.pop(i)
continue
# Skip different layouts
if b.get("layoutno") != b_.get("layoutno"):
i += 1
continue
# Skip if too far apart vertically
if b_["top"] - b["bottom"] > mh * 1.5:
i += 1
continue
# ═══════════════════════════════════════════════════════════
# CHECK HORIZONTAL OVERLAP
# ═══════════════════════════════════════════════════════════
overlap = max(0, min(b["x1"], b_["x1"]) - max(b["x0"], b_["x0"]))
min_width = min(b["x1"] - b["x0"], b_["x1"] - b_["x0"])
if overlap / max(1, min_width) < 0.3:
i += 1
continue
# ═══════════════════════════════════════════════════════════
# SEMANTIC ANALYSIS
# ═══════════════════════════════════════════════════════════
# Features favoring concatenation
concatting_feats = [
b["text"].strip()[-1] in ",;:'\",、'"-", # Ends with continuation punct
len(b["text"].strip()) > 1 and
b["text"].strip()[-2] in ",;:'\"'"、;:",
b_["text"].strip() and
b_["text"].strip()[0] in "。;?!?"),,、:", # Starts with ending punct
]
# Features preventing concatenation
feats = [
b.get("layoutno", 0) != b_.get("layoutno", 0), # Different layout
b["text"].strip()[-1] in "。?!?", # Sentence end
self.is_english and b["text"].strip()[-1] in ".!?",
b["page_number"] == b_["page_number"] and
b_["top"] - b["bottom"] > mh * 1.5, # Too far
b["page_number"] < b_["page_number"] and
abs(b["x0"] - b_["x0"]) > self.mean_width[b["page_number"] - 1] * 4,
]
# Features for definite split
detach_feats = [
b["x1"] < b_["x0"], # No horizontal overlap at all
b["x0"] > b_["x1"],
]
# ═══════════════════════════════════════════════════════════
# DECISION
# ═══════════════════════════════════════════════════════════
if (any(feats) and not any(concatting_feats)) or any(detach_feats):
i += 1
continue
# ═══════════════════════════════════════════════════════════
# MERGE
# ═══════════════════════════════════════════════════════════
b["text"] = (b["text"].rstrip() + " " + b_["text"].lstrip()).strip()
b["bottom"] = b_["bottom"]
b["x0"] = min(b["x0"], b_["x0"])
b["x1"] = max(b["x1"], b_["x1"])
bxs.pop(i + 1)
merged_boxes.extend(bxs)
self.boxes = sorted(merged_boxes, key=lambda x: (x["page_number"], x.get("col_id", 0), x["top"]))
```
### Visualization
```
Before vertical merge:
┌────────────────────────┐
│This is paragraph one │
└────────────────────────┘
┌────────────────────────┐
│that continues here and │
└────────────────────────┘
┌────────────────────────┐
│ends with this line. │
└────────────────────────┘
After vertical merge:
┌────────────────────────┐
│This is paragraph one │
│that continues here and │
│ends with this line. │
└────────────────────────┘
```
---
## Step 8: Filter & Cleanup
### 8.1 `_filter_forpages`
**Lines**: 685-729
```python
def _filter_forpages(self):
"""
Remove table of contents and dirty pages.
"""
if not self.boxes:
return
# ═══════════════════════════════════════════════════════════════════
# DETECT AND REMOVE TABLE OF CONTENTS
# ═══════════════════════════════════════════════════════════════════
findit = False
i = 0
while i < len(self.boxes):
# Check for TOC headers
text_lower = re.sub(r"( | |\u3000)+", "", self.boxes[i]["text"].lower())
if not re.match(r"(contents|目录|目次|table of contents|致谢|acknowledge)$", text_lower):
i += 1
continue
findit = True
eng = re.match(r"[0-9a-zA-Z :'.-]{5,}", self.boxes[i]["text"].strip())
self.boxes.pop(i) # Remove TOC header
if i >= len(self.boxes):
break
# Get prefix of first TOC entry
prefix = self.boxes[i]["text"].strip()[:3] if not eng else \
" ".join(self.boxes[i]["text"].strip().split()[:2])
# Remove empty entries
while not prefix:
self.boxes.pop(i)
if i >= len(self.boxes):
break
prefix = self.boxes[i]["text"].strip()[:3] if not eng else \
" ".join(self.boxes[i]["text"].strip().split()[:2])
self.boxes.pop(i)
if i >= len(self.boxes) or not prefix:
break
# Remove entries matching TOC pattern
for j in range(i, min(i + 128, len(self.boxes))):
if not re.match(prefix, self.boxes[j]["text"]):
continue
for k in range(i, j):
self.boxes.pop(i)
break
if findit:
return
# ═══════════════════════════════════════════════════════════════════
# DETECT AND REMOVE DIRTY PAGES
# ═══════════════════════════════════════════════════════════════════
page_dirty = [0] * len(self.page_images)
for b in self.boxes:
# Count repetitive patterns (common in scanned TOC)
if re.search(r"(··|··|··)", b["text"]):
page_dirty[b["page_number"] - 1] += 1
# Pages with >3 repetitive patterns are dirty
page_dirty = set([i + 1 for i, t in enumerate(page_dirty) if t > 3])
if not page_dirty:
return
# Remove all boxes from dirty pages
i = 0
while i < len(self.boxes):
if self.boxes[i]["page_number"] in page_dirty:
self.boxes.pop(i)
continue
i += 1
```
---
## Step 9: Extract Tables & Figures (`_extract_table_figure`)
**File**: `pdf_parser.py`
**Lines**: 757-930
### Code Analysis
```python
def _extract_table_figure(self, need_image, ZM, return_html, need_position,
separate_tables_figures=False):
"""
Extract tables and figures from detected layouts.
"""
tables = {}
figures = {}
# ═══════════════════════════════════════════════════════════════════
# STEP 9.1: SEPARATE TABLE AND FIGURE BOXES
# ═══════════════════════════════════════════════════════════════════
i = 0
lst_lout_no = ""
nomerge_lout_no = []
while i < len(self.boxes):
if "layoutno" not in self.boxes[i]:
i += 1
continue
lout_no = f"{self.boxes[i]['page_number']}-{self.boxes[i]['layoutno']}"
# Mark captions as non-mergeable
if (TableStructureRecognizer.is_caption(self.boxes[i]) or
self.boxes[i]["layout_type"] in ["table caption", "title",
"figure caption", "reference"]):
nomerge_lout_no.append(lst_lout_no)
# ═══════════════════════════════════════════════════════════════
# EXTRACT TABLE BOXES
# ═══════════════════════════════════════════════════════════════
if self.boxes[i]["layout_type"] == "table":
# Skip source citations
if re.match(r"(数据|资料|图表)*来源[: ]", self.boxes[i]["text"]):
self.boxes.pop(i)
continue
if lout_no not in tables:
tables[lout_no] = []
tables[lout_no].append(self.boxes[i])
self.boxes.pop(i)
lst_lout_no = lout_no
continue
# ═══════════════════════════════════════════════════════════════
# EXTRACT FIGURE BOXES
# ═══════════════════════════════════════════════════════════════
if need_image and self.boxes[i]["layout_type"] == "figure":
if re.match(r"(数据|资料|图表)*来源[: ]", self.boxes[i]["text"]):
self.boxes.pop(i)
continue
if lout_no not in figures:
figures[lout_no] = []
figures[lout_no].append(self.boxes[i])
self.boxes.pop(i)
lst_lout_no = lout_no
continue
i += 1
# ═══════════════════════════════════════════════════════════════════
# STEP 9.2: MERGE CROSS-PAGE TABLES
# ═══════════════════════════════════════════════════════════════════
nomerge_lout_no = set(nomerge_lout_no)
tbls = sorted([(k, bxs) for k, bxs in tables.items()],
key=lambda x: (x[1][0]["top"], x[1][0]["x0"]))
i = len(tbls) - 1
while i - 1 >= 0:
k0, bxs0 = tbls[i - 1]
k, bxs = tbls[i]
i -= 1
if k0 in nomerge_lout_no:
continue
if bxs[0]["page_number"] == bxs0[0]["page_number"]:
continue
if bxs[0]["page_number"] - bxs0[0]["page_number"] > 1:
continue
mh = self.mean_height[bxs[0]["page_number"] - 1]
if self._y_dis(bxs0[-1], bxs[0]) > mh * 23:
continue
# Merge tables
tables[k0].extend(tables[k])
del tables[k]
# ═══════════════════════════════════════════════════════════════════
# STEP 9.3: ASSOCIATE CAPTIONS WITH TABLES/FIGURES
# ═══════════════════════════════════════════════════════════════════
i = 0
while i < len(self.boxes):
c = self.boxes[i]
if not TableStructureRecognizer.is_caption(c):
i += 1
continue
# Find nearest table/figure
def nearest(tbls):
mink, minv = "", float('inf')
for k, bxs in tbls.items():
for b in bxs:
if b.get("layout_type", "").find("caption") >= 0:
continue
y_dis = self._y_dis(c, b)
x_dis = self._x_dis(c, b) if not x_overlapped(c, b) else 0
dis = y_dis**2 + x_dis**2
if dis < minv:
mink, minv = k, dis
return mink, minv
tk, tv = nearest(tables)
fk, fv = nearest(figures)
if tv < fv and tk:
tables[tk].insert(0, c)
elif fk:
figures[fk].insert(0, c)
self.boxes.pop(i)
# ═══════════════════════════════════════════════════════════════════
# STEP 9.4: CONSTRUCT TABLE OUTPUT
# ═══════════════════════════════════════════════════════════════════
res = []
for k, bxs in tables.items():
if not bxs:
continue
bxs = Recognizer.sort_Y_firstly(bxs, np.mean([(b["bottom"] - b["top"]) / 2 for b in bxs]))
poss = []
# Crop table image
img = cropout(bxs, "table", poss)
# Construct table content (HTML or descriptive)
content = self.tbl_det.construct_table(
bxs,
html=return_html,
is_english=self.is_english
)
res.append((img, content))
return res
```
### Table Construction Flow
```
Table boxes with R/C/H/SP attributes
┌─────────────────────────────────────────────────────────────────────────────┐
│ TableStructureRecognizer.construct_table() │
│ │
│ 1. Sort by row (R attribute) │
│ 2. Group into rows │
│ 3. Sort each row by column (C attribute) │
│ 4. Build 2D table matrix │
│ 5. Handle spanning cells (SP attribute) │
│ 6. Generate output format │
└─────────────────────────────────────────────────────────────────────────────┘
┌───────┴───────┐
│ │
▼ ▼
HTML Output Descriptive Output
HTML:
<table>
<caption>Table 1: Data</caption>
<tr><th>Name</th><th>Value</th></tr>
<tr><td>Item 1</td><td>100</td></tr>
</table>
Descriptive:
Name: Item 1; Value: 100
Name: Item 2; Value: 200
(from "Table 1: Data")
```
---
## Step 10: Final Output (`__filterout_scraps`)
**File**: `pdf_parser.py`
**Lines**: 971-1029
### Code Analysis
```python
def __filterout_scraps(self, boxes, ZM):
"""
Filter low-quality text blocks and format final output.
"""
def width(b):
return b["x1"] - b["x0"]
def height(b):
return b["bottom"] - b["top"]
def usefull(b):
"""Check if box is useful."""
if b.get("layout_type"):
return True
# Width > 1/3 page width
if width(b) > self.page_images[b["page_number"] - 1].size[0] / ZM / 3:
return True
# Height > mean character height
if height(b) > self.mean_height[b["page_number"] - 1]:
return True
return False
res = []
while boxes:
lines = []
widths = []
pw = self.page_images[boxes[0]["page_number"] - 1].size[0] / ZM
mh = self.mean_height[boxes[0]["page_number"] - 1]
mj = self.proj_match(boxes[0]["text"]) or \
boxes[0].get("layout_type", "") == "title"
# ═══════════════════════════════════════════════════════════════
# DFS TO FIND CONNECTED LINES
# ═══════════════════════════════════════════════════════════════
def dfs(line, st):
nonlocal mh, pw, lines, widths
lines.append(line)
widths.append(width(line))
mmj = self.proj_match(line["text"]) or \
line.get("layout_type", "") == "title"
for i in range(st + 1, min(st + 20, len(boxes))):
# Stop at page boundary
if boxes[i]["page_number"] - line["page_number"] > 0:
break
# Stop if too far vertically
if not mmj and self._y_dis(line, boxes[i]) >= 3 * mh and \
height(line) < 1.5 * mh:
break
if not usefull(boxes[i]):
continue
# Check horizontal proximity
if mmj or (self._x_dis(boxes[i], line) < pw / 10):
dfs(boxes[i], i)
boxes.pop(i)
break
try:
if usefull(boxes[0]):
dfs(boxes[0], 0)
else:
logging.debug("WASTE: " + boxes[0]["text"])
except:
pass
boxes.pop(0)
# ═══════════════════════════════════════════════════════════════
# FILTER AND FORMAT OUTPUT
# ═══════════════════════════════════════════════════════════════
mw = np.mean(widths)
if mj or mw / pw >= 0.35 or mw > 200:
# Add position tags to each line
result = "\n".join([
c["text"] + self._line_tag(c, ZM)
for c in lines
])
res.append(result)
else:
logging.debug("REMOVED: " + "<<".join([c["text"] for c in lines]))
return "\n\n".join(res)
```
### Position Tag Format
```python
def _line_tag(self, bx, ZM):
"""
Generate position tag for a text box.
Format: @@{page_numbers}\t{x0}\t{x1}\t{top}\t{bottom}##
Example: @@1-2\t50.0\t450.0\t100.0\t120.0##
(Text spans pages 1-2, coordinates in original scale)
"""
pn = [bx["page_number"]]
top = bx["top"] - self.page_cum_height[pn[0] - 1]
bott = bx["bottom"] - self.page_cum_height[pn[0] - 1]
# Handle multi-page spanning
while bott * ZM > self.page_images[pn[-1] - 1].size[1]:
bott -= self.page_images[pn[-1] - 1].size[1] / ZM
pn.append(pn[-1] + 1)
return "@@{}\t{:.1f}\t{:.1f}\t{:.1f}\t{:.1f}##".format(
"-".join([str(p) for p in pn]),
bx["x0"], bx["x1"], top, bott
)
```
### Final Output Format
```python
# Return value of __call__:
(
# documents: str (paragraphs separated by \n\n)
"Paragraph 1 text@@1\t50.0\t450.0\t100.0\t150.0##\n\n"
"Paragraph 2 text@@1\t50.0\t450.0\t200.0\t250.0##\n\n"
"...",
# tables: List[Tuple[PIL.Image, str|List[str]]]
[
(table_image_1, "<table>...</table>"),
(table_image_2, ["desc line 1", "desc line 2"]),
]
)
```
---
## Tổng Kết
### Complete Pipeline Summary
| Step | Method | Lines | Input | Output |
|------|--------|-------|-------|--------|
| 1 | `__init__` | 52-105 | - | Models loaded |
| 2 | `__images__` | 1042-1159 | PDF file | boxes[], page_images[] |
| 3 | `_layouts_rec` | 347-353 | page_images, boxes | boxes with layout_type |
| 4 | `_table_transformer_job` | 196-281 | page_images, boxes | boxes with R/C/H/SP |
| 5 | `_assign_column` | 355-440 | boxes | boxes with col_id |
| 6 | `_text_merge` | 442-478 | boxes | merged boxes (horizontal) |
| 7 | `_naive_vertical_merge` | 480-556 | boxes | merged boxes (vertical) |
| 8 | `_filter_forpages` | 685-729 | boxes | cleaned boxes |
| 9 | `_extract_table_figure` | 757-930 | boxes | tables[], figures[] |
| 10 | `__filterout_scraps` | 971-1029 | boxes | formatted text |
### Key Data Structures
```python
# Box structure throughout pipeline
{
# Basic (from OCR)
"x0": float, # Left edge
"x1": float, # Right edge
"top": float, # Top edge (cumulative Y)
"bottom": float, # Bottom edge (cumulative Y)
"text": str, # Recognized text
"page_number": int, # 1-indexed page
# From layout recognition (Step 3)
"layout_type": str, # "text", "title", "table", "figure"...
"layoutno": int, # Layout region ID
# From table detection (Step 4)
"R": int, # Row index
"C": int, # Column index
"H": int, # Header row index
"SP": int, # Spanning cell index
# From column detection (Step 5)
"col_id": int, # Column ID (0-based)
}
```