Created comprehensive documentation for RAGFlowPdfParser processing pipeline: - 10 major processing steps with code references - Complete data flow diagrams - Algorithm explanations (K-Means column detection, text merging) - Box data structure evolution through pipeline - Position tag format specification - Line-by-line code analysis for key methods: - __init__ (model loading) - __images__ (OCR processing) - _layouts_rec (layout detection) - _table_transformer_job (table structure) - _assign_column (column detection) - _text_merge (horizontal merge) - _naive_vertical_merge (vertical merge) - _filter_forpages (cleanup) - _extract_table_figure (extraction) - __filterout_scraps (final output)
77 KiB
77 KiB
Chi Tiết Các Bước Xử Lý Khi Gọi RAGFlowPdfParser
Mục Lục
- Tổng Quan Pipeline
- Step 1: Khởi Tạo Parser
- Step 2: Load Images & OCR
- Step 3: Layout Recognition
- Step 4: Table Structure Detection
- Step 5: Column Detection
- Step 6: Text Merge (Horizontal)
- Step 7: Text Merge (Vertical)
- Step 8: Filter & Cleanup
- Step 9: Extract Tables & Figures
- Step 10: Final Output
1. Tổng Quan Pipeline
1.1 Entry Points
Có 2 entry points chính:
# Entry Point 1: Simple call (Line 1160-1168)
def __call__(self, fnm, need_image=True, zoomin=3, return_html=False):
self.__images__(fnm, zoomin) # Step 2
self._layouts_rec(zoomin) # Step 3
self._table_transformer_job(zoomin) # Step 4
self._text_merge() # Step 6
self._concat_downward() # Step 7 (disabled)
self._filter_forpages() # Step 8
tbls = self._extract_table_figure(...) # Step 9
return self.__filterout_scraps(...), tbls # Step 10
# Entry Point 2: Detailed parsing (Line 1170-1252)
def parse_into_bboxes(self, fnm, callback=None, zoomin=3):
# Same steps but with callbacks và more detailed output
1.2 Pipeline Flow Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ PDF PARSING PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
PDF File (path/bytes)
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 1: __init__() │
│ • Load OCR model (DBNet + CRNN) │
│ • Load LayoutRecognizer (YOLOv10) │
│ • Load TableStructureRecognizer (YOLOv10) │
│ • Load XGBoost model (text concatenation) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 2: __images__() │
│ • Convert PDF pages to images (pdfplumber) │
│ • Extract native PDF characters │
│ • Run OCR detection + recognition │
│ • Merge native chars with OCR boxes │
│ Output: self.boxes[], self.page_images[], self.page_chars[] │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 3: _layouts_rec() │
│ • Run YOLOv10 on page images │
│ • Detect 10 layout types (Text, Title, Table, Figure...) │
│ • Associate OCR boxes with layouts │
│ • Filter garbage (headers, footers, page numbers) │
│ Output: boxes[] with layout_type, layoutno attributes │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 4: _table_transformer_job() │
│ • Crop table regions from images │
│ • Run TableStructureRecognizer │
│ • Detect rows, columns, headers, spanning cells │
│ • Tag boxes with R (row), C (column), H (header), SP (spanning) │
│ Output: self.tb_cpns[], boxes[] with table attributes │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 5: _assign_column() (called in _text_merge) │
│ • K-Means clustering on X coordinates │
│ • Silhouette score to find optimal k (1-4 columns) │
│ • Assign col_id to each text box │
│ Output: boxes[] with col_id attribute │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 6: _text_merge() │
│ • Horizontal merge: same line, same column, same layout │
│ Output: Fewer, wider text boxes │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 7: _naive_vertical_merge() / _concat_downward() │
│ • Vertical merge: adjacent paragraphs │
│ • Semantic checks (punctuation, distance, overlap) │
│ Output: Merged paragraphs │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 8: _filter_forpages() │
│ • Remove table of contents │
│ • Remove dirty pages (repetitive patterns) │
│ Output: Cleaned boxes[] │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 9: _extract_table_figure() │
│ • Extract table boxes → construct HTML/descriptive │
│ • Extract figure boxes → crop images │
│ • Associate captions with tables/figures │
│ Output: tables[], figures[] │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP 10: __filterout_scraps() │
│ • Filter low-quality text blocks │
│ • Add position tags │
│ • Format final output │
│ Output: (documents, tables) │
└─────────────────────────────────────────────────────────────────────────────┘
Step 1: Khởi Tạo Parser (__init__)
File: pdf_parser.py
Lines: 52-105
Code Analysis
class RAGFlowPdfParser:
def __init__(self, **kwargs):
# ═══════════════════════════════════════════════════════════════════
# 1. LOAD OCR MODEL
# ═══════════════════════════════════════════════════════════════════
self.ocr = OCR() # Line 66
# OCR class chứa:
# - TextDetector (DBNet): Phát hiện vùng text
# - TextRecognizer (CRNN): Nhận dạng text trong vùng
# ═══════════════════════════════════════════════════════════════════
# 2. SETUP PARALLEL PROCESSING
# ═══════════════════════════════════════════════════════════════════
self.parallel_limiter = None
if settings.PARALLEL_DEVICES > 1:
# Tạo capacity limiter cho mỗi GPU
self.parallel_limiter = [
trio.CapacityLimiter(1) # 1 task per device
for _ in range(settings.PARALLEL_DEVICES)
]
# ═══════════════════════════════════════════════════════════════════
# 3. LOAD LAYOUT RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
layout_recognizer_type = os.getenv("LAYOUT_RECOGNIZER_TYPE", "onnx")
if layout_recognizer_type == "ascend":
self.layouter = AscendLayoutRecognizer(recognizer_domain) # Huawei NPU
else:
self.layouter = LayoutRecognizer(recognizer_domain) # ONNX (default)
# ═══════════════════════════════════════════════════════════════════
# 4. LOAD TABLE STRUCTURE RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
self.tbl_det = TableStructureRecognizer() # Line 86
# ═══════════════════════════════════════════════════════════════════
# 5. LOAD XGBOOST MODEL (Text Concatenation)
# ═══════════════════════════════════════════════════════════════════
self.updown_cnt_mdl = xgb.Booster() # Line 88
# Try GPU first
try:
import torch.cuda
if torch.cuda.is_available():
self.updown_cnt_mdl.set_param({"device": "cuda"})
except:
pass
# Load model weights
model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
self.updown_cnt_mdl.load_model(
os.path.join(model_dir, "updown_concat_xgb.model")
)
# ═══════════════════════════════════════════════════════════════════
# 6. INITIALIZE STATE
# ═══════════════════════════════════════════════════════════════════
self.page_from = 0
self.column_num = 1
Models Loaded
| Model | Type | Purpose | Size |
|---|---|---|---|
| OCR (DBNet) | ONNX | Text detection | ~30MB |
| OCR (CRNN) | ONNX | Text recognition | ~20MB |
| LayoutRecognizer | ONNX (YOLOv10) | Layout detection | ~50MB |
| TableStructureRecognizer | ONNX (YOLOv10) | Table structure | ~50MB |
| XGBoost | Binary | Text concatenation | ~5MB |
Step 2: Load Images & OCR (__images__)
File: pdf_parser.py
Lines: 1042-1159
2.1 PDF to Images Conversion
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# ═══════════════════════════════════════════════════════════════════
# INITIALIZE STATE VARIABLES
# ═══════════════════════════════════════════════════════════════════
self.lefted_chars = [] # Characters không match với OCR box
self.mean_height = [] # Average character height per page
self.mean_width = [] # Average character width per page
self.boxes = [] # OCR results
self.garbages = {} # Garbage patterns found
self.page_cum_height = [0] # Cumulative page heights
self.page_layout = [] # Layout detection results
self.page_from = page_from
# ═══════════════════════════════════════════════════════════════════
# CONVERT PDF PAGES TO IMAGES (Lines 1052-1067)
# ═══════════════════════════════════════════════════════════════════
with pdfplumber.open(fnm) as pdf:
self.pdf = pdf
# Convert each page to image
# resolution = 72 * zoomin (default: 72 * 3 = 216 DPI)
self.page_images = [
p.to_image(resolution=72 * zoomin, antialias=True).annotated
for p in pdf.pages[page_from:page_to]
]
# ═══════════════════════════════════════════════════════════════
# EXTRACT NATIVE PDF CHARACTERS (Lines 1058-1062)
# ═══════════════════════════════════════════════════════════════
# Extract character-level info from PDF text layer
self.page_chars = [
[c for c in page.dedupe_chars().chars if self._has_color(c)]
for page in pdf.pages[page_from:page_to]
]
self.total_page = len(pdf.pages)
2.2 Language Detection
# ═══════════════════════════════════════════════════════════════════
# DETECT DOCUMENT LANGUAGE (Lines 1093-1100)
# ═══════════════════════════════════════════════════════════════════
# Sample random characters, check if English
self.is_english = [
re.search(r"[ a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}",
"".join(random.choices([c["text"] for c in self.page_chars[i]],
k=min(100, len(self.page_chars[i])))))
for i in range(len(self.page_chars))
]
# If >50% pages are English, mark document as English
if sum([1 if e else 0 for e in self.is_english]) > len(self.page_images) / 2:
self.is_english = True
else:
self.is_english = False
2.3 OCR Processing (Parallel)
# ═══════════════════════════════════════════════════════════════════
# ASYNC OCR PROCESSING (Lines 1102-1145)
# ═══════════════════════════════════════════════════════════════════
async def __img_ocr(i, device_id, img, chars, limiter):
# Add spaces between characters if needed
j = 0
while j + 1 < len(chars):
if (chars[j]["text"] and chars[j + 1]["text"]
and re.match(r"[0-9a-zA-Z,.:;!%]+",
chars[j]["text"] + chars[j + 1]["text"])
and chars[j + 1]["x0"] - chars[j]["x1"] >=
min(chars[j + 1]["width"], chars[j]["width"]) / 2):
chars[j]["text"] += " "
j += 1
# Run OCR with rate limiting for parallel execution
if limiter:
async with limiter:
await trio.to_thread.run_sync(
lambda: self.__ocr(i + 1, img, chars, zoomin, device_id)
)
else:
self.__ocr(i + 1, img, chars, zoomin, device_id)
# Launch OCR tasks
async def __img_ocr_launcher():
if self.parallel_limiter:
# Parallel processing across multiple GPUs
async with trio.open_nursery() as nursery:
for i, img in enumerate(self.page_images):
chars = preprocess(i)
nursery.start_soon(
__img_ocr, i,
i % settings.PARALLEL_DEVICES, # Round-robin GPU
img, chars,
self.parallel_limiter[i % settings.PARALLEL_DEVICES]
)
else:
# Sequential processing
for i, img in enumerate(self.page_images):
chars = preprocess(i)
await __img_ocr(i, 0, img, chars, None)
trio.run(__img_ocr_launcher)
2.4 OCR Core Function (__ocr)
def __ocr(self, pagenum, img, chars, ZM=3, device_id=None):
"""
Core OCR function for a single page.
Lines: 282-345
"""
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.1: TEXT DETECTION
# ═══════════════════════════════════════════════════════════════════
bxs = self.ocr.detect(np.array(img), device_id) # Line 284
# Returns: [(box_points, (text_hint, confidence)), ...]
if not bxs:
self.boxes.append([])
return
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.2: CONVERT TO BOX FORMAT
# ═══════════════════════════════════════════════════════════════════
bxs = [(line[0], line[1][0]) for line in bxs]
bxs = Recognizer.sort_Y_firstly([
{
"x0": b[0][0] / ZM,
"x1": b[1][0] / ZM,
"top": b[0][1] / ZM,
"bottom": b[-1][1] / ZM,
"text": "",
"txt": t,
"chars": [],
"page_number": pagenum
}
for b, t in bxs
if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]
], self.mean_height[pagenum - 1] / 3)
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.3: MERGE NATIVE PDF CHARS WITH OCR BOXES
# ═══════════════════════════════════════════════════════════════════
for c in chars:
# Find overlapping OCR box
ii = Recognizer.find_overlapped(c, bxs)
if ii is None:
self.lefted_chars.append(c)
continue
# Check height compatibility (within 70% tolerance)
ch = c["bottom"] - c["top"]
bh = bxs[ii]["bottom"] - bxs[ii]["top"]
if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != " ":
self.lefted_chars.append(c)
continue
# Add character to box
bxs[ii]["chars"].append(c)
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.4: RECONSTRUCT TEXT FROM CHARS
# ═══════════════════════════════════════════════════════════════════
for b in bxs:
if not b["chars"]:
del b["chars"]
continue
# Sort chars by Y position, then concatenate
m_ht = np.mean([c["height"] for c in b["chars"]])
for c in Recognizer.sort_Y_firstly(b["chars"], m_ht):
if c["text"] == " " and b["text"]:
if re.match(r"[0-9a-zA-Zа-яА-Я,.?;:!%%]", b["text"][-1]):
b["text"] += " "
else:
b["text"] += c["text"]
del b["chars"]
# ═══════════════════════════════════════════════════════════════════
# STEP 2.4.5: OCR RECOGNITION FOR BOXES WITHOUT NATIVE TEXT
# ═══════════════════════════════════════════════════════════════════
boxes_to_reg = []
img_np = np.array(img)
for b in bxs:
if not b["text"]:
# Crop region for OCR
left, right = b["x0"] * ZM, b["x1"] * ZM
top, bott = b["top"] * ZM, b["bottom"] * ZM
b["box_image"] = self.ocr.get_rotate_crop_image(
img_np,
np.array([[left, top], [right, top],
[right, bott], [left, bott]], dtype=np.float32)
)
boxes_to_reg.append(b)
del b["txt"]
# Batch recognition
texts = self.ocr.recognize_batch(
[b["box_image"] for b in boxes_to_reg],
device_id
)
for i, b in enumerate(boxes_to_reg):
b["text"] = texts[i]
del b["box_image"]
# Filter empty boxes
bxs = [b for b in bxs if b["text"]]
self.boxes.append(bxs)
2.5 Data Flow Diagram
PDF File
│
├──────────────────────────────────────────────────────────────┐
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ pdfplumber.open() │ │ pdf.pages[i] │
│ │ │ .to_image() │
│ Extract text layer │ │ │
│ (native characters) │ │ Resolution: 216 DPI │
└───────────┬─────────────┘ └───────────┬─────────────┘
│ │
▼ ▼
page_chars[] page_images[]
(Native PDF text) (PIL Images)
│ │
│ │
│ ┌──────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ OCR Detection │
│ │ (DBNet) │
│ │ │
│ │ Input: page_image │
│ │ Output: bounding boxes│
│ └───────────┬─────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ Box-Char Matching │
│ │ │
└────────▶│ Match native chars │
│ to OCR boxes │
│ (overlap detection) │
└───────────┬─────────────┘
│
┌─────────────┴─────────────┐
│ │
▼ ▼
Boxes with text Boxes without text
(from native) (need OCR recognition)
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ OCR Recognition │
│ │ (CRNN) │
│ │ │
│ │ Crop → Recognize │
│ └───────────┬─────────────┘
│ │
└─────────────┬─────────────┘
│
▼
self.boxes[]
[{"x0", "x1", "top", "bottom", "text", "page_number"}, ...]
Step 3: Layout Recognition (_layouts_rec)
File: pdf_parser.py
Lines: 347-353
Code Analysis
def _layouts_rec(self, ZM, drop=True):
"""
Run layout recognition on all pages.
Args:
ZM: Zoom factor (default 3)
drop: Whether to filter garbage layouts (headers, footers)
"""
assert len(self.page_images) == len(self.boxes)
# ═══════════════════════════════════════════════════════════════════
# CALL LAYOUT RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
# LayoutRecognizer.__call__() internally:
# 1. Runs YOLOv10 on each page image
# 2. Detects 10 layout types
# 3. Associates OCR boxes with layouts
# 4. Filters garbage if drop=True
self.boxes, self.page_layout = self.layouter(
self.page_images, # List of page images
self.boxes, # List of OCR boxes per page (flattened after this)
ZM, # Zoom factor
drop=drop # Filter garbage
)
# ═══════════════════════════════════════════════════════════════════
# ADD CUMULATIVE Y COORDINATES
# ═══════════════════════════════════════════════════════════════════
# After layouter, self.boxes is flattened (not per-page anymore)
for i in range(len(self.boxes)):
self.boxes[i]["top"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
self.boxes[i]["bottom"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
Layout Types
# From layout_recognizer.py, lines 34-46
labels = [
"_background_", # 0: Ignored
"Text", # 1: Body text paragraphs
"Title", # 2: Section titles
"Figure", # 3: Images, charts, diagrams
"Figure caption", # 4: Text describing figures
"Table", # 5: Data tables
"Table caption", # 6: Text describing tables
"Header", # 7: Page headers
"Footer", # 8: Page footers
"Reference", # 9: Bibliography
"Equation", # 10: Mathematical formulas
]
Box Attributes After Layout Recognition
# Each box in self.boxes now has:
{
"x0": float, # Left edge
"x1": float, # Right edge
"top": float, # Top edge (cumulative)
"bottom": float, # Bottom edge (cumulative)
"text": str, # Recognized text
"page_number": int, # 1-indexed page number
"layout_type": str, # "text", "title", "table", "figure", etc.
"layoutno": int, # Layout region ID
}
Step 4: Table Structure Detection (_table_transformer_job)
File: pdf_parser.py
Lines: 196-281
Code Analysis
def _table_transformer_job(self, ZM):
"""
Detect table structure and tag boxes with R/C/H/SP attributes.
"""
logging.debug("Table processing...")
# ═══════════════════════════════════════════════════════════════════
# STEP 4.1: EXTRACT TABLE REGIONS
# ═══════════════════════════════════════════════════════════════════
imgs, pos = [], []
tbcnt = [0]
MARGIN = 10
self.tb_cpns = []
for p, tbls in enumerate(self.page_layout):
# Filter only table layouts
tbls = [f for f in tbls if f["type"] == "table"]
tbcnt.append(len(tbls))
if not tbls:
continue
for tb in tbls:
# Crop table region with margin
left = tb["x0"] - MARGIN
top = tb["top"] - MARGIN
right = tb["x1"] + MARGIN
bott = tb["bottom"] + MARGIN
# Scale by zoom factor
pos.append((left * ZM, top * ZM))
imgs.append(self.page_images[p].crop((
left * ZM, top * ZM,
right * ZM, bott * ZM
)))
if not imgs:
return
# ═══════════════════════════════════════════════════════════════════
# STEP 4.2: RUN TABLE STRUCTURE RECOGNIZER
# ═══════════════════════════════════════════════════════════════════
recos = self.tbl_det(imgs) # Line 220
# Returns per table: [{"label": "table row|column|header|spanning", "x0", "top", ...}, ...]
# ═══════════════════════════════════════════════════════════════════
# STEP 4.3: MAP COORDINATES BACK TO FULL PAGE
# ═══════════════════════════════════════════════════════════════════
tbcnt = np.cumsum(tbcnt)
for i in range(len(tbcnt) - 1): # For each page
pg = []
for j, tb_items in enumerate(recos[tbcnt[i]:tbcnt[i + 1]]):
poss = pos[tbcnt[i]:tbcnt[i + 1]]
for it in tb_items:
# Add offset back
it["x0"] += poss[j][0]
it["x1"] += poss[j][0]
it["top"] += poss[j][1]
it["bottom"] += poss[j][1]
# Scale back from zoom
for n in ["x0", "x1", "top", "bottom"]:
it[n] /= ZM
# Add cumulative height
it["top"] += self.page_cum_height[i]
it["bottom"] += self.page_cum_height[i]
it["pn"] = i
it["layoutno"] = j
pg.append(it)
self.tb_cpns.extend(pg)
# ═══════════════════════════════════════════════════════════════════
# STEP 4.4: GATHER COMPONENTS BY TYPE
# ═══════════════════════════════════════════════════════════════════
def gather(kwd, fzy=10, ption=0.6):
eles = Recognizer.sort_Y_firstly(
[r for r in self.tb_cpns if re.match(kwd, r["label"])],
fzy
)
eles = Recognizer.layouts_cleanup(self.boxes, eles, 5, ption)
return Recognizer.sort_Y_firstly(eles, 0)
headers = gather(r".*header$")
rows = gather(r".* (row|header)")
spans = gather(r".*spanning")
clmns = sorted(
[r for r in self.tb_cpns if re.match(r"table column$", r["label"])],
key=lambda x: (x["pn"], x["layoutno"], x["x0"])
)
clmns = Recognizer.layouts_cleanup(self.boxes, clmns, 5, 0.5)
# ═══════════════════════════════════════════════════════════════════
# STEP 4.5: TAG BOXES WITH TABLE ATTRIBUTES
# ═══════════════════════════════════════════════════════════════════
for b in self.boxes:
if b.get("layout_type", "") != "table":
continue
# Find row (R)
ii = Recognizer.find_overlapped_with_threshold(b, rows, thr=0.3)
if ii is not None:
b["R"] = ii
b["R_top"] = rows[ii]["top"]
b["R_bott"] = rows[ii]["bottom"]
# Find header (H)
ii = Recognizer.find_overlapped_with_threshold(b, headers, thr=0.3)
if ii is not None:
b["H"] = ii
b["H_top"] = headers[ii]["top"]
b["H_bott"] = headers[ii]["bottom"]
b["H_left"] = headers[ii]["x0"]
b["H_right"] = headers[ii]["x1"]
# Find column (C)
ii = Recognizer.find_horizontally_tightest_fit(b, clmns)
if ii is not None:
b["C"] = ii
b["C_left"] = clmns[ii]["x0"]
b["C_right"] = clmns[ii]["x1"]
# Find spanning cell (SP)
ii = Recognizer.find_overlapped_with_threshold(b, spans, thr=0.3)
if ii is not None:
b["SP"] = ii
b["H_top"] = spans[ii]["top"]
b["H_bott"] = spans[ii]["bottom"]
b["H_left"] = spans[ii]["x0"]
b["H_right"] = spans[ii]["x1"]
Data Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ TABLE STRUCTURE DETECTION │
└─────────────────────────────────────────────────────────────────────────────┘
page_layout[] ───────────────────────────┐
(Table regions) │
▼
┌─────────────────────────┐
│ Crop Table Regions │
│ + MARGIN (10px) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ TableStructureRec() │
│ (YOLOv10) │
│ │
│ Detects: │
│ • table row │
│ • table column │
│ • table column header │
│ • table spanning cell │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Tag OCR Boxes │
│ │
│ • R (row index) │
│ • C (column index) │
│ • H (header index) │
│ • SP (spanning cell) │
└─────────────────────────┘
After this step, table boxes have:
{
"R": 0, # Row index
"R_top": 100, # Row top boundary
"R_bott": 150, # Row bottom boundary
"C": 1, # Column index
"C_left": 50, # Column left boundary
"C_right": 200, # Column right boundary
"H": 0, # Header row index (if header)
"SP": 2, # Spanning cell index (if spanning)
}
Step 5: Column Detection (_assign_column)
File: pdf_parser.py
Lines: 355-440
Algorithm Overview
K-Means Column Detection:
1. Group boxes by page
2. For each page:
a. Extract X0 coordinates
b. Normalize indented text (within 12% page width)
c. Try K from 1 to 4
d. Select K with highest silhouette score
3. Use majority voting for global column count
4. Final clustering with selected K
5. Remap cluster IDs to left-to-right order
Code Analysis
def _assign_column(self, boxes, zoomin=3):
"""
Detect number of columns using K-Means clustering.
"""
if not boxes:
return boxes
if all("col_id" in b for b in boxes):
return boxes
# ═══════════════════════════════════════════════════════════════════
# GROUP BOXES BY PAGE
# ═══════════════════════════════════════════════════════════════════
by_page = defaultdict(list)
for b in boxes:
by_page[b["page_number"]].append(b)
page_cols = {}
# ═══════════════════════════════════════════════════════════════════
# FOR EACH PAGE: FIND OPTIMAL K
# ═══════════════════════════════════════════════════════════════════
for pg, bxs in by_page.items():
if not bxs:
page_cols[pg] = 1
continue
x0s_raw = np.array([b["x0"] for b in bxs], dtype=float)
# Calculate page width
min_x0 = np.min(x0s_raw)
max_x1 = np.max([b["x1"] for b in bxs])
width = max_x1 - min_x0
# ═══════════════════════════════════════════════════════════════
# INDENT TOLERANCE: Normalize near-left-edge text
# ═══════════════════════════════════════════════════════════════
INDENT_TOL = width * 0.12 # 12% of page width
x0s = []
for x in x0s_raw:
if abs(x - min_x0) < INDENT_TOL:
x0s.append([min_x0]) # Snap to left edge
else:
x0s.append([x])
x0s = np.array(x0s, dtype=float)
# ═══════════════════════════════════════════════════════════════
# TRY K FROM 1 TO 4
# ═══════════════════════════════════════════════════════════════
max_try = min(4, len(bxs))
if max_try < 2:
max_try = 1
best_k = 1
best_score = -1
for k in range(1, max_try + 1):
km = KMeans(n_clusters=k, n_init="auto")
labels = km.fit_predict(x0s)
centers = np.sort(km.cluster_centers_.flatten())
if len(centers) > 1:
try:
score = silhouette_score(x0s, labels)
except ValueError:
continue
else:
score = 0
if score > best_score:
best_score = score
best_k = k
page_cols[pg] = best_k
logging.info(f"[Page {pg}] best_score={best_score:.2f}, best_k={best_k}")
# ═══════════════════════════════════════════════════════════════════
# MAJORITY VOTING FOR GLOBAL COLUMN COUNT
# ═══════════════════════════════════════════════════════════════════
global_cols = Counter(page_cols.values()).most_common(1)[0][0]
logging.info(f"Global column_num by majority: {global_cols}")
# ═══════════════════════════════════════════════════════════════════
# FINAL CLUSTERING WITH SELECTED K
# ═══════════════════════════════════════════════════════════════════
for pg, bxs in by_page.items():
if not bxs:
continue
k = page_cols[pg]
if len(bxs) < k:
k = 1
x0s = np.array([[b["x0"]] for b in bxs], dtype=float)
km = KMeans(n_clusters=k, n_init="auto")
labels = km.fit_predict(x0s)
# ═══════════════════════════════════════════════════════════════
# REMAP CLUSTER IDS: Left-to-right order
# ═══════════════════════════════════════════════════════════════
centers = km.cluster_centers_.flatten()
order = np.argsort(centers)
remap = {orig: new for new, orig in enumerate(order)}
for b, lb in zip(bxs, labels):
b["col_id"] = remap[lb]
return boxes
Visualization
Single column (k=1): Two columns (k=2):
┌────────────────────────────┐ ┌─────────────┬─────────────┐
│ Text text text text │ │ Col 0 │ Col 1 │
│ text text text text │ │ Text text │ Text text │
│ text text text text │ │ text text │ text text │
│ text text text text │ │ text text │ text text │
└────────────────────────────┘ └─────────────┴─────────────┘
col_id = 0 col_id = 0 col_id = 1
X coordinates: X coordinates:
[50, 52, 48, 51, ...] [50, 52, 300, 302, 49, 301, ...]
↓ K-Means ↓ K-Means
k=1, all → 0 k=2, cluster 0 → 0, cluster 1 → 1
Step 6: Text Merge (Horizontal) (_text_merge)
File: pdf_parser.py
Lines: 442-478
Algorithm
Horizontal Merge Conditions:
1. Same page
2. Same column (col_id)
3. Same layout (layoutno)
4. Not table/figure/equation
5. Y distance < mean_height / 3
Code Analysis
def _text_merge(self, zoomin=3):
"""
Merge horizontally adjacent boxes with same layout.
"""
bxs = self._assign_column(self.boxes, zoomin) # Ensure col_id assigned
# Helper functions
def end_with(b, txt):
txt = txt.strip()
tt = b.get("text", "").strip()
return tt and tt.find(txt) == len(tt) - len(txt)
def start_with(b, txts):
tt = b.get("text", "").strip()
return tt and any([tt.find(t.strip()) == 0 for t in txts])
# ═══════════════════════════════════════════════════════════════════
# HORIZONTAL MERGE LOOP
# ═══════════════════════════════════════════════════════════════════
i = 0
while i < len(bxs) - 1:
b = bxs[i]
b_ = bxs[i + 1]
# Skip if different page or column
if b["page_number"] != b_["page_number"]:
i += 1
continue
if b.get("col_id") != b_.get("col_id"):
i += 1
continue
# Skip if different layout or special type
if b.get("layoutno", "0") != b_.get("layoutno", "1"):
i += 1
continue
if b.get("layout_type", "") in ["table", "figure", "equation"]:
i += 1
continue
# Check Y distance
y_dis = abs(self._y_dis(b, b_))
threshold = self.mean_height[bxs[i]["page_number"] - 1] / 3
if y_dis < threshold:
# ═══════════════════════════════════════════════════════════
# MERGE: Expand box to include next
# ═══════════════════════════════════════════════════════════
bxs[i]["x1"] = b_["x1"] # Extend right edge
bxs[i]["top"] = (b["top"] + b_["top"]) / 2 # Average top
bxs[i]["bottom"] = (b["bottom"] + b_["bottom"]) / 2 # Average bottom
bxs[i]["text"] += b_["text"] # Concatenate text
bxs.pop(i + 1) # Remove merged box
continue # Check if can merge more
i += 1
self.boxes = bxs
Visualization
Before horizontal merge:
┌──────┐ ┌──────┐ ┌──────┐
│Hello │ │World │ │! │ (same line, same layout)
└──────┘ └──────┘ └──────┘
After horizontal merge:
┌────────────────────────┐
│Hello World! │
└────────────────────────┘
Step 7: Text Merge (Vertical) (_naive_vertical_merge)
File: pdf_parser.py
Lines: 480-556
Algorithm
Vertical Merge Conditions:
1. Same page and column
2. Same layout (layoutno)
3. Y distance < 1.5 * mean_height
4. Horizontal overlap > 30%
5. Semantic checks (punctuation, text patterns)
Code Analysis
def _naive_vertical_merge(self, zoomin=3):
"""
Merge vertically adjacent boxes within same layout.
"""
bxs = self._assign_column(self.boxes, zoomin)
# ═══════════════════════════════════════════════════════════════════
# GROUP BY PAGE AND COLUMN
# ═══════════════════════════════════════════════════════════════════
grouped = defaultdict(list)
for b in bxs:
grouped[(b["page_number"], b.get("col_id", 0))].append(b)
merged_boxes = []
for (pg, col), bxs in grouped.items():
# Sort by top-to-bottom, left-to-right
bxs = sorted(bxs, key=lambda x: (x["top"], x["x0"]))
if not bxs:
continue
mh = self.mean_height[pg - 1] if self.mean_height else 10
i = 0
while i + 1 < len(bxs):
b = bxs[i]
b_ = bxs[i + 1]
# ═══════════════════════════════════════════════════════════
# SKIP CONDITIONS
# ═══════════════════════════════════════════════════════════
# Remove page numbers at page boundaries
if b["page_number"] < b_["page_number"]:
if re.match(r"[0-9 •一—-]+$", b["text"]):
bxs.pop(i)
continue
# Skip empty text
if not b["text"].strip():
bxs.pop(i)
continue
# Skip different layouts
if b.get("layoutno") != b_.get("layoutno"):
i += 1
continue
# Skip if too far apart vertically
if b_["top"] - b["bottom"] > mh * 1.5:
i += 1
continue
# ═══════════════════════════════════════════════════════════
# CHECK HORIZONTAL OVERLAP
# ═══════════════════════════════════════════════════════════
overlap = max(0, min(b["x1"], b_["x1"]) - max(b["x0"], b_["x0"]))
min_width = min(b["x1"] - b["x0"], b_["x1"] - b_["x0"])
if overlap / max(1, min_width) < 0.3:
i += 1
continue
# ═══════════════════════════════════════════════════════════
# SEMANTIC ANALYSIS
# ═══════════════════════════════════════════════════════════
# Features favoring concatenation
concatting_feats = [
b["text"].strip()[-1] in ",;:'\",、'";:-", # Ends with continuation punct
len(b["text"].strip()) > 1 and
b["text"].strip()[-2] in ",;:'\",'"、;:",
b_["text"].strip() and
b_["text"].strip()[0] in "。;?!?")),,、:", # Starts with ending punct
]
# Features preventing concatenation
feats = [
b.get("layoutno", 0) != b_.get("layoutno", 0), # Different layout
b["text"].strip()[-1] in "。?!?", # Sentence end
self.is_english and b["text"].strip()[-1] in ".!?",
b["page_number"] == b_["page_number"] and
b_["top"] - b["bottom"] > mh * 1.5, # Too far
b["page_number"] < b_["page_number"] and
abs(b["x0"] - b_["x0"]) > self.mean_width[b["page_number"] - 1] * 4,
]
# Features for definite split
detach_feats = [
b["x1"] < b_["x0"], # No horizontal overlap at all
b["x0"] > b_["x1"],
]
# ═══════════════════════════════════════════════════════════
# DECISION
# ═══════════════════════════════════════════════════════════
if (any(feats) and not any(concatting_feats)) or any(detach_feats):
i += 1
continue
# ═══════════════════════════════════════════════════════════
# MERGE
# ═══════════════════════════════════════════════════════════
b["text"] = (b["text"].rstrip() + " " + b_["text"].lstrip()).strip()
b["bottom"] = b_["bottom"]
b["x0"] = min(b["x0"], b_["x0"])
b["x1"] = max(b["x1"], b_["x1"])
bxs.pop(i + 1)
merged_boxes.extend(bxs)
self.boxes = sorted(merged_boxes, key=lambda x: (x["page_number"], x.get("col_id", 0), x["top"]))
Visualization
Before vertical merge:
┌────────────────────────┐
│This is paragraph one │
└────────────────────────┘
┌────────────────────────┐
│that continues here and │
└────────────────────────┘
┌────────────────────────┐
│ends with this line. │
└────────────────────────┘
After vertical merge:
┌────────────────────────┐
│This is paragraph one │
│that continues here and │
│ends with this line. │
└────────────────────────┘
Step 8: Filter & Cleanup
8.1 _filter_forpages
Lines: 685-729
def _filter_forpages(self):
"""
Remove table of contents and dirty pages.
"""
if not self.boxes:
return
# ═══════════════════════════════════════════════════════════════════
# DETECT AND REMOVE TABLE OF CONTENTS
# ═══════════════════════════════════════════════════════════════════
findit = False
i = 0
while i < len(self.boxes):
# Check for TOC headers
text_lower = re.sub(r"( | |\u3000)+", "", self.boxes[i]["text"].lower())
if not re.match(r"(contents|目录|目次|table of contents|致谢|acknowledge)$", text_lower):
i += 1
continue
findit = True
eng = re.match(r"[0-9a-zA-Z :'.-]{5,}", self.boxes[i]["text"].strip())
self.boxes.pop(i) # Remove TOC header
if i >= len(self.boxes):
break
# Get prefix of first TOC entry
prefix = self.boxes[i]["text"].strip()[:3] if not eng else \
" ".join(self.boxes[i]["text"].strip().split()[:2])
# Remove empty entries
while not prefix:
self.boxes.pop(i)
if i >= len(self.boxes):
break
prefix = self.boxes[i]["text"].strip()[:3] if not eng else \
" ".join(self.boxes[i]["text"].strip().split()[:2])
self.boxes.pop(i)
if i >= len(self.boxes) or not prefix:
break
# Remove entries matching TOC pattern
for j in range(i, min(i + 128, len(self.boxes))):
if not re.match(prefix, self.boxes[j]["text"]):
continue
for k in range(i, j):
self.boxes.pop(i)
break
if findit:
return
# ═══════════════════════════════════════════════════════════════════
# DETECT AND REMOVE DIRTY PAGES
# ═══════════════════════════════════════════════════════════════════
page_dirty = [0] * len(self.page_images)
for b in self.boxes:
# Count repetitive patterns (common in scanned TOC)
if re.search(r"(··|··|··)", b["text"]):
page_dirty[b["page_number"] - 1] += 1
# Pages with >3 repetitive patterns are dirty
page_dirty = set([i + 1 for i, t in enumerate(page_dirty) if t > 3])
if not page_dirty:
return
# Remove all boxes from dirty pages
i = 0
while i < len(self.boxes):
if self.boxes[i]["page_number"] in page_dirty:
self.boxes.pop(i)
continue
i += 1
Step 9: Extract Tables & Figures (_extract_table_figure)
File: pdf_parser.py
Lines: 757-930
Code Analysis
def _extract_table_figure(self, need_image, ZM, return_html, need_position,
separate_tables_figures=False):
"""
Extract tables and figures from detected layouts.
"""
tables = {}
figures = {}
# ═══════════════════════════════════════════════════════════════════
# STEP 9.1: SEPARATE TABLE AND FIGURE BOXES
# ═══════════════════════════════════════════════════════════════════
i = 0
lst_lout_no = ""
nomerge_lout_no = []
while i < len(self.boxes):
if "layoutno" not in self.boxes[i]:
i += 1
continue
lout_no = f"{self.boxes[i]['page_number']}-{self.boxes[i]['layoutno']}"
# Mark captions as non-mergeable
if (TableStructureRecognizer.is_caption(self.boxes[i]) or
self.boxes[i]["layout_type"] in ["table caption", "title",
"figure caption", "reference"]):
nomerge_lout_no.append(lst_lout_no)
# ═══════════════════════════════════════════════════════════════
# EXTRACT TABLE BOXES
# ═══════════════════════════════════════════════════════════════
if self.boxes[i]["layout_type"] == "table":
# Skip source citations
if re.match(r"(数据|资料|图表)*来源[:: ]", self.boxes[i]["text"]):
self.boxes.pop(i)
continue
if lout_no not in tables:
tables[lout_no] = []
tables[lout_no].append(self.boxes[i])
self.boxes.pop(i)
lst_lout_no = lout_no
continue
# ═══════════════════════════════════════════════════════════════
# EXTRACT FIGURE BOXES
# ═══════════════════════════════════════════════════════════════
if need_image and self.boxes[i]["layout_type"] == "figure":
if re.match(r"(数据|资料|图表)*来源[:: ]", self.boxes[i]["text"]):
self.boxes.pop(i)
continue
if lout_no not in figures:
figures[lout_no] = []
figures[lout_no].append(self.boxes[i])
self.boxes.pop(i)
lst_lout_no = lout_no
continue
i += 1
# ═══════════════════════════════════════════════════════════════════
# STEP 9.2: MERGE CROSS-PAGE TABLES
# ═══════════════════════════════════════════════════════════════════
nomerge_lout_no = set(nomerge_lout_no)
tbls = sorted([(k, bxs) for k, bxs in tables.items()],
key=lambda x: (x[1][0]["top"], x[1][0]["x0"]))
i = len(tbls) - 1
while i - 1 >= 0:
k0, bxs0 = tbls[i - 1]
k, bxs = tbls[i]
i -= 1
if k0 in nomerge_lout_no:
continue
if bxs[0]["page_number"] == bxs0[0]["page_number"]:
continue
if bxs[0]["page_number"] - bxs0[0]["page_number"] > 1:
continue
mh = self.mean_height[bxs[0]["page_number"] - 1]
if self._y_dis(bxs0[-1], bxs[0]) > mh * 23:
continue
# Merge tables
tables[k0].extend(tables[k])
del tables[k]
# ═══════════════════════════════════════════════════════════════════
# STEP 9.3: ASSOCIATE CAPTIONS WITH TABLES/FIGURES
# ═══════════════════════════════════════════════════════════════════
i = 0
while i < len(self.boxes):
c = self.boxes[i]
if not TableStructureRecognizer.is_caption(c):
i += 1
continue
# Find nearest table/figure
def nearest(tbls):
mink, minv = "", float('inf')
for k, bxs in tbls.items():
for b in bxs:
if b.get("layout_type", "").find("caption") >= 0:
continue
y_dis = self._y_dis(c, b)
x_dis = self._x_dis(c, b) if not x_overlapped(c, b) else 0
dis = y_dis**2 + x_dis**2
if dis < minv:
mink, minv = k, dis
return mink, minv
tk, tv = nearest(tables)
fk, fv = nearest(figures)
if tv < fv and tk:
tables[tk].insert(0, c)
elif fk:
figures[fk].insert(0, c)
self.boxes.pop(i)
# ═══════════════════════════════════════════════════════════════════
# STEP 9.4: CONSTRUCT TABLE OUTPUT
# ═══════════════════════════════════════════════════════════════════
res = []
for k, bxs in tables.items():
if not bxs:
continue
bxs = Recognizer.sort_Y_firstly(bxs, np.mean([(b["bottom"] - b["top"]) / 2 for b in bxs]))
poss = []
# Crop table image
img = cropout(bxs, "table", poss)
# Construct table content (HTML or descriptive)
content = self.tbl_det.construct_table(
bxs,
html=return_html,
is_english=self.is_english
)
res.append((img, content))
return res
Table Construction Flow
Table boxes with R/C/H/SP attributes
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TableStructureRecognizer.construct_table() │
│ │
│ 1. Sort by row (R attribute) │
│ 2. Group into rows │
│ 3. Sort each row by column (C attribute) │
│ 4. Build 2D table matrix │
│ 5. Handle spanning cells (SP attribute) │
│ 6. Generate output format │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────┴───────┐
│ │
▼ ▼
HTML Output Descriptive Output
HTML:
<table>
<caption>Table 1: Data</caption>
<tr><th>Name</th><th>Value</th></tr>
<tr><td>Item 1</td><td>100</td></tr>
</table>
Descriptive:
Name: Item 1; Value: 100
Name: Item 2; Value: 200
(from "Table 1: Data")
Step 10: Final Output (__filterout_scraps)
File: pdf_parser.py
Lines: 971-1029
Code Analysis
def __filterout_scraps(self, boxes, ZM):
"""
Filter low-quality text blocks and format final output.
"""
def width(b):
return b["x1"] - b["x0"]
def height(b):
return b["bottom"] - b["top"]
def usefull(b):
"""Check if box is useful."""
if b.get("layout_type"):
return True
# Width > 1/3 page width
if width(b) > self.page_images[b["page_number"] - 1].size[0] / ZM / 3:
return True
# Height > mean character height
if height(b) > self.mean_height[b["page_number"] - 1]:
return True
return False
res = []
while boxes:
lines = []
widths = []
pw = self.page_images[boxes[0]["page_number"] - 1].size[0] / ZM
mh = self.mean_height[boxes[0]["page_number"] - 1]
mj = self.proj_match(boxes[0]["text"]) or \
boxes[0].get("layout_type", "") == "title"
# ═══════════════════════════════════════════════════════════════
# DFS TO FIND CONNECTED LINES
# ═══════════════════════════════════════════════════════════════
def dfs(line, st):
nonlocal mh, pw, lines, widths
lines.append(line)
widths.append(width(line))
mmj = self.proj_match(line["text"]) or \
line.get("layout_type", "") == "title"
for i in range(st + 1, min(st + 20, len(boxes))):
# Stop at page boundary
if boxes[i]["page_number"] - line["page_number"] > 0:
break
# Stop if too far vertically
if not mmj and self._y_dis(line, boxes[i]) >= 3 * mh and \
height(line) < 1.5 * mh:
break
if not usefull(boxes[i]):
continue
# Check horizontal proximity
if mmj or (self._x_dis(boxes[i], line) < pw / 10):
dfs(boxes[i], i)
boxes.pop(i)
break
try:
if usefull(boxes[0]):
dfs(boxes[0], 0)
else:
logging.debug("WASTE: " + boxes[0]["text"])
except:
pass
boxes.pop(0)
# ═══════════════════════════════════════════════════════════════
# FILTER AND FORMAT OUTPUT
# ═══════════════════════════════════════════════════════════════
mw = np.mean(widths)
if mj or mw / pw >= 0.35 or mw > 200:
# Add position tags to each line
result = "\n".join([
c["text"] + self._line_tag(c, ZM)
for c in lines
])
res.append(result)
else:
logging.debug("REMOVED: " + "<<".join([c["text"] for c in lines]))
return "\n\n".join(res)
Position Tag Format
def _line_tag(self, bx, ZM):
"""
Generate position tag for a text box.
Format: @@{page_numbers}\t{x0}\t{x1}\t{top}\t{bottom}##
Example: @@1-2\t50.0\t450.0\t100.0\t120.0##
(Text spans pages 1-2, coordinates in original scale)
"""
pn = [bx["page_number"]]
top = bx["top"] - self.page_cum_height[pn[0] - 1]
bott = bx["bottom"] - self.page_cum_height[pn[0] - 1]
# Handle multi-page spanning
while bott * ZM > self.page_images[pn[-1] - 1].size[1]:
bott -= self.page_images[pn[-1] - 1].size[1] / ZM
pn.append(pn[-1] + 1)
return "@@{}\t{:.1f}\t{:.1f}\t{:.1f}\t{:.1f}##".format(
"-".join([str(p) for p in pn]),
bx["x0"], bx["x1"], top, bott
)
Final Output Format
# Return value of __call__:
(
# documents: str (paragraphs separated by \n\n)
"Paragraph 1 text@@1\t50.0\t450.0\t100.0\t150.0##\n\n"
"Paragraph 2 text@@1\t50.0\t450.0\t200.0\t250.0##\n\n"
"...",
# tables: List[Tuple[PIL.Image, str|List[str]]]
[
(table_image_1, "<table>...</table>"),
(table_image_2, ["desc line 1", "desc line 2"]),
]
)
Tổng Kết
Complete Pipeline Summary
| Step | Method | Lines | Input | Output |
|---|---|---|---|---|
| 1 | __init__ |
52-105 | - | Models loaded |
| 2 | __images__ |
1042-1159 | PDF file | boxes[], page_images[] |
| 3 | _layouts_rec |
347-353 | page_images, boxes | boxes with layout_type |
| 4 | _table_transformer_job |
196-281 | page_images, boxes | boxes with R/C/H/SP |
| 5 | _assign_column |
355-440 | boxes | boxes with col_id |
| 6 | _text_merge |
442-478 | boxes | merged boxes (horizontal) |
| 7 | _naive_vertical_merge |
480-556 | boxes | merged boxes (vertical) |
| 8 | _filter_forpages |
685-729 | boxes | cleaned boxes |
| 9 | _extract_table_figure |
757-930 | boxes | tables[], figures[] |
| 10 | __filterout_scraps |
971-1029 | boxes | formatted text |
Key Data Structures
# Box structure throughout pipeline
{
# Basic (from OCR)
"x0": float, # Left edge
"x1": float, # Right edge
"top": float, # Top edge (cumulative Y)
"bottom": float, # Bottom edge (cumulative Y)
"text": str, # Recognized text
"page_number": int, # 1-indexed page
# From layout recognition (Step 3)
"layout_type": str, # "text", "title", "table", "figure"...
"layoutno": int, # Layout region ID
# From table detection (Step 4)
"R": int, # Row index
"C": int, # Column index
"H": int, # Header row index
"SP": int, # Spanning cell index
# From column detection (Step 5)
"col_id": int, # Column ID (0-based)
}