# 05-DOCUMENT-PROCESSING - Document Parsing Pipeline

## Tong Quan

Document Processing pipeline chuyển đổi raw documents thành searchable chunks với layout analysis, OCR, và intelligent chunking.

## Kien Truc Document Processing

```
┌─────────────────────────────────────────────────────────────────┐
│                 DOCUMENT PROCESSING PIPELINE                     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   File Upload   │────▶│  Task Creation  │────▶│  Task Queue     │
│   (MinIO)       │     │   (MySQL)       │     │   (Redis)       │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                     TASK EXECUTOR                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  1. Download file from MinIO                             │   │
│  │  2. Select parser based on file type                     │   │
│  │  3. Execute parsing pipeline                             │   │
│  │  4. Generate embeddings                                  │   │
│  │  5. Index in Elasticsearch                               │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   PDF Parser    │  │  Office Parser  │  │  Text Parser    │
│                 │  │                 │  │                 │
│ - Layout detect │  │ - DOCX/XLSX    │  │ - TXT/MD/CSV   │
│ - OCR          │  │ - Table extract │  │ - Direct chunk  │
│ - Table struct  │  │ - Image embed   │  │                 │
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

## Cau Truc Thu Muc

```
/rag/
├── svr/
│   └── task_executor.py    # Main task executor ⭐
├── app/
│   ├── naive.py           # Document parsing logic
│   ├── paper.py           # Academic paper parser
│   ├── qa.py              # Q&A document parser
│   └── table.py           # Structured table parser
├── flow/
│   ├── parser/            # Document parsers
│   ├── splitter/          # Chunking logic
│   └── tokenizer/         # Tokenization
└── nlp/
    └── __init__.py        # naive_merge() chunking

/deepdoc/
├── parser/
│   └── pdf_parser.py      # RAGFlow PDF parser ⭐
├── vision/
│   ├── ocr.py            # PaddleOCR integration
│   ├── layout_recognizer.py  # Detectron2 layout
│   └── table_structure_recognizer.py  # TSR
└── images/
    └── ...               # Image processing
```

## Files Trong Module Nay

| File | Mo Ta |
|------|-------|
| [task_executor_analysis.md](./task_executor_analysis.md) | Task execution pipeline |
| [pdf_parsing.md](./pdf_parsing.md) | PDF parsing với layout analysis |
| [ocr_pipeline.md](./ocr_pipeline.md) | OCR với PaddleOCR |
| [layout_detection.md](./layout_detection.md) | Detectron2 layout recognition |
| [table_extraction.md](./table_extraction.md) | Table structure recognition |
| [file_type_handlers.md](./file_type_handlers.md) | Handler cho từng file type |

## Processing Flow

```
┌─────────────────────────────────────────────────────────────────┐
│                    PDF PROCESSING PIPELINE                       │
└─────────────────────────────────────────────────────────────────┘

                    PDF Binary Input
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│  1. IMAGE EXTRACTION (0-40%)                                     │
│     pdfplumber → PIL Images (3x zoom)                           │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  2. OCR DETECTION (40-63%)                                       │
│     PaddleOCR → Bounding boxes + Text                           │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  3. LAYOUT RECOGNITION (63-83%)                                  │
│     Detectron2 → Layout types (Text, Title, Table, Figure)      │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  4. TABLE STRUCTURE (TSR)                                        │
│     TableTransformer → Rows, Columns, Cells                     │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  5. TEXT MERGING                                                 │
│     ML-based vertical merge (XGBoost)                           │
│     Column detection (KMeans)                                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  6. CHUNKING                                                     │
│     naive_merge() → Token-based chunks                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  7. EMBEDDING + INDEXING                                         │
│     Vector generation → Elasticsearch                           │
└─────────────────────────────────────────────────────────────────┘
```

## Supported File Types

| Category | Extensions | Parser |
|----------|------------|--------|
| **PDF** | .pdf | RAGFlowPdfParser, PlainParser, VisionParser |
| **Office** | .docx, .xlsx, .pptx | python-docx, openpyxl |
| **Text** | .txt, .md, .csv | Direct reading |
| **Images** | .jpg, .png, .tiff | Vision LLM |
| **Email** | .eml | Email parser |
| **Web** | .html | Beautiful Soup |

## Layout Types Detected

| Type | Description |
|------|-------------|
| Text | Regular body text |
| Title | Section/document titles |
| Figure | Images and diagrams |
| Figure caption | Figure descriptions |
| Table | Data tables |
| Table caption | Table descriptions |
| Header | Page headers |
| Footer | Page footers |
| Reference | Bibliography |
| Equation | Mathematical equations |

## Key Algorithms

### Text Merging (XGBoost)
```
Features:
- Y-distance normalized by char height
- Same layout number
- Ending punctuation patterns
- Beginning character patterns
- Chinese numbering patterns

Output: Merge probability → threshold decision
```

### Column Detection (KMeans)
```
Input: X-coordinates of text boxes
Output: Optimal column assignments

Algorithm:
1. For k = 1 to max_columns:
   - Fit KMeans(k)
   - Calculate silhouette_score
2. Select k with best score
3. Assign boxes to columns
```

## Configuration

```python
parser_config = {
    "chunk_token_num": 512,           # Tokens per chunk
    "delimiter": "\n。；！？",         # Chunk boundaries
    "layout_recognize": "DeepDOC",    # Layout method
    "task_page_size": 12,             # Pages per task
}

# Task executor config
MAX_CONCURRENT_TASKS = 5
EMBEDDING_BATCH_SIZE = 16
DOC_BULK_SIZE = 64
```

## Related Files

- `/rag/svr/task_executor.py` - Main executor
- `/deepdoc/parser/pdf_parser.py` - PDF parsing
- `/deepdoc/vision/ocr.py` - OCR engine
- `/rag/nlp/__init__.py` - Chunking algorithms