Add detailed analysis documentation for RAGFlow's document processing pipeline: - README.md: Overview and architecture diagram - task_executor_analysis.md: Task execution pipeline details - pdf_parsing.md: PDF parsing with layout analysis - ocr_pipeline.md: PaddleOCR integration and text detection - layout_detection.md: Detectron2 layout recognition - table_extraction.md: Table structure recognition (TSR) - file_type_handlers.md: Handlers for all supported file types These documents explain the document processing flow for newcomers to understand how RAGFlow handles various file formats.
210 lines
12 KiB
Markdown
210 lines
12 KiB
Markdown
# 05-DOCUMENT-PROCESSING - Document Parsing Pipeline
|
|
|
|
## Tong Quan
|
|
|
|
Document Processing pipeline chuyển đổi raw documents thành searchable chunks với layout analysis, OCR, và intelligent chunking.
|
|
|
|
## Kien Truc Document Processing
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ DOCUMENT PROCESSING PIPELINE │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ File Upload │────▶│ Task Creation │────▶│ Task Queue │
|
|
│ (MinIO) │ │ (MySQL) │ │ (Redis) │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ TASK EXECUTOR │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ 1. Download file from MinIO │ │
|
|
│ │ 2. Select parser based on file type │ │
|
|
│ │ 3. Execute parsing pipeline │ │
|
|
│ │ 4. Generate embeddings │ │
|
|
│ │ 5. Index in Elasticsearch │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌─────────────────────┼─────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ PDF Parser │ │ Office Parser │ │ Text Parser │
|
|
│ │ │ │ │ │
|
|
│ - Layout detect │ │ - DOCX/XLSX │ │ - TXT/MD/CSV │
|
|
│ - OCR │ │ - Table extract │ │ - Direct chunk │
|
|
│ - Table struct │ │ - Image embed │ │ │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
## Cau Truc Thu Muc
|
|
|
|
```
|
|
/rag/
|
|
├── svr/
|
|
│ └── task_executor.py # Main task executor ⭐
|
|
├── app/
|
|
│ ├── naive.py # Document parsing logic
|
|
│ ├── paper.py # Academic paper parser
|
|
│ ├── qa.py # Q&A document parser
|
|
│ └── table.py # Structured table parser
|
|
├── flow/
|
|
│ ├── parser/ # Document parsers
|
|
│ ├── splitter/ # Chunking logic
|
|
│ └── tokenizer/ # Tokenization
|
|
└── nlp/
|
|
└── __init__.py # naive_merge() chunking
|
|
|
|
/deepdoc/
|
|
├── parser/
|
|
│ └── pdf_parser.py # RAGFlow PDF parser ⭐
|
|
├── vision/
|
|
│ ├── ocr.py # PaddleOCR integration
|
|
│ ├── layout_recognizer.py # Detectron2 layout
|
|
│ └── table_structure_recognizer.py # TSR
|
|
└── images/
|
|
└── ... # Image processing
|
|
```
|
|
|
|
## Files Trong Module Nay
|
|
|
|
| File | Mo Ta |
|
|
|------|-------|
|
|
| [task_executor_analysis.md](./task_executor_analysis.md) | Task execution pipeline |
|
|
| [pdf_parsing.md](./pdf_parsing.md) | PDF parsing với layout analysis |
|
|
| [ocr_pipeline.md](./ocr_pipeline.md) | OCR với PaddleOCR |
|
|
| [layout_detection.md](./layout_detection.md) | Detectron2 layout recognition |
|
|
| [table_extraction.md](./table_extraction.md) | Table structure recognition |
|
|
| [file_type_handlers.md](./file_type_handlers.md) | Handler cho từng file type |
|
|
|
|
## Processing Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PDF PROCESSING PIPELINE │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
PDF Binary Input
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 1. IMAGE EXTRACTION (0-40%) │
|
|
│ pdfplumber → PIL Images (3x zoom) │
|
|
└──────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 2. OCR DETECTION (40-63%) │
|
|
│ PaddleOCR → Bounding boxes + Text │
|
|
└──────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 3. LAYOUT RECOGNITION (63-83%) │
|
|
│ Detectron2 → Layout types (Text, Title, Table, Figure) │
|
|
└──────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 4. TABLE STRUCTURE (TSR) │
|
|
│ TableTransformer → Rows, Columns, Cells │
|
|
└──────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 5. TEXT MERGING │
|
|
│ ML-based vertical merge (XGBoost) │
|
|
│ Column detection (KMeans) │
|
|
└──────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 6. CHUNKING │
|
|
│ naive_merge() → Token-based chunks │
|
|
└──────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 7. EMBEDDING + INDEXING │
|
|
│ Vector generation → Elasticsearch │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Supported File Types
|
|
|
|
| Category | Extensions | Parser |
|
|
|----------|------------|--------|
|
|
| **PDF** | .pdf | RAGFlowPdfParser, PlainParser, VisionParser |
|
|
| **Office** | .docx, .xlsx, .pptx | python-docx, openpyxl |
|
|
| **Text** | .txt, .md, .csv | Direct reading |
|
|
| **Images** | .jpg, .png, .tiff | Vision LLM |
|
|
| **Email** | .eml | Email parser |
|
|
| **Web** | .html | Beautiful Soup |
|
|
|
|
## Layout Types Detected
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| Text | Regular body text |
|
|
| Title | Section/document titles |
|
|
| Figure | Images and diagrams |
|
|
| Figure caption | Figure descriptions |
|
|
| Table | Data tables |
|
|
| Table caption | Table descriptions |
|
|
| Header | Page headers |
|
|
| Footer | Page footers |
|
|
| Reference | Bibliography |
|
|
| Equation | Mathematical equations |
|
|
|
|
## Key Algorithms
|
|
|
|
### Text Merging (XGBoost)
|
|
```
|
|
Features:
|
|
- Y-distance normalized by char height
|
|
- Same layout number
|
|
- Ending punctuation patterns
|
|
- Beginning character patterns
|
|
- Chinese numbering patterns
|
|
|
|
Output: Merge probability → threshold decision
|
|
```
|
|
|
|
### Column Detection (KMeans)
|
|
```
|
|
Input: X-coordinates of text boxes
|
|
Output: Optimal column assignments
|
|
|
|
Algorithm:
|
|
1. For k = 1 to max_columns:
|
|
- Fit KMeans(k)
|
|
- Calculate silhouette_score
|
|
2. Select k with best score
|
|
3. Assign boxes to columns
|
|
```
|
|
|
|
## Configuration
|
|
|
|
```python
|
|
parser_config = {
|
|
"chunk_token_num": 512, # Tokens per chunk
|
|
"delimiter": "\n。;!?", # Chunk boundaries
|
|
"layout_recognize": "DeepDOC", # Layout method
|
|
"task_page_size": 12, # Pages per task
|
|
}
|
|
|
|
# Task executor config
|
|
MAX_CONCURRENT_TASKS = 5
|
|
EMBEDDING_BATCH_SIZE = 16
|
|
DOC_BULK_SIZE = 64
|
|
```
|
|
|
|
## Related Files
|
|
|
|
- `/rag/svr/task_executor.py` - Main executor
|
|
- `/deepdoc/parser/pdf_parser.py` - PDF parsing
|
|
- `/deepdoc/vision/ocr.py` - OCR engine
|
|
- `/rag/nlp/__init__.py` - Chunking algorithms
|