ragflow/personal_analyze/05-DOCUMENT-PROCESSING/README.md
Claude a6ee18476d
docs: Add detailed backend module analysis documentation
Add comprehensive documentation covering 6 modules:
- 01-API-LAYER: Authentication, routing, SSE streaming
- 02-SERVICE-LAYER: Dialog, Task, LLM service analysis
- 03-RAG-ENGINE: Hybrid search, embedding, reranking
- 04-AGENT-SYSTEM: Canvas engine, components, tools
- 05-DOCUMENT-PROCESSING: Task executor, PDF parsing
- 06-ALGORITHMS: BM25, fusion, RAPTOR

Total 28 documentation files with code analysis, diagrams, and formulas.
2025-11-26 11:10:54 +00:00

210 lines
12 KiB
Markdown

# 05-DOCUMENT-PROCESSING - Document Parsing Pipeline
## Tong Quan
Document Processing pipeline chuyển đổi raw documents thành searchable chunks với layout analysis, OCR, và intelligent chunking.
## Kien Truc Document Processing
```
┌─────────────────────────────────────────────────────────────────┐
│ DOCUMENT PROCESSING PIPELINE │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ File Upload │────▶│ Task Creation │────▶│ Task Queue │
│ (MinIO) │ │ (MySQL) │ │ (Redis) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ TASK EXECUTOR │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. Download file from MinIO │ │
│ │ 2. Select parser based on file type │ │
│ │ 3. Execute parsing pipeline │ │
│ │ 4. Generate embeddings │ │
│ │ 5. Index in Elasticsearch │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PDF Parser │ │ Office Parser │ │ Text Parser │
│ │ │ │ │ │
│ - Layout detect │ │ - DOCX/XLSX │ │ - TXT/MD/CSV │
│ - OCR │ │ - Table extract │ │ - Direct chunk │
│ - Table struct │ │ - Image embed │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Cau Truc Thu Muc
```
/rag/
├── svr/
│ └── task_executor.py # Main task executor ⭐
├── app/
│ ├── naive.py # Document parsing logic
│ ├── paper.py # Academic paper parser
│ ├── qa.py # Q&A document parser
│ └── table.py # Structured table parser
├── flow/
│ ├── parser/ # Document parsers
│ ├── splitter/ # Chunking logic
│ └── tokenizer/ # Tokenization
└── nlp/
└── __init__.py # naive_merge() chunking
/deepdoc/
├── parser/
│ └── pdf_parser.py # RAGFlow PDF parser ⭐
├── vision/
│ ├── ocr.py # PaddleOCR integration
│ ├── layout_recognizer.py # Detectron2 layout
│ └── table_structure_recognizer.py # TSR
└── images/
└── ... # Image processing
```
## Files Trong Module Nay
| File | Mo Ta |
|------|-------|
| [task_executor_analysis.md](./task_executor_analysis.md) | Task execution pipeline |
| [pdf_parsing.md](./pdf_parsing.md) | PDF parsing với layout analysis |
| [ocr_pipeline.md](./ocr_pipeline.md) | OCR với PaddleOCR |
| [layout_detection.md](./layout_detection.md) | Detectron2 layout recognition |
| [table_extraction.md](./table_extraction.md) | Table structure recognition |
| [file_type_handlers.md](./file_type_handlers.md) | Handler cho từng file type |
## Processing Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ PDF PROCESSING PIPELINE │
└─────────────────────────────────────────────────────────────────┘
PDF Binary Input
┌─────────────────────────────────────────────────────────────────┐
│ 1. IMAGE EXTRACTION (0-40%) │
│ pdfplumber → PIL Images (3x zoom) │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 2. OCR DETECTION (40-63%) │
│ PaddleOCR → Bounding boxes + Text │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 3. LAYOUT RECOGNITION (63-83%) │
│ Detectron2 → Layout types (Text, Title, Table, Figure) │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 4. TABLE STRUCTURE (TSR) │
│ TableTransformer → Rows, Columns, Cells │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 5. TEXT MERGING │
│ ML-based vertical merge (XGBoost) │
│ Column detection (KMeans) │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 6. CHUNKING │
│ naive_merge() → Token-based chunks │
└──────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 7. EMBEDDING + INDEXING │
│ Vector generation → Elasticsearch │
└─────────────────────────────────────────────────────────────────┘
```
## Supported File Types
| Category | Extensions | Parser |
|----------|------------|--------|
| **PDF** | .pdf | RAGFlowPdfParser, PlainParser, VisionParser |
| **Office** | .docx, .xlsx, .pptx | python-docx, openpyxl |
| **Text** | .txt, .md, .csv | Direct reading |
| **Images** | .jpg, .png, .tiff | Vision LLM |
| **Email** | .eml | Email parser |
| **Web** | .html | Beautiful Soup |
## Layout Types Detected
| Type | Description |
|------|-------------|
| Text | Regular body text |
| Title | Section/document titles |
| Figure | Images and diagrams |
| Figure caption | Figure descriptions |
| Table | Data tables |
| Table caption | Table descriptions |
| Header | Page headers |
| Footer | Page footers |
| Reference | Bibliography |
| Equation | Mathematical equations |
## Key Algorithms
### Text Merging (XGBoost)
```
Features:
- Y-distance normalized by char height
- Same layout number
- Ending punctuation patterns
- Beginning character patterns
- Chinese numbering patterns
Output: Merge probability → threshold decision
```
### Column Detection (KMeans)
```
Input: X-coordinates of text boxes
Output: Optimal column assignments
Algorithm:
1. For k = 1 to max_columns:
- Fit KMeans(k)
- Calculate silhouette_score
2. Select k with best score
3. Assign boxes to columns
```
## Configuration
```python
parser_config = {
"chunk_token_num": 512, # Tokens per chunk
"delimiter": "\n。;!?", # Chunk boundaries
"layout_recognize": "DeepDOC", # Layout method
"task_page_size": 12, # Pages per task
}
# Task executor config
MAX_CONCURRENT_TASKS = 5
EMBEDDING_BATCH_SIZE = 16
DOC_BULK_SIZE = 64
```
## Related Files
- `/rag/svr/task_executor.py` - Main executor
- `/deepdoc/parser/pdf_parser.py` - PDF parsing
- `/deepdoc/vision/ocr.py` - OCR engine
- `/rag/nlp/__init__.py` - Chunking algorithms