docs: Add detailed PDF parser processing steps documentation
Created comprehensive documentation for RAGFlowPdfParser processing pipeline: - 10 major processing steps with code references - Complete data flow diagrams - Algorithm explanations (K-Means column detection, text merging) - Box data structure evolution through pipeline - Position tag format specification - Line-by-line code analysis for key methods: - __init__ (model loading) - __images__ (OCR processing) - _layouts_rec (layout detection) - _table_transformer_job (table structure) - _assign_column (column detection) - _text_merge (horizontal merge) - _naive_vertical_merge (vertical merge) - _filter_forpages (cleanup) - _extract_table_figure (extraction) - __filterout_scraps (final output)
This commit is contained in:
parent
6d4dbbfe2c
commit
1dcc9a870b
1 changed files with 1651 additions and 0 deletions
1651
personal_analyze/07-DEEPDOC-DEEP-GUIDE/pdf_parser_steps_detail.md
Normal file
1651
personal_analyze/07-DEEPDOC-DEEP-GUIDE/pdf_parser_steps_detail.md
Normal file
File diff suppressed because it is too large
Load diff
Loading…
Add table
Reference in a new issue