ragflow

Author	SHA1	Message	Date
buua436	2b9145948f	Fix:not enough values to unpack (expected 3, got 2) in general chunk (#11139 ) ### What problem does this PR solve? issue： #11136 change: not enough values to unpack (expected 3, got 2) in general chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-10 15:08:24 +08:00
Kevin Hu	d207291217	Fix: add download stats to kb logs. (#11112 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-10 13:28:07 +08:00
Billy Bao	b137de1def	Fix: Plain parser is skipped (#11094 ) ### What problem does this PR solve? plain parser skipeed ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-07 13:39:29 +08:00
YngvarHuang	2cb1046cbf	fix: The doc file cannot be parsed(#11092 ) (#11093 ) ### What problem does this PR solve? The doc file cannot be parsed(#11092) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: virgilwong <hyhvirgil@gmail.com>	2025-11-07 11:46:10 +08:00
Billy Bao	4b8ce08050	Fix: fix pdf_parser ignored in rag/app/naive.py (#11065 ) ### What problem does this PR solve? Fix: fix pdf_parser ignored in rag/app/naive.py #11000 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-06 15:20:35 +08:00
Billy Bao	121c51661d	Fix: Markdown table extractor (#11018 ) ### What problem does this PR solve? Now markdown table extractor supports <table ...>. #10966 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-05 16:10:21 +08:00
Billy Bao	cf9611c96f	Feat: Support more chunking methods (#11000 ) ### What problem does this PR solve? Feat: Support more chunking methods #10772 This PR enables multiple chunking methods — including books, laws, naive, one, and presentation — to be used with all existing PDF parsers (DeepDOC, MinerU, Docling, TCADP, Plain Text, and Vision modes). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-05 13:00:42 +08:00
Jin Hai	bab3fce136	Move some constants to common (#11004 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 08:01:39 +08:00
Yongteng Lei	2677617f93	Feat: supports MinerU http-client/server method (#10961 ) ### What problem does this PR solve? Add support for MinerU http-client/server method. To use MinerU with vLLM server: 1. Set up a vLLM server running MinerU: ```bash mineru-vllm-server --port 30000 ``` 2. Configure the following environment variables: - `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru` (or the path to your MinerU executable) - `MINERU_BACKEND="vlm-http-client"` - `MINERU_SERVER_URL="http://your-vllm-server-ip:30000"` 3. Follow the standard MinerU setup steps as described above. With this configuration, RAGFlow will connect to your vLLM server to perform document parsing, which can significantly improve parsing performance for complex documents while reducing the resource requirements on your RAGFlow server. ![1](https://github.com/user-attachments/assets/46624a0c-0f3b-423e-ace8-81801e97a27d) ![2](https://github.com/user-attachments/assets/66ccc004-a598-47d4-93cb-fe176834f83b) ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2025-11-04 16:03:30 +08:00
Jin Hai	9a486e0f51	Move some funcs from api to rag module (#10972 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 19:26:09 +08:00
Billy Bao	fa210e7c58	Feat: parsing hyperlinks in docx and pdf & Fix: default parser config of toc extraction (#10877 ) ### What problem does this PR solve? Feat: parsing hyperlinks in docx and pdf #10848 Fix: default parser config of toc extraction ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-03 09:34:12 +08:00
buua436	bb9504d1cc	Fix:enhance delimiters in markdown parser (#10896 ) ### What problem does this PR solve? issue: [#10890](https://github.com/infiniflow/ragflow/issues/10890) change： enhance delimiters in markdown parser ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-30 17:36:51 +08:00
Edward Chen	b52f09adfe	Mineru api support (#10874 ) ### What problem does this PR solve? support local mineru api in docker instance. like no gpu in wsl on windows, but has mineru api with gpu support. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-30 17:31:46 +08:00
Yongteng Lei	5acc407240	Feat: MinerU supports VLM-Transfomers backend (#10809 ) ### What problem does this PR solve? MinerU supports VLM-Transfomers backend. Set `MINERU_BACKEND="pipeline"` to choose the backend. (Options: pipeline \| vlm-transformers, default is pipeline) ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-10-27 17:04:13 +08:00
aidan	33a189f620	Feat: add TCADP Parser (#10775 ) ### What problem does this PR solve? This PR adds a new TCADP (Tencent Cloud Advanced Document Processing) parser to RAGFlow, enabling users to leverage Tencent Cloud's document parsing capabilities for more accurate and structured document processing. The implementation includes: New TCADP Parser: A complete implementation of Tencent Cloud's document parsing API without SDK dependency Configuration Support: Added configuration options in service_conf.yaml for Tencent Cloud API credentials Frontend Integration: Updated UI components to support the new TCADP parser option Error Handling: Comprehensive error handling and retry mechanisms for API calls Result Processing: Support for both SSE streaming and JSON response formats from Tencent Cloud API ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-10-27 15:14:58 +08:00
buua436	0ff2042fc1	Feat: add Docling parser (#10759 ) ### What problem does this PR solve? issue: #3945 change: add Docling parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-23 19:44:25 +08:00
buua436	6ab96287c9	Feat:Vision Model Image Enhancement in Manual/Paper/Book/One chunker (#10640 ) ### What problem does this PR solve? issue: [#7472](https://github.com/infiniflow/ragflow/issues/7472) change: Vision Model Image Enhancement in Manual chunker ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-21 09:36:27 +08:00
Billy Bao	8ee0b6ea54	File: Now parsing support all types of embedded documents, solved #10059 (#10635 ) ### What problem does this PR solve? File: Now parsing support all types of embedded documents, solved #10059 Fix: Incomplete words in chat #10530 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-17 18:46:47 +08:00
Yongteng Lei	387baf858f	Feat: add MinerU parser (#10621 ) ### What problem does this PR solve? Add MinerU parser. #3945, #8092. Set `MINERU_EXECUTABLE` to the MinerU executable path, defaults to `mineru`. Set `MINERU_DELETE_OUTPUT=0` to preserve MinerU's output, default is 1, which deletes temporary output. Set `MINERU_OUTPUT_DIR` to choose the MinerU output directory (uses the temporary directory if unset). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-17 09:55:39 +08:00
Yongteng Lei	5200711441	Feat: add support for multi-column PDF parsing (#10475 ) ### What problem does this PR solve? Add support for multi-columns PDF parsing. #9878, #9919. Two-column sample: <img width="1885" height="1020" alt="image" src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28" /> Three-column sample: <img width="1881" height="992" alt="image" src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a" /> Single-column sample: <img width="1883" height="1042" alt="image" src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:46:09 +08:00
Yongteng Lei	8aabc2807c	Feat: Pipeline Docx file supports Markdown output (#10439 ) ### What problem does this PR solve? Pipeline Docx file supports Markdown output. <img width="1242" height="755" alt="image" src="https://github.com/user-attachments/assets/63cca75b-20b9-4a90-a01c-c0c2fccf1f2a" /> <img width="1227" height="717" alt="image" src="https://github.com/user-attachments/assets/0dcb94b2-7ba0-48d5-9231-dc6e5c4b4192" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 09:39:15 +08:00
Billy Bao	ea0f1d47a5	Support image recognition for url links in Markdown file, fix log error in code_exec (#10139 ) ### What problem does this PR solve? Support image recognition with image links in markdown files, solved issue: #8755 Fixed log info error in code_exec, solved issue: #10064 ### Type of change (8755) - [x] New Feature (non-breaking change which adds functionality) ### Type of change (10064) - [x] Bug Fix (non-breaking change which fixes an issue)	2025-09-18 09:44:17 +08:00
Stephen Hu	179091b1a4	Fix: In ragflow/rag/app /naive.py, if there are multiple images in one line, the other images will be lost (#9968 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9966 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-09-11 11:08:31 +08:00
pingguoCooler	cf0011be67	Feat: Upgrade html parser (#9675 ) ### What problem does this PR solve? parse more html content. ### Type of change - [x] Other (please describe):	2025-08-27 12:43:55 +08:00
Yongteng Lei	382458ace7	Feat: advanced markdown parsing (#9607 ) ### What problem does this PR solve? Using AST parsing to handle markdown more accurately, preventing components from being cut off by chunking. #9564 <img width="1746" height="993" alt="image" src="https://github.com/user-attachments/assets/4aaf4bf6-5714-4d48-a9cf-864f59633f7f" /> <img width="1739" height="982" alt="image" src="https://github.com/user-attachments/assets/dc00233f-7a55-434f-bbb7-74ce7f57a6cf" /> <img width="559" height="100" alt="image" src="https://github.com/user-attachments/assets/4a556b5b-d9c6-4544-a486-8ac342bd504e" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-21 09:36:18 +08:00
Kevin Hu	312f1a0477	Fix: enlarge raptor timeout limits. (#9600 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-20 17:29:15 +08:00
Yongteng Lei	eef43fa25c	Fix: unexpected truncated Excel files (#9500 ) ### What problem does this PR solve? Handle unexpected truncated Excel files. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-15 17:00:34 +08:00
Jay Xu	6d1078b538	fix 'KeyError: "There is no item named 'word/NULL' in the archive"' (#9455 ) ### What problem does this PR solve? Issue referring to: https://github.com/python-openxml/python-docx/issues/797 Fix referring to: https://github.com/python-openxml/python-docx/issues/1105#issuecomment-1298075246 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-14 12:14:03 +08:00
Jay Xu	7f08ba47d7	Fix "no `tc` element at grid_offset" (#9375 ) ### What problem does this PR solve? fix "no `tc` element at grid_offset", just log warning and ignore. stacktrace: ``` Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 620, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 553, in do_handle_task chunks = await build_chunks(task, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 257, in build_chunks cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 257, in <lambda> cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], File "/ragflow/rag/app/naive.py", line 384, in chunk sections, tables = Docx()(filename, binary) File "/ragflow/rag/app/naive.py", line 230, in __call__ while i < len(r.cells): File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 438, in cells return tuple(_iter_row_cells()) File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 436, in _iter_row_cells yield from iter_tc_cells(tc) File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 424, in iter_tc_cells yield from iter_tc_cells(tc._tc_above) # pyright: ignore[reportPrivateUsage] File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 741, in _tc_above return self._tr_above.tc_at_grid_offset(self.grid_offset) File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 98, in tc_at_grid_offset raise ValueError(f"no `tc` element at grid_offset={grid_offset}") ValueError: no `tc` element at grid_offset=10 ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-11 17:13:10 +08:00
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113 ) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 19:41:09 +08:00
Yongteng Lei	39ef2ffba9	Feat: parsing supports jsonl or ldjson format (#9087 ) ### What problem does this PR solve? Supports jsonl or ldjson format. Feature request from [discussion](https://github.com/orgs/infiniflow/discussions/8774). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 09:48:20 +08:00
Stephen Hu	92cfbcb382	Fix: when parse markdown support extract image at local (#8906 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8902 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-18 17:06:58 +08:00
Yongteng Lei	51a8604dcb	Fix: fixed context loss caused by separating markdown tables from original text (#8844 ) ### What problem does this PR solve? Fix context loss caused by separating markdown tables from original text. #6871, #8804. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 13:03:01 +08:00
wenxuan.zhang	f586dd0a96	Fix: docx parse error. (#8600 ) ### What problem does this PR solve? docx parse error. ![image](https://github.com/user-attachments/assets/efbe6d1b-10c8-415e-b693-a86f73e1ffa6) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### What problem does this PR solve? Some docx parse with naive cause error. `block.style.name` in Function `__get_nearest_title` will be None in some case. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: wenxuan.zhang <wenxuan.zhang@chinacreator.com>	2025-07-01 17:38:11 +08:00
Jin Hai	4a2ff633e0	Fix typo in code (#8327 ) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-06-18 09:41:09 +08:00
Yongteng Lei	bd4678bca6	Fix: Unnecessary truncation in markdown parser (#7972 ) ### What problem does this PR solve? Fix unnecessary truncation in markdown parser. So that markdown can work perfectly like [this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576) in #7824, supporting multiple special delimiters. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-30 15:04:21 +08:00
Kevin Hu	bfe97d896d	Fix: docx get image exception. (#7636 ) ### What problem does this PR solve? Close #7631 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-14 12:24:48 +08:00
alkscr	baa108f5cc	Fix: markdown table conversion error (#7570 ) ### What problem does this PR solve? Since `import markdown.markdown` has been changed to `import markdown` in `rag/app/naive.py`, previous code for converting markdown tables would call a markdown module instead of a callable function. This cause error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-05-12 17:16:55 +08:00
Stephen Hu	1662c7eda3	Feat: Markdown add image (#7124 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/6984 1. Markdown parser supports get pictures 2. For Native, when handling Markdown, it will handle images 3. improve merge and ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-04-25 18:35:28 +08:00
Kevin Hu	14a3efd756	Fix: docx image exceptions. (#6839 ) ### What problem does this PR solve? Close #6784 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-04-07 12:33:34 +08:00
fansir	0e0ebaac5f	Feat: Adds hierarchical title path tracking for tables in DOCX documents to improve context association (#6374 ) ### What problem does this PR solve? Adds hierarchical title path tracking for tables in DOCX documents to improve context association. Previously, extracted tables lacked positional context within document structure. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-21 18:42:36 +08:00
Kevin Hu	95497b4aab	Fix: adapt to old configurations. (#6321 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-20 14:50:59 +08:00
Yongteng Lei	9611185eb4	Feat: add VLM-boosted DocX parser (#6307 ) ### What problem does this PR solve? Add VLM-boosted DocX parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 11:24:44 +08:00
Yongteng Lei	e4380843c4	Feat: add fallback for PDF figure parser (#6305 ) ### What problem does this PR solve? Add fallback for PDF figure parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 10:48:38 +08:00
Yongteng Lei	1d6760dd84	Feat: add VLM-boosted PDF parser (#6278 ) ### What problem does this PR solve? Add VLM-boosted PDF parser if VLM is set. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 09:39:32 +08:00
Yongteng Lei	5cf610af40	Feat: add vision LLM PDF parser (#6173 ) ### What problem does this PR solve? Add vision LLM PDF parser ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-03-18 14:52:20 +08:00
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-03-17 16:49:54 +08:00
Kevin Hu	bfa8d342b3	Fix: retrieval debug mode issue. (#6150 ) ### What problem does this PR solve? #6139 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-17 13:07:13 +08:00
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972 ) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	2025-03-17 11:58:40 +08:00
Yongteng Lei	4ff609b6a8	Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027 ) ### What problem does this PR solve? Optimize OCR garbage identification to reduce unnecessary filtering. #5713 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-13 18:48:32 +08:00

1 2 3

118 commits