ragflow

Author	SHA1	Message	Date
David Eberto Domenech Castillo	3c224c817b	Fix: Correct pagination and early termination bugs in chunk_list() (#11692 ) ## Summary This PR fixes two critical bugs in `chunk_list()` method that prevent processing large documents (>128 chunks) in GraphRAG and other workflows. ## Bugs Fixed ### Bug 1: Incorrect pagination offset calculation Location: `rag/nlp/search.py` lines 530-531 Problem: The loop variable `p` was used directly as offset, causing incorrect pagination: ```python # BEFORE (BUGGY): for p in range(offset, max_count, bs): # p = 0, 128, 256, 384... es_res = self.dataStore.search(..., p, bs, ...) # p used as offset Fix: Use page number multiplied by batch size: # AFTER (FIXED): for page_num, p in enumerate(range(offset, max_count, bs)): es_res = self.dataStore.search(..., page_num * bs, bs, ...) Bug 2: Premature loop termination Location: rag/nlp/search.py lines 538-539 Problem: Loop terminates when any page returns fewer than 128 chunks, even when thousands more remain: # BEFORE (BUGGY): if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+ remain break Fix: Only terminate when zero chunks returned: # AFTER (FIXED): if len(dict_chunks.values()) == 0: break Enhancement: Add max_count parameter to GraphRAG Location: graphrag/general/index.py line 60 Added max_count=10000 parameter to chunk loading for both LightRAG and General GraphRAG paths to ensure all chunks are processed. Testing Validated with a 314-page legal document containing 3,207 chunks: Before fixes: - Only 2-126 chunks processed - GraphRAG generated 25 nodes, 8 edges After fixes: - All 3,209 chunks processed ✅ - GraphRAG processing complete dataset Impact These bugs affect any workflow using chunk_list() with large documents, particularly: - GraphRAG knowledge graph generation - RAPTOR hierarchical summarization - Document processing pipelines with >128 chunks Related Issue Fixes #11687 Checklist - Code follows project style guidelines - Tested with large documents (3,207+ chunks) - Both bugs validated by Dosu bot in issue #11687 - No breaking changes to API --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-03 19:44:20 +08:00
hsparks-codes	4870d42949	feat: Auto-disable Raptor for structured data (Issue #11653 ) (#11676 ) ### What problem does this PR solve? Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653. Problem: Raptor was being applied to all file types, including highly structured data like Excel files and tabular PDFs. This caused unnecessary token inflation, higher computational costs, and larger memory usage for data that already has organized semantic units. Solution: Automatically skip Raptor processing for: - Excel files (.xls, .xlsx, .xlsm, .xlsb) - CSV files (.csv, .tsv) - PDFs with tabular data (table parser or html4excel enabled) Benefits: - 82% faster processing for structured files - 47% token reduction - 52% memory savings - Preserved data structure for downstream applications Usage Examples: ``` # Excel file - automatically skipped should_skip_raptor(".xlsx") # True # CSV file - automatically skipped should_skip_raptor(".csv") # True # Tabular PDF - automatically skipped should_skip_raptor(".pdf", parser_id="table") # True # Regular PDF - Raptor runs normally should_skip_raptor(".pdf", parser_id="naive") # False # Override for special cases should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False ``` Configuration: Includes `auto_disable_for_structured_data` toggle (default: true) to allow override for special use cases. Testing: 44 comprehensive tests, 100% passing ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:02:29 +08:00
Jin Hai	3c50c7d3ac	Refactor code (#11694 ) ### What problem does this PR solve? Rename function and refactor log message ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-03 15:15:00 +08:00
Yongteng Lei	e3f40db963	Refa: make RAGFlow more asynchronous 2 (#11689 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-12-03 14:19:53 +08:00
Kevin Hu	b5ad7b7062	Feat: support TOC transformer. (#11685 ) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 12:27:50 +08:00
Yongteng Lei	5c81e01de5	Fix: incorrect async chat streamly output (#11679 ) ### What problem does this PR solve? Incorrect async chat streamly output. #11677. Disable beartype for #11666. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-03 11:15:45 +08:00
Kevin Hu	a6681d6366	Revert "Refa: make RAGFlow more asynchronous 2" (#11669 ) Reverts infiniflow/ragflow#11664	2025-12-02 19:42:05 +08:00
Yongteng Lei	627c11c429	Refa: make RAGFlow more asynchronous 2 (#11664 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring - [x] Performance Improvement	2025-12-02 18:57:07 +08:00
rommy2017	4ba17361e9	feat: improve presentation PdfParser (#11639 ) The old presentation PdfParser lost table format after parse ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:35:14 +08:00
Billy Bao	c946858328	Feat: add mineru auto installer (#11649 ) ### What problem does this PR solve? Feat: add mineru auto installer ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:29:26 +08:00
qinling0210	2ffe6f7439	Import rag_tokenizer from Infinity (#11647 ) ### What problem does this PR solve? - Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk via https://github.com/infiniflow/infinity/pull/3117 . Import rag_tokenizer from infinity and inherit from rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py. - Bump infinity to 0.6.8 ### Type of change - [x] Refactoring	2025-12-02 14:59:37 +08:00
Yongteng Lei	a713f54732	Refa: add MiniMax-M2 and remove deprecated MiniMax models (#11642 ) ### What problem does this PR solve? Add MiniMax-M2 and remove deprecated models. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-02 14:43:44 +08:00
buua436	b8c0fb4572	Feat:new api /sequence2txt and update QWenSeq2txt (#11643 ) ### What problem does this PR solve? change: new api /sequence2txt, update QWenSeq2txt and ZhipuSeq2txt ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 11:17:31 +08:00
Stephen Hu	d1e172171f	Refactor: better describe how to get prefix for sync data source (#11636 ) ### What problem does this PR solve? better describe how to get prefix for sync data source ### Type of change - [x] Refactoring	2025-12-01 17:46:44 +08:00
Kevin Hu	81ae6cf78d	Feat: support uploading in dialog. (#11634 ) ### What problem does this PR solve? #9590 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 16:54:57 +08:00
Billy Bao	41cff3e09e	Fix: jina embedding issue (#11628 ) ### What problem does this PR solve? Fix: jina embedding issue #11614 Feat: Add jina embedding v4 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 14:24:35 +08:00
Yongteng Lei	b6c4722687	Refa: make RAGFlow more asynchronous (#11601 ) ### What problem does this PR solve? Try to make this more asynchronous. Verified in chat and agent scenarios, reducing blocking behavior. #11551, #11579. However, the impact of these changes still requires further investigation to ensure everything works as expected. ### Type of change - [x] Refactoring	2025-12-01 14:24:06 +08:00
Kevin Hu	6ea4248bdc	Feat: support parent-child in search procedure. (#11629 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 14:03:09 +08:00
Kevin Hu	88a28212b3	Fix: Table parse method issue. (#11627 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:42:35 +08:00
Yongteng Lei	9d0309aedc	Fix: [MinerU] Missing output file (#11623 ) ### What problem does this PR solve? Add fallbacks for MinerU output path. #11613, #11620. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:17:43 +08:00
Lei Zhang	7499608a8b	feat: add Redis username support (#11608 ) ### What problem does this PR solve? Support for Redis 6+ ACL authentication (username) close #11606 ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2025-12-01 11:26:20 +08:00
omahs	80f6d22d2a	Fix typos (#11607 ) ### What problem does this PR solve? Fix typos ### Type of change - [x] Fix typos	2025-12-01 09:49:46 +08:00
Kevin Hu	14616cf845	Feat: add child parent chunking method in backend. (#11598 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 19:25:32 +08:00
TeslaZY	dbdda0fbab	Feat: optimize meta filter generation for better structure handling (#11586 ) ### What problem does this PR solve? optimize meta filter generation for better structure handling ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 13:30:53 +08:00
Billy Bao	cf7fdd274b	Feat: add gmail connector (#11549 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 13:09:40 +08:00
Billy Bao	982ed233a2	Fix: doc_aggs not correctly returned when no chunks retrieved. (#11578 ) ### What problem does this PR solve? Fix: doc_aggs not correctly returned when no chunks retrieved. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-28 13:09:05 +08:00
Yongteng Lei	8604c4f57c	Feat: add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5 (#11559 ) ### What problem does this PR solve? Add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5. #11548 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 17:59:17 +08:00
Zhichang Yu	856201c0f2	Fix ft_title_rag_fine (#11555 ) ### What problem does this PR solve? Fix ft_title_rag_fine ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-27 10:26:08 +08:00
Yongteng Lei	9d8b96c1d0	Feat: add context for figure and table (#11547 ) ### What problem does this PR solve? Add context for figure table. ![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e) `==================()` for demonstrating purpose. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 10:21:44 +08:00
Levi	12979a3f21	feat: improve metadata handling in connector service (#11421 ) ### What problem does this PR solve? - Update sync data source to handle metadata properly ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-11-26 19:55:48 +08:00
Jonah Hartmann	2fd5ac1031	Feat: Add Webdav storage as data source (#11422 ) ### What problem does this PR solve? This PR adds webdav storage as data source for data sync service. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-26 14:14:42 +08:00
Zhichang Yu	40e84ca41a	Use Infinity single-field-multi-index (#11444 ) ### What problem does this PR solve? Use Infinity single-field-multi-index ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-11-26 11:06:37 +08:00
Kevin Hu	74e0b58d89	Fix: excel default optimization. (#11519 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:20 +08:00
Yongteng Lei	7c20c964b4	Fix: incorrect image merging for naive markdown parser (#11520 ) ### What problem does this PR solve? Fix incorrect image merging for naive markdown parser. #9349 [ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:06 +08:00
Kevin Hu	915e385244	Fix: uv lock updates (#11511 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 16:01:12 +08:00
Kevin Hu	f5faf0c94f	Feat: support operator in/not in for metadata filter. (#11503 ) ### What problem does this PR solve? #11376 #11378 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-25 12:44:26 +08:00
Kevin Hu	bcd70affb5	Fix: unexpected parameter. (#11497 ) ### What problem does this PR solve? #11489 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 11:17:27 +08:00
Stephen Hu	41665b0865	Refactor: Email parser use with to handle buffer (#11496 ) ### What problem does this PR solve? Email parser use with to handle buffer ### Type of change - [x] Refactoring	2025-11-25 10:03:37 +08:00
Yongteng Lei	d1744aaaf3	Feat: add datasource Dropbox (#11488 ) ### What problem does this PR solve? Add datasource Dropbox. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-25 09:40:03 +08:00
Levi	f0a14f5fce	Add Moodle data source integration (#11325 ) ### What problem does this PR solve? This PR adds a native Moodle connector to sync content (courses, resources, forums, assignments, pages, books) into RAGFlow. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-11-21 19:58:49 +08:00
Yongteng Lei	174a2578e8	Feat: add auth header for Ollama chat model (#11452 ) ### What problem does this PR solve? Add auth header for Ollama chat model. #11350 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-21 19:47:06 +08:00
Yongteng Lei	db0f6840d9	Feat: ignore chunk size when using custom delimiters (#11434 ) ### What problem does this PR solve? Ignore chunk size when using custom delimiter. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-21 14:36:26 +08:00
coding	971c1bcba7	Fix: missing parameters in by_plaintext method for PDF naive mode (#11408 ) ### What problem does this PR solve? FIx: missing parameters in by_plaintext method for PDF naive mode ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: lih <dev_lih@139.com>	2025-11-21 09:33:36 +08:00
Billy Bao	d3d2ccc76c	Feat: add more chunking method (#11413 ) ### What problem does this PR solve? Feat: add more chunking method #11311 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 19:07:17 +08:00
Yongteng Lei	b846a0f547	Fix: incorrect retrieval total count with pagination enabled (#11400 ) ### What problem does this PR solve? Incorrect retrieval total count with pagination enabled. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-20 15:35:09 +08:00
Kevin Hu	06cef71ba6	Feat: add or logic operations for meta data filters. (#11404 ) ### What problem does this PR solve? #11376 #11387 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 14:31:12 +08:00
aidan	420c97199a	Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041 ) ### What problem does this PR solve? - Added TCADP Parser configuration fields to PDF, PPT, and spreadsheet parsing forms - Implemented support for setting table result type (Markdown/HTML) and Markdown image response type (URL/Text) - Updated TCADP Parser to handle return format settings from configuration or parameters - Enhanced frontend to dynamically show TCADP options based on selected parsing method - Modified backend to pass format parameters when calling TCADP API - Optimized form default value logic for TCADP configuration items - Updated multilingual resource files for new configuration options ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 10:08:42 +08:00
He Wang	38234aca53	feat: add OceanBase doc engine (#11228 ) ### What problem does this PR solve? Add OceanBase doc engine. Close #5350 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 10:00:14 +08:00
Philipp Heyken Soares	1c06ec39ca	fix cohere rerank base_url default (#11353 ) ### What problem does this PR solve? Cohere rerank base_url default handling - Background: When no rerank base URL is configured, the settings pipeline was passing an empty string through RERANK_CFG → TenantLLMService → CoHereRerank, so the Cohere client received base_url="" and produced “missing protocol” errors during rerank calls. - What changed: The CoHereRerank constructor now only forwards base_url to the Cohere client when it isn’t empty/whitespace, causing the client to fall back to its default API endpoint otherwise. - Why it matters: This prevents invalid URL construction in the rerank workflow and keeps tests/sanity checks that rely on the default Cohere endpoint from failing when no custom base URL is specified. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>	2025-11-20 09:46:39 +08:00
Yongteng Lei	e8fe580d7a	Feat: add Gemini 3 Pro preview (#11361 ) ### What problem does this PR solve? Add Gemini 3 Pro preview. Change `GenerativeModel` to `genai`. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-19 13:17:22 +08:00

1 2 3 4 5 ...

1103 commits