120 lines
No EOL
8.3 KiB
Text
120 lines
No EOL
8.3 KiB
Text
---
|
|
title: Docling in OpenRAG
|
|
slug: /ingestion
|
|
---
|
|
|
|
import Icon from "@site/src/components/icon/icon";
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
import PartialModifyFlows from '@site/docs/_partial-modify-flows.mdx';
|
|
|
|
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for document ingestion.
|
|
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling serve` process on your local machine and runs Docling ingestion through an API service.
|
|
|
|
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
|
|
|
|
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
|
|
|
|
To modify OpenRAG's ingestion settings, including the Docling settings and ingestion flows, click 2" aria-hidden="true"/> **Settings**.
|
|
|
|
## Knowledge ingestion settings
|
|
|
|
These settings configure the Docling ingestion parameters.
|
|
|
|
OpenRAG will warn you if `docling serve` is not running.
|
|
To start or stop `docling serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
|
|
|
|
**Embedding model** determines which AI model is used to create vector embeddings. The default is the OpenAI `text-embedding-3-small` model.
|
|
|
|
**Chunk size** determines how large each text chunk is in number of characters.
|
|
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
|
|
The default value of `1000` characters provides a good starting point that balances these considerations.
|
|
|
|
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
|
|
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
|
|
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
|
|
|
|
**Table Structure** enables Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/) tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. **Table Structure** is enabled by default.
|
|
|
|
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
|
|
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
|
|
|
|
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
|
|
|
|
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
|
|
|
|
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
|
|
|
|
## Knowledge ingestion flows
|
|
|
|
[Flows](https://docs.langflow.org/concepts-overview) in Langflow are functional representations of application workflows, with multiple [component](https://docs.langflow.org/concepts-components) nodes connected as single steps in a workflow.
|
|
|
|
The **OpenSearch Ingestion** flow is the default knowledge ingestion flow in OpenRAG: when you **Add Knowledge** in OpenRAG, you run the OpenSearch Ingestion flow in the background. The flow ingests documents using **Docling Serve** to import and process documents.
|
|
|
|
This flow contains ten components connected together to process and store documents in your knowledge base.
|
|
|
|
* The [**Docling Serve** component](https://docs.langflow.org/bundles-docling) processes input documents by connecting to your instance of Docling Serve.
|
|
* The [**Export DoclingDocument** component](https://docs.langflow.org/components-docling) exports the processed DoclingDocument to markdown format with image export mode set to placeholder. This conversion makes the structured document data into a standardized format for further processing.
|
|
* Three [**DataFrame Operations** components](https://docs.langflow.org/components-processing#dataframe-operations) sequentially add metadata columns to the document data of `filename`, `file_size`, and `mimetype`.
|
|
* The [**Split Text** component](https://docs.langflow.org/components-processing#split-text) splits the processed text into chunks with a chunk size of 1000 characters and an overlap of 200 characters.
|
|
* Four **Secret Input** components provide secure access to configuration variables: `CONNECTOR_TYPE`, `OWNER`, `OWNER_EMAIL`, and `OWNER_NAME`. These are runtime variables populated from OAuth login.
|
|
* The **Create Data** component combines the secret inputs into a structured data object that will be associated with the document embeddings.
|
|
* The [**Embedding Model** component](https://docs.langflow.org/components-embedding-models) generates vector embeddings using OpenAI's `text-embedding-3-small` model. The embedding model is selected at [Application onboarding] and cannot be changed.
|
|
* The [**OpenSearch** component](https://docs.langflow.org/bundles-elastic#opensearch) stores the processed documents and their embeddings in the `documents` index at `https://opensearch:9200`. By default, the component is authenticated with a JWT token, but you can also select `basic` auth mode, and enter your OpenSearch admin username and password.
|
|
|
|
<PartialModifyFlows />
|
|
|
|
### OpenSearch URL Ingestion flow {#url-flow}
|
|
|
|
An additional knowledge ingestion flow is included in OpenRAG, where it is used as an MCP tool by the [**Open Search Agent flow**](/agents#flow).
|
|
The agent calls this component to fetch web content, and the results are ingested into OpenSearch.
|
|
|
|
For more on using MCP clients in Langflow, see [MCP clients](https://docs.langflow.org/mcp-client).\
|
|
To connect additional MCP servers to the MCP client, see [Connect to MCP servers from your application](https://docs.langflow.org/mcp-tutorial).
|
|
|
|
## Use OpenRAG default ingestion instead of Docling serve
|
|
|
|
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/reference/configuration#document-processing).
|
|
|
|
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
|
|
|
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
|
|
|
|
## Performance expectations
|
|
|
|
On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes.
|
|
This equates to approximately 2.4 documents per second.
|
|
|
|
You can generally expect equal or better performance on developer laptops and significantly faster on servers.
|
|
Throughput scales with CPU cores, memory, storage speed, and configuration choices such as embedding model, chunk size and overlap, and concurrency.
|
|
|
|
This test returned 12 errors (approximately 1.1%).
|
|
All errors were file-specific, and they didn't stop the pipeline.
|
|
|
|
Ingestion dataset:
|
|
|
|
* Total files: 1,083 items mounted
|
|
* Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
|
|
|
|
Hardware specifications:
|
|
|
|
* Machine: Apple M4 Pro
|
|
* Podman VM:
|
|
* Name: `podman-machine-default`
|
|
* Type: `applehv`
|
|
* vCPUs: 7
|
|
* Memory: 8 GiB
|
|
* Disk size: 100 GiB
|
|
|
|
Test results:
|
|
|
|
```text
|
|
2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
|
|
2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
|
|
...
|
|
2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082
|
|
```
|
|
|
|
Elapsed time: ~42 minutes 15 seconds (2,535 seconds)
|
|
|
|
Throughput: ~2.4 documents/second |