50 lines
No EOL
3.5 KiB
Text
50 lines
No EOL
3.5 KiB
Text
---
|
|
title: Docling Ingestion
|
|
slug: /ingestion
|
|
---
|
|
|
|
import Icon from "@site/src/components/icon/icon";
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
import PartialModifyFlows from '@site/docs/_partial-modify-flows.mdx';
|
|
|
|
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for its document ingestion pipeline.
|
|
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling-serve` process on your local machine and runs Docling ingestion through an API service.
|
|
|
|
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
|
|
|
|
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
|
|
|
|
## Docling ingestion settings
|
|
|
|
These settings configure the Docling ingestion parameters.
|
|
|
|
OpenRAG will warn you if `docling-serve` is not running.
|
|
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
|
|
|
|
**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`.
|
|
|
|
**Chunk size** determines how large each text chunk is in number of characters.
|
|
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
|
|
The default value of `1000` characters provides a good starting point that balances these considerations.
|
|
|
|
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
|
|
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
|
|
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
|
|
|
|
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
|
|
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
|
|
|
|
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
|
|
|
|
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
|
|
|
|
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
|
|
|
|
## Use OpenRAG default ingestion instead of Docling serve
|
|
|
|
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/reference/configuration#document-processing).
|
|
|
|
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
|
|
|
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58). |