ingestion-settings

This commit is contained in:
Mendon Kissling 2025-09-30 14:54:22 -04:00
parent b1c395462d
commit 13e30c1b74

View file

@ -11,13 +11,46 @@ import PartialModifyFlows from '@site/docs/_partial-modify-flows.mdx';
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for its document ingestion pipeline.
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling-serve` process on your local machine and runs Docling ingestion through an API service.
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
## Docling ingestion settings
These settings control the Docling ingestion parameters.
These settings configure the Docling ingestion parameters, from using no OCR to using advanced vision language models.
OpenRAG will warn you if `docling-serve` is not running.
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
## Use OpenRAG default ingestion instead of Docling
**Embedding model** determines which AI model is used to create vector embeddings. The default is
**Chunk size** determines how large each text chunk is in number of characters.
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
The default value of `1000` characters provides a good starting point that balances these considerations.
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing.
**VLM (Vision Language Model)** enables or disables VLM processing.
VLM processing is used _instead of_ OCR processing.
It uses an LLM to understand a document's structure and return text in a structured `doctags` format.
For more information, see [Vision models](https://docling-project.github.io/docling/usage/vision_models/).
Enable a VLM when you are processing complex documents containing a mixture of text, images, tables, and charts.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [SmolDocling-256M-preview-mlx-bf16](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) VLM, which includes the [MLX framework](https://ml-explore.github.io/mlx/build/html/index.html) for Apple silicon.
Other platforms use [SmolDocling-256M-preview](https://huggingface.co/ds4sd/SmolDocling-256M-preview).
## Use OpenRAG default ingestion instead of Docling
If you want to use OpenRAG's built in pipeline instead of Docling, set `DISABLE_INGEST_WITH_LANGFLOW=true`.