From 37bd1880b36e46895e593c83ac9e1f0d993aabb0 Mon Sep 17 00:00:00 2001 From: Mendon Kissling <59585235+mendonk@users.noreply.github.com> Date: Wed, 1 Oct 2025 10:41:34 -0400 Subject: [PATCH] openrag-defaults --- docs/docs/core-components/ingestion.mdx | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/docs/docs/core-components/ingestion.mdx b/docs/docs/core-components/ingestion.mdx index f4820bf7..cb1c28be 100644 --- a/docs/docs/core-components/ingestion.mdx +++ b/docs/docs/core-components/ingestion.mdx @@ -22,7 +22,7 @@ These settings configure the Docling ingestion parameters, from using no OCR to OpenRAG will warn you if `docling-serve` is not running. To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**. -**Embedding model** determines which AI model is used to create vector embeddings. The default is +**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`. ` **Chunk size** determines how large each text chunk is in number of characters. Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context. @@ -35,22 +35,16 @@ The default value of 200 characters of overlap with a chunk size of 1000 (20% ov **OCR** enables or disabled OCR processing when extracting text from images and scanned documents. OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed. -Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. +Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance. If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/). -**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. +**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance. -**VLM (Vision Language Model)** enables or disables VLM processing. -VLM processing is used _instead of_ OCR processing. -It uses an LLM to understand a document's structure and return text in a structured `doctags` format. -For more information, see [Vision models](https://docling-project.github.io/docling/usage/vision_models/). +## Use OpenRAG default ingestion instead of Docling serve -Enable a VLM when you are processing complex documents containing a mixture of text, images, tables, and charts. +If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/configure/configuration#ingestion-configuration). -If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [SmolDocling-256M-preview-mlx-bf16](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) VLM, which includes the [MLX framework](https://ml-explore.github.io/mlx/build/html/index.html) for Apple silicon. -Other platforms use [SmolDocling-256M-preview](https://huggingface.co/ds4sd/SmolDocling-256M-preview). +The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API. -## Use OpenRAG default ingestion instead of Docling - -If you want to use OpenRAG's built in pipeline instead of Docling, set `DISABLE_INGEST_WITH_LANGFLOW=true`. \ No newline at end of file +For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58). \ No newline at end of file