Docling in OpenRAG
OpenRAG uses Docling for its document ingestion pipeline. +
Docling in OpenRAG
OpenRAG uses Docling for document ingestion.
More specifically, OpenRAG uses Docling Serve, which starts a docling serve process on your local machine and runs Docling ingestion through an API service.
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch documents index.
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
-Docling ingestion settings
+To modify OpenRAG's ingestion settings, including the Docling settings and ingestion flows, click 2" aria-hidden="true"/> Settings.
+Knowledge ingestion settings
These settings configure the Docling ingestion parameters.
OpenRAG will warn you if docling serve is not running.
To start or stop docling serve or any other native services, in the TUI main menu, click Start Native Services or Stop Native Services.
Embedding model determines which AI model is used to create vector embeddings. The default is text-embedding-3-small.
Embedding model determines which AI model is used to create vector embeddings. The default is the OpenAI text-embedding-3-small model.
Chunk size determines how large each text chunk is in number of characters.
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
The default value of 1000 characters provides a good starting point that balances these considerations.
Chunk overlap controls the number of characters that overlap over chunk boundaries. Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important. The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
+Table Structure enables Docling's DocumentConverter tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. Table Structure is enabled by default.
OCR enables or disabled OCR processing when extracting text from images and scanned documents.
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's DocumentConverter. Images are ignored and not processed.
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the ocrmac OCR engine. Other platforms use easyocr.
Picture descriptions adds image descriptions generated by the SmolVLM-256M-Instruct model to OCR processing. Enabling picture descriptions can slow ingestion performance.
-Use OpenRAG default ingestion instead of Docling serve
-If you want to use OpenRAG's built-in pipeline instead of Docling serve, set DISABLE_INGEST_WITH_LANGFLOW=true in Environment variables.
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
-For more information, see processors.py in the OpenRAG repository.
Knowledge ingestion flows
Flows in Langflow are functional representations of application workflows, with multiple component nodes connected as single steps in a workflow.
The OpenSearch Ingestion flow is the default knowledge ingestion flow in OpenRAG: when you Add Knowledge in OpenRAG, you run the OpenSearch Ingestion flow in the background. The flow ingests documents using Docling Serve to import and process documents.
@@ -56,7 +54,11 @@ OpenRAG's visual editor is based on the Open Search Agent flow. The agent calls this component to fetch web content, and the results are ingested into OpenSearch.For more on using MCP clients in Langflow, see MCP clients.
-To connect additional MCP servers to the MCP client, see Connect to MCP servers from your application.
