From f6bb375860fc821af56ac278939e5d34e5c89300 Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Tue, 30 Sep 2025 09:51:42 -0400
Subject: [PATCH 1/8] init
---
docs/docs/core-components/ingestion.mdx | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
create mode 100644 docs/docs/core-components/ingestion.mdx
diff --git a/docs/docs/core-components/ingestion.mdx b/docs/docs/core-components/ingestion.mdx
new file mode 100644
index 00000000..d240d53e
--- /dev/null
+++ b/docs/docs/core-components/ingestion.mdx
@@ -0,0 +1,23 @@
+---
+title: Docling Ingestion
+slug: /ingestion
+---
+
+import Icon from "@site/src/components/icon/icon";
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+import PartialModifyFlows from '@site/docs/_partial-modify-flows.mdx';
+
+OpenRAG uses [Docling](https://docling-project.github.io/docling/) for its document ingestion pipeline.
+More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling-serve` process on your local machine and runs Docling ingestion through an API service.
+
+OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
+
+## Docling ingestion settings
+
+These settings control the Docling ingestion parameters.
+
+OpenRAG will warn you if `docling-serve` is not running.
+To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
+
+## Use OpenRAG default ingestion instead of Docling
\ No newline at end of file
From 13e30c1b7408102cc28b6b43a6eb776e86d30b87 Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Tue, 30 Sep 2025 14:54:22 -0400
Subject: [PATCH 2/8] ingestion-settings
---
docs/docs/core-components/ingestion.mdx | 37 +++++++++++++++++++++++--
1 file changed, 35 insertions(+), 2 deletions(-)
diff --git a/docs/docs/core-components/ingestion.mdx b/docs/docs/core-components/ingestion.mdx
index d240d53e..f4820bf7 100644
--- a/docs/docs/core-components/ingestion.mdx
+++ b/docs/docs/core-components/ingestion.mdx
@@ -11,13 +11,46 @@ import PartialModifyFlows from '@site/docs/_partial-modify-flows.mdx';
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for its document ingestion pipeline.
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling-serve` process on your local machine and runs Docling ingestion through an API service.
+Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
+
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
## Docling ingestion settings
-These settings control the Docling ingestion parameters.
+These settings configure the Docling ingestion parameters, from using no OCR to using advanced vision language models.
OpenRAG will warn you if `docling-serve` is not running.
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
-## Use OpenRAG default ingestion instead of Docling
\ No newline at end of file
+**Embedding model** determines which AI model is used to create vector embeddings. The default is
+
+**Chunk size** determines how large each text chunk is in number of characters.
+Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
+The default value of `1000` characters provides a good starting point that balances these considerations.
+
+**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
+Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
+The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
+
+**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
+OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
+
+Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents.
+
+If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
+
+**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing.
+
+**VLM (Vision Language Model)** enables or disables VLM processing.
+VLM processing is used _instead of_ OCR processing.
+It uses an LLM to understand a document's structure and return text in a structured `doctags` format.
+For more information, see [Vision models](https://docling-project.github.io/docling/usage/vision_models/).
+
+Enable a VLM when you are processing complex documents containing a mixture of text, images, tables, and charts.
+
+If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [SmolDocling-256M-preview-mlx-bf16](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) VLM, which includes the [MLX framework](https://ml-explore.github.io/mlx/build/html/index.html) for Apple silicon.
+Other platforms use [SmolDocling-256M-preview](https://huggingface.co/ds4sd/SmolDocling-256M-preview).
+
+## Use OpenRAG default ingestion instead of Docling
+
+If you want to use OpenRAG's built in pipeline instead of Docling, set `DISABLE_INGEST_WITH_LANGFLOW=true`.
\ No newline at end of file
From 3fdbac561a30bf002444130f2b159444d7bc6f34 Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Tue, 30 Sep 2025 14:59:26 -0400
Subject: [PATCH 3/8] sidebars
---
docs/sidebars.js | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/docs/sidebars.js b/docs/sidebars.js
index 3048cb70..affab754 100644
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@@ -60,6 +60,11 @@ const sidebars = {
type: "doc",
id: "core-components/knowledge",
label: "OpenSearch Knowledge"
+ },
+ {
+ type: "doc",
+ id: "core-components/ingestion",
+ label: "Docling Ingestion"
}
],
},
From e902ad59019410bc9bec160d1345dd4812ac3799 Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Tue, 30 Sep 2025 15:03:43 -0400
Subject: [PATCH 4/8] docs-link-to-page
---
docs/docs/core-components/knowledge.mdx | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/docs/docs/core-components/knowledge.mdx b/docs/docs/core-components/knowledge.mdx
index 2ea5ef9f..852991a7 100644
--- a/docs/docs/core-components/knowledge.mdx
+++ b/docs/docs/core-components/knowledge.mdx
@@ -97,6 +97,10 @@ You can monitor the sync progress in the
Once processing is complete, the synced documents become available in your knowledge base and can be searched through the chat interface or Knowledge page.
+### Knowledge ingestion settings
+
+To configure the knowledge ingestion pipeline parameters, see [Docling Ingestion](/ingestion).
+
## Create knowledge filters
OpenRAG includes a knowledge filter system for organizing and managing document collections.
From 6773ff85cca5b8093c65644dff7a6f4a4d7cb170 Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Wed, 1 Oct 2025 10:04:34 -0400
Subject: [PATCH 5/8] docs-setup-versioning-but-dont-enable
---
docs/VERSIONING_SETUP.md | 111 ++++++++++++++++++++++++++++++++++++++
docs/docusaurus.config.js | 10 ++++
2 files changed, 121 insertions(+)
create mode 100644 docs/VERSIONING_SETUP.md
diff --git a/docs/VERSIONING_SETUP.md b/docs/VERSIONING_SETUP.md
new file mode 100644
index 00000000..17a8c8f6
--- /dev/null
+++ b/docs/VERSIONING_SETUP.md
@@ -0,0 +1,111 @@
+# Docusaurus versioning setup
+
+Docs versioning is currently **DISABLED** but configured and ready to enable.
+The configuration is found in `docusaurus.config.js` with commented-out sections.
+
+To enable versioning, do the following:
+
+1. Open `docusaurus.config.js`
+2. Find the versioning configuration section (around line 57)
+3. Uncomment the versioning configuration:
+
+```javascript
+docs: {
+ // ... other config
+ lastVersion: 'current', // Use 'current' to make ./docs the latest version
+ versions: {
+ current: {
+ label: 'Next (unreleased)',
+ path: 'next',
+ },
+ },
+ onlyIncludeVersions: ['current'], // Limit versions for faster builds
+},
+```
+
+## Create docs versions
+
+See the [Docusaurus docs](https://docusaurus.io/docs/versioning) for more info.
+
+1. Use the Docusaurus CLI command to create a version.
+You can use `yarn` instead of `npm`.
+```bash
+# Create version 1.0.0 from current docs
+npm run docusaurus docs:version 1.0.0
+```
+
+This command will:
+- Copy the full `docs/` folder contents into `versioned_docs/version-1.0.0/`
+- Create a versioned sidebar file at `versioned_sidebars/version-1.0.0-sidebars.json`
+- Append the new version to `versions.json`
+
+3. After creating a version, update the Docusaurus configuration to include multiple versions.
+`lastVersion:'1.0.0'` makes the '1.0.0' release the `latest` version.
+`current` is the work-in-progress docset, accessible at `/docs/next`.
+To remove a version, remove it from `onlyIncludeVersions`.
+
+```javascript
+docs: {
+ // ... other config
+ lastVersion: '1.0.0', // Make 1.0.0 the latest version
+ versions: {
+ current: {
+ label: 'Next (unreleased)',
+ path: 'next',
+ },
+ '1.0.0': {
+ label: '1.0.0',
+ path: '1.0.0',
+ },
+ },
+ onlyIncludeVersions: ['current', '1.0.0'], // Include both versions
+},
+```
+
+4. Test the deployment locally.
+
+```bash
+npm run build
+npm run serve
+```
+
+5. To add subsequent versions, repeat the process, first running the CLI command then updating `docusaurus.config.js`.
+
+```bash
+# Create version 2.0.0 from current docs
+npm run docusaurus docs:version 2.0.0
+```
+
+After creating a new version, update `docusaurus.config.js`.
+
+```javascript
+docs: {
+ lastVersion: '2.0.0', // Make 2.0.0 the latest version
+ versions: {
+ current: {
+ label: 'Next (unreleased)',
+ path: 'next',
+ },
+ '2.0.0': {
+ label: '2.0.0',
+ path: '2.0.0',
+ },
+ '1.0.0': {
+ label: '1.0.0',
+ path: '1.0.0',
+ },
+ },
+ onlyIncludeVersions: ['current', '2.0.0', '1.0.0'], // Include all versions
+},
+```
+
+## Disable versioning
+
+1. Remove the `versions` configuration from `docusaurus.config.js`.
+2. Delete the `docs/versioned_docs/` and `docs/versioned_sidebars/` directories.
+3. Delete `docs/versions.json`.
+
+## References
+
+- [Official Docusaurus Versioning Documentation](https://docusaurus.io/docs/versioning)
+- [Docusaurus Versioning Best Practices](https://docusaurus.io/docs/versioning#recommended-practices)
\ No newline at end of file
diff --git a/docs/docusaurus.config.js b/docs/docusaurus.config.js
index c4175c09..92a3d6dd 100644
--- a/docs/docusaurus.config.js
+++ b/docs/docusaurus.config.js
@@ -53,6 +53,16 @@ const config = {
editUrl:
'https://github.com/openrag/openrag/tree/main/docs/',
routeBasePath: '/',
+ // Versioning configuration - see VERSIONING_SETUP.md
+ // To enable versioning, uncomment the following lines:
+ // lastVersion: 'current',
+ // versions: {
+ // current: {
+ // label: 'Next (unreleased)',
+ // path: 'next',
+ // },
+ // },
+ // onlyIncludeVersions: ['current'],
},
theme: {
customCss: './src/css/custom.css',
From 37bd1880b36e46895e593c83ac9e1f0d993aabb0 Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Wed, 1 Oct 2025 10:41:34 -0400
Subject: [PATCH 6/8] openrag-defaults
---
docs/docs/core-components/ingestion.mdx | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/docs/docs/core-components/ingestion.mdx b/docs/docs/core-components/ingestion.mdx
index f4820bf7..cb1c28be 100644
--- a/docs/docs/core-components/ingestion.mdx
+++ b/docs/docs/core-components/ingestion.mdx
@@ -22,7 +22,7 @@ These settings configure the Docling ingestion parameters, from using no OCR to
OpenRAG will warn you if `docling-serve` is not running.
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
-**Embedding model** determines which AI model is used to create vector embeddings. The default is
+**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`. `
**Chunk size** determines how large each text chunk is in number of characters.
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
@@ -35,22 +35,16 @@ The default value of 200 characters of overlap with a chunk size of 1000 (20% ov
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
-Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents.
+Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
-**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing.
+**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
-**VLM (Vision Language Model)** enables or disables VLM processing.
-VLM processing is used _instead of_ OCR processing.
-It uses an LLM to understand a document's structure and return text in a structured `doctags` format.
-For more information, see [Vision models](https://docling-project.github.io/docling/usage/vision_models/).
+## Use OpenRAG default ingestion instead of Docling serve
-Enable a VLM when you are processing complex documents containing a mixture of text, images, tables, and charts.
+If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/configure/configuration#ingestion-configuration).
-If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [SmolDocling-256M-preview-mlx-bf16](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) VLM, which includes the [MLX framework](https://ml-explore.github.io/mlx/build/html/index.html) for Apple silicon.
-Other platforms use [SmolDocling-256M-preview](https://huggingface.co/ds4sd/SmolDocling-256M-preview).
+The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
-## Use OpenRAG default ingestion instead of Docling
-
-If you want to use OpenRAG's built in pipeline instead of Docling, set `DISABLE_INGEST_WITH_LANGFLOW=true`.
\ No newline at end of file
+For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
\ No newline at end of file
From 60c61d6519408c822a5de00318e4eb23dcba78fc Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Wed, 1 Oct 2025 10:51:06 -0400
Subject: [PATCH 7/8] remove-vlm
---
docs/docs/core-components/ingestion.mdx | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/docs/core-components/ingestion.mdx b/docs/docs/core-components/ingestion.mdx
index cb1c28be..9491652d 100644
--- a/docs/docs/core-components/ingestion.mdx
+++ b/docs/docs/core-components/ingestion.mdx
@@ -17,7 +17,7 @@ OpenRAG chose Docling for its support for a wide variety of file formats, high p
## Docling ingestion settings
-These settings configure the Docling ingestion parameters, from using no OCR to using advanced vision language models.
+These settings configure the Docling ingestion parameters.
OpenRAG will warn you if `docling-serve` is not running.
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
From 7370ad3bf71107e78b44f8ddc0a0235f3129cdff Mon Sep 17 00:00:00 2001
From: Mendon Kissling <59585235+mendonk@users.noreply.github.com>
Date: Wed, 1 Oct 2025 11:07:40 -0400
Subject: [PATCH 8/8] Apply suggestion from @mfortman11
Co-authored-by: Mike Fortman
---
docs/docs/core-components/ingestion.mdx | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/docs/core-components/ingestion.mdx b/docs/docs/core-components/ingestion.mdx
index 9491652d..7e5afb20 100644
--- a/docs/docs/core-components/ingestion.mdx
+++ b/docs/docs/core-components/ingestion.mdx
@@ -22,7 +22,7 @@ These settings configure the Docling ingestion parameters.
OpenRAG will warn you if `docling-serve` is not running.
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
-**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`. `
+**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`.
**Chunk size** determines how large each text chunk is in number of characters.
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.