openrag/docs/docs/core-components/ingestion.mdx
2026-01-05 12:31:20 -08:00

281 lines
No EOL
15 KiB
Text

---
title: Ingest knowledge
slug: /ingestion
---
import Icon from "@site/src/components/icon/icon";
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import PartialTempKnowledge from '@site/docs/_partial-temp-knowledge.mdx';
import PartialIngestionFlow from '@site/docs/_partial-ingestion-flow.mdx';
import PartialDockerComposeUp from '@site/docs/_partial-docker-compose-up.mdx';
import PartialDockerStopAll from '@site/docs/_partial-docker-stop-all.mdx';
Upload documents to your [OpenRAG OpenSearch instance](/knowledge) to populate your knowledge base with unique content, such as your own company documents, research papers, or websites.
Documents are processed through OpenRAG's knowledge ingestion flows with Docling.
OpenRAG can ingest knowledge from direct file uploads, URLs, and OAuth authenticated connectors.
Knowledge ingestion is powered by OpenRAG's built-in knowledge ingestion flows that use Docling to process documents before storing the documents in your OpenSearch database.
During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected [embedding model](/knowledge#set-the-embedding-model-and-dimensions).
Then, the chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.
To modify chunking behavior and other ingestion settings, see [Knowledge ingestion settings](/knowledge#knowledge-ingestion-settings) and [Inspect and modify flows](/agents#inspect-and-modify-flows).
## Ingest local files and folders
You can upload files and folders from your local machine to your knowledge base:
1. Click <Icon name="Library" aria-hidden="true"/> **Knowledge** to view your OpenSearch knowledge base.
2. Click **Add Knowledge** to add your own documents to your OpenRAG knowledge base.
3. To upload one file, click <Icon name="File" aria-hidden="true"/> **File**. To upload all documents in a folder, click <Icon name="Folder" aria-hidden="true"/> **Folder**.
The default path is `~/.openrag/documents`.
To change this path, see [Set the local documents path](/knowledge#set-the-local-documents-path).
The selected files are processed in the background through the **OpenSearch Ingestion** flow.
<PartialIngestionFlow />
You can [monitor ingestion](#monitor-ingestion) to see the progress of the uploads and check for failed uploads.
## Ingest local files temporarily
<PartialTempKnowledge />
## Ingest files with OAuth connectors {#oauth-ingestion}
OpenRAG can use OAuth authenticated connectors to ingest documents from the following external services:
* AWS S3
* Google Drive
* Microsoft OneDrive
* Microsoft Sharepoint
These connectors enable seamless ingestion of files from cloud storage to your OpenRAG knowledge base.
Individual users can connect their personal cloud storage accounts to OpenRAG. Each user must separately authorize OpenRAG to access their own cloud storage. When a user connects a cloud storage service, they are redirected to authenticate with that service provider and grant OpenRAG permission to sync documents from their personal cloud storage.
### Enable OAuth connectors
Before users can connect their own cloud storage accounts, you must configure the provider's OAuth credentials in OpenRAG. Typically, this requires that you register OpenRAG as an OAuth application in your cloud provider, and then obtain the app's OAuth credentials, such as a client ID and secret key.
To enable multiple connectors, you must register an app and generate credentials for each provider.
<Tabs>
<TabItem value="TUI" label="TUI-managed services" default>
If you use the [Terminal User Interface (TUI)](/tui) to manage your OpenRAG services, enter OAuth credentials on the **Advanced Setup** page.
You can do this during [installation](/install#setup), or you can add the credentials afterwards:
1. If OpenRAG is running, click **Stop All Services** in the TUI.
2. Open the **Advanced Setup** page, and then add the OAuth credentials for the cloud storage providers that you want to use under **API Keys**:
* **Google**: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials). For more information, see the [Google OAuth client documentation](https://developers.google.com/identity/protocols/oauth2).
* **Microsoft**: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide [Azure application registration credentials for SharePoint and OneDrive](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/app-registration?view=odsp-graph-online). For more information, see the [Microsoft Graph OAuth client documentation](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/graph-oauth).
* **Amazon**: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on [Configuring access to AWS applications](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-applications.html).
3. Register the redirect URIs shown in the TUI in your OAuth provider.
These are the URLs your OAuth provider will use to redirect users back to OpenRAG after they sign in.
4. Click **Save Configuration** to add the OAuth credentials to your [OpenRAG `.env` file](/reference/configuration).
5. Click **Start Services** to restart the OpenRAG containers with OAuth enabled.
6. Launch the OpenRAG app.
You should be prompted to sign in to your OAuth provider before being redirected to your OpenRAG instance.
</TabItem>
<TabItem value="env" label="Self-managed services">
If you [installed OpenRAG with self-managed services](/docker), set OAuth credentials in your [OpenRAG `.env` file](/reference/configuration).
You can do this during [initial set up](/docker#setup), or you can add the credentials afterwards:
1. Stop all OpenRAG containers:
<PartialDockerStopAll />
2. Edit your OpenRAG `.env` file to add the OAuth credentials for the cloud storage providers that you want to use:
* **Google**: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials). For more information, see the [Google OAuth client documentation](https://developers.google.com/identity/protocols/oauth2).
```env
GOOGLE_OAUTH_CLIENT_ID=
GOOGLE_OAUTH_CLIENT_SECRET=
```
* **Microsoft**: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide [Azure application registration credentials for SharePoint and OneDrive](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/app-registration?view=odsp-graph-online). For more information, see the [Microsoft Graph OAuth client documentation](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/graph-oauth).
```env
MICROSOFT_GRAPH_OAUTH_CLIENT_ID=
MICROSOFT_GRAPH_OAUTH_CLIENT_SECRET=
```
* **Amazon**: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on [Configuring access to AWS applications](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-applications.html).
```env
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
```
3. Save the `.env` file.
4. Restart your OpenRAG containers:
<PartialDockerComposeUp />
5. Access the OpenRAG frontend at `http://localhost:3000`.
You should be prompted to sign in to your OAuth provider before being redirected to your OpenRAG instance.
</TabItem>
</Tabs>
### Authenticate and ingest files from cloud storage
After you start OpenRAG with OAuth connectors enabled, each user is prompted to authenticate with the OAuth provider upon accessing your OpenRAG instance.
Individual authentication is required to access a user's cloud storage from your OpenRAG instance.
For example, if a user navigates to the default OpenRAG URL at `http://localhost:3000`, they are redirected to the OAuth provider's sign-in page.
After authenticating and granting the required permissions for OpenRAG, the user is redirected back to OpenRAG.
To ingest knowledge with an OAuth connector, do the following:
1. Click <Icon name="Library" aria-hidden="true"/> **Knowledge** to view your OpenSearch knowledge base.
2. Click **Add Knowledge**, and then select a storage provider.
3. On the **Add Cloud Knowledge** page, click **Add Files**, and then select the files and folders to ingest from the connected storage.
4. Click **Ingest Files**.
The selected files are processed in the background through the **OpenSearch Ingestion** flow.
<PartialIngestionFlow />
You can [monitor ingestion](#monitor-ingestion) to see the progress of the uploads and check for failed uploads.
## Ingest knowledge from URLs {#url-flow}
When using the OpenRAG chat, you can enter URLs into the chat to be ingested in real-time during your conversation.
:::tip
Use [UTF-8 encoding](https://www.w3schools.com/tags/ref_urlencode.ASP) for URLs with special characters other than the standard slash, period, and colon characters.
For example, use `https://en.wikipedia.org/wiki/Caf%C3%A9` instead of `https://en.wikipedia.org/wiki/Café` or `https://en.wikipedia.org/wiki/Coffee%5Fculture` instead of `https://en.wikipedia.org/wiki/Coffee_culture`.
:::
The **OpenSearch URL Ingestion** flow is used to ingest web content from URLs.
This flow isn't directly accessible from the OpenRAG user interface.
Instead, this flow is called by the [**OpenRAG OpenSearch Agent** flow](/chat#flow) as a Model Context Protocol (MCP) tool.
The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base.
Like all OpenRAG flows, you can [inspect the flow in Langflow](/agents#inspect-and-modify-flows), and you can customize it.
For more information about MCP in Langflow, see the Langflow documentation on [MCP clients](https://docs.langflow.org/mcp-client) and [MCP servers](https://docs.langflow.org/mcp-tutorial).
## Monitor ingestion
Document ingestion tasks run in the background.
In the OpenRAG user interface, a badge is shown on <Icon name="Bell" aria-hidden="true"/> **Tasks** when OpenRAG tasks are active.
Click <Icon name="Bell" aria-hidden="true"/> **Tasks** to inspect and cancel tasks:
* **Active Tasks**: All tasks that are **Pending**, **Running**, or **Processing**.
For each active task, depending on its state, you can find the task ID, start time, duration, number of files processed, and the total files enqueued for processing.
* **Pending**: The task is queued and waiting to start.
* **Running**: The task is actively processing files.
* **Processing**: The task is performing ingestion operations.
* **Failed**: Something went wrong during ingestion, or the task was manually canceled.
For troubleshooting advice, see [Troubleshoot ingestion](#troubleshoot-ingestion).
To stop an active task, click <Icon name="X" aria-hidden="true"/> **Cancel**. Canceling a task stops processing immediately and marks the task as **Failed**.
### Ingestion performance expectations
The following performance test was conducted with Docling Serve.
On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes.
This equates to approximately 2.4 documents per second.
You can generally expect equal or better performance on developer laptops, and significantly faster performance on servers.
Throughput scales with CPU cores, memory, storage speed, and configuration choices, such as the embedding model, chunk size, overlap, and concurrency.
This test returned 12 error, approximately 1.1 percent of the total files ingested.
All errors were file-specific, and they didn't stop the pipeline.
<details>
<summary>Ingestion performance test details</summary>
* Ingestion dataset:
* Total files: 1,083 items mounted
* Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
* Hardware specifications:
* Machine: Apple M4 Pro
* Podman VM:
* Name: podman-machine-default
* Type: applehv
* vCPUs: 7
* Memory: 8 GiB
* Disk size: 100 GiB
* Test results:
```text
2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
...
2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082
```
* Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)
* Throughput: Approximately 2.4 documents per second
</details>
## Troubleshoot ingestion {#troubleshoot-ingestion}
The following issues can occur during document ingestion.
### Failed or slow ingestion
If an ingestion task fails, do the following:
* Make sure you are uploading supported file types.
* Split excessively large files into smaller files before uploading.
* Remove unusual embedded content, such as videos or animations, before uploading. Although Docling can replace some non-text content with placeholders during ingestion, some embedded content might cause errors.
* Make sure your Podman/Docker VM has sufficient memory for the ingestion tasks.
The minimum recommendation is 8 GB of RAM.
If you regularly upload large files, more RAM is recommended.
For more information, see [Memory issue with Podman on macOS](/support/troubleshoot#memory-issue-with-podman-on-macos) and [Container out of memory errors](/support/troubleshoot#container-out-of-memory-errors).
* If OCR ingestion fails due to OCR missing, see [OCR ingestion fails (easyocr not installed)](/support/troubleshoot#ocr-ingestion-fails-easyocr-not-installed).
### Problems when referencing documents in chat
If the OpenRAG **Chat** doesn't seem to use your documents correctly, [browse your knowledge base](/knowledge#browse-knowledge) to confirm that the documents are uploaded in full, and the chunks are correct.
If the documents are present and well-formed, check your [knowledge filters](/knowledge-filters).
If a global filter is applied, make sure the expected documents are included in the global filter.
If the global filter excludes any documents, the agent cannot access those documents unless you apply a chat-level filter or change the global filter.
If text is missing or incorrectly processed, you need to reupload the documents after modifying the ingestion parameters or the documents themselves.
For example:
* Break combined documents into separate files for better metadata context.
* Make sure scanned documents are legible enough for extraction, and enable the **OCR** option. Poorly scanned documents might require additional preparation or rescanning before ingestion.
* Adjust the **Chunk Size** and **Chunk Overlap** settings to better suit your documents. Larger chunks provide more context but can include irrelevant information, while smaller chunks yield more precise semantic search but can lack context.
For more information about modifying ingestion parameters and flows, see [Knowledge ingestion settings](/knowledge#knowledge-ingestion-settings).
## See also
* [Configure knowledge](/knowledge)
* [Filter knowledge](/knowledge-filters)
* [Chat with knowledge](/chat)
* [Inspect and modify flows](/agents#inspect-and-modify-flows)