finish knowledge and oauth

This commit is contained in:
April M 2025-11-26 16:22:59 -08:00
parent ae8638b071
commit 0a83ea2e6c
8 changed files with 332 additions and 194 deletions

View file

@ -0,0 +1,24 @@
<details>
<summary>About the OpenSearch Ingestion flow</summary>
When you upload documents locally or with OAuth connectors, the **OpenSearch Ingestion** flow runs in the background.
By default, this flow uses Docling Serve to import and process documents.
Like all [OpenRAG flows](/agents), you can [inspect the flow in Langflow](/agents#inspect-and-modify-flows), and you can customize it if you want to change the knowledge ingestion settings.
The **OpenSearch Ingestion** flow is comprised of several components that work together to process and store documents in your knowledge base:
* [**Docling Serve** component](https://docs.langflow.org/bundles-docling#docling-serve): Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is `DoclingDocument` data that contains the extracted text and metadata from the documents.
* [**Export DoclingDocument** component](https://docs.langflow.org/bundles-docling#export-doclingdocument): Exports processed `DoclingDocument` data to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing.
* [**DataFrame Operations** component](https://docs.langflow.org/components-processing#dataframe-operations): Three of these components run sequentially to add metadata to the document data: `filename`, `file_size`, and `mimetype`.
* [**Split Text** component](https://docs.langflow.org/components-processing#split-text): Splits the processed text into chunks, based on the configured [chunk size and overlap settings](/knowledge#knowledge-ingestion-settings).
* **Secret Input** component: If needed, four of these components securely fetch the [OAuth authentication](/knowledge#auth) configuration variables: `CONNECTOR_TYPE`, `OWNER`, `OWNER_EMAIL`, and `OWNER_NAME`.
* **Create Data** component: Combines the authentication credentials from the **Secret Input** components into a structured data object that is associated with the document embeddings.
* [**Embedding Model** component](https://docs.langflow.org/components-embedding-models): Generates vector embeddings using your selected [embedding model](/knowledge#set-the-embedding-model-and-dimensions).
* [**OpenSearch** component](https://docs.langflow.org/bundles-elastic#opensearch): Stores the processed documents and their embeddings in a `documents` index of your OpenRAG [OpenSearch knowledge base](/knowledge).
The default address for the OpenSearch instance is `https://opensearch:9200`. To change this address, edit the `OPENSEARCH_PORT` [environment variable](/reference/configuration#opensearch-settings).
The default authentication method is JSON Web Token (JWT) authentication. If you [edit the flow](/agents#inspect-and-modify-flows), you can select `basic` auth mode, which uses the `OPENSEARCH_USERNAME` and `OPENSEARCH_PASSWORD` [environment variables](/reference/configuration#opensearch-settings) for authentication instead of JWT.
</details>

View file

@ -46,7 +46,7 @@ For example, to view and edit the built-in **Chat** flow (the **OpenRAG OpenSear
If you modify the built-in **Chat** flow, make sure you click <Icon name="Plus" aria-hidden="true"/> in the **Conversations** tab to start a new conversation. This ensures that the chat doesn't persist any context from the previous conversation with the original flow settings.
:::
### Revert a built-in flow to its default state
### Revert a built-in flow to its original configuration {#revert-a-built-in-flow-to-its-original-configuration}
After you edit a built-in flow, you can click **Restore flow** on the **Settings** page to revert the flow to its original state when you first installed OpenRAG.
This is a destructive action that discards all customizations to the flow.

View file

@ -7,21 +7,22 @@ import Icon from "@site/src/components/icon/icon";
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import PartialTempKnowledge from '@site/docs/_partial-temp-knowledge.mdx';
import PartialIngestionFlow from '@site/docs/_partial-ingestion-flow.mdx';
Upload documents to your [OpenRAG OpenSearch instance](/knowledge) to populate your knowledge base with unique content, such as your own company documents, research papers, or websites.
Documents are processed through OpenRAG's knowledge ingestion flows with Docling.
OpenRAG can ingest knowledge from direct file uploads, URLs, and OAuth authenticated connections.
OpenRAG can ingest knowledge from direct file uploads, URLs, and OAuth authenticated connectors.
Knowledge ingestion is powered by OpenRAG's built-in knowledge ingestion flows that use Docling to process documents before storing the documents in your OpenSearch database.
During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected [embedding model](/knowledge#set-the-embedding-model-and-dimensions).
Then, the chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.
Like all [OpenRAG flows](/agents), you can [inspect the flows in Langflow](/agents#inspect-and-modify-flows), and you can customize them if you want to change the knowledge ingestion settings.
To modify chunking behavior and other ingestion settings, see [Knowledge ingestion settings](/knowledge#knowledge-ingestion-settings) and [Inspect and modify flows](/agents#inspect-and-modify-flows).
## Ingest local files and folders
You can upload files and folders from your local machine to your knowledge base. When you do this, the **OpenSearch Ingestion** flow runs in the background.
You can upload files and folders from your local machine to your knowledge base:
1. Click <Icon name="Library" aria-hidden="true"/> **Knowledge** to view your OpenSearch knowledge base.
@ -29,32 +30,156 @@ You can upload files and folders from your local machine to your knowledge base.
3. To upload one file, click <Icon name="File" aria-hidden="true"/> **File**. To upload all documents in a folder, click <Icon name="Folder" aria-hidden="true"/> **Folder**.
The default path for either **File** or **Folder** uploads is the `/documents` subdirectory in your OpenRAG installation directory.
The default path is the `./documents` subdirectory in your OpenRAG installation directory.
To change this path, see [Set the local documents path](/knowledge#set-the-local-documents-path).
### Ingest local files temporarily
The selected files are processed in the background through the **OpenSearch Ingestion** flow.
<PartialIngestionFlow />
You can [monitor ingestion](#monitor-ingestion) to see the progress of the uploads and check for failed uploads.
## Ingest local files temporarily
<PartialTempKnowledge />
### OpenSearch Ingestion flow
## Ingest files with OAuth connectors {#oauth-ingestion}
<!-- combine with above -->
OpenRAG can use OAuth authenticated connectors to ingest documents from the following external services:
The **OpenSearch Ingestion** flow is the default knowledge ingestion flow in OpenRAG. When you **Add Knowledge** in OpenRAG, the **OpenSearch Ingestion** flow runs in the background. The flow ingests documents using Docling Serve to import and process documents.
* AWS S3
* Google Drive
* Microsoft OneDrive
* Microsoft Sharepoint
If you [inspect the flow in Langflow](/agents#inspect-and-modify-flows), you'll see that it is comprised of ten components that work together to process and store documents in your knowledge base:
These connectors enable seamless ingestion of files from cloud storage to your OpenRAG knowledge base.
* The **OpenSearch Ingestion** flow uses Langflow's [**File** component](https://docs.langflow.org/components-data#file) to split and embed files loaded from your local machine into the OpenSearch database.
The default path to your local folder is mounted from the `./documents` folder in your OpenRAG project directory to the `/app/documents/` directory inside the Docker container. Files added to the host or the container will be visible in both locations. To configure this location, modify the **Documents Paths** variable in either the TUI's [Advanced Setup](/install#setup) menu or in the `.env` used by Docker Compose.
* The [**Docling Serve** component](https://docs.langflow.org/bundles-docling) processes input documents by connecting to your instance of Docling Serve.
* The [**Export DoclingDocument** component](https://docs.langflow.org/components-docling) exports the processed DoclingDocument to markdown format with image export mode set to placeholder. This conversion makes the structured document data into a standardized format for further processing.
* Three [**DataFrame Operations** components](https://docs.langflow.org/components-processing#dataframe-operations) sequentially add metadata columns to the document data of `filename`, `file_size`, and `mimetype`.
* The [**Split Text** component](https://docs.langflow.org/components-processing#split-text) splits the processed text into chunks with a chunk size of 1000 characters and an overlap of 200 characters.
* Four **Secret Input** components provide secure access to configuration variables: `CONNECTOR_TYPE`, `OWNER`, `OWNER_EMAIL`, and `OWNER_NAME`. These are runtime variables populated from OAuth login.
* The **Create Data** component combines the secret inputs into a structured data object that will be associated with the document embeddings.
* The [**Embedding Model** component](https://docs.langflow.org/components-embedding-models) generates vector embeddings using OpenAI's `text-embedding-3-small` model. The embedding model is selected at [Application onboarding] and cannot be changed.
* The [**OpenSearch** component](https://docs.langflow.org/bundles-elastic#opensearch) stores the processed documents and their embeddings in the `documents` index at `https://opensearch:9200`. By default, the component is authenticated with a JWT token, but you can also select `basic` auth mode, and enter your OpenSearch admin username and password.
Individual users can connect their personal cloud storage accounts to OpenRAG. Each user must separately authorize OpenRAG to access their own cloud storage. When a user connects a cloud storage service, they are redirected to authenticate with that service provider and grant OpenRAG permission to sync documents from their personal cloud storage.
To customize this flow, see [Inspect and modify flows](/agents#inspect-and-modify-flows).
### Enable OAuth connectors
Before users can connect their own cloud storage accounts, you must configure the provider's OAuth credentials in OpenRAG. Typically, this requires that you register OpenRAG as an OAuth application in your cloud provider, and then obtain the app's OAuth credentials, such as a client ID and secret key.
To enable multiple connectors, you must register an app and generate credentials for each provider.
<Tabs>
<TabItem value="TUI" label="TUI Advanced Setup" default>
If you use the TUI to manage your OpenRAG containers, provide OAuth credentials in the **Advanced Setup**.
You can do this during [installation](/install#setup), or you can add the credentials afterwards:
1. If OpenRAG is running, stop it: Go to [**Status**](/install#tui-container-management), and then click **Stop Services**.
2. Click **Advanced Setup**, and then add the OAuth credentials for the cloud storage providers that you want to use:
* **Amazon**: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on [Configuring access to AWS applications](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-applications.html).
* **Google**: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials). For more information, see the [Google OAuth client documentation](https://developers.google.com/identity/protocols/oauth2).
* **Microsoft**: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide [Azure application registration credentials for SharePoint and OneDrive](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/app-registration?view=odsp-graph-online). For more information, see the [Microsoft Graph OAuth client documentation](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/graph-oauth).
3. The OpenRAG TUI presents redirect URIs for your OAuth app that you must register with your OAuth provider.
These are the URLs your OAuth provider will redirect back to after users authenticate and grant access to their cloud storage.
4. Click **Save Configuration**.
OpenRAG regenerates the [`.env`](/reference/configuration) file with the given credentials.
5. Click **Start Container Services**.
</TabItem>
<TabItem value="env" label="Docker Compose .env file">
If you [install OpenRAG with self-managed containers](/docker), set OAuth credentials in the `.env` file for Docker Compose.
You can do this during [initial set up](/docker#install-openrag-with-docker-compose), or you can add the credentials afterwards:
1. Stop your OpenRAG deployment.
<Tabs>
<TabItem value="podman" label="Podman">
```bash
podman stop --all
```
</TabItem>
<TabItem value="docker" label="Docker">
```bash
docker stop $(docker ps -q)
```
</TabItem>
</Tabs>
2. Edit the `.env` file for Docker Compose to add the OAuth credentials for the cloud storage providers that you want to use:
* **Amazon**: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on [Configuring access to AWS applications](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-applications.html).
```env
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
```
* **Google**: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials). For more information, see the [Google OAuth client documentation](https://developers.google.com/identity/protocols/oauth2).
```env
GOOGLE_OAUTH_CLIENT_ID=
GOOGLE_OAUTH_CLIENT_SECRET=
```
* **Microsoft**: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide [Azure application registration credentials for SharePoint and OneDrive](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/app-registration?view=odsp-graph-online). For more information, see the [Microsoft Graph OAuth client documentation](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/graph-oauth).
```env
MICROSOFT_GRAPH_OAUTH_CLIENT_ID=
MICROSOFT_GRAPH_OAUTH_CLIENT_SECRET=
```
3. Save the `.env` file.
4. Restart your OpenRAG deployment:
<Tabs>
<TabItem value="podman" label="Podman">
```bash
podman-compose up -d
```
</TabItem>
<TabItem value="docker" label="Docker">
```bash
docker-compose up -d
```
</TabItem>
</Tabs>
</TabItem>
</Tabs>
### Authenticate and ingest files from cloud storage
After you start OpenRAG with OAuth connectors enabled, each user is prompted to authenticate with the OAuth provider upon accessing your OpenRAG instance.
Individual authentication is required to access a user's cloud storage from your OpenRAG instance.
For example, if a user navigates to the default OpenRAG URL at `http://localhost:3000`, they are redirected to the OAuth provider's sign-in page.
After authenticating and granting the required permissions for OpenRAG, the user is redirected back to OpenRAG.
To ingest knowledge with an OAuth connector, do the following:
1. Click <Icon name="Library" aria-hidden="true"/> **Knowledge** to view your OpenSearch knowledge base.
2. Click **Add Knowledge**, and then select a storage provider.
3. On the **Add Cloud Knowledge** page, click **Add Files**, and then select the files and folders to ingest from the connected storage.
4. Click **Ingest Files**.
The selected files are processed in the background through the **OpenSearch Ingestion** flow.
<PartialIngestionFlow />
You can [monitor ingestion](#monitor-ingestion) to see the progress of the uploads and check for failed uploads.
## Ingest knowledge from URLs {#url-flow}
@ -63,62 +188,15 @@ This flow isn't directly accessible from the OpenRAG user interface.
Instead, this flow is called by the [**OpenRAG OpenSearch Agent** flow](/chat#flow) as a Model Context Protocol (MCP) tool.
The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base.
Like all OpenRAG flows, you can [inspect the flow in Langflow](/agents#inspect-and-modify-flows), and you can customize it.
For more information about MCP in Langflow, see the Langflow documentation on [MCP clients](https://docs.langflow.org/mcp-client) and [MCP servers](https://docs.langflow.org/mcp-tutorial).
## Ingest files through OAuth connectors {#oauth-ingestion}
OpenRAG supports Google Drive, OneDrive, and Sharepoint as OAuth connectors for seamless document synchronization.
OAuth integration allows individual users to connect their personal cloud storage accounts to OpenRAG. Each user must separately authorize OpenRAG to access their own cloud storage files. When a user connects a cloud service, they are redirected to authenticate with that service provider and grant OpenRAG permission to sync documents from their personal cloud storage.
Before users can connect their cloud storage accounts, you must configure OAuth credentials in OpenRAG. This requires registering OpenRAG as an OAuth application with a cloud provider and obtaining client ID and secret keys for each service you want to support.
To add an OAuth connector to OpenRAG, do the following.
This example uses Google OAuth.
If you wish to use another provider, add the secrets to another provider.
<Tabs groupId="Installation type">
<TabItem value="TUI" label="TUI" default>
1. If OpenRAG is running, stop it with **Status** > **Stop Services**.
2. Click **Advanced Setup**.
3. Add the OAuth provider's client and secret key in the [Advanced Setup](/install#setup) menu.
4. Click **Save Configuration**.
The TUI generates a new `.env` file with your OAuth values.
5. Click **Start Container Services**.
</TabItem>
<TabItem value=".env" label=".env">
1. Stop the Docker deployment.
2. Add the OAuth provider's client and secret key in the `.env` file for Docker Compose.
```bash
GOOGLE_OAUTH_CLIENT_ID='YOUR_OAUTH_CLIENT_ID'
GOOGLE_OAUTH_CLIENT_SECRET='YOUR_OAUTH_CLIENT_SECRET'
```
3. Save your `.env` file.
4. Start the Docker deployment.
</TabItem>
</Tabs>
The OpenRAG frontend at `http://localhost:3000` now redirects to an OAuth callback login page for your OAuth provider.
A successful authentication opens OpenRAG with the required scopes for your connected storage.
To add knowledge from an OAuth-connected storage provider, do the following:
1. Click **Add Knowledge**, and then select the storage provider, for example, **Google Drive**.
The **Add Cloud Knowledge** page opens.
2. To add files or folders from the connected storage, click **Add Files**.
Select the files or folders you want and click **Select**.
You can select multiple files.
3. When your files are selected, click **Ingest Files**.
The ingestion process can take some time depending on the size of your documents.
4. When ingestion is complete, your documents are available in the Knowledge screen.
If ingestion fails, click **Status** to view the logged error.
## Monitor ingestion
Document ingestion tasks run in the background.
In the OpenRAG UI, a badge is shown on <Icon name="Bell" aria-hidden="true"/> **Tasks** when OpenRAG tasks are active.
In the OpenRAG user interface, a badge is shown on <Icon name="Bell" aria-hidden="true"/> **Tasks** when OpenRAG tasks are active.
Click <Icon name="Bell" aria-hidden="true"/> **Tasks** to inspect and cancel tasks:
* **Active Tasks**: All tasks that are **Pending**, **Running**, or **Processing**.
@ -135,87 +213,7 @@ For troubleshooting advice, see [Troubleshoot ingestion](#troubleshoot-ingestion
To stop an active task, click <Icon name="X" aria-hidden="true"/> **Cancel**. Canceling a task stops processing immediately and marks the task as **Failed**.
## Troubleshoot ingestion {#troubleshoot-ingestion}
If an ingestion task fails, do the following:
* Make sure you are uploading supported file types.
* Split excessively large files into smaller files before uploading.
* Remove unusual embedded content, such as videos or animations, before uploading. Although Docling can replace some non-text content with placeholders during ingestion, some embedded content might cause errors.
If the OpenRAG **Chat** doesn't seem to use your documents correctly, [browse your knowledge base](#browse-knowledge) to confirm that the documents are uploaded in full, and the chunks are correct.
If the documents are present and well-formed, check your [knowledge filters](/knowledge-filters).
If a global filter is applied, make sure the expected documents are included in the global filter.
If the global filter excludes any documents, the agent cannot access those documents unless you apply a chat-level filter or change the global filter.
If text is missing or incorrectly processed, you need to reupload the documents after modifying the ingestion parameters or the documents themselves.
For example:
* Break combined documents into separate files for better metadata context.
* Make sure scanned documents are legible enough for extraction, and enable the **OCR** option. Poorly scanned documents might require additional preparation or rescanning before ingestion.
* Adjust the **Chunk Size** and **Chunk Overlap** settings to better suit your documents. Larger chunks provide more context but can include irrelevant information, while smaller chunks yield more precise semantic search but can lack context.
For more information about modifying ingestion parameters and flows, see [Knowledge ingestion settings](#knowledge-ingestion-settings).
## Knowledge ingestion settings {#knowledge-ingestion-settings}
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for document ingestion.
You can use either Docling Serve or OpenRAG's built-in Docling ingestion pipeline to process documents.
<Tabs>
<TabItem value="serve" label="Docling Serve ingestion" default>
When OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), it starts a `docling serve` process on your local machine and runs Docling ingestion through an API service.
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
The following knowledge ingestion settings only apply to the Docling Serve option:
To modify OpenRAG's ingestion settings, including the Docling settings and ingestion flows, <Icon name="Settings2" aria-hidden="true"/> **Settings**.
These settings configure the Docling ingestion parameters.
OpenRAG will warn you if `docling serve` is not running.
To start or stop `docling serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
**Embedding model** determines which AI model is used to create vector embeddings. The default is the OpenAI `text-embedding-3-small` model.
**Chunk size** determines how large each text chunk is in number of characters.
Larger chunks yield more context per chunk, but can include irrelevant information. Smaller chunks yield more precise semantic search, but can lack context.
The default value of `1000` characters provides a good starting point that balances these considerations.
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
**Table Structure** enables Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/) tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. **Table Structure** is enabled by default.
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
</TabItem>
<TabItem value="docling" label="Built-in Docling ingestion">
If you want to use OpenRAG's built-in Docling ingestion pipeline instead of the separate Docling Serve service, set `DISABLE_INGEST_WITH_LANGFLOW=true` in your [OpenRAG environment variables](/reference/configuration#document-processing).
The built-in pipeline uses the Docling processor directly instead of through the Docling Serve API.
For the underlying functionality, see [`processors.py`](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58) in the OpenRAG repository.
</TabItem>
</Tabs>
## Ingestion performance expectations
### Ingestion performance expectations
The following performance test was conducted with Docling Serve.
@ -228,6 +226,9 @@ Throughput scales with CPU cores, memory, storage speed, and configuration choic
This test returned 12 error, approximately 1.1 percent of the total files ingested.
All errors were file-specific, and they didn't stop the pipeline.
<details>
<summary>Ingestion performance test details</summary>
* Ingestion dataset:
* Total files: 1,083 items mounted
@ -256,8 +257,34 @@ All errors were file-specific, and they didn't stop the pipeline.
* Throughput: Approximately 2.4 documents per second
</details>
## Troubleshoot ingestion {#troubleshoot-ingestion}
If an ingestion task fails, do the following:
* Make sure you are uploading supported file types.
* Split excessively large files into smaller files before uploading.
* Remove unusual embedded content, such as videos or animations, before uploading. Although Docling can replace some non-text content with placeholders during ingestion, some embedded content might cause errors.
If the OpenRAG **Chat** doesn't seem to use your documents correctly, [browse your knowledge base](#browse-knowledge) to confirm that the documents are uploaded in full, and the chunks are correct.
If the documents are present and well-formed, check your [knowledge filters](/knowledge-filters).
If a global filter is applied, make sure the expected documents are included in the global filter.
If the global filter excludes any documents, the agent cannot access those documents unless you apply a chat-level filter or change the global filter.
If text is missing or incorrectly processed, you need to reupload the documents after modifying the ingestion parameters or the documents themselves.
For example:
* Break combined documents into separate files for better metadata context.
* Make sure scanned documents are legible enough for extraction, and enable the **OCR** option. Poorly scanned documents might require additional preparation or rescanning before ingestion.
* Adjust the **Chunk Size** and **Chunk Overlap** settings to better suit your documents. Larger chunks provide more context but can include irrelevant information, while smaller chunks yield more precise semantic search but can lack context.
For more information about modifying ingestion parameters and flows, see [Knowledge ingestion settings](/knowledge#knowledge-ingestion-settings).
## See also
* [Configure knowledge](/knowledge)
* [Filter knowledge](/knowledge-filters)
* [Chat with knowledge](/chat)
* [Chat with knowledge](/chat)
* [Inspect and modify flows](/agents#inspect-and-modify-flows)

View file

@ -21,12 +21,13 @@ You can configure how documents are ingested and how the **Chat** interacts with
## Browse knowledge {#browse-knowledge}
The **Knowledge** page lists the documents OpenRAG has ingested into your OpenSearch database, specifically in the `documents` index.
The **Knowledge** page lists the documents OpenRAG has ingested into your OpenSearch database, specifically in an [OpenSearch index](https://docs.opensearch.org/latest/getting-started/intro/#index) named `documents`.
To explore the raw contents of your knowledge base, click <Icon name="Library" aria-hidden="true"/> **Knowledge** to get a list of all ingested documents.
Click a document to view the chunks produced from splitting the document during ingestion.
OpenRAG includes some sample documents that you can use to see how the agent references documents in the [**Chat**](/chat).
You might want to [delete these documents](#delete-knowledge) before uploading your own documents to avoid polluting the agent's context with these samples.
## OpenSearch authentication and document access {#auth}
@ -36,38 +37,116 @@ The mode you choose determines how OpenRAG authenticates with OpenSearch and con
* **Basic Setup (no-auth mode)**: If you choose **Basic Setup**, then OpenRAG is installed in no-auth mode.
This mode uses one, anonymous JWT token for OpenSearch authentication.
There is no differentiation between users.
All users that access your OpenRAG instance can access all documents uploaded to your OpenSearch `documents` index.
All users that access your OpenRAG instance can access all documents uploaded to your OpenSearch knowledge base.
* **Advanced Setup (OAuth mode)**: If you choose **Advanced Setup**, then OpenRAG is installed in OAuth mode.
This mode uses a unique JWT token for each OpenRAG user, and each document is tagged with user ownership. Documents are filtered by user owner.
This means users see only the documents that they uploaded or have access to.
You can enable OAuth mode after installation.
For more information, see [Ingest files through OAuth connectors](/ingestion#oauth-ingestion).
For more information, see [Ingest files with OAuth connectors](/ingestion#oauth-ingestion).
## Set the embedding model and dimensions {#set-the-embedding-model-and-dimensions}
## OpenSearch indexes
An [OpenSearch index](https://docs.opensearch.org/latest/getting-started/intro/#index) is a collection of documents in an OpenSearch database.
By default, all documents you upload to your OpenRAG knowledge base are stored in an index named `documents`.
It is possible to change the index name by [editing the ingestion flow](/agents#inspect-and-modify-flows).
However, this can impact dependent processes, such as the [filters](/knowledge-filters) and [**Chat**](/chat) flow, that reference the `documents` index by default.
Make sure you edit other flows as needed to ensure all processes use the same index name.
If you encounter errors or unexpected behavior after changing the index name, you can [revert the flows to their original configuration](/agents#revert-a-built-in-flow-to-its-original-configuration), or [delete knowledge](/knowledge#delete-knowledge) to clear the existing documents from your knowledge base.
## Knowledge ingestion settings {#knowledge-ingestion-settings}
:::warning
Knowledge ingestion settings apply to documents you upload after making the changes.
Documents uploaded before changing these settings aren't reprocessed.
To ensure consistency across your knowledge base, you must reupload all documents after adjusting any of these settings.
:::
### Set the embedding model and dimensions {#set-the-embedding-model-and-dimensions}
When you [install OpenRAG](/install), you select an embedding model during **Application Onboarding**.
OpenRAG automatically detects and configures the appropriate vector dimensions for your selected embedding model, ensuring optimal search performance and compatibility.
In the OpenRAG repository, you can find the complete list of supported models in [`models_service.py`](https://github.com/langflow-ai/openrag/blob/main/src/services/models_service.py) and the corresponding vector dimensions in [`settings.py`](https://github.com/langflow-ai/openrag/blob/main/src/config/settings.py).
The default embedding dimension is `1536` and the default model is `text-embedding-3-small`.
The default embedding dimension is `1536` and the default model is the OpenAI `text-embedding-3-small`.
You can use any supported or unsupported embedding model by specifying the model in your OpenRAG configuration during installation.
If you use an unsupported embedding model that doesn't have defined dimensions in `settings.py`, then OpenRAG falls back to the default dimensions (1536) and logs a warning. OpenRAG's OpenSearch instance and flows continue to work, but [similarity search](https://www.ibm.com/think/topics/vector-search) quality can be affected if the actual model dimensions aren't 1536.
The embedding model setting is immutable.
To change the embedding model, you must [reinstall OpenRAG](/install#reinstall).
This embedding model you choose during **Application Onboarding** is immutable and can only be changed by [reinstalling OpenRAG](/install#reinstall).
Alternatively, you can [edit the OpenRAG flows](/agents#inspect-and-modify-flows) for knowledge ingestion and chat. Make sure all flows use the same embedding model.
## Set ingestion parameters
### Set Docling parameters
For information about modifying ingestion parameters and flows, see [Ingest knowledge](/ingestion).
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for document ingestion because it supports many file formats, processes tables and images well, and performs efficiently.
## Delete knowledge
When you [upload documents](/ingestion), Docling processes the files, splits them into chunks, and stores them as separate, structured documents in your OpenSearch knowledge base.
To clear your entire knowledge base, you can delete the contents of the `./opensearch-data` folder in your OpenRAG installation directory, or you can [reset the OpenRAG containers](/install#tui-container-management).
You can use either Docling Serve or OpenRAG's built-in Docling ingestion pipeline to process documents.
<Tabs>
<TabItem value="serve" label="Docling Serve ingestion" default>
By default, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve).
This means that OpenRAG starts a `docling serve` process on your local machine and runs Docling ingestion through an API service.
</TabItem>
<TabItem value="docling" label="Built-in Docling ingestion">
If you want to use OpenRAG's built-in Docling ingestion pipeline instead of the separate Docling Serve service, set `DISABLE_INGEST_WITH_LANGFLOW=true` in your [OpenRAG environment variables](/reference/configuration#document-processing).
The built-in pipeline uses the Docling processor directly instead of through the Docling Serve API.
For the underlying functionality, see [`processors.py`](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58) in the OpenRAG repository.
</TabItem>
</Tabs>
To modify the Docling ingestion and embedding parameters, click <Icon name="Settings2" aria-hidden="true"/> **Settings** in the OpenRAG user interface.
:::tip
OpenRAG warns you if `docling serve` isn't running.
You can [start and stop OpenRAG services](/install#tui-container-management) from the TUI main menu with **Start Native Services** or **Stop Native Services**.
:::
* **Embedding model**: Select the model to use to generate vector embeddings for your documents. This is initially set during installation.
The recommended way to change this setting is by [reinstalling OpenRAG](/install#reinstall).
If you change this value by directly [editing the flow](/agents#inspect-and-modify-flows), you must also change the embedding model in other [OpenRAG flows](/agents) to ensure that similarity search results are consistent.
If you uploaded documents prior to changing the embedding model, you must either [create filters](/knowledge-filters) to prevent mixing documents embedded with different models, or you must reupload all documents to regenerate embeddings with the new model.
* **Chunk size**: Set the number of characters for each text chunk when breaking down a file.
Larger chunks yield more context per chunk, but can include irrelevant information. Smaller chunks yield more precise semantic search, but can lack context.
The default value is 1000 characters, which is usually a good balance between context and precision.
* **Chunk overlap**: Set the number of characters to overlap over chunk boundaries.
Use larger overlap values for documents where context is most important. Use smaller overlap values for simpler documents or when optimization is most important.
The default value is 200 characters, which represents an overlap of 20 percent if the **Chunk size** is 1000. This is suitable for general use. For faster processing, decrease the overlap to approximately 10 percent. For more complex documents where you need to preserve context across chunks, increase it to approximately 40 percent.
* **Table Structure**: Enables Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/) tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. This option is enabled by default.
* **OCR**: Enables Optical Character Recognition (OCR) processing when extracting text from images and ingesting scanned documents. This setting is best suited for processing text-based documents faster with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
This option is disabled by default. Enabling OCR can slow ingestion performance.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
* **Picture descriptions**: Only applicable if **OCR** is enabled. Adds image descriptions generated by the [`SmolVLM-256M-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model. Enabling picture descriptions can slow ingestion performance.
### Set the local documents path {#set-the-local-documents-path}
The default path for local uploads is the `./documents` subdirectory in your OpenRAG installation directory. This is mounted to the `/app/documents/` directory inside the OpenRAG container. Files added to the host or container directory are visible in both locations.
To change this location, modify the **Documents Paths** variable in either the [**Advanced Setup** menu](/install#setup) or in the `.env` used by Docker Compose.
## Delete knowledge {#delete-knowledge}
To clear your entire knowledge base, you can delete the contents of the `./opensearch-data` folder in your OpenRAG installation directory, or you can [reset the containers](/install#tui-container-management).
Be aware that both of these operations are destructive and cannot be undone.
In particular, resetting containers reverts your OpenRAG instance to the initial state as though it were a fresh installation.
@ -76,4 +155,5 @@ In particular, resetting containers reverts your OpenRAG instance to the initial
* [Ingest knowledge](/ingestion)
* [Filter knowledge](/knowledge-filters)
* [Chat with knowledge](/chat)
* [Chat with knowledge](/chat)
* [Inspect and modify flows](/agents#inspect-and-modify-flows)

View file

@ -180,18 +180,15 @@ If you encounter errors during installation, see [Troubleshoot OpenRAG](/support
## Set up OpenRAG with the TUI {#setup}
The TUI creates a `.env` file in your OpenRAG directory root and starts OpenRAG.
If the TUI detects a `.env` file in the OpenRAG root directory, it sources any variables from the `.env` file.
If the TUI detects OAuth credentials, it enforces the **Advanced Setup** path.
The OpenRAG setup process creates a `.env` file at the root of your OpenRAG directory, and then starts OpenRAG.
If it detects a `.env` file in the OpenRAG root directory, it sources any variables from the `.env` file.
The TUI offers two setup methods to populate the required values. **Basic Setup** can generate all minimum required values for OpenRAG. However, **Basic Setup** doesn't enable [OAuth connectors for cloud storage](/knowledge#auth). If you want to use OAuth connectors to upload documents from cloud storage, select **Advanced Setup**.
If OpenRAG detects OAuth credentials, it recommends **Advanced Setup**.
<Tabs groupId="Setup method">
<TabItem value="Basic setup" label="Basic setup" default>
**Basic Setup** can generate all of the required values for OpenRAG. The OpenAI API key is optional and can be provided during onboarding.
**Basic Setup** does not set up OAuth connections for ingestion from cloud providers.
For OAuth setup, use **Advanced Setup**.
For information about the difference between basic (no auth) and OAuth in OpenRAG, see [OpenSearch authentication and document access](/knowledge#auth).
1. To install OpenRAG with **Basic Setup**, click **Basic Setup** or press <kbd>1</kbd>.
2. Click **Generate Passwords** to generate passwords for OpenSearch and Langflow.
@ -215,16 +212,21 @@ If the TUI detects OAuth credentials, it enforces the **Advanced Setup** path.
</TabItem>
<TabItem value="Advanced setup" label="Advanced setup">
1. To install OpenRAG with **Advanced Setup**, click **Advanced Setup** or press <kbd>2</kbd>.
1. To install OpenRAG with **Advanced Setup**, click **Advanced Setup** or press <kbd>2</kbd>.
2. Click **Generate Passwords** to generate passwords for OpenSearch and Langflow.
The OpenSearch password is required. The Langflow admin password is optional.
If no Langflow admin password is generated, Langflow runs in [autologin mode](https://docs.langflow.org/api-keys-and-authentication#langflow-auto-login) with no password required.
3. Paste your OpenAI API key in the OpenAI API key field.
4. Add your client and secret values for Google or Microsoft OAuth.
These values can be found with your OAuth provider.
For more information, see the [Google OAuth client](https://developers.google.com/identity/protocols/oauth2) or [Microsoft Graph OAuth client](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/graph-oauth) documentation.
4. If you want to upload documents from external storage, such as Google Drive, add the required OAuth credentials for the connectors that you want to use. These settings can be populated automatically if OpenRAG detects these credentials in a `.env` file in the OpenRAG installation directory.
* **Amazon**: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on [Configuring access to AWS applications](https://docs.aws.amazon.com/singlesignon/latest/userguide/manage-your-applications.html).
* **Google**: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials). For more information, see the [Google OAuth client documentation](https://developers.google.com/identity/protocols/oauth2).
* **Microsoft**: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide [Azure application registration credentials for SharePoint and OneDrive](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/app-registration?view=odsp-graph-online). For more information, see the [Microsoft Graph OAuth client documentation](https://learn.microsoft.com/en-us/onedrive/developer/rest-api/getting-started/graph-oauth).
You can [manage OAuth credentials](/ingestion#oauth-ingestion) later, but it is recommended to configure them during initial set up.
5. The OpenRAG TUI presents redirect URIs for your OAuth app.
These are the URLs your OAuth provider will redirect back to after user sign-in.
Register these redirect values with your OAuth provider as they are presented in the TUI.
@ -239,21 +241,23 @@ If the TUI detects OAuth credentials, it enforces the **Advanced Setup** path.
8. To start the Docling service, under **Native Services**, click **Start**.
9. To open the OpenRAG application, navigate to the TUI main menu, and then click **Open App**.
Alternatively, in your browser, navigate to `localhost:3000`.
You are presented with your provider's OAuth sign-in screen.
After sign-in, you are redirected to the redirect URI.
Two additional variables are available for Advanced Setup:
10. If you enabled OAuth connectors, you must sign in to your OAuth provider before being redirected to your OpenRAG instance.
The `LANGFLOW_PUBLIC_URL` controls where the Langflow web interface can be accessed. This is where users interact with their flows in a browser.
11. Two additional variables are available for **Advanced Setup** at this point.
Only change these variables if you have a non-default network configuration for your deployment, such as using a reverse proxy or custom domain.
The `WEBHOOK_BASE_URL` controls where the endpoint for `/connectors/CONNECTOR_TYPE/webhook` will be available.
This connection enables real-time document synchronization with external services.
* `LANGFLOW_PUBLIC_URL`: Sets the base address to access the Langflow web interface. This is where users interact with flows in a browser.
* `WEBHOOK_BASE_URL`: Sets the base address of the OpenRAG OAuth connector endpoint.
Supported webhook endpoints:
- Google Drive: `/connectors/google_drive/webhook`
- OneDrive: `/connectors/onedrive/webhook`
- SharePoint: `/connectors/sharepoint/webhook`
10. Continue with [Application Onboarding](#application-onboarding).
- Amazon S3: Not applicable.
- Google Drive: `/connectors/google_drive/webhook`
- OneDrive: `/connectors/onedrive/webhook`
- SharePoint: `/connectors/sharepoint/webhook`
12. Continue with [Application Onboarding](#application-onboarding).
</TabItem>
</Tabs>

View file

@ -104,7 +104,7 @@ You can click a document to view the chunks of the document as they are stored i
**Folder** uploads an entire directory.
The default directory is the `/documents` subdirectory in your OpenRAG installation directory.
For information about the cloud storage provider options, see [Ingest files through OAuth connectors](/ingestion#oauth-ingestion).
For information about the cloud storage provider options, see [Ingest files with OAuth connectors](/ingestion#oauth-ingestion).
5. Return to the **Chat** window, and then ask a question related to the documents that you just uploaded.
@ -137,7 +137,6 @@ You can click a document to view the chunks of the document as they are stored i
Click the **Language Model** component, and then change the **Model Name** to a different OpenAI model.
After you edit a built-in flow, you can click **Restore flow** on the **Settings** page to revert the flow to its original state when you first installed OpenRAG.
This is a destructive action that discards all customizations to the flow.
4. Press <kbd>Command</kbd>+<kbd>S</kbd> (<kbd>Ctrl</kbd>+<kbd>S</kbd>) to save your changes.

View file

@ -17,7 +17,7 @@ OpenRAG connects and amplifies three popular, proven open-source projects into o
* [OpenSearch](https://docs.opensearch.org/latest/): OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.
It provides powerful hybrid search capabilities with enterprise-grade security and multi-tenancy support.
OpenRAG uses OpenSearch as the underlying vector database for storing and retrieving your documents and associated vector data (embeddings). You can ingest documents from a variety of sources, including your local filesystem and OAuth authenticated connections to popular cloud storage services.
OpenRAG uses OpenSearch as the underlying vector database for storing and retrieving your documents and associated vector data (embeddings). You can ingest documents from a variety of sources, including your local filesystem and OAuth authenticated connectors to popular cloud storage services.
* [Docling](https://docling-project.github.io/docling/): Docling simplifies document processing, supports many file formats and advanced PDF parsing, and provides seamless integrations with the generative AI ecosystem.
@ -63,6 +63,6 @@ flowchart TD
* **Docling Serve**: This is a local document processing service managed by the **OpenRAG backend**.
* **External connectors**: Integrate third-party cloud storage services through OAuth authenticated connections to the **OpenRAG backend**, allowing synchronization of external storage with your OpenSearch knowledge base.
* **External connectors**: Integrate third-party cloud storage services with OAuth authenticated connectors to the **OpenRAG backend**, allowing you to load documents from external storage to your OpenSearch knowledge base.
* **OpenRAG frontend**: Provides the user interface for interacting with the OpenRAG platform.

View file

@ -210,4 +210,8 @@ After removing the containers, retry the upgrade in the OpenRAG TUI by clicking
```
</TabItem>
</Tabs>
</Tabs>
## Document ingestion or similarity search issues
See [Troubleshoot ingestion](/ingestion#troubleshoot-ingestion).