organize knowledge pages
This commit is contained in:
parent
069135d563
commit
fa825525f8
11 changed files with 310 additions and 297 deletions
|
|
@ -13,7 +13,7 @@ In a flow, the individual workflow steps are represented by [_components_](https
|
|||
OpenRAG includes several built-in flows:
|
||||
|
||||
* The [**OpenRAG OpenSearch Agent** flow](/chat#flow) powers the **Chat** feature in OpenRAG.
|
||||
* The [**OpenSearch Ingestion** and **OpenSearch URL Ingestion** flows](/knowledge#knowledge-ingestion-flows) process documents and web content for storage in your OpenSearch knowledge base.
|
||||
* The [**OpenSearch Ingestion** and **OpenSearch URL Ingestion** flows](/ingestion#knowledge-ingestion-flows) process documents and web content for storage in your OpenSearch knowledge base.
|
||||
* The [**OpenRAG OpenSearch Nudges** flow](/chat#nudges) provides optional contextual suggestions in the OpenRAG **Chat**.
|
||||
|
||||
You can customize these flows and create your own flows using OpenRAG's embedded Langflow visual editor.
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ import Tabs from '@theme/Tabs';
|
|||
import TabItem from '@theme/TabItem';
|
||||
import PartialIntegrateChat from '@site/docs/_partial-integrate-chat.mdx';
|
||||
|
||||
After you [upload documents to your knowledge base](/knowledge), you can use the OpenRAG <Icon name="MessageSquare" aria-hidden="true"/> **Chat** feature to interact with your knowledge through natural language queries.
|
||||
After you [upload documents to your knowledge base](/ingestion), you can use the OpenRAG <Icon name="MessageSquare" aria-hidden="true"/> **Chat** feature to interact with your knowledge through natural language queries.
|
||||
|
||||
:::tip
|
||||
Try chatting, uploading documents, and modifying chat settings in the [quickstart](/quickstart).
|
||||
|
|
@ -48,13 +48,15 @@ One or more specialized tools can be attached to the **Tools** port to extend th
|
|||
|
||||
Different models can change the style and content of the agent's responses, and some models might be better suited for certain tasks than others. If the agent doesn't seem to be handling requests well, try changing the model to see how the responses change. For example, fast models might be good for simple queries, but they might not have the depth of reasoning for complex, multi-faceted queries.
|
||||
|
||||
* [**MCP Tools** component](https://docs.langflow.org/mcp-client): Connected to the **Agent** component's **Tools** port, this component can be used to [access any Model Context Protocol (MCP) server](https://docs.langflow.org/mcp-server) and the MCP tools provided by that server. In this case, your OpenRAG Langflow instance's [**Starter Project**](https://docs.langflow.org/concepts-flows#projects) is the MCP server, and the [**OpenSearch URL Ingestion** flow](/knowledge#url-flow) is the MCP tool.
|
||||
* [**MCP Tools** component](https://docs.langflow.org/mcp-client): Connected to the **Agent** component's **Tools** port, this component can be used to [access any Model Context Protocol (MCP) server](https://docs.langflow.org/mcp-server) and the MCP tools provided by that server. In this case, your OpenRAG Langflow instance's [**Starter Project**](https://docs.langflow.org/concepts-flows#projects) is the MCP server, and the [**OpenSearch URL Ingestion** flow](/ingestion#url-flow) is the MCP tool.
|
||||
This flow fetches content from URLs, and then stores the content in your OpenRAG OpenSearch knowledge base. By serving this flow as an MCP tool, the agent can selectively call this tool if a URL is detected in the chat input.
|
||||
|
||||
* [**OpenSearch** component](https://docs.langflow.org/bundles-elastic#opensearch): Connected to the **Agent** component's **Tools** port, this component lets the agent search your [OpenRAG OpenSearch knowledge base](/knowledge). The agent might not use this database for every request; the agent uses this connection only if it decides that documents in your knowledge base are relevant to your query.
|
||||
|
||||
* [**Embedding Model** component](https://docs.langflow.org/components-embedding-models): Connected to the **OpenSearch** component's **Embedding** port, this component generates embeddings from chat input that are used in [similarity search](https://www.ibm.com/think/topics/vector-search) to find content in your knowledge base that is relevant to the chat input. The agent uses this information to generate context-aware responses that are specialized for your data.
|
||||
|
||||
It is critical that the embedding model used here matches the embedding model used when you [upload documents to your knowledge base](/ingestion). Mismatched models and dimensions can degrade the quality of similarity search results causing the agent to retrieve irrelevant documents from your knowledge base.
|
||||
|
||||
* [**Text Input** component](https://docs.langflow.org/components-io): Connected to the **OpenSearch** component's **Search Filters** port, this component is populated with a Langflow global variable named `OPENRAG-QUERY-FILTER`. If a global or chat-level [knowledge filter](/knowledge-filters) is set, then the variable contains the filter expression, which limits the documents that the agent can access in the knowledge base.
|
||||
If no knowledge filter is set, then the `OPENRAG-QUERY-FILTER` variable is empty, and the agent can access all documents in the knowledge base.
|
||||
|
||||
|
|
@ -73,7 +75,7 @@ Like OpenRAG's other built-in flows, you can [inspect the flow in Langflow](/age
|
|||
|
||||
During the chat, you'll see information about the agent's process. For more detail, you can inspect individual tool calls. This is helpful for troubleshooting because it shows you how the agent used particular tools. For example, click <Icon name="Gear" aria-hidden="true"/> **Function Call: search_documents (tool_call)** to view the log of tool calls made by the agent to the **OpenSearch** component.
|
||||
|
||||
If documents in your knowledge base seem to be missing or interpreted incorrectly, see [Troubleshoot ingestion](/knowledge#troubleshoot-ingestion).
|
||||
If documents in your knowledge base seem to be missing or interpreted incorrectly, see [Troubleshoot ingestion](/ingestion#troubleshoot-ingestion).
|
||||
|
||||
If tool calls and knowledge appear normal, but the agent's responses seem off-topic or incorrect, consider changing the agent's language model or prompt, as explained in [Inspect and modify flows](/agents#inspect-and-modify-flows).
|
||||
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
---
|
||||
title: Docling in OpenRAG
|
||||
title: Ingest knowledge
|
||||
slug: /ingestion
|
||||
---
|
||||
|
||||
|
|
@ -7,6 +7,10 @@ import Icon from "@site/src/components/icon/icon";
|
|||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
The documents in your OpenRAG [OpenSearch knowledge base](/knowledge) provide specialized context in addition to the general knowledge available to the language model that you select when you [install OpenRAG](/install).
|
||||
Upload documents to populate your knowledge base with unique content, such as your own company documents, research papers, or websites.
|
||||
Then, the [OpenRAG **Chat**](/chat) can retrieve relevant content from your knowledge base to provide context-aware responses.
|
||||
|
||||
<!-- reuse for ingestion sources and flows. Reuse knowledge.mdx for browse & general info about what knowledge is and the role of Docling & OpenSearch. -->
|
||||
|
||||
|
||||
|
|
@ -19,4 +23,245 @@ You can click a document to view the chunks of the document as they are stored i
|
|||
**Folder** uploads an entire directory.
|
||||
The default directory is the `/documents` subdirectory in your OpenRAG installation directory.
|
||||
|
||||
For information about the cloud storage provider options, see [Ingest files through OAuth connectors](/knowledge#oauth-ingestion).
|
||||
For information about the cloud storage provider options, see [Ingest files through OAuth connectors](/ingestion#oauth-ingestion).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## ingest
|
||||
|
||||
OpenRAG can ingest knowledge from direct file uploads, URLs, and OAuth connectors.
|
||||
|
||||
Knowledge ingestion is powered by OpenRAG's built-in [knowledge ingestion flows](/ingestion#knowledge-ingestion-flows) that use Docling Serve to process documents before storing the documents in your OpenSearch database.
|
||||
|
||||
During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected [embedding model](/knowledge#set-the-embedding-model-and-dimensions).
|
||||
The chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.
|
||||
|
||||
Like all [OpenRAG flows](/agents), you can [inspect the flows in Langflow](/agents#inspect-and-modify-flows), and you can customize them if you want to change the [ingestion settings](/ingestion#knowledge-ingestion-settings).
|
||||
|
||||
## Ingest local files and folders {#knowledge-ingestion-flows}
|
||||
|
||||
<!-- You can upload files and folders from your local machine to your knowledge base. When you do this, the OpenSearch Ingestion flow runs in the background -->
|
||||
|
||||
The **OpenSearch Ingestion** flow uses Langflow's [**File** component](https://docs.langflow.org/components-data#file) to split and embed files loaded from your local machine into the OpenSearch database.
|
||||
|
||||
The default path to your local folder is mounted from the `./documents` folder in your OpenRAG project directory to the `/app/documents/` directory inside the Docker container. Files added to the host or the container will be visible in both locations. To configure this location, modify the **Documents Paths** variable in either the TUI's [Advanced Setup](/install#setup) menu or in the `.env` used by Docker Compose.
|
||||
|
||||
To load and process a single file from the mapped location, click **Add Knowledge**, and then click <Icon name="File" aria-hidden="true"/> **File**.
|
||||
The file is loaded into your OpenSearch database, and appears in the Knowledge page.
|
||||
|
||||
To load and process a directory from the mapped location, click **Add Knowledge**, and then click <Icon name="Folder" aria-hidden="true"/> **Folder**.
|
||||
The files are loaded into your OpenSearch database, and appear in the Knowledge page.
|
||||
|
||||
To add files directly to a chat session, click <Icon name="Plus" aria-hidden="true"/> in the chat input and select the files you want to include. Files added this way are processed and made available to the agent for the current conversation, and are not permanently added to the knowledge base.
|
||||
|
||||
### OpenSearch Ingestion flow
|
||||
|
||||
<!-- combine with above -->
|
||||
|
||||
The **OpenSearch Ingestion** flow is the default knowledge ingestion flow in OpenRAG. When you **Add Knowledge** in OpenRAG, the **OpenSearch Ingestion** flow runs in the background. The flow ingests documents using Docling Serve to import and process documents.
|
||||
|
||||
If you [inspect the flow in Langflow](/agents#inspect-and-modify-flows), you'll see that it is comprised of ten components that work together to process and store documents in your knowledge base:
|
||||
|
||||
* The [**Docling Serve** component](https://docs.langflow.org/bundles-docling) processes input documents by connecting to your instance of Docling Serve.
|
||||
* The [**Export DoclingDocument** component](https://docs.langflow.org/components-docling) exports the processed DoclingDocument to markdown format with image export mode set to placeholder. This conversion makes the structured document data into a standardized format for further processing.
|
||||
* Three [**DataFrame Operations** components](https://docs.langflow.org/components-processing#dataframe-operations) sequentially add metadata columns to the document data of `filename`, `file_size`, and `mimetype`.
|
||||
* The [**Split Text** component](https://docs.langflow.org/components-processing#split-text) splits the processed text into chunks with a chunk size of 1000 characters and an overlap of 200 characters.
|
||||
* Four **Secret Input** components provide secure access to configuration variables: `CONNECTOR_TYPE`, `OWNER`, `OWNER_EMAIL`, and `OWNER_NAME`. These are runtime variables populated from OAuth login.
|
||||
* The **Create Data** component combines the secret inputs into a structured data object that will be associated with the document embeddings.
|
||||
* The [**Embedding Model** component](https://docs.langflow.org/components-embedding-models) generates vector embeddings using OpenAI's `text-embedding-3-small` model. The embedding model is selected at [Application onboarding] and cannot be changed.
|
||||
* The [**OpenSearch** component](https://docs.langflow.org/bundles-elastic#opensearch) stores the processed documents and their embeddings in the `documents` index at `https://opensearch:9200`. By default, the component is authenticated with a JWT token, but you can also select `basic` auth mode, and enter your OpenSearch admin username and password.
|
||||
|
||||
To customize this flow, see [Inspect and modify flows](/agents#inspect-and-modify-flows).
|
||||
|
||||
## Ingest knowledge from URLs {#url-flow}
|
||||
|
||||
The **OpenSearch URL Ingestion** flow is used to ingest web content from URLs.
|
||||
This flow isn't directly accessible from the OpenRAG user interface.
|
||||
Instead, this flow is called by the [**OpenRAG OpenSearch Agent** flow](/chat#flow) as a Model Context Protocol (MCP) tool.
|
||||
The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base.
|
||||
|
||||
For more information about MCP in Langflow, see the Langflow documentation on [MCP clients](https://docs.langflow.org/mcp-client) and [MCP servers](https://docs.langflow.org/mcp-tutorial).
|
||||
|
||||
## Ingest files through OAuth connectors {#oauth-ingestion}
|
||||
|
||||
OpenRAG supports Google Drive, OneDrive, and Sharepoint as OAuth connectors for seamless document synchronization.
|
||||
|
||||
OAuth integration allows individual users to connect their personal cloud storage accounts to OpenRAG. Each user must separately authorize OpenRAG to access their own cloud storage files. When a user connects a cloud service, they are redirected to authenticate with that service provider and grant OpenRAG permission to sync documents from their personal cloud storage.
|
||||
|
||||
Before users can connect their cloud storage accounts, you must configure OAuth credentials in OpenRAG. This requires registering OpenRAG as an OAuth application with a cloud provider and obtaining client ID and secret keys for each service you want to support.
|
||||
|
||||
To add an OAuth connector to OpenRAG, do the following.
|
||||
This example uses Google OAuth.
|
||||
If you wish to use another provider, add the secrets to another provider.
|
||||
|
||||
<Tabs groupId="Installation type">
|
||||
<TabItem value="TUI" label="TUI" default>
|
||||
1. If OpenRAG is running, stop it with **Status** > **Stop Services**.
|
||||
2. Click **Advanced Setup**.
|
||||
3. Add the OAuth provider's client and secret key in the [Advanced Setup](/install#setup) menu.
|
||||
4. Click **Save Configuration**.
|
||||
The TUI generates a new `.env` file with your OAuth values.
|
||||
5. Click **Start Container Services**.
|
||||
</TabItem>
|
||||
<TabItem value=".env" label=".env">
|
||||
1. Stop the Docker deployment.
|
||||
2. Add the OAuth provider's client and secret key in the `.env` file for Docker Compose.
|
||||
```bash
|
||||
GOOGLE_OAUTH_CLIENT_ID='YOUR_OAUTH_CLIENT_ID'
|
||||
GOOGLE_OAUTH_CLIENT_SECRET='YOUR_OAUTH_CLIENT_SECRET'
|
||||
```
|
||||
3. Save your `.env` file.
|
||||
4. Start the Docker deployment.
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
The OpenRAG frontend at `http://localhost:3000` now redirects to an OAuth callback login page for your OAuth provider.
|
||||
A successful authentication opens OpenRAG with the required scopes for your connected storage.
|
||||
|
||||
To add knowledge from an OAuth-connected storage provider, do the following:
|
||||
|
||||
1. Click **Add Knowledge**, and then select the storage provider, for example, **Google Drive**.
|
||||
The **Add Cloud Knowledge** page opens.
|
||||
2. To add files or folders from the connected storage, click **Add Files**.
|
||||
Select the files or folders you want and click **Select**.
|
||||
You can select multiple files.
|
||||
3. When your files are selected, click **Ingest Files**.
|
||||
The ingestion process can take some time depending on the size of your documents.
|
||||
4. When ingestion is complete, your documents are available in the Knowledge screen.
|
||||
|
||||
If ingestion fails, click **Status** to view the logged error.
|
||||
|
||||
## Monitor ingestion
|
||||
|
||||
Document ingestion tasks run in the background.
|
||||
|
||||
In the OpenRAG UI, a badge is shown on <Icon name="Bell" aria-hidden="true"/> **Tasks** when OpenRAG tasks are active.
|
||||
Click <Icon name="Bell" aria-hidden="true"/> **Tasks** to inspect and cancel tasks:
|
||||
|
||||
* **Active Tasks**: All tasks that are **Pending**, **Running**, or **Processing**.
|
||||
For each active task, depending on its state, you can find the task ID, start time, duration, number of files processed, and the total files enqueued for processing.
|
||||
|
||||
* **Pending**: The task is queued and waiting to start.
|
||||
|
||||
* **Running**: The task is actively processing files.
|
||||
|
||||
* **Processing**: The task is performing ingestion operations.
|
||||
|
||||
* **Failed**: Something went wrong during ingestion, or the task was manually canceled.
|
||||
|
||||
To stop an active task, click <Icon name="X" aria-hidden="true"/> **Cancel**. Canceling a task stops processing immediately and marks the task as **Failed**.
|
||||
|
||||
## Troubleshoot ingestion (#troubleshoot-ingestion)
|
||||
|
||||
If an ingestion task fails, do the following:
|
||||
|
||||
* Make sure you are uploading supported file types.
|
||||
* Split excessively large files into smaller files before uploading.
|
||||
* Remove unusual embedded content, such as videos or animations, before uploading. Although Docling can replace some non-text content with placeholders during ingestion, some embedded content might cause errors.
|
||||
|
||||
If the OpenRAG **Chat** doesn't seem to use your documents correctly, [browse your knowledge base](#browse-knowledge) to confirm that the documents are uploaded in full, and the chunks are correct.
|
||||
|
||||
If the documents are present and well-formed, check your [knowledge filters](/knowledge-filters).
|
||||
If a global filter is applied, make sure the expected documents are included in the global filter.
|
||||
If the global filter excludes any documents, the agent cannot access those documents unless you apply a chat-level filter or change the global filter.
|
||||
|
||||
If text is missing or incorrectly processed, you need to reupload the documents after modifying the ingestion parameters or the documents themselves.
|
||||
For example:
|
||||
|
||||
* Break combined documents into separate files for better metadata context.
|
||||
* Make sure scanned documents are legible enough for extraction, and enable the **OCR** option. Poorly scanned documents might require additional preparation or rescanning before ingestion.
|
||||
* Adjust the **Chunk Size** and **Chunk Overlap** settings to better suit your documents. Larger chunks provide more context but can include irrelevant information, while smaller chunks yield more precise semantic search but can lack context.
|
||||
|
||||
For more information about modifying ingestion parameters and flows, see [Docling Serve for knowledge ingestion](/knowledge#docling-serve-for-knowledge-ingestion).
|
||||
|
||||
## Docling Serve for knowledge ingestion {#docling-serve-for-knowledge-ingestion}
|
||||
|
||||
<!-- revise this section and subsections. Move to the knowledge-configure page and rename that page to "Configure knowledge". -->
|
||||
|
||||
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for document ingestion.
|
||||
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling serve` process on your local machine and runs Docling ingestion through an API service.
|
||||
|
||||
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
|
||||
|
||||
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
|
||||
|
||||
### Knowledge ingestion settings {#knowledge-ingestion-settings}
|
||||
|
||||
To modify OpenRAG's ingestion settings, including the Docling settings and ingestion flows, <Icon name="Settings2" aria-hidden="true"/> **Settings**.
|
||||
|
||||
These settings configure the Docling ingestion parameters.
|
||||
|
||||
OpenRAG will warn you if `docling serve` is not running.
|
||||
To start or stop `docling serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
|
||||
|
||||
**Embedding model** determines which AI model is used to create vector embeddings. The default is the OpenAI `text-embedding-3-small` model.
|
||||
|
||||
**Chunk size** determines how large each text chunk is in number of characters.
|
||||
Larger chunks yield more context per chunk, but can include irrelevant information. Smaller chunks yield more precise semantic search, but can lack context.
|
||||
The default value of `1000` characters provides a good starting point that balances these considerations.
|
||||
|
||||
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
|
||||
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
|
||||
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
|
||||
|
||||
**Table Structure** enables Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/) tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. **Table Structure** is enabled by default.
|
||||
|
||||
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
|
||||
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
|
||||
|
||||
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
|
||||
|
||||
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
|
||||
|
||||
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
|
||||
|
||||
### Use OpenRAG default ingestion instead of Docling serve
|
||||
|
||||
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/reference/configuration#document-processing).
|
||||
|
||||
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
||||
|
||||
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
|
||||
|
||||
## Ingestion performance expectations
|
||||
|
||||
On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes.
|
||||
This equates to approximately 2.4 documents per second.
|
||||
|
||||
You can generally expect equal or better performance on developer laptops and significantly faster on servers.
|
||||
Throughput scales with CPU cores, memory, storage speed, and configuration choices such as embedding model, chunk size and overlap, and concurrency.
|
||||
|
||||
This test returned 12 errors (approximately 1.1 percent).
|
||||
All errors were file-specific, and they didn't stop the pipeline.
|
||||
|
||||
* Ingestion dataset:
|
||||
|
||||
* Total files: 1,083 items mounted
|
||||
* Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
|
||||
|
||||
* Hardware specifications:
|
||||
|
||||
* Machine: Apple M4 Pro
|
||||
* Podman VM:
|
||||
* Name: `podman-machine-default`
|
||||
* Type: `applehv`
|
||||
* vCPUs: 7
|
||||
* Memory: 8 GiB
|
||||
* Disk size: 100 GiB
|
||||
|
||||
* Test results:
|
||||
|
||||
```text
|
||||
2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
|
||||
2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
|
||||
...
|
||||
2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082
|
||||
```
|
||||
|
||||
* Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)
|
||||
|
||||
* Throughput: Approximately 2.4 documents/second
|
||||
|
|
@ -1,58 +0,0 @@
|
|||
---
|
||||
title: Configure OpenSearch in OpenRAG
|
||||
slug: /knowledge-configure
|
||||
---
|
||||
|
||||
import Icon from "@site/src/components/icon/icon";
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
OpenRAG includes a built-in [OpenSearch](https://docs.opensearch.org/latest/) instance that serves as the underlying datastore for your knowledge.
|
||||
This specialized database is used to store and retrieve your documents and the associated vector data (embeddings).
|
||||
|
||||
The [OpenRAG **Chat**](/chat) runs [similarity searches](https://www.ibm.com/think/topics/vector-search) against your OpenSearch database to retrieve relevant information and generate context-aware responses.
|
||||
|
||||
Additionally, OpenSearch provides powerful hybrid search capabilities with enterprise-grade security and multi-tenancy support.
|
||||
|
||||
## OpenSearch authentication and document access {#auth}
|
||||
|
||||
When you [install OpenRAG](/install), you can choose between two setup modes: **Basic Setup** and **Advanced Setup**.
|
||||
The mode you choose determines how OpenRAG authenticates with OpenSearch and controls access to documents:
|
||||
|
||||
* **Basic Setup (no-auth mode)**: If you choose **Basic Setup**, then OpenRAG is installed in no-auth mode.
|
||||
This mode uses one, anonymous JWT token for OpenSearch authentication.
|
||||
There is no differentiation between users.
|
||||
All users that access your OpenRAG instance can access all documents uploaded to your OpenSearch `documents` index.
|
||||
|
||||
* **Advanced Setup (OAuth mode)**: If you choose **Advanced Setup**, then OpenRAG is installed in OAuth mode.
|
||||
This mode uses a unique JWT token for each OpenRAG user, and each document is tagged with user ownership. Documents are filtered by user owner.
|
||||
This means users see only the documents that they uploaded or have access to.
|
||||
|
||||
You can enable OAuth mode after installation.
|
||||
For more information, see [Ingest files through OAuth connectors](/knowledge#oauth-ingestion).
|
||||
|
||||
## Set the embedding model and dimensions {#set-the-embedding-model-and-dimensions}
|
||||
|
||||
When you [install OpenRAG](/install), you select an embedding model during **Application Onboarding**.
|
||||
OpenRAG automatically detects and configures the appropriate vector dimensions for your selected embedding model, ensuring optimal search performance and compatibility.
|
||||
|
||||
In the OpenRAG repository, you can find the complete list of supported models in [`models_service.py`](https://github.com/langflow-ai/openrag/blob/main/src/services/models_service.py) and the corresponding vector dimensions in [`settings.py`](https://github.com/langflow-ai/openrag/blob/main/src/config/settings.py).
|
||||
|
||||
The default embedding dimension is `1536` and the default model is `text-embedding-3-small`.
|
||||
|
||||
You can use any supported or unsupported embedding model by specifying the model in your OpenRAG configuration during installation.
|
||||
|
||||
If you use an unsupported embedding model that doesn't have defined dimensions in `settings.py`, then OpenRAG falls back to the default dimensions (1536) and logs a warning. OpenRAG's OpenSearch instance and flows continue to work, but [similarity search](https://www.ibm.com/think/topics/vector-search) quality can be affected if the actual model dimensions aren't 1536.
|
||||
|
||||
The embedding model setting is immutable.
|
||||
To change the embedding model, you must [reinstall OpenRAG](/install#reinstall).
|
||||
|
||||
## Set ingestion parameters
|
||||
|
||||
For information about modifying ingestion parameters and flows, see [Knowledge ingestion settings](/knowledge#knowledge-ingestion-settings) and [Knowledge ingestion flows](/knowledge#knowledge-ingestion-flows).
|
||||
|
||||
## See also
|
||||
|
||||
* [Ingest knowledge](/knowledge)
|
||||
* [Filter knowledge](/knowledge-filters)
|
||||
* [Chat with knowledge](/chat)
|
||||
|
|
@ -7,7 +7,7 @@ import Icon from "@site/src/components/icon/icon";
|
|||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
OpenRAG's knowledge filters help you organize and manage your [knowledge base](/knowledge-configure) by creating pre-defined views of your documents.
|
||||
OpenRAG's knowledge filters help you organize and manage your [knowledge base](/knowledge) by creating pre-defined views of your documents.
|
||||
|
||||
Each knowledge filter captures a specific subset of documents based on given a search query and filters.
|
||||
|
||||
|
|
@ -31,7 +31,7 @@ To create a knowledge filter, do the following:
|
|||
* **Data Sources**: Select specific data sources or folders to include.
|
||||
* **Document Types**: Filter by file type.
|
||||
* **Owners**: Filter by the user that uploaded the documents.
|
||||
* **Connectors**: Filter by [upload source](/knowledge), such as the local file system or a Google Drive OAuth connector.
|
||||
* **Connectors**: Filter by [upload source](/ingestion), such as the local file system or a Google Drive OAuth connector.
|
||||
* **Response Limit**: Set the maximum number of results to return from the knowledge base. The default is `10`.
|
||||
* **Score Threshold**: Set the minimum relevance score for similarity search. The default score is `0`.
|
||||
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
---
|
||||
title: Ingest knowledge
|
||||
title: Configure knowledge
|
||||
slug: /knowledge
|
||||
---
|
||||
|
||||
|
|
@ -7,131 +7,15 @@ import Icon from "@site/src/components/icon/icon";
|
|||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
The documents in your OpenRAG [OpenSearch knowledge base](/knowledge-configure) provide specialized context in addition to the general knowledge available to the language model that you select when you [install OpenRAG](/install).
|
||||
Upload documents to populate your knowledge base with unique content, such as your own company documents, research papers, or websites.
|
||||
Then, the [OpenRAG **Chat**](/chat) can retrieve relevant content from your knowledge base to provide context-aware responses.
|
||||
OpenRAG includes a built-in [OpenSearch](https://docs.opensearch.org/latest/) instance that serves as the underlying datastore for your _knowledge_ (documents).
|
||||
This specialized database is used to store and retrieve your documents and the associated vector data (embeddings).
|
||||
|
||||
OpenRAG can ingest knowledge from direct file uploads, URLs, and OAuth connectors.
|
||||
You can [upload documents](/ingestion) from a variety of sources.
|
||||
Documents are processed through OpenRAG's knowledge ingestion flows with Docling.
|
||||
|
||||
Knowledge ingestion is powered by OpenRAG's built-in [knowledge ingestion flows](/knowledge#knowledge-ingestion-flows) that use Docling Serve to process documents before storing the documents in your OpenSearch database.
|
||||
The [OpenRAG **Chat**](/chat) runs [similarity searches](https://www.ibm.com/think/topics/vector-search) against your OpenSearch database to retrieve relevant information and generate context-aware responses.
|
||||
|
||||
During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected [embedding model](/knowledge-configure#set-the-embedding-model-and-dimensions).
|
||||
The chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.
|
||||
|
||||
Like all [OpenRAG flows](/agents), you can [inspect the flows in Langflow](/agents#inspect-and-modify-flows), and you can customize them if you want to change the [ingestion settings](/knowledge#knowledge-ingestion-settings).
|
||||
|
||||
## Ingest local files and folders {#knowledge-ingestion-flows}
|
||||
|
||||
<!-- You can upload files and folders from your local machine to your knowledge base. When you do this, the OpenSearch Ingestion flow runs in the background -->
|
||||
|
||||
The **OpenSearch Ingestion** flow uses Langflow's [**File** component](https://docs.langflow.org/components-data#file) to split and embed files loaded from your local machine into the OpenSearch database.
|
||||
|
||||
The default path to your local folder is mounted from the `./documents` folder in your OpenRAG project directory to the `/app/documents/` directory inside the Docker container. Files added to the host or the container will be visible in both locations. To configure this location, modify the **Documents Paths** variable in either the TUI's [Advanced Setup](/install#setup) menu or in the `.env` used by Docker Compose.
|
||||
|
||||
To load and process a single file from the mapped location, click **Add Knowledge**, and then click <Icon name="File" aria-hidden="true"/> **File**.
|
||||
The file is loaded into your OpenSearch database, and appears in the Knowledge page.
|
||||
|
||||
To load and process a directory from the mapped location, click **Add Knowledge**, and then click <Icon name="Folder" aria-hidden="true"/> **Folder**.
|
||||
The files are loaded into your OpenSearch database, and appear in the Knowledge page.
|
||||
|
||||
To add files directly to a chat session, click <Icon name="Plus" aria-hidden="true"/> in the chat input and select the files you want to include. Files added this way are processed and made available to the agent for the current conversation, and are not permanently added to the knowledge base.
|
||||
|
||||
### OpenSearch Ingestion flow
|
||||
|
||||
<!-- combine with above -->
|
||||
|
||||
The **OpenSearch Ingestion** flow is the default knowledge ingestion flow in OpenRAG. When you **Add Knowledge** in OpenRAG, the **OpenSearch Ingestion** flow runs in the background. The flow ingests documents using Docling Serve to import and process documents.
|
||||
|
||||
If you [inspect the flow in Langflow](/agents#inspect-and-modify-flows), you'll see that it is comprised of ten components that work together to process and store documents in your knowledge base:
|
||||
|
||||
* The [**Docling Serve** component](https://docs.langflow.org/bundles-docling) processes input documents by connecting to your instance of Docling Serve.
|
||||
* The [**Export DoclingDocument** component](https://docs.langflow.org/components-docling) exports the processed DoclingDocument to markdown format with image export mode set to placeholder. This conversion makes the structured document data into a standardized format for further processing.
|
||||
* Three [**DataFrame Operations** components](https://docs.langflow.org/components-processing#dataframe-operations) sequentially add metadata columns to the document data of `filename`, `file_size`, and `mimetype`.
|
||||
* The [**Split Text** component](https://docs.langflow.org/components-processing#split-text) splits the processed text into chunks with a chunk size of 1000 characters and an overlap of 200 characters.
|
||||
* Four **Secret Input** components provide secure access to configuration variables: `CONNECTOR_TYPE`, `OWNER`, `OWNER_EMAIL`, and `OWNER_NAME`. These are runtime variables populated from OAuth login.
|
||||
* The **Create Data** component combines the secret inputs into a structured data object that will be associated with the document embeddings.
|
||||
* The [**Embedding Model** component](https://docs.langflow.org/components-embedding-models) generates vector embeddings using OpenAI's `text-embedding-3-small` model. The embedding model is selected at [Application onboarding] and cannot be changed.
|
||||
* The [**OpenSearch** component](https://docs.langflow.org/bundles-elastic#opensearch) stores the processed documents and their embeddings in the `documents` index at `https://opensearch:9200`. By default, the component is authenticated with a JWT token, but you can also select `basic` auth mode, and enter your OpenSearch admin username and password.
|
||||
|
||||
To customize this flow, see [Inspect and modify flows](/agents#inspect-and-modify-flows).
|
||||
|
||||
## Ingest knowledge from URLs {#url-flow}
|
||||
|
||||
The **OpenSearch URL Ingestion** flow is used to ingest web content from URLs.
|
||||
This flow isn't directly accessible from the OpenRAG user interface.
|
||||
Instead, this flow is called by the [**OpenRAG OpenSearch Agent** flow](/chat#flow) as a Model Context Protocol (MCP) tool.
|
||||
The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base.
|
||||
|
||||
For more information about MCP in Langflow, see the Langflow documentation on [MCP clients](https://docs.langflow.org/mcp-client) and [MCP servers](https://docs.langflow.org/mcp-tutorial).
|
||||
|
||||
## Ingest files through OAuth connectors {#oauth-ingestion}
|
||||
|
||||
OpenRAG supports Google Drive, OneDrive, and Sharepoint as OAuth connectors for seamless document synchronization.
|
||||
|
||||
OAuth integration allows individual users to connect their personal cloud storage accounts to OpenRAG. Each user must separately authorize OpenRAG to access their own cloud storage files. When a user connects a cloud service, they are redirected to authenticate with that service provider and grant OpenRAG permission to sync documents from their personal cloud storage.
|
||||
|
||||
Before users can connect their cloud storage accounts, you must configure OAuth credentials in OpenRAG. This requires registering OpenRAG as an OAuth application with a cloud provider and obtaining client ID and secret keys for each service you want to support.
|
||||
|
||||
To add an OAuth connector to OpenRAG, do the following.
|
||||
This example uses Google OAuth.
|
||||
If you wish to use another provider, add the secrets to another provider.
|
||||
|
||||
<Tabs groupId="Installation type">
|
||||
<TabItem value="TUI" label="TUI" default>
|
||||
1. If OpenRAG is running, stop it with **Status** > **Stop Services**.
|
||||
2. Click **Advanced Setup**.
|
||||
3. Add the OAuth provider's client and secret key in the [Advanced Setup](/install#setup) menu.
|
||||
4. Click **Save Configuration**.
|
||||
The TUI generates a new `.env` file with your OAuth values.
|
||||
5. Click **Start Container Services**.
|
||||
</TabItem>
|
||||
<TabItem value=".env" label=".env">
|
||||
1. Stop the Docker deployment.
|
||||
2. Add the OAuth provider's client and secret key in the `.env` file for Docker Compose.
|
||||
```bash
|
||||
GOOGLE_OAUTH_CLIENT_ID='YOUR_OAUTH_CLIENT_ID'
|
||||
GOOGLE_OAUTH_CLIENT_SECRET='YOUR_OAUTH_CLIENT_SECRET'
|
||||
```
|
||||
3. Save your `.env` file.
|
||||
4. Start the Docker deployment.
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
The OpenRAG frontend at `http://localhost:3000` now redirects to an OAuth callback login page for your OAuth provider.
|
||||
A successful authentication opens OpenRAG with the required scopes for your connected storage.
|
||||
|
||||
To add knowledge from an OAuth-connected storage provider, do the following:
|
||||
|
||||
1. Click **Add Knowledge**, and then select the storage provider, for example, **Google Drive**.
|
||||
The **Add Cloud Knowledge** page opens.
|
||||
2. To add files or folders from the connected storage, click **Add Files**.
|
||||
Select the files or folders you want and click **Select**.
|
||||
You can select multiple files.
|
||||
3. When your files are selected, click **Ingest Files**.
|
||||
The ingestion process can take some time depending on the size of your documents.
|
||||
4. When ingestion is complete, your documents are available in the Knowledge screen.
|
||||
|
||||
If ingestion fails, click **Status** to view the logged error.
|
||||
|
||||
## Monitor ingestion
|
||||
|
||||
Document ingestion tasks run in the background.
|
||||
|
||||
In the OpenRAG UI, a badge is shown on <Icon name="Bell" aria-hidden="true"/> **Tasks** when OpenRAG tasks are active.
|
||||
Click <Icon name="Bell" aria-hidden="true"/> **Tasks** to inspect and cancel tasks:
|
||||
|
||||
* **Active Tasks**: All tasks that are **Pending**, **Running**, or **Processing**.
|
||||
For each active task, depending on its state, you can find the task ID, start time, duration, number of files processed, and the total files enqueued for processing.
|
||||
|
||||
* **Pending**: The task is queued and waiting to start.
|
||||
|
||||
* **Running**: The task is actively processing files.
|
||||
|
||||
* **Processing**: The task is performing ingestion operations.
|
||||
|
||||
* **Failed**: Something went wrong during ingestion, or the task was manually canceled.
|
||||
|
||||
To stop an active task, click <Icon name="X" aria-hidden="true"/> **Cancel**. Canceling a task stops processing immediately and marks the task as **Failed**.
|
||||
You can configure how documents are ingested and how the **Chat** interacts with your knowledge base.
|
||||
|
||||
## Browse knowledge {#browse-knowledge}
|
||||
|
||||
|
|
@ -140,118 +24,54 @@ The **Knowledge** page lists the documents OpenRAG has ingested into your OpenSe
|
|||
To explore the raw contents of your knowledge base, click <Icon name="Library" aria-hidden="true"/> **Knowledge** to get a list of all ingested documents.
|
||||
Click a document to view the chunks produced from splitting the document during ingestion.
|
||||
|
||||
## Troubleshoot ingestion (#troubleshoot-ingestion)
|
||||
OpenRAG includes some sample documents that you can use to see how the agent references documents in the [**Chat**](/chat).
|
||||
|
||||
If an ingestion task fails, do the following:
|
||||
## OpenSearch authentication and document access {#auth}
|
||||
|
||||
* Make sure you are uploading supported file types.
|
||||
* Split excessively large files into smaller files before uploading.
|
||||
* Remove unusual embedded content, such as videos or animations, before uploading. Although Docling can replace some non-text content with placeholders during ingestion, some embedded content might cause errors.
|
||||
When you [install OpenRAG](/install), you can choose between two setup modes: **Basic Setup** and **Advanced Setup**.
|
||||
The mode you choose determines how OpenRAG authenticates with OpenSearch and controls access to documents:
|
||||
|
||||
If the OpenRAG **Chat** doesn't seem to use your documents correctly, [browse your knowledge base](#browse-knowledge) to confirm that the documents are uploaded in full, and the chunks are correct.
|
||||
* **Basic Setup (no-auth mode)**: If you choose **Basic Setup**, then OpenRAG is installed in no-auth mode.
|
||||
This mode uses one, anonymous JWT token for OpenSearch authentication.
|
||||
There is no differentiation between users.
|
||||
All users that access your OpenRAG instance can access all documents uploaded to your OpenSearch `documents` index.
|
||||
|
||||
If the documents are present and well-formed, check your [knowledge filters](/knowledge-filters).
|
||||
If a global filter is applied, make sure the expected documents are included in the global filter.
|
||||
If the global filter excludes any documents, the agent cannot access those documents unless you apply a chat-level filter or change the global filter.
|
||||
* **Advanced Setup (OAuth mode)**: If you choose **Advanced Setup**, then OpenRAG is installed in OAuth mode.
|
||||
This mode uses a unique JWT token for each OpenRAG user, and each document is tagged with user ownership. Documents are filtered by user owner.
|
||||
This means users see only the documents that they uploaded or have access to.
|
||||
|
||||
If text is missing or incorrectly processed, you need to reupload the documents after modifying the ingestion parameters or the documents themselves.
|
||||
For example:
|
||||
You can enable OAuth mode after installation.
|
||||
For more information, see [Ingest files through OAuth connectors](/ingestion#oauth-ingestion).
|
||||
|
||||
* Break combined documents into separate files for better metadata context.
|
||||
* Make sure scanned documents are legible enough for extraction, and enable the **OCR** option. Poorly scanned documents might require additional preparation or rescanning before ingestion.
|
||||
* Adjust the **Chunk Size** and **Chunk Overlap** settings to better suit your documents. Larger chunks provide more context but can include irrelevant information, while smaller chunks yield more precise semantic search but can lack context.
|
||||
## Set the embedding model and dimensions {#set-the-embedding-model-and-dimensions}
|
||||
|
||||
For more information about modifying ingestion parameters and flows, see [Docling Serve for knowledge ingestion](/knowledge#docling-serve-for-knowledge-ingestion).
|
||||
When you [install OpenRAG](/install), you select an embedding model during **Application Onboarding**.
|
||||
OpenRAG automatically detects and configures the appropriate vector dimensions for your selected embedding model, ensuring optimal search performance and compatibility.
|
||||
|
||||
## Docling Serve for knowledge ingestion {#docling-serve-for-knowledge-ingestion}
|
||||
In the OpenRAG repository, you can find the complete list of supported models in [`models_service.py`](https://github.com/langflow-ai/openrag/blob/main/src/services/models_service.py) and the corresponding vector dimensions in [`settings.py`](https://github.com/langflow-ai/openrag/blob/main/src/config/settings.py).
|
||||
|
||||
<!-- revise this section and subsections. Move to the knowledge-configure page and rename that page to "Configure knowledge". -->
|
||||
The default embedding dimension is `1536` and the default model is `text-embedding-3-small`.
|
||||
|
||||
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for document ingestion.
|
||||
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling serve` process on your local machine and runs Docling ingestion through an API service.
|
||||
You can use any supported or unsupported embedding model by specifying the model in your OpenRAG configuration during installation.
|
||||
|
||||
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
|
||||
If you use an unsupported embedding model that doesn't have defined dimensions in `settings.py`, then OpenRAG falls back to the default dimensions (1536) and logs a warning. OpenRAG's OpenSearch instance and flows continue to work, but [similarity search](https://www.ibm.com/think/topics/vector-search) quality can be affected if the actual model dimensions aren't 1536.
|
||||
|
||||
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
|
||||
The embedding model setting is immutable.
|
||||
To change the embedding model, you must [reinstall OpenRAG](/install#reinstall).
|
||||
|
||||
### Knowledge ingestion settings {#knowledge-ingestion-settings}
|
||||
## Set ingestion parameters
|
||||
|
||||
To modify OpenRAG's ingestion settings, including the Docling settings and ingestion flows, <Icon name="Settings2" aria-hidden="true"/> **Settings**.
|
||||
For information about modifying ingestion parameters and flows, see [Knowledge ingestion settings](/ingestion#knowledge-ingestion-settings) and [Knowledge ingestion flows](/ingestion#knowledge-ingestion-flows).
|
||||
|
||||
These settings configure the Docling ingestion parameters.
|
||||
## Delete knowledge
|
||||
|
||||
OpenRAG will warn you if `docling serve` is not running.
|
||||
To start or stop `docling serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
|
||||
To clear your entire knowledge base, you can delete the contents of the `./opensearch-data` folder in your OpenRAG installation directory, or you can [reset the OpenRAG containers](/install#tui-container-management).
|
||||
|
||||
**Embedding model** determines which AI model is used to create vector embeddings. The default is the OpenAI `text-embedding-3-small` model.
|
||||
|
||||
**Chunk size** determines how large each text chunk is in number of characters.
|
||||
Larger chunks yield more context per chunk, but can include irrelevant information. Smaller chunks yield more precise semantic search, but can lack context.
|
||||
The default value of `1000` characters provides a good starting point that balances these considerations.
|
||||
|
||||
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
|
||||
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
|
||||
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
|
||||
|
||||
**Table Structure** enables Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/) tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. **Table Structure** is enabled by default.
|
||||
|
||||
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
|
||||
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
|
||||
|
||||
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
|
||||
|
||||
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
|
||||
|
||||
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
|
||||
|
||||
### Use OpenRAG default ingestion instead of Docling serve
|
||||
|
||||
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/reference/configuration#document-processing).
|
||||
|
||||
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
||||
|
||||
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
|
||||
|
||||
## Ingestion performance expectations
|
||||
|
||||
On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes.
|
||||
This equates to approximately 2.4 documents per second.
|
||||
|
||||
You can generally expect equal or better performance on developer laptops and significantly faster on servers.
|
||||
Throughput scales with CPU cores, memory, storage speed, and configuration choices such as embedding model, chunk size and overlap, and concurrency.
|
||||
|
||||
This test returned 12 errors (approximately 1.1 percent).
|
||||
All errors were file-specific, and they didn't stop the pipeline.
|
||||
|
||||
* Ingestion dataset:
|
||||
|
||||
* Total files: 1,083 items mounted
|
||||
* Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
|
||||
|
||||
* Hardware specifications:
|
||||
|
||||
* Machine: Apple M4 Pro
|
||||
* Podman VM:
|
||||
* Name: `podman-machine-default`
|
||||
* Type: `applehv`
|
||||
* vCPUs: 7
|
||||
* Memory: 8 GiB
|
||||
* Disk size: 100 GiB
|
||||
|
||||
* Test results:
|
||||
|
||||
```text
|
||||
2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
|
||||
2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
|
||||
...
|
||||
2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082
|
||||
```
|
||||
|
||||
* Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)
|
||||
|
||||
* Throughput: Approximately 2.4 documents/second
|
||||
Be aware that both of these operations are destructive and cannot be undone.
|
||||
In particular, resetting containers reverts your OpenRAG instance to the initial state as though it were a fresh installation.
|
||||
|
||||
## See also
|
||||
|
||||
* [Configure OpenSearch in OpenRAG](/knowledge-configure)
|
||||
* [Filter knowledge](/knowledge-filters)
|
||||
* [Ingest knowledge](/ingestion)
|
||||
* [Filter knowledge](/knowledge-filters)
|
||||
* [Chat with knowledge](/chat)
|
||||
|
|
@ -190,7 +190,7 @@ If the TUI detects OAuth credentials, it enforces the **Advanced Setup** path.
|
|||
**Basic Setup** can generate all of the required values for OpenRAG. The OpenAI API key is optional and can be provided during onboarding.
|
||||
**Basic Setup** does not set up OAuth connections for ingestion from cloud providers.
|
||||
For OAuth setup, use **Advanced Setup**.
|
||||
For information about the difference between basic (no auth) and OAuth in OpenRAG, see [OpenSearch authentication and document access](/knowledge-configure#auth).
|
||||
For information about the difference between basic (no auth) and OAuth in OpenRAG, see [OpenSearch authentication and document access](/knowledge#auth).
|
||||
|
||||
1. To install OpenRAG with **Basic Setup**, click **Basic Setup** or press <kbd>1</kbd>.
|
||||
2. Click **Generate Passwords** to generate passwords for OpenSearch and Langflow.
|
||||
|
|
|
|||
|
|
@ -104,7 +104,7 @@ You can click a document to view the chunks of the document as they are stored i
|
|||
**Folder** uploads an entire directory.
|
||||
The default directory is the `/documents` subdirectory in your OpenRAG installation directory.
|
||||
|
||||
For information about the cloud storage provider options, see [Ingest files through OAuth connectors](/knowledge#oauth-ingestion).
|
||||
For information about the cloud storage provider options, see [Ingest files through OAuth connectors](/ingestion#oauth-ingestion).
|
||||
|
||||
5. Return to the **Chat** window, and then ask a question related to the documents that you just uploaded.
|
||||
|
||||
|
|
@ -117,7 +117,7 @@ You can click a document to view the chunks of the document as they are stored i
|
|||
|
||||
* Click <Icon name="Settings2" aria-hidden="true"/> **Settings** to modify the knowledge ingestion settings.
|
||||
|
||||
For more information, see [Configure OpenSearch in OpenRAG](/knowledge-configure) and [Ingest knowledge](/knowledge).
|
||||
For more information, see [Configure knowledge](/knowledge) and [Ingest knowledge](/ingestion).
|
||||
|
||||
## Change the language model and chat settings {#change-components}
|
||||
|
||||
|
|
|
|||
|
|
@ -10,13 +10,18 @@ OpenRAG connects and amplifies three popular, proven open-source projects into o
|
|||
|
||||
* [Langflow](https://docs.langflow.org): Langflow is a versatile tool for building and deploying AI agents and MCP servers. It supports all major LLMs, vector databases, and a growing library of AI tools.
|
||||
|
||||
OpenRAG uses several built-in flows, and it provides full access to all Langflow features through the embedded Langflow visual editor.
|
||||
|
||||
By customizing the built-in flows or creating your own flows, every part of the OpenRAG stack interchangeable. You can modify any aspect of the flows from basic settings, like changing the language model, to replacing entire components. You can also write your own custom Langflow components, integrate MCP servers, call APIs, and leverage any other functionality provided by Langflow.
|
||||
|
||||
* [OpenSearch](https://docs.opensearch.org/latest/): OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.
|
||||
It provides powerful hybrid search capabilities with enterprise-grade security and multi-tenancy support.
|
||||
|
||||
* [Docling](https://docling-project.github.io/docling/): Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
OpenRAG uses OpenSearch as the underlying vector database for storing and retrieving your documents and associated vector data (embeddings). You can ingest documents from a variety of sources, including your local filesystem and OAuth authenticated connections to popular cloud storage services.
|
||||
|
||||
OpenRAG builds on Langflow's familiar interface while adding OpenSearch for vector storage and Docling for simplified document parsing. It uses opinionated flows that serve as ready-to-use recipes for ingestion, retrieval, and generation from familiar sources like Google Drive, OneDrive, and SharePoint.
|
||||
* [Docling](https://docling-project.github.io/docling/): Docling simplifies document processing, supports many file formats and advanced PDF parsing, and provides seamless integrations with the generative AI ecosystem.
|
||||
|
||||
What's more, every part of the stack is interchangeable: You can write your own custom components in Python, try different language models, and customize your flows to build a personalized agentic RAG system.
|
||||
OpenRAG uses Docling to parse and chunk documents that are stored in your OpenSearch knowledge base.
|
||||
|
||||
:::tip
|
||||
Ready to get started? Try the [quickstart](/quickstart) to install OpenRAG and start exploring in minutes.
|
||||
|
|
@ -52,12 +57,12 @@ flowchart TD
|
|||
ext --> backend
|
||||
```
|
||||
|
||||
* The **OpenRAG Backend** is the central orchestration service that coordinates all other components.
|
||||
* **OpenRAG backend**: The central orchestration service that coordinates all other components.
|
||||
|
||||
* **Langflow** provides a visual workflow engine for building AI agents, and connects to **OpenSearch** for vector storage and retrieval.
|
||||
* **Langflow**: This container runs a Langflow instance. It provides the embedded Langflow visual editor for editing and creating flow, and it connects to the **OpenSearch** container for vector storage and retrieval.
|
||||
|
||||
* **Docling Serve** is a local document processing service managed by the **OpenRAG Backend**.
|
||||
* **Docling Serve**: This is a local document processing service managed by the **OpenRAG backend**.
|
||||
|
||||
* **External connectors** integrate third-party cloud storage services through OAuth authenticated connections to the **OpenRAG Backend**, allowing synchronization of external storage with your OpenSearch knowledge base.
|
||||
* **External connectors**: Integrate third-party cloud storage services through OAuth authenticated connections to the **OpenRAG backend**, allowing synchronization of external storage with your OpenSearch knowledge base.
|
||||
|
||||
* The **OpenRAG Frontend** provides the user interface for interacting with the platform.
|
||||
* **OpenRAG frontend**: Provides the user interface for interacting with the OpenRAG platform.
|
||||
|
|
@ -71,7 +71,7 @@ For more information, see [Application onboarding](/install#application-onboardi
|
|||
|
||||
### Document processing
|
||||
|
||||
Control how OpenRAG [processes and ingests documents](/knowledge) into your knowledge base.
|
||||
Control how OpenRAG [processes and ingests documents](/ingestion) into your knowledge base.
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
|
|
|
|||
|
|
@ -33,7 +33,6 @@ const sidebars = {
|
|||
type: "category",
|
||||
label: "Knowledge",
|
||||
items: [
|
||||
"core-components/knowledge-configure",
|
||||
"core-components/knowledge",
|
||||
"core-components/ingestion",
|
||||
"core-components/knowledge-filters",
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue