Merge branch 'main' into docs-peer-review

This commit is contained in:
Mendon Kissling 2025-10-01 11:10:08 -04:00 committed by GitHub
commit daa85dc5e7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 180 additions and 0 deletions

111
docs/VERSIONING_SETUP.md Normal file
View file

@ -0,0 +1,111 @@
# Docusaurus versioning setup
Docs versioning is currently **DISABLED** but configured and ready to enable.
The configuration is found in `docusaurus.config.js` with commented-out sections.
To enable versioning, do the following:
1. Open `docusaurus.config.js`
2. Find the versioning configuration section (around line 57)
3. Uncomment the versioning configuration:
```javascript
docs: {
// ... other config
lastVersion: 'current', // Use 'current' to make ./docs the latest version
versions: {
current: {
label: 'Next (unreleased)',
path: 'next',
},
},
onlyIncludeVersions: ['current'], // Limit versions for faster builds
},
```
## Create docs versions
See the [Docusaurus docs](https://docusaurus.io/docs/versioning) for more info.
1. Use the Docusaurus CLI command to create a version.
You can use `yarn` instead of `npm`.
```bash
# Create version 1.0.0 from current docs
npm run docusaurus docs:version 1.0.0
```
This command will:
- Copy the full `docs/` folder contents into `versioned_docs/version-1.0.0/`
- Create a versioned sidebar file at `versioned_sidebars/version-1.0.0-sidebars.json`
- Append the new version to `versions.json`
3. After creating a version, update the Docusaurus configuration to include multiple versions.
`lastVersion:'1.0.0'` makes the '1.0.0' release the `latest` version.
`current` is the work-in-progress docset, accessible at `/docs/next`.
To remove a version, remove it from `onlyIncludeVersions`.
```javascript
docs: {
// ... other config
lastVersion: '1.0.0', // Make 1.0.0 the latest version
versions: {
current: {
label: 'Next (unreleased)',
path: 'next',
},
'1.0.0': {
label: '1.0.0',
path: '1.0.0',
},
},
onlyIncludeVersions: ['current', '1.0.0'], // Include both versions
},
```
4. Test the deployment locally.
```bash
npm run build
npm run serve
```
5. To add subsequent versions, repeat the process, first running the CLI command then updating `docusaurus.config.js`.
```bash
# Create version 2.0.0 from current docs
npm run docusaurus docs:version 2.0.0
```
After creating a new version, update `docusaurus.config.js`.
```javascript
docs: {
lastVersion: '2.0.0', // Make 2.0.0 the latest version
versions: {
current: {
label: 'Next (unreleased)',
path: 'next',
},
'2.0.0': {
label: '2.0.0',
path: '2.0.0',
},
'1.0.0': {
label: '1.0.0',
path: '1.0.0',
},
},
onlyIncludeVersions: ['current', '2.0.0', '1.0.0'], // Include all versions
},
```
## Disable versioning
1. Remove the `versions` configuration from `docusaurus.config.js`.
2. Delete the `docs/versioned_docs/` and `docs/versioned_sidebars/` directories.
3. Delete `docs/versions.json`.
## References
- [Official Docusaurus Versioning Documentation](https://docusaurus.io/docs/versioning)
- [Docusaurus Versioning Best Practices](https://docusaurus.io/docs/versioning#recommended-practices)

View file

@ -0,0 +1,50 @@
---
title: Docling Ingestion
slug: /ingestion
---
import Icon from "@site/src/components/icon/icon";
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import PartialModifyFlows from '@site/docs/_partial-modify-flows.mdx';
OpenRAG uses [Docling](https://docling-project.github.io/docling/) for its document ingestion pipeline.
More specifically, OpenRAG uses [Docling Serve](https://github.com/docling-project/docling-serve), which starts a `docling-serve` process on your local machine and runs Docling ingestion through an API service.
Docling ingests documents from your local machine or OAuth connectors, splits them into chunks, and stores them as separate, structured documents in the OpenSearch `documents` index.
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
## Docling ingestion settings
These settings configure the Docling ingestion parameters.
OpenRAG will warn you if `docling-serve` is not running.
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`.
**Chunk size** determines how large each text chunk is in number of characters.
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
The default value of `1000` characters provides a good starting point that balances these considerations.
**Chunk overlap** controls the number of characters that overlap over chunk boundaries.
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
## Use OpenRAG default ingestion instead of Docling serve
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/configure/configuration#ingestion-configuration).
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).

View file

@ -98,6 +98,10 @@ Documents are processed with the default **Knowledge Ingest** flow, so if you wa
<PartialModifyFlows />
### Knowledge ingestion settings
To configure the knowledge ingestion pipeline parameters, see [Docling Ingestion](/ingestion).
## Create knowledge filters
OpenRAG includes a knowledge filter system for organizing and managing document collections.

View file

@ -53,6 +53,16 @@ const config = {
editUrl:
'https://github.com/openrag/openrag/tree/main/docs/',
routeBasePath: '/',
// Versioning configuration - see VERSIONING_SETUP.md
// To enable versioning, uncomment the following lines:
// lastVersion: 'current',
// versions: {
// current: {
// label: 'Next (unreleased)',
// path: 'next',
// },
// },
// onlyIncludeVersions: ['current'],
},
theme: {
customCss: './src/css/custom.css',

View file

@ -60,6 +60,11 @@ const sidebars = {
type: "doc",
id: "core-components/knowledge",
label: "OpenSearch Knowledge"
},
{
type: "doc",
id: "core-components/ingestion",
label: "Docling Ingestion"
}
],
},