clarify-ingestion
This commit is contained in:
parent
3e618271c0
commit
cc0eaf8317
1 changed files with 15 additions and 11 deletions
|
|
@ -15,14 +15,16 @@ Docling ingests documents from your local machine or OAuth connectors, splits th
|
|||
|
||||
OpenRAG chose Docling for its support for a wide variety of file formats, high performance, and advanced understanding of tables and images.
|
||||
|
||||
## Docling ingestion settings
|
||||
To modify OpenRAG's Knowledge Ingest settings or flows, click <Icon name="Settings" aria-hidden="true"/> **Settings**.
|
||||
|
||||
## Knowledge ingestion settings
|
||||
|
||||
These settings configure the Docling ingestion parameters.
|
||||
|
||||
OpenRAG will warn you if `docling serve` is not running.
|
||||
To start or stop `docling serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
|
||||
|
||||
**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`.
|
||||
**Embedding model** determines which AI model is used to create vector embeddings. The default is the OpenAI `text-embedding-3-small` model.
|
||||
|
||||
**Chunk size** determines how large each text chunk is in number of characters.
|
||||
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
|
||||
|
|
@ -32,6 +34,8 @@ The default value of `1000` characters provides a good starting point that balan
|
|||
Use larger overlap values for documents where context is most important, and use smaller overlap values for simpler documents, or when optimization is most important.
|
||||
The default value of 200 characters of overlap with a chunk size of 1000 (20% overlap) is suitable for general use cases. Decrease the overlap to 10% for a more efficient pipeline, or increase to 40% for more complex documents.
|
||||
|
||||
**Table Structure** enables Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/) tool for parsing tables. Instead of treating tables as plain text, tables are output as structured table data with preserved relationships and metadata. **Table Structure** is enabled by default.
|
||||
|
||||
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
|
||||
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
|
||||
|
||||
|
|
@ -41,14 +45,6 @@ If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the
|
|||
|
||||
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
|
||||
|
||||
## Use OpenRAG default ingestion instead of Docling serve
|
||||
|
||||
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/reference/configuration#document-processing).
|
||||
|
||||
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
||||
|
||||
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
|
||||
|
||||
## Knowledge ingestion flows
|
||||
|
||||
[Flows](https://docs.langflow.org/concepts-overview) in Langflow are functional representations of application workflows, with multiple [component](https://docs.langflow.org/concepts-components) nodes connected as single steps in a workflow.
|
||||
|
|
@ -74,4 +70,12 @@ An additional knowledge ingestion flow is included in OpenRAG, where it is used
|
|||
The agent calls this component to fetch web content, and the results are ingested into OpenSearch.
|
||||
|
||||
For more on using MCP clients in Langflow, see [MCP clients](https://docs.langflow.org/mcp-client).\
|
||||
To connect additional MCP servers to the MCP client, see [Connect to MCP servers from your application](https://docs.langflow.org/mcp-tutorial).
|
||||
To connect additional MCP servers to the MCP client, see [Connect to MCP servers from your application](https://docs.langflow.org/mcp-tutorial).
|
||||
|
||||
## Use OpenRAG default ingestion instead of Docling serve
|
||||
|
||||
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/reference/configuration#document-processing).
|
||||
|
||||
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
||||
|
||||
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
|
||||
Loading…
Add table
Reference in a new issue