openrag-defaults
This commit is contained in:
parent
74565d3835
commit
37bd1880b3
1 changed files with 7 additions and 13 deletions
|
|
@ -22,7 +22,7 @@ These settings configure the Docling ingestion parameters, from using no OCR to
|
|||
OpenRAG will warn you if `docling-serve` is not running.
|
||||
To start or stop `docling-serve` or any other native services, in the TUI main menu, click **Start Native Services** or **Stop Native Services**.
|
||||
|
||||
**Embedding model** determines which AI model is used to create vector embeddings. The default is
|
||||
**Embedding model** determines which AI model is used to create vector embeddings. The default is `text-embedding-3-small`. `
|
||||
|
||||
**Chunk size** determines how large each text chunk is in number of characters.
|
||||
Larger chunks yield more context per chunk, but may include irrelevant information. Smaller chunks yield more precise semantic search, but may lack context.
|
||||
|
|
@ -35,22 +35,16 @@ The default value of 200 characters of overlap with a chunk size of 1000 (20% ov
|
|||
**OCR** enables or disabled OCR processing when extracting text from images and scanned documents.
|
||||
OCR is disabled by default. This setting is best suited for processing text-based documents as quickly as possible with Docling's [`DocumentConverter`](https://docling-project.github.io/docling/reference/document_converter/). Images are ignored and not processed.
|
||||
|
||||
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents.
|
||||
Enable OCR when you are processing documents containing images with text that requires extraction, or for scanned documents. Enabling OCR can slow ingestion performance.
|
||||
|
||||
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [ocrmac](https://www.piwheels.org/project/ocrmac/) OCR engine. Other platforms use [easyocr](https://www.jaided.ai/easyocr/).
|
||||
|
||||
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing.
|
||||
**Picture descriptions** adds image descriptions generated by the [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) model to OCR processing. Enabling picture descriptions can slow ingestion performance.
|
||||
|
||||
**VLM (Vision Language Model)** enables or disables VLM processing.
|
||||
VLM processing is used _instead of_ OCR processing.
|
||||
It uses an LLM to understand a document's structure and return text in a structured `doctags` format.
|
||||
For more information, see [Vision models](https://docling-project.github.io/docling/usage/vision_models/).
|
||||
## Use OpenRAG default ingestion instead of Docling serve
|
||||
|
||||
Enable a VLM when you are processing complex documents containing a mixture of text, images, tables, and charts.
|
||||
If you want to use OpenRAG's built-in pipeline instead of Docling serve, set `DISABLE_INGEST_WITH_LANGFLOW=true` in [Environment variables](/configure/configuration#ingestion-configuration).
|
||||
|
||||
If OpenRAG detects that the local machine is running on macOS, OpenRAG uses the [SmolDocling-256M-preview-mlx-bf16](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) VLM, which includes the [MLX framework](https://ml-explore.github.io/mlx/build/html/index.html) for Apple silicon.
|
||||
Other platforms use [SmolDocling-256M-preview](https://huggingface.co/ds4sd/SmolDocling-256M-preview).
|
||||
The built-in pipeline still uses the Docling processor, but uses it directly without the Docling Serve API.
|
||||
|
||||
## Use OpenRAG default ingestion instead of Docling
|
||||
|
||||
If you want to use OpenRAG's built in pipeline instead of Docling, set `DISABLE_INGEST_WITH_LANGFLOW=true`.
|
||||
For more information, see [`processors.py` in the OpenRAG repository](https://github.com/langflow-ai/openrag/blob/main/src/models/processors.py#L58).
|
||||
Loading…
Add table
Reference in a new issue