openrag/FIRST_RUN_SETUP.md
2025-09-03 10:28:50 -04:00

5.5 KiB

OpenRAG First-Run Setup

This document describes the automatic dataset loading feature that initializes OpenRAG with default documents on first startup.

Overview

OpenRAG now includes a first-run initialization system that automatically loads documents from a default dataset directory when the application starts for the first time.

Configuration

Environment Variables

  • DATA_DIRECTORY: Path to the directory containing default documents to load on first run (default: ./documents)
  • SKIP_FIRST_RUN_INIT: Set to true to disable automatic first-run initialization (default: false)

External Dataset Loading

You can point DATA_DIRECTORY to any external location containing your default datasets. The system will:

  • Copy files from the external directory to ./documents to ensure Docker volume access
  • Maintain directory structure during copying
  • Only copy newer files (based on modification time)
  • Skip files that already exist and are up-to-date

Example with external directory:

DATA_DIRECTORY=/path/to/my/external/datasets

This allows you to maintain your datasets outside the OpenRAG project while still leveraging the automatic loading feature.

Example .env Configuration

# Default dataset directory for automatic ingestion on first run
DATA_DIRECTORY=./documents

# Skip first-run initialization (set to true to disable automatic dataset loading)
# SKIP_FIRST_RUN_INIT=false

How It Works

  1. First-Run Detection: The system checks for a .openrag_initialized marker file in the application root
  2. Document Detection: If no marker file exists and there are no existing documents in the OpenSearch index, first-run initialization triggers
  3. File Copying: If DATA_DIRECTORY points to a different location than ./documents, files are copied to the documents folder to ensure Docker volume access
  4. File Discovery: The system scans the documents folder for supported document types (PDF, TXT, DOC, DOCX, MD, RTF, ODT)
  5. Existing Workflow Reuse: Found files are processed using the same create_upload_task method as the manual "Upload Path" feature
  6. Document Ownership: In no-auth mode, documents owned by anonymous user; in auth mode, documents created without owner (globally accessible)
  7. Initialization Marker: After successful setup, a marker file is created to prevent re-initialization on subsequent startups

Docker Configuration

The DATA_DIRECTORY environment variable is automatically passed to the Docker containers. The default ./documents directory is already mounted as a volume in the Docker configuration.

Docker Compose

Both docker-compose.yml and docker-compose-cpu.yml have been updated to include:

environment:
  - DATA_DIRECTORY=${DATA_DIRECTORY}
volumes:
  - ./documents:/app/documents:Z

File Structure

openrag/
├── documents/          # Default dataset directory
│   ├── sample1.pdf
│   ├── sample2.txt
│   └── ...
├── .openrag_initialized  # Created after first successful initialization
└── src/
    └── utils/
        ├── first_run.py          # First-run detection logic
        └── default_ingestion.py  # Dataset ingestion logic

Supported File Types

The first-run initialization supports the following document types:

  • PDF (.pdf)
  • Plain text (.txt)
  • Microsoft Word (.doc, .docx)
  • Markdown (.md)
  • Rich Text Format (.rtf)
  • OpenDocument Text (.odt)

Behavior

Normal First Run

  1. Application starts
  2. OpenSearch index is initialized
  3. System checks for existing documents
  4. If none found, copies files from DATA_DIRECTORY to ./documents (if different)
  5. Scans documents folder for supported files
  6. Creates upload task using existing create_upload_task method (same as manual "Upload Path")
  7. Documents are processed through complete knowledge pipeline (conversion, chunking, embedding, indexing)
  8. Creates .openrag_initialized marker file
  9. Processing continues asynchronously in the background

Subsequent Runs

  1. Application starts
  2. System detects .openrag_initialized marker file
  3. First-run initialization is skipped
  4. Application starts normally

Skipping Initialization

Set SKIP_FIRST_RUN_INIT=true in your environment to disable first-run initialization entirely.

Monitoring

First-run initialization creates a background task that can be monitored through:

  • Console logs with [FIRST_RUN] prefix
  • Task API endpoints (for system tasks)

Troubleshooting

No Documents Were Loaded

  1. Check that DATA_DIRECTORY points to a valid directory
  2. Verify the directory contains supported file types
  3. Check console logs for [FIRST_RUN] messages
  4. Ensure OpenSearch is running and accessible

Disable First-Run Setup

If you want to prevent automatic initialization:

  1. Set SKIP_FIRST_RUN_INIT=true in your .env file
  2. Or create an empty .openrag_initialized file manually

Files Not Visible in Knowledge List

If first-run files don't appear in the knowledge interface:

For No-Auth Mode:

  • Files should be owned by "anonymous" user and visible immediately

For Auth Mode:

  • Files are created without owner field, making them globally accessible
  • All authenticated users should see these files in their knowledge list
  • Check OpenSearch DLS configuration in securityconfig/roles.yml

Force Re-initialization

To force first-run setup to run again:

  1. Stop the application
  2. Delete the .openrag_initialized file
  3. Optionally clear the OpenSearch index
  4. Restart the application