5.5 KiB
OpenRAG First-Run Setup
This document describes the automatic dataset loading feature that initializes OpenRAG with default documents on first startup.
Overview
OpenRAG now includes a first-run initialization system that automatically loads documents from a default dataset directory when the application starts for the first time.
Configuration
Environment Variables
DATA_DIRECTORY: Path to the directory containing default documents to load on first run (default:./documents)SKIP_FIRST_RUN_INIT: Set totrueto disable automatic first-run initialization (default:false)
External Dataset Loading
You can point DATA_DIRECTORY to any external location containing your default datasets. The system will:
- Copy files from the external directory to
./documentsto ensure Docker volume access - Maintain directory structure during copying
- Only copy newer files (based on modification time)
- Skip files that already exist and are up-to-date
Example with external directory:
DATA_DIRECTORY=/path/to/my/external/datasets
This allows you to maintain your datasets outside the OpenRAG project while still leveraging the automatic loading feature.
Example .env Configuration
# Default dataset directory for automatic ingestion on first run
DATA_DIRECTORY=./documents
# Skip first-run initialization (set to true to disable automatic dataset loading)
# SKIP_FIRST_RUN_INIT=false
How It Works
- First-Run Detection: The system checks for a
.openrag_initializedmarker file in the application root - Document Detection: If no marker file exists and there are no existing documents in the OpenSearch index, first-run initialization triggers
- File Copying: If
DATA_DIRECTORYpoints to a different location than./documents, files are copied to the documents folder to ensure Docker volume access - File Discovery: The system scans the documents folder for supported document types (PDF, TXT, DOC, DOCX, MD, RTF, ODT)
- Existing Workflow Reuse: Found files are processed using the same
create_upload_taskmethod as the manual "Upload Path" feature - Document Ownership: In no-auth mode, documents owned by anonymous user; in auth mode, documents created without owner (globally accessible)
- Initialization Marker: After successful setup, a marker file is created to prevent re-initialization on subsequent startups
Docker Configuration
The DATA_DIRECTORY environment variable is automatically passed to the Docker containers. The default ./documents directory is already mounted as a volume in the Docker configuration.
Docker Compose
Both docker-compose.yml and docker-compose-cpu.yml have been updated to include:
environment:
- DATA_DIRECTORY=${DATA_DIRECTORY}
volumes:
- ./documents:/app/documents:Z
File Structure
openrag/
├── documents/ # Default dataset directory
│ ├── sample1.pdf
│ ├── sample2.txt
│ └── ...
├── .openrag_initialized # Created after first successful initialization
└── src/
└── utils/
├── first_run.py # First-run detection logic
└── default_ingestion.py # Dataset ingestion logic
Supported File Types
The first-run initialization supports the following document types:
- PDF (.pdf)
- Plain text (.txt)
- Microsoft Word (.doc, .docx)
- Markdown (.md)
- Rich Text Format (.rtf)
- OpenDocument Text (.odt)
Behavior
Normal First Run
- Application starts
- OpenSearch index is initialized
- System checks for existing documents
- If none found, copies files from
DATA_DIRECTORYto./documents(if different) - Scans documents folder for supported files
- Creates upload task using existing
create_upload_taskmethod (same as manual "Upload Path") - Documents are processed through complete knowledge pipeline (conversion, chunking, embedding, indexing)
- Creates
.openrag_initializedmarker file - Processing continues asynchronously in the background
Subsequent Runs
- Application starts
- System detects
.openrag_initializedmarker file - First-run initialization is skipped
- Application starts normally
Skipping Initialization
Set SKIP_FIRST_RUN_INIT=true in your environment to disable first-run initialization entirely.
Monitoring
First-run initialization creates a background task that can be monitored through:
- Console logs with
[FIRST_RUN]prefix - Task API endpoints (for system tasks)
Troubleshooting
No Documents Were Loaded
- Check that
DATA_DIRECTORYpoints to a valid directory - Verify the directory contains supported file types
- Check console logs for
[FIRST_RUN]messages - Ensure OpenSearch is running and accessible
Disable First-Run Setup
If you want to prevent automatic initialization:
- Set
SKIP_FIRST_RUN_INIT=truein your .env file - Or create an empty
.openrag_initializedfile manually
Files Not Visible in Knowledge List
If first-run files don't appear in the knowledge interface:
For No-Auth Mode:
- Files should be owned by "anonymous" user and visible immediately
For Auth Mode:
- Files are created without owner field, making them globally accessible
- All authenticated users should see these files in their knowledge list
- Check OpenSearch DLS configuration in
securityconfig/roles.yml
Force Re-initialization
To force first-run setup to run again:
- Stop the application
- Delete the
.openrag_initializedfile - Optionally clear the OpenSearch index
- Restart the application