148 lines
No EOL
5.5 KiB
Markdown
148 lines
No EOL
5.5 KiB
Markdown
# OpenRAG First-Run Setup
|
|
|
|
This document describes the automatic dataset loading feature that initializes OpenRAG with default documents on first startup.
|
|
|
|
## Overview
|
|
|
|
OpenRAG now includes a first-run initialization system that automatically loads documents from a default dataset directory when the application starts for the first time.
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
- `DATA_DIRECTORY`: Path to the directory containing default documents to load on first run (default: `./documents`)
|
|
- `SKIP_FIRST_RUN_INIT`: Set to `true` to disable automatic first-run initialization (default: `false`)
|
|
|
|
### External Dataset Loading
|
|
|
|
You can point `DATA_DIRECTORY` to any external location containing your default datasets. The system will:
|
|
- Copy files from the external directory to `./documents` to ensure Docker volume access
|
|
- Maintain directory structure during copying
|
|
- Only copy newer files (based on modification time)
|
|
- Skip files that already exist and are up-to-date
|
|
|
|
Example with external directory:
|
|
```bash
|
|
DATA_DIRECTORY=/path/to/my/external/datasets
|
|
```
|
|
|
|
This allows you to maintain your datasets outside the OpenRAG project while still leveraging the automatic loading feature.
|
|
|
|
### Example .env Configuration
|
|
|
|
```bash
|
|
# Default dataset directory for automatic ingestion on first run
|
|
DATA_DIRECTORY=./documents
|
|
|
|
# Skip first-run initialization (set to true to disable automatic dataset loading)
|
|
# SKIP_FIRST_RUN_INIT=false
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **First-Run Detection**: The system checks for a `.openrag_initialized` marker file in the application root
|
|
2. **Document Detection**: If no marker file exists and there are no existing documents in the OpenSearch index, first-run initialization triggers
|
|
3. **File Copying**: If `DATA_DIRECTORY` points to a different location than `./documents`, files are copied to the documents folder to ensure Docker volume access
|
|
4. **File Discovery**: The system scans the documents folder for supported document types (PDF, TXT, DOC, DOCX, MD, RTF, ODT)
|
|
5. **Existing Workflow Reuse**: Found files are processed using the same `create_upload_task` method as the manual "Upload Path" feature
|
|
6. **Document Ownership**: In no-auth mode, documents owned by anonymous user; in auth mode, documents created without owner (globally accessible)
|
|
7. **Initialization Marker**: After successful setup, a marker file is created to prevent re-initialization on subsequent startups
|
|
|
|
## Docker Configuration
|
|
|
|
The `DATA_DIRECTORY` environment variable is automatically passed to the Docker containers. The default `./documents` directory is already mounted as a volume in the Docker configuration.
|
|
|
|
### Docker Compose
|
|
|
|
Both `docker-compose.yml` and `docker-compose-cpu.yml` have been updated to include:
|
|
|
|
```yaml
|
|
environment:
|
|
- DATA_DIRECTORY=${DATA_DIRECTORY}
|
|
volumes:
|
|
- ./documents:/app/documents:Z
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
openrag/
|
|
├── documents/ # Default dataset directory
|
|
│ ├── sample1.pdf
|
|
│ ├── sample2.txt
|
|
│ └── ...
|
|
├── .openrag_initialized # Created after first successful initialization
|
|
└── src/
|
|
└── utils/
|
|
├── first_run.py # First-run detection logic
|
|
└── default_ingestion.py # Dataset ingestion logic
|
|
```
|
|
|
|
## Supported File Types
|
|
|
|
The first-run initialization supports the following document types:
|
|
- PDF (.pdf)
|
|
- Plain text (.txt)
|
|
- Microsoft Word (.doc, .docx)
|
|
- Markdown (.md)
|
|
- Rich Text Format (.rtf)
|
|
- OpenDocument Text (.odt)
|
|
|
|
## Behavior
|
|
|
|
### Normal First Run
|
|
1. Application starts
|
|
2. OpenSearch index is initialized
|
|
3. System checks for existing documents
|
|
4. If none found, copies files from `DATA_DIRECTORY` to `./documents` (if different)
|
|
5. Scans documents folder for supported files
|
|
6. Creates upload task using existing `create_upload_task` method (same as manual "Upload Path")
|
|
7. Documents are processed through complete knowledge pipeline (conversion, chunking, embedding, indexing)
|
|
8. Creates `.openrag_initialized` marker file
|
|
9. Processing continues asynchronously in the background
|
|
|
|
### Subsequent Runs
|
|
1. Application starts
|
|
2. System detects `.openrag_initialized` marker file
|
|
3. First-run initialization is skipped
|
|
4. Application starts normally
|
|
|
|
### Skipping Initialization
|
|
Set `SKIP_FIRST_RUN_INIT=true` in your environment to disable first-run initialization entirely.
|
|
|
|
## Monitoring
|
|
|
|
First-run initialization creates a background task that can be monitored through:
|
|
- Console logs with `[FIRST_RUN]` prefix
|
|
- Task API endpoints (for system tasks)
|
|
|
|
## Troubleshooting
|
|
|
|
### No Documents Were Loaded
|
|
1. Check that `DATA_DIRECTORY` points to a valid directory
|
|
2. Verify the directory contains supported file types
|
|
3. Check console logs for `[FIRST_RUN]` messages
|
|
4. Ensure OpenSearch is running and accessible
|
|
|
|
### Disable First-Run Setup
|
|
If you want to prevent automatic initialization:
|
|
1. Set `SKIP_FIRST_RUN_INIT=true` in your .env file
|
|
2. Or create an empty `.openrag_initialized` file manually
|
|
|
|
### Files Not Visible in Knowledge List
|
|
If first-run files don't appear in the knowledge interface:
|
|
|
|
**For No-Auth Mode:**
|
|
- Files should be owned by "anonymous" user and visible immediately
|
|
|
|
**For Auth Mode:**
|
|
- Files are created without owner field, making them globally accessible
|
|
- All authenticated users should see these files in their knowledge list
|
|
- Check OpenSearch DLS configuration in `securityconfig/roles.yml`
|
|
|
|
### Force Re-initialization
|
|
To force first-run setup to run again:
|
|
1. Stop the application
|
|
2. Delete the `.openrag_initialized` file
|
|
3. Optionally clear the OpenSearch index
|
|
4. Restart the application |