Introduces the LightRAG Retrieval-Augmented Generation framework as an Apolo app, including input/output schemas, types, and processors. Adds Helm chart value processing, environment and persistence configurations, and output service discovery for deployment. Includes scripts for generating type schemas and testing support, along with CI and linting setup tailored for the new app. Provides a documentation loader script to ingest markdown files into LightRAG with flexible referencing modes. Relates to MLO-469
174 lines
5.7 KiB
Markdown
174 lines
5.7 KiB
Markdown
# LightRAG Documentation Loader
|
|
|
|
Advanced script to load markdown documentation into LightRAG with flexible reference modes.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Default mode (file path references)
|
|
python load_docs.py /path/to/your/docs
|
|
|
|
# URL mode (website link references)
|
|
python load_docs.py /path/to/docs --mode urls --base-url https://docs.example.com/
|
|
```
|
|
|
|
## Reference Modes
|
|
|
|
### Files Mode (Default)
|
|
Uses local file paths in query response citations:
|
|
```bash
|
|
python load_docs.py docs/
|
|
python load_docs.py docs/ --mode files
|
|
```
|
|
|
|
**Query Response Example:**
|
|
```
|
|
### References
|
|
- [DC] getting-started/installation.md
|
|
- [KG] administration/setup.md
|
|
```
|
|
|
|
### URLs Mode
|
|
Uses website URLs in query response citations:
|
|
```bash
|
|
python load_docs.py docs/ --mode urls --base-url https://docs.apolo.us/index/
|
|
python load_docs.py docs/ --mode urls --base-url https://my-docs.com/v1/
|
|
```
|
|
|
|
**Query Response Example:**
|
|
```
|
|
### References
|
|
- [DC] https://docs.apolo.us/index/getting-started/installation
|
|
- [KG] https://docs.apolo.us/index/administration/setup
|
|
```
|
|
|
|
**⚠️ Important for URLs Mode**: Your local file structure must match your documentation site's URL structure for proper link generation.
|
|
|
|
**File Structure Requirements:**
|
|
```
|
|
docs/
|
|
├── getting-started/
|
|
│ ├── installation.md → https://docs.example.com/getting-started/installation
|
|
│ └── first-steps.md → https://docs.example.com/getting-started/first-steps
|
|
├── administration/
|
|
│ ├── README.md → https://docs.example.com/administration
|
|
│ └── setup.md → https://docs.example.com/administration/setup
|
|
└── README.md → https://docs.example.com/
|
|
```
|
|
|
|
**URL Mapping Rules:**
|
|
- `.md` extension is removed from URLs
|
|
- `README.md` files map to their directory URL
|
|
- Subdirectories become URL path segments
|
|
- Hyphens and underscores in filenames are preserved
|
|
|
|
### Organizing Docs for URL Mode
|
|
|
|
**Step 1: Analyze Your Documentation Site Structure**
|
|
```bash
|
|
# Visit your docs site and note the URL patterns:
|
|
# https://docs.example.com/getting-started/installation
|
|
# https://docs.example.com/api/authentication
|
|
# https://docs.example.com/guides/deployment
|
|
```
|
|
|
|
**Step 2: Create Matching Directory Structure**
|
|
```bash
|
|
mkdir -p docs/{getting-started,api,guides}
|
|
```
|
|
|
|
**Step 3: Organize Your Markdown Files**
|
|
```bash
|
|
# Match each URL to a file path:
|
|
docs/getting-started/installation.md # → /getting-started/installation
|
|
docs/api/authentication.md # → /api/authentication
|
|
docs/guides/deployment.md # → /guides/deployment
|
|
docs/guides/README.md # → /guides (overview page)
|
|
```
|
|
|
|
**Step 4: Verify URL Mapping**
|
|
```bash
|
|
# Test a few URLs manually to ensure they work:
|
|
curl -I https://docs.example.com/getting-started/installation
|
|
curl -I https://docs.example.com/api/authentication
|
|
```
|
|
|
|
**Common Documentation Site Patterns:**
|
|
|
|
| Site Type | File Structure | URL Structure |
|
|
|-----------|---------------|---------------|
|
|
| **GitBook** | `docs/section/page.md` | `/section/page` |
|
|
| **Docusaurus** | `docs/section/page.md` | `/docs/section/page` |
|
|
| **MkDocs** | `docs/section/page.md` | `/section/page/` |
|
|
| **Custom** | Varies | Match your site's pattern |
|
|
|
|
**Real Example: Apolo Documentation**
|
|
```bash
|
|
# Apolo docs site: https://docs.apolo.us/index/
|
|
# Your local structure should match:
|
|
apolo-docs/
|
|
├── getting-started/
|
|
│ ├── first-steps/
|
|
│ │ ├── getting-started.md → /index/getting-started/first-steps/getting-started
|
|
│ │ └── README.md → /index/getting-started/first-steps
|
|
│ ├── apolo-base-docker-image.md → /index/getting-started/apolo-base-docker-image
|
|
│ └── faq.md → /index/getting-started/faq
|
|
├── apolo-console/
|
|
│ └── getting-started/
|
|
│ └── sign-up-login.md → /index/apolo-console/getting-started/sign-up-login
|
|
└── README.md → /index/
|
|
|
|
# Load with correct base URL:
|
|
python load_docs.py apolo-docs/ --mode urls --base-url https://docs.apolo.us/index/
|
|
```
|
|
|
|
## Complete Usage Examples
|
|
|
|
```bash
|
|
# Load Apolo documentation with URL references
|
|
python load_docs.py ../apolo-copilot/docs/official-apolo-documentation/docs \
|
|
--mode urls --base-url https://docs.apolo.us/index/
|
|
|
|
# Load with custom LightRAG endpoint
|
|
python load_docs.py docs/ --endpoint https://lightrag.example.com
|
|
|
|
# Load to local instance, skip test query
|
|
python load_docs.py docs/ --no-test
|
|
|
|
# Files mode with custom endpoint
|
|
python load_docs.py docs/ --mode files --endpoint http://localhost:9621
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Dual Reference Modes**: File paths or live website URLs in citations
|
|
- **Flexible Base URL**: Works with any documentation site structure
|
|
- **Simple dependency**: Only requires `httpx` and Python standard library
|
|
- **Automatic discovery**: Finds all `.md` files recursively
|
|
- **Smart metadata**: Adds appropriate title, path/URL, and source information
|
|
- **Progress tracking**: Shows loading progress with success/failure counts
|
|
- **Health checks**: Verifies LightRAG connectivity before loading
|
|
- **Test queries**: Validates functionality after loading
|
|
- **Error handling**: Clear validation and error messages
|
|
|
|
## Requirements
|
|
|
|
```bash
|
|
pip install httpx
|
|
```
|
|
|
|
## Use Cases
|
|
|
|
This loader is perfect for:
|
|
- **Kubernetes deployments**: Self-contained with minimal dependencies
|
|
- **Quick testing**: Immediate setup without complex environments
|
|
- **Documentation loading**: Any markdown-based documentation
|
|
- **Development workflows**: Fast iteration and testing
|
|
|
|
## Requirements
|
|
|
|
```bash
|
|
pip install httpx
|
|
```
|
|
|
|
**Note**: This script is included with LightRAG deployments and provides a simple way to load any markdown documentation into your LightRAG instance.
|