LightRAG/README_load_docs.md
Taddeus a70ba1f75a
Phase 1: LightRAG Minimal Helm chart and documentation indexing using url references (#2)
* Partial implementation of phase-0

* Partial implementation of phase-1

* add report

* add postgress

* Revert "add postgress"

This reverts commit 27778dc6bb3906b5220dd386e47fe32ca7415332.

* remove junk

* Cleaned up annd setup docs

* update docs

* moved report

* Updated load_markdown_files function: Now returns tuples with (content, title, relative_path) instead of just (content, title)

* fixes to load docs script and more env variables for llm configuration

* update prod values

* update docs

* apolo docs support with linking

* update docs to reflect url conventions and mapping with docs

* Adds ingress and forwardAuth configurations

Adds ingress configuration to expose the application.

Adds forwardAuth configuration to enable user authentication.

Includes middleware to strip headers.

* Adds ingress and forward authentication middleware support
2025-06-23 20:04:34 +03:00

5.7 KiB

LightRAG Documentation Loader

Advanced script to load markdown documentation into LightRAG with flexible reference modes.

Quick Start

# Default mode (file path references)
python load_docs.py /path/to/your/docs

# URL mode (website link references) 
python load_docs.py /path/to/docs --mode urls --base-url https://docs.example.com/

Reference Modes

Files Mode (Default)

Uses local file paths in query response citations:

python load_docs.py docs/
python load_docs.py docs/ --mode files

Query Response Example:

### References
- [DC] getting-started/installation.md
- [KG] administration/setup.md

URLs Mode

Uses website URLs in query response citations:

python load_docs.py docs/ --mode urls --base-url https://docs.apolo.us/index/
python load_docs.py docs/ --mode urls --base-url https://my-docs.com/v1/

Query Response Example:

### References  
- [DC] https://docs.apolo.us/index/getting-started/installation
- [KG] https://docs.apolo.us/index/administration/setup

⚠️ Important for URLs Mode: Your local file structure must match your documentation site's URL structure for proper link generation.

File Structure Requirements:

docs/
├── getting-started/
│   ├── installation.md          → https://docs.example.com/getting-started/installation
│   └── first-steps.md           → https://docs.example.com/getting-started/first-steps
├── administration/
│   ├── README.md                → https://docs.example.com/administration
│   └── setup.md                 → https://docs.example.com/administration/setup
└── README.md                    → https://docs.example.com/

URL Mapping Rules:

  • .md extension is removed from URLs
  • README.md files map to their directory URL
  • Subdirectories become URL path segments
  • Hyphens and underscores in filenames are preserved

Organizing Docs for URL Mode

Step 1: Analyze Your Documentation Site Structure

# Visit your docs site and note the URL patterns:
# https://docs.example.com/getting-started/installation
# https://docs.example.com/api/authentication  
# https://docs.example.com/guides/deployment

Step 2: Create Matching Directory Structure

mkdir -p docs/{getting-started,api,guides}

Step 3: Organize Your Markdown Files

# Match each URL to a file path:
docs/getting-started/installation.md     # → /getting-started/installation  
docs/api/authentication.md              # → /api/authentication
docs/guides/deployment.md               # → /guides/deployment
docs/guides/README.md                   # → /guides (overview page)

Step 4: Verify URL Mapping

# Test a few URLs manually to ensure they work:
curl -I https://docs.example.com/getting-started/installation
curl -I https://docs.example.com/api/authentication

Common Documentation Site Patterns:

Site Type File Structure URL Structure
GitBook docs/section/page.md /section/page
Docusaurus docs/section/page.md /docs/section/page
MkDocs docs/section/page.md /section/page/
Custom Varies Match your site's pattern

Real Example: Apolo Documentation

# Apolo docs site: https://docs.apolo.us/index/
# Your local structure should match:
apolo-docs/
├── getting-started/
│   ├── first-steps/
│   │   ├── getting-started.md   → /index/getting-started/first-steps/getting-started  
│   │   └── README.md           → /index/getting-started/first-steps
│   ├── apolo-base-docker-image.md → /index/getting-started/apolo-base-docker-image
│   └── faq.md                  → /index/getting-started/faq
├── apolo-console/
│   └── getting-started/
│       └── sign-up-login.md    → /index/apolo-console/getting-started/sign-up-login
└── README.md                   → /index/

# Load with correct base URL:
python load_docs.py apolo-docs/ --mode urls --base-url https://docs.apolo.us/index/

Complete Usage Examples

# Load Apolo documentation with URL references
python load_docs.py ../apolo-copilot/docs/official-apolo-documentation/docs \
  --mode urls --base-url https://docs.apolo.us/index/

# Load with custom LightRAG endpoint
python load_docs.py docs/ --endpoint https://lightrag.example.com

# Load to local instance, skip test query
python load_docs.py docs/ --no-test

# Files mode with custom endpoint  
python load_docs.py docs/ --mode files --endpoint http://localhost:9621

Features

  • Dual Reference Modes: File paths or live website URLs in citations
  • Flexible Base URL: Works with any documentation site structure
  • Simple dependency: Only requires httpx and Python standard library
  • Automatic discovery: Finds all .md files recursively
  • Smart metadata: Adds appropriate title, path/URL, and source information
  • Progress tracking: Shows loading progress with success/failure counts
  • Health checks: Verifies LightRAG connectivity before loading
  • Test queries: Validates functionality after loading
  • Error handling: Clear validation and error messages

Requirements

pip install httpx

Use Cases

This loader is perfect for:

  • Kubernetes deployments: Self-contained with minimal dependencies
  • Quick testing: Immediate setup without complex environments
  • Documentation loading: Any markdown-based documentation
  • Development workflows: Fast iteration and testing

Requirements

pip install httpx

Note: This script is included with LightRAG deployments and provides a simple way to load any markdown documentation into your LightRAG instance.