Task Log: Document Processing Debug Session

Date: 2025-12-05 12:45 Mode: Beastmode Topic: Investigation - Document stuck in "Processing" state

Summary

Successfully investigated and resolved a document stuck in "Processing" state for TechStart Inc tenant.

Actions Performed

Verified server health - Server running on port 9621, PostgreSQL connected, multi-tenant enabled
Identified tenant issue - UI was using cached tenant_id: default which doesn't exist in PostgreSQL
Switched to valid tenant - Searched and selected "TechStart Inc" with "Main KB"
Identified stuck document - doc-408153a6090f3deeeea5a56df844fef8 ("Can AI Really Check Its Own Math Homework?")
Found root cause in logs - LLM extraction timeout after 360s at 03:29:09
Deleted stuck document - Used UI to delete the orphaned document
Verified resolution - Processing count dropped from 1 to 0

Decisions Made

Document was orphaned due to server crash/restart during processing
The document status was never updated to "Failed" after timeout exception
Best solution: delete and re-upload rather than fixing state manually

Root Cause Analysis

2025-12-05 03:29:09 - Failed to extract entities and relationships:
C[1/1]: chunk-408153a6090f3deeeea5a56df844fef8: LLM func: Worker execution timeout after 360s

The document started processing at 00:53:00 and failed at 03:29:09 with a timeout. The exception handling code should have marked the document as "Failed", but likely the server was restarted or crashed during error handling.

Technical Details

Affected Components

lightrag/lightrag.py - Entity extraction with timeout
lightrag/api/routers/document_routes.py - Document management endpoints
PostgreSQL doc_status storage

Files Verified (from previous session)

lightrag/services/tenant_service.py - Fixed datetime deserialization
lightrag_webui/src/features/DocumentManager.tsx - Pipeline status sync

Next Steps

Consider adding a "stale document cleanup" job that marks documents stuck in "Processing" for >1 hour as "Failed"
Add UI button to manually reset document status to "Pending" for retry
Improve error handling in _process_extract_entities to ensure status is always updated

Lessons Learned

Document state can become inconsistent if server crashes during processing
The "default" tenant in localStorage can cause 500 errors when it doesn't exist in PostgreSQL
Always verify tenant/KB selection before debugging document issues
LLM extraction can timeout (360s default) for complex documents

Verification Steps

# Health check
curl -s "http://localhost:9621/health" | jq '.status, .pipeline_busy, .multi_tenant_enabled'
# Result: "healthy", false, true

# Document status after fix
# All (1), Completed (1), Processing (0), Failed (0)

Session End

Document stuck state: RESOLVED ✅
Application functional: YES ✅
No pending processing: VERIFIED ✅

3 KiB Raw Blame History