LightRAG/logs/2025-12-05-12-45-beastmode-processing-debug.md

3 KiB

Task Log: Document Processing Debug Session

Date: 2025-12-05 12:45 Mode: Beastmode Topic: Investigation - Document stuck in "Processing" state


Summary

Successfully investigated and resolved a document stuck in "Processing" state for TechStart Inc tenant.


Actions Performed

  1. Verified server health - Server running on port 9621, PostgreSQL connected, multi-tenant enabled
  2. Identified tenant issue - UI was using cached tenant_id: default which doesn't exist in PostgreSQL
  3. Switched to valid tenant - Searched and selected "TechStart Inc" with "Main KB"
  4. Identified stuck document - doc-408153a6090f3deeeea5a56df844fef8 ("Can AI Really Check Its Own Math Homework?")
  5. Found root cause in logs - LLM extraction timeout after 360s at 03:29:09
  6. Deleted stuck document - Used UI to delete the orphaned document
  7. Verified resolution - Processing count dropped from 1 to 0

Decisions Made

  • Document was orphaned due to server crash/restart during processing
  • The document status was never updated to "Failed" after timeout exception
  • Best solution: delete and re-upload rather than fixing state manually

Root Cause Analysis

2025-12-05 03:29:09 - Failed to extract entities and relationships:
C[1/1]: chunk-408153a6090f3deeeea5a56df844fef8: LLM func: Worker execution timeout after 360s

The document started processing at 00:53:00 and failed at 03:29:09 with a timeout. The exception handling code should have marked the document as "Failed", but likely the server was restarted or crashed during error handling.


Technical Details

Affected Components

  • lightrag/lightrag.py - Entity extraction with timeout
  • lightrag/api/routers/document_routes.py - Document management endpoints
  • PostgreSQL doc_status storage

Files Verified (from previous session)

  • lightrag/services/tenant_service.py - Fixed datetime deserialization
  • lightrag_webui/src/features/DocumentManager.tsx - Pipeline status sync

Next Steps

  1. Consider adding a "stale document cleanup" job that marks documents stuck in "Processing" for >1 hour as "Failed"
  2. Add UI button to manually reset document status to "Pending" for retry
  3. Improve error handling in _process_extract_entities to ensure status is always updated

Lessons Learned

  • Document state can become inconsistent if server crashes during processing
  • The "default" tenant in localStorage can cause 500 errors when it doesn't exist in PostgreSQL
  • Always verify tenant/KB selection before debugging document issues
  • LLM extraction can timeout (360s default) for complex documents

Verification Steps

# Health check
curl -s "http://localhost:9621/health" | jq '.status, .pipeline_busy, .multi_tenant_enabled'
# Result: "healthy", false, true

# Document status after fix
# All (1), Completed (1), Processing (0), Failed (0)

Session End

  • Document stuck state: RESOLVED
  • Application functional: YES
  • No pending processing: VERIFIED