LightRAG/logs/2025-12-05-12-45-beastmode-processing-debug.md

93 lines
3 KiB
Markdown

# Task Log: Document Processing Debug Session
**Date:** 2025-12-05 12:45
**Mode:** Beastmode
**Topic:** Investigation - Document stuck in "Processing" state
---
## Summary
Successfully investigated and resolved a document stuck in "Processing" state for TechStart Inc tenant.
---
## Actions Performed
1. **Verified server health** - Server running on port 9621, PostgreSQL connected, multi-tenant enabled
2. **Identified tenant issue** - UI was using cached `tenant_id: default` which doesn't exist in PostgreSQL
3. **Switched to valid tenant** - Searched and selected "TechStart Inc" with "Main KB"
4. **Identified stuck document** - `doc-408153a6090f3deeeea5a56df844fef8` ("Can AI Really Check Its Own Math Homework?")
5. **Found root cause in logs** - LLM extraction timeout after 360s at 03:29:09
6. **Deleted stuck document** - Used UI to delete the orphaned document
7. **Verified resolution** - Processing count dropped from 1 to 0
---
## Decisions Made
- Document was orphaned due to server crash/restart during processing
- The document status was never updated to "Failed" after timeout exception
- Best solution: delete and re-upload rather than fixing state manually
---
## Root Cause Analysis
```
2025-12-05 03:29:09 - Failed to extract entities and relationships:
C[1/1]: chunk-408153a6090f3deeeea5a56df844fef8: LLM func: Worker execution timeout after 360s
```
The document started processing at 00:53:00 and failed at 03:29:09 with a timeout. The exception handling code should have marked the document as "Failed", but likely the server was restarted or crashed during error handling.
---
## Technical Details
### Affected Components
- `lightrag/lightrag.py` - Entity extraction with timeout
- `lightrag/api/routers/document_routes.py` - Document management endpoints
- PostgreSQL doc_status storage
### Files Verified (from previous session)
- `lightrag/services/tenant_service.py` - Fixed datetime deserialization
- `lightrag_webui/src/features/DocumentManager.tsx` - Pipeline status sync
---
## Next Steps
1. Consider adding a "stale document cleanup" job that marks documents stuck in "Processing" for >1 hour as "Failed"
2. Add UI button to manually reset document status to "Pending" for retry
3. Improve error handling in `_process_extract_entities` to ensure status is always updated
---
## Lessons Learned
- Document state can become inconsistent if server crashes during processing
- The "default" tenant in localStorage can cause 500 errors when it doesn't exist in PostgreSQL
- Always verify tenant/KB selection before debugging document issues
- LLM extraction can timeout (360s default) for complex documents
---
## Verification Steps
```bash
# Health check
curl -s "http://localhost:9621/health" | jq '.status, .pipeline_busy, .multi_tenant_enabled'
# Result: "healthy", false, true
# Document status after fix
# All (1), Completed (1), Processing (0), Failed (0)
```
---
## Session End
- Document stuck state: **RESOLVED**
- Application functional: **YES**
- No pending processing: **VERIFIED**