- web/db.py: Add last_heartbeat column to Run model - core/utils.py: Add set_heartbeat_callback() and send_heartbeat() - web/tasks.py: Add _robust_update_run_status() with 5-retry exponential backoff; add db_heartbeat_callback(); remove all bare except:pass on DB status updates; set start_time + last_heartbeat when marking run as 'running' - web/app.py: Add last_heartbeat column migration; add _stale_job_watcher() background thread (checks every 5 min, 15-min heartbeat threshold, 2-hr start_time threshold) - cli/engine.py: Add phase-level logging banners and try/except wrappers in process_book(); add utils.send_heartbeat() after each chapter save; add start/finish logging in run_generation() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.0 KiB
AI Blueprint: Addressing Stuck Book Generation Jobs
Status: IMPLEMENTED — v2.14 All five steps below were implemented on 2026-02-21.
1. The Problem: Progress Stalls
The primary issue is that book generation jobs can get "stuck" in a "running" state, preventing users from starting new runs and causing confusion as the UI shows no progress. This is likely caused by worker processes crashing or encountering unhandled errors before they can update the job's final status to "completed" or "failed".
2. Investigation Findings
- State Management: The
Runtable in the database has astatuscolumn. Tasks inweb/tasks.pyare responsible for updating this from "queued" to "running" and finally to "completed" or "failed". - Point of Failure: The most likely failure point is a catastrophic crash of the Huey worker process (e.g., out-of-memory error) or a deadlock within the core
cli.engine.run_generationfunction. In these scenarios, thefinallyblock that updates the status is never reached. - Database Contention: The direct use of
sqlite3in the tasks can lead todatabase is lockederrors. While there are some retries, prolonged locks could cause status updates to fail. - Silent Errors: Some task functions use a bare
try...except: passaround the final status update. If updating the database fails, the error is swallowed, and the job remains in a "running" state.
3. The Plan: Enhancing Robustness
Step 1: Implement a "Stale Job" Cleanup Process ✅
last_heartbeatcolumn added toRunmodel (web/db.py).- Migration added in
web/app.pystartup to addlast_heartbeatcolumn to existing databases. - Startup reset already present — all
status='running'jobs are reset tofailedat boot. - Periodic stale-job watcher thread (
_stale_job_watcher) started inweb/app.py:- Runs every 5 minutes.
- Marks jobs
failediflast_heartbeatis > 15 minutes stale. - Marks jobs
failedifstart_timeis > 2 hours old and no heartbeat was ever recorded.
Step 2: Fortify Database Updates ✅
_robust_update_run_status()helper added toweb/tasks.py:- 5 retries with linear backoff (1–5 seconds per attempt).
- Handles
sqlite3.OperationalErrorspecifically with retry; raisesRuntimeErroron total failure.
- All bare
except: passblocks around DB status updates removed from:generate_book_task— final status update now uses robust helper with retry.regenerate_artifacts_task— all three status-update sites fixed.rewrite_chapter_task—db_pathmoved above the outertryblock to preventNameError; all status-update sites fixed.
Step 3: Add Granular Logging to Core Engine ✅
cli/engine.py—run_generation(): logs series title at start; logs start/finish of eachprocess_bookcall; catches and re-logs exceptions before re-raising.cli/engine.py—process_book(): Added--- Phase: X ---banners at the start of each major stage (Blueprint, Structure & Events, Chapter Planning, Writing, Post-Processing). Each phase is wrapped intry/exceptthat logsERRORwith the exception type before re-raising.
Step 4: Introduce a Task Heartbeat ✅
core/utils.py:set_heartbeat_callback()andsend_heartbeat()added (mirrors the existing progress/log callback pattern).web/tasks.py:db_heartbeat_callback()writeslast_heartbeat = NOWto the DB with up to 3 retries. Set as the heartbeat callback ingenerate_book_task.cli/engine.py:utils.send_heartbeat()called after each chapter is saved to disk — the most meaningful signal that the worker is still processing.
Step 5: Commit and Push Changes ✅
Changes committed to main branch with message Auto-commit: v2.14 — Stuck job robustness (heartbeat, retry, stale watcher, granular logging).
This multi-layered approach will significantly reduce the chances of jobs getting stuck and provide better diagnostics if they do. It ensures the system can recover gracefully from worker failures and database locks.