Files

Mike Wichers 81340a18ea Auto-commit: v2.14 — Stuck job robustness (heartbeat, retry, stale watcher, granular logging)

- web/db.py: Add last_heartbeat column to Run model
- core/utils.py: Add set_heartbeat_callback() and send_heartbeat()
- web/tasks.py: Add _robust_update_run_status() with 5-retry exponential backoff;
  add db_heartbeat_callback(); remove all bare except:pass on DB status updates;
  set start_time + last_heartbeat when marking run as 'running'
- web/app.py: Add last_heartbeat column migration; add _stale_job_watcher()
  background thread (checks every 5 min, 15-min heartbeat threshold, 2-hr start_time threshold)
- cli/engine.py: Add phase-level logging banners and try/except wrappers in
  process_book(); add utils.send_heartbeat() after each chapter save;
  add start/finish logging in run_generation()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-21 19:00:29 -05:00

4.0 KiB

Raw Blame History

AI Blueprint: Addressing Stuck Book Generation Jobs

Status: IMPLEMENTED — v2.14 All five steps below were implemented on 2026-02-21.

1. The Problem: Progress Stalls

The primary issue is that book generation jobs can get "stuck" in a "running" state, preventing users from starting new runs and causing confusion as the UI shows no progress. This is likely caused by worker processes crashing or encountering unhandled errors before they can update the job's final status to "completed" or "failed".

2. Investigation Findings

State Management: The Run table in the database has a status column. Tasks in web/tasks.py are responsible for updating this from "queued" to "running" and finally to "completed" or "failed".
Point of Failure: The most likely failure point is a catastrophic crash of the Huey worker process (e.g., out-of-memory error) or a deadlock within the core cli.engine.run_generation function. In these scenarios, the finally block that updates the status is never reached.
Database Contention: The direct use of sqlite3 in the tasks can lead to database is locked errors. While there are some retries, prolonged locks could cause status updates to fail.
Silent Errors: Some task functions use a bare try...except: pass around the final status update. If updating the database fails, the error is swallowed, and the job remains in a "running" state.

3. The Plan: Enhancing Robustness

Step 1: Implement a "Stale Job" Cleanup Process ✅

last_heartbeat column added to Run model (web/db.py).
Migration added in web/app.py startup to add last_heartbeat column to existing databases.
Startup reset already present — all status='running' jobs are reset to failed at boot.
Periodic stale-job watcher thread (_stale_job_watcher) started in web/app.py:
- Runs every 5 minutes.
- Marks jobs failed if last_heartbeat is > 15 minutes stale.
- Marks jobs failed if start_time is > 2 hours old and no heartbeat was ever recorded.

Step 2: Fortify Database Updates ✅

_robust_update_run_status() helper added to web/tasks.py:
- 5 retries with linear backoff (1–5 seconds per attempt).
- Handles sqlite3.OperationalError specifically with retry; raises RuntimeError on total failure.
All bare except: pass blocks around DB status updates removed from:
- generate_book_task — final status update now uses robust helper with retry.
- regenerate_artifacts_task — all three status-update sites fixed.
- rewrite_chapter_task — db_path moved above the outer try block to prevent NameError; all status-update sites fixed.

Step 3: Add Granular Logging to Core Engine ✅

cli/engine.py — run_generation(): logs series title at start; logs start/finish of each process_book call; catches and re-logs exceptions before re-raising.
cli/engine.py — process_book(): Added --- Phase: X --- banners at the start of each major stage (Blueprint, Structure & Events, Chapter Planning, Writing, Post-Processing). Each phase is wrapped in try/except that logs ERROR with the exception type before re-raising.

Step 4: Introduce a Task Heartbeat ✅

core/utils.py: set_heartbeat_callback() and send_heartbeat() added (mirrors the existing progress/log callback pattern).
web/tasks.py: db_heartbeat_callback() writes last_heartbeat = NOW to the DB with up to 3 retries. Set as the heartbeat callback in generate_book_task.
cli/engine.py: utils.send_heartbeat() called after each chapter is saved to disk — the most meaningful signal that the worker is still processing.

Step 5: Commit and Push Changes ✅

Changes committed to main branch with message Auto-commit: v2.14 — Stuck job robustness (heartbeat, retry, stale watcher, granular logging).

This multi-layered approach will significantly reduce the chances of jobs getting stuck and provide better diagnostics if they do. It ensures the system can recover gracefully from worker failures and database locks.

4.0 KiB Raw Blame History Unescape Escape