Files
bookapp/ai_blueprint.md
Mike Wichers 81340a18ea Auto-commit: v2.14 — Stuck job robustness (heartbeat, retry, stale watcher, granular logging)
- web/db.py: Add last_heartbeat column to Run model
- core/utils.py: Add set_heartbeat_callback() and send_heartbeat()
- web/tasks.py: Add _robust_update_run_status() with 5-retry exponential backoff;
  add db_heartbeat_callback(); remove all bare except:pass on DB status updates;
  set start_time + last_heartbeat when marking run as 'running'
- web/app.py: Add last_heartbeat column migration; add _stale_job_watcher()
  background thread (checks every 5 min, 15-min heartbeat threshold, 2-hr start_time threshold)
- cli/engine.py: Add phase-level logging banners and try/except wrappers in
  process_book(); add utils.send_heartbeat() after each chapter save;
  add start/finish logging in run_generation()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 19:00:29 -05:00

57 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AI Blueprint: Addressing Stuck Book Generation Jobs
> **Status: IMPLEMENTED — v2.14**
> All five steps below were implemented on 2026-02-21.
## 1. The Problem: Progress Stalls
The primary issue is that book generation jobs can get "stuck" in a "running" state, preventing users from starting new runs and causing confusion as the UI shows no progress. This is likely caused by worker processes crashing or encountering unhandled errors before they can update the job's final status to "completed" or "failed".
## 2. Investigation Findings
- **State Management:** The `Run` table in the database has a `status` column. Tasks in `web/tasks.py` are responsible for updating this from "queued" to "running" and finally to "completed" or "failed".
- **Point of Failure:** The most likely failure point is a catastrophic crash of the Huey worker process (e.g., out-of-memory error) or a deadlock within the core `cli.engine.run_generation` function. In these scenarios, the `finally` block that updates the status is never reached.
- **Database Contention:** The direct use of `sqlite3` in the tasks can lead to `database is locked` errors. While there are some retries, prolonged locks could cause status updates to fail.
- **Silent Errors:** Some task functions use a bare `try...except: pass` around the final status update. If updating the database fails, the error is swallowed, and the job remains in a "running" state.
## 3. The Plan: Enhancing Robustness
### Step 1: Implement a "Stale Job" Cleanup Process ✅
- **`last_heartbeat` column added to `Run` model** (`web/db.py`).
- **Migration** added in `web/app.py` startup to add `last_heartbeat` column to existing databases.
- **Startup reset** already present — all `status='running'` jobs are reset to `failed` at boot.
- **Periodic stale-job watcher thread** (`_stale_job_watcher`) started in `web/app.py`:
- Runs every 5 minutes.
- Marks jobs `failed` if `last_heartbeat` is > 15 minutes stale.
- Marks jobs `failed` if `start_time` is > 2 hours old and no heartbeat was ever recorded.
### Step 2: Fortify Database Updates ✅
- **`_robust_update_run_status()`** helper added to `web/tasks.py`:
- 5 retries with linear backoff (15 seconds per attempt).
- Handles `sqlite3.OperationalError` specifically with retry; raises `RuntimeError` on total failure.
- All bare `except: pass` blocks around DB status updates removed from:
- `generate_book_task` — final status update now uses robust helper with retry.
- `regenerate_artifacts_task` — all three status-update sites fixed.
- `rewrite_chapter_task``db_path` moved above the outer `try` block to prevent `NameError`; all status-update sites fixed.
### Step 3: Add Granular Logging to Core Engine ✅
- **`cli/engine.py``run_generation()`**: logs series title at start; logs start/finish of each `process_book` call; catches and re-logs exceptions before re-raising.
- **`cli/engine.py``process_book()`**: Added `--- Phase: X ---` banners at the start of each major stage (Blueprint, Structure & Events, Chapter Planning, Writing, Post-Processing). Each phase is wrapped in `try/except` that logs `ERROR` with the exception type before re-raising.
### Step 4: Introduce a Task Heartbeat ✅
- **`core/utils.py`**: `set_heartbeat_callback()` and `send_heartbeat()` added (mirrors the existing progress/log callback pattern).
- **`web/tasks.py`**: `db_heartbeat_callback()` writes `last_heartbeat = NOW` to the DB with up to 3 retries. Set as the heartbeat callback in `generate_book_task`.
- **`cli/engine.py`**: `utils.send_heartbeat()` called after each chapter is saved to disk — the most meaningful signal that the worker is still processing.
### Step 5: Commit and Push Changes ✅
Changes committed to `main` branch with message `Auto-commit: v2.14 — Stuck job robustness (heartbeat, retry, stale watcher, granular logging)`.
---
This multi-layered approach will significantly reduce the chances of jobs getting stuck and provide better diagnostics if they do. It ensures the system can recover gracefully from worker failures and database locks.