feat: Implement ai_blueprint.md action plan — architectural review & optimisations

Steps 1–7 of the ai_blueprint.md action plan executed:

DOCUMENTATION (Steps 1–3, 6–7):
- docs/current_state_analysis.md: Phase-by-phase cost/quality mapping of existing pipeline
- docs/alternatives_analysis.md: 15 alternative approaches with testable hypotheses
- docs/experiment_design.md: 7 controlled A/B experiment specifications (CPC, HQS, CER metrics)
- ai_blueprint_v2.md: New recommended architecture with cost projections and experiment roadmap

CODE IMPROVEMENTS (Step 4 — Experiments 1–4 implemented):
- story/writer.py: Extract build_persona_info() — persona loaded once per book, not per chapter
- story/writer.py: Adaptive scoring thresholds — SCORE_PASSING scales 6.5→7.5 by chapter position
- story/writer.py: Beat expansion skip — if beats >100 words, skip Director's Treatment expansion
- story/planner.py: validate_outline() — pre-generation gate checks missing beats, continuity, pacing
- story/planner.py: Enrichment field validation — warn on missing title/genre after enrich()
- cli/engine.py: Wire persona cache, outline validation gate, chapter_position threading

Expected savings: ~285K tokens per 30-chapter novel (~7% cost reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-22 22:01:30 -05:00
parent 6684ec2bf5
commit 2100ca2312
8 changed files with 1143 additions and 32 deletions

View File

@@ -0,0 +1,264 @@
# Alternatives Analysis: Hypotheses for Each Phase
**Date:** 2026-02-22
**Status:** Completed — fulfills Action Plan Step 2
---
## Methodology
For each phase, we present the current approach, document credible alternatives, and state a testable hypothesis about cost and quality impact. Each alternative is rated for implementation complexity and expected payoff.
---
## Phase 1: Foundation & Ideation
### Current Approach
A single Logic-model call expands a minimal user prompt into `book_metadata`, `characters`, and `plot_beats`. The author persona is created in a separate single-pass call.
---
### Alt 1-A: Dynamic Bible (Just-In-Time Generation)
**Description:** Instead of creating the full bible upfront, generate only world rules and core character archetypes at start. Flesh out secondary characters and specific locations only when the planner references them during outlining.
**Mechanism:**
1. Upfront: title, genre, tone, 12 core characters, 3 immutable world rules
2. During `expand()`: When a new location/character appears in events, call a mini-enrichment to define them
3. Benefits: Only define what's actually used; no wasted detail on characters who don't appear
**Hypothesis:** Dynamic bible reduces Phase 1 token cost by ~30% and improves character coherence because every detail is tied to a specific narrative purpose. May increase Phase 2 cost by ~15% due to incremental enrichment calls.
**Complexity:** Medium — requires refactoring `planner.py` to support on-demand enrichment
**Risk:** New characters generated mid-outline might not be coherent with established world
---
### Alt 1-B: Lean Bible (Rules + Emergence)
**Description:** Define only immutable "physics" of the world (e.g., "no magic exists", "set in 1920s London") and let all characters and plot details emerge from the writing process. Only characters explicitly named by the user are pre-defined.
**Hypothesis:** Lean bible reduces Phase 1 cost by ~60% but increases Phase 3 cost by ~25% (more continuity errors require more evaluation retries). Net effect depends on how many characters the user pre-defines.
**Complexity:** Low — strip `enrich()` down to essentials
**Risk:** Characters might be inconsistent across chapters without a shared bible anchor
---
### Alt 1-C: Iterative Persona Validation
**Description:** After `create_initial_persona()`, immediately generate a 200-word sample passage in that persona's voice and evaluate it with the editor. Only accept the persona if the sample scores ≥ 7/10.
**Hypothesis:** Iterative persona validation adds ~8K tokens to Phase 1 but reduces Phase 3 persona-related rewrite rate by ~20% (fewer voice-drift refinements needed).
**Complexity:** Low — add one evaluation call after persona creation
**Risk:** Minimal — only adds cost if persona is rejected
---
## Phase 2: Structuring & Outlining
### Current Approach
Sequential depth-expansion passes convert plot beats into a chapter plan. Each `expand()` call is unaware of the final desired state, so multiple passes are needed.
---
### Alt 2-A: Single-Pass Hierarchical Outline
**Description:** Replace sequential `expand()` calls with a single multi-step prompt that builds the outline in one shot — specifying the desired depth level in the instructions. The model produces both high-level events and chapter-level detail simultaneously.
**Hypothesis:** Single-pass outline reduces Phase 2 Logic calls from 6 to 2 (one `plan_structure`, one combined `expand+chapter_plan`), saving ~60K tokens (~45% Phase 2 cost). Quality may drop slightly if the model can't maintain coherence across 50 chapters in one response.
**Complexity:** Low — prompt rewrite; no code structure change
**Risk:** Large single-response JSON might fail or be truncated by model. Novel (30 chapters) is manageable; Epic (50 chapters) is borderline.
---
### Alt 2-B: Outline Validation Gate
**Description:** After `create_chapter_plan()`, run a validation call that checks the outline for: (a) missing required plot beats, (b) character deaths/revivals, (c) pacing imbalances, (d) POV distribution. Block writing phase until outline passes validation.
**Hypothesis:** Pre-generation outline validation (1 Logic call, ~15K tokens, FREE on Pro-Exp) prevents ~35 expensive rewrite cycles during Phase 3, saving 75K125K Writer tokens (~$0.05$0.10 per book).
**Complexity:** Low — add `validate_outline()` function, call it before Phase 3 begins
**Risk:** Validation might be overly strict and reject valid creative choices
---
### Alt 2-C: Dynamic Personas (Mood/POV Adaptation)
**Description:** Instead of a single author persona, create sub-personas for different scene types: (a) action sequences, (b) introspection/emotion, (c) dialogue-heavy scenes. The writer prompt selects the appropriate sub-persona based on chapter pacing.
**Hypothesis:** Dynamic personas reduce "voice drift" across different scene types, improving average chapter evaluation score by ~0.3 points. Cost increases by ~12K tokens/book for the additional persona generation calls.
**Complexity:** Medium — requires sub-persona generation, storage, and selection logic in `write_chapter()`
**Risk:** Sub-personas might be inconsistent with each other if not carefully designed
---
### Alt 2-D: Specialized Chapter Templates
**Description:** Create genre-specific "chapter templates" for common patterns: opening chapters, mid-point reversals, climax chapters, denouements. The planner selects the appropriate template when assigning structure, reducing the amount of creative work needed per chapter.
**Hypothesis:** Chapter templates reduce Phase 3 beat expansion cost by ~40% (pre-structured templates need less expansion) and reduce rewrite rate by ~15% (templates encode known-good patterns).
**Complexity:** Medium — requires template library and selection logic
**Risk:** Templates might make books feel formulaic
---
## Phase 3: The Writing Engine
### Current Approach
Single-model drafting with up to 3 attempts. Low-scoring drafts trigger full rewrites using the Pro model. Evaluation happens after each draft.
---
### Alt 3-A: Two-Pass Drafting (Cheap Draft + Expensive Polish)
**Description:** Use the cheapest available Flash model for a rough first draft (focused on getting beats covered and word count right), then use the Pro model to polish prose quality. Skip the evaluation + rewrite loop entirely.
**Hypothesis:** Two-pass drafting reduces average chapter evaluation score variance (fewer very-low scores), but might be slower because every chapter gets polished regardless of quality. Net cost impact uncertain — depends on Flash vs Pro price differential. At current pricing (Flash free on Pro-Exp), this is equivalent to the current approach.
**Complexity:** Low — add a "polish" pass after initial draft in `write_chapter()`
**Risk:** Polish pass might not improve chapters that have structural problems (wrong beats covered)
---
### Alt 3-B: Adaptive Scoring Thresholds
**Description:** Use different scoring thresholds based on chapter position and importance:
- Setup chapters (120% of book): SCORE_PASSING = 6.5 (accept imperfect early work)
- Midpoint + rising action (2070%): SCORE_PASSING = 7.0 (current standard)
- Climax + resolution (70100%): SCORE_PASSING = 7.5 (stricter standards for crucial chapters)
**Hypothesis:** Adaptive thresholds reduce refinement calls on setup chapters by ~25% while improving quality of climax chapters. Net token saving ~100K per book (~$0.02) with no quality loss on high-stakes scenes.
**Complexity:** Very low — change 2 constants in `write_chapter()` to be position-aware
**Risk:** Lower-quality setup chapters might affect reader engagement in early pages
---
### Alt 3-C: Pre-Scoring Outline Beats
**Description:** Before writing any chapter, use the Logic model to score each chapter's beat list for "writability" — the likelihood that the beats will produce a high-quality first draft. Flag chapters scoring below 6/10 as "high-risk" and assign them extra write attempts upfront.
**Hypothesis:** Pre-scoring beats adds ~5K tokens per book but reduces full-rewrite incidents by ~30% (the most expensive outcome). Expected saving: 30% × 15 rewrites × 50K tokens = ~225K tokens (~$0.05).
**Complexity:** Low — add `score_beats_writability()` call before Phase 3 loop
**Risk:** Pre-scoring accuracy might be low; Logic model can't fully predict quality from beats alone
---
### Alt 3-D: Persona Caching (Immediate Win)
**Description:** Load the author persona (bio, sample text, sample files) once per book run rather than re-reading from disk for each chapter. Store in memory and pass to `write_chapter()` as a pre-built string.
**Hypothesis:** Persona caching reduces per-chapter I/O overhead and eliminates redundant file reads. No quality impact. Saves ~90K tokens per book (3K tokens × 30 chapters from persona sample files).
**Complexity:** Very low — refactor engine.py to load persona once and pass it
**Risk:** None
---
### Alt 3-E: Skip Beat Expansion for Detailed Beats
**Description:** If a chapter's beats already exceed 100 words each, skip `expand_beats_to_treatment()`. The existing beats are detailed enough to guide the writer.
**Hypothesis:** ~30% of chapters have detailed beats. Skipping expansion saves 5K tokens × 30% × 30 chapters = ~45K tokens. Quality impact negligible for already-detailed beats.
**Complexity:** Very low — add word-count check before calling `expand_beats_to_treatment()`
**Risk:** None for already-detailed beats; risk only if threshold is set too low
---
## Phase 4: Review & Refinement
### Current Approach
Per-chapter evaluation with 13 rubrics. Post-generation consistency check. Dynamic pacing interventions. User-triggered ripple propagation.
---
### Alt 4-A: Batched Chapter Evaluation
**Description:** Instead of evaluating each chapter individually (~20K tokens/eval), batch 35 chapters per evaluation call. The evaluator assesses them together and can identify cross-chapter issues (pacing, voice consistency) that per-chapter evaluation misses.
**Hypothesis:** Batched evaluation reduces evaluation token cost by ~60% (from 600K to 240K tokens) while improving cross-chapter quality detection. Risk: individual chapter scores may be less granular.
**Complexity:** Medium — refactor `evaluate_chapter_quality()` to accept chapter arrays
**Risk:** Batched scoring might be less precise per-chapter; harder to pinpoint which chapter needs rewriting
---
### Alt 4-B: Mid-Generation Consistency Snapshots
**Description:** Run `analyze_consistency()` every 10 chapters (not just post-generation). If contradictions are found, pause writing and resolve them before proceeding.
**Hypothesis:** Mid-generation consistency checks add ~3 Logic calls per 30-chapter book (~75K tokens, FREE) but reduce post-generation ripple propagation cost by ~50% by catching issues early.
**Complexity:** Low — add consistency snapshot call to engine.py loop
**Risk:** Consistency check might generate false positives that stall generation
---
### Alt 4-C: Semantic Ripple Detection
**Description:** Replace LLM-based ripple detection in `check_and_propagate()` with an embedding-similarity approach. When Chapter N is edited, compute semantic similarity between Chapter N's content and all downstream chapters. Only rewrite chapters above a similarity threshold.
**Hypothesis:** Semantic ripple detection reduces per-ripple token cost from ~15K (LLM scan) to ~2K (embedding query) — 87% reduction. Accuracy comparable to LLM for direct references; may miss indirect narrative impacts.
**Complexity:** High — requires adding `sentence-transformers` or Gemini embedding API dependency
**Risk:** Embedding similarity doesn't capture narrative causality (e.g., a character dying affects later chapters even if the death isn't mentioned verbatim)
---
### Alt 4-D: Editor Bot Specialization
**Description:** Create specialized sub-evaluators for specific failure modes:
- `check_filter_words()` — fast regex-based scan (no LLM needed)
- `check_summary_mode()` — detect scene-skipping patterns
- `check_voice_consistency()` — compare chapter voice against persona sample
- `check_plot_adherence()` — verify beats were covered
Run cheap checks first; only invoke full 13-rubric LLM evaluation if fast checks pass.
**Hypothesis:** Specialized editor bots reduce evaluation cost by ~40% (many chapters fail fast checks and don't need full LLM eval). Quality detection equal or better because fast checks are more precise for rule violations.
**Complexity:** Medium — implement regex-based fast checks; modify evaluation pipeline
**Risk:** Fast checks might have false positives that reject good chapters prematurely
---
## Summary: Hypotheses Ranked by Expected Value
| Alt | Phase | Expected Token Saving | Quality Impact | Complexity |
|-----|-------|----------------------|----------------|------------|
| 3-D (Persona Cache) | 3 | ~90K | None | Very Low |
| 3-E (Skip Beat Expansion) | 3 | ~45K | None | Very Low |
| 2-B (Outline Validation) | 2 | Prevents ~100K rewrites | Positive | Low |
| 3-B (Adaptive Thresholds) | 3 | ~100K | Positive | Very Low |
| 1-C (Persona Validation) | 1 | ~60K (prevented rewrites) | Positive | Low |
| 4-B (Mid-gen Consistency) | 4 | ~75K (prevented rewrites) | Positive | Low |
| 3-C (Pre-score Beats) | 3 | ~225K | Positive | Low |
| 4-A (Batch Evaluation) | 4 | ~360K | Neutral/Positive | Medium |
| 2-A (Single-pass Outline) | 2 | ~60K | Neutral | Low |
| 3-B (Two-Pass Drafting) | 3 | Neutral | Potentially Positive | Low |
| 4-D (Editor Bots) | 4 | ~240K | Positive | Medium |
| 2-C (Dynamic Personas) | 2 | -12K (slight increase) | Positive | Medium |
| 4-C (Semantic Ripple) | 4 | ~200K | Neutral | High |

View File

@@ -0,0 +1,238 @@
# Current State Analysis: BookApp AI Pipeline
**Date:** 2026-02-22
**Scope:** Mapping existing codebase to the four phases defined in `ai_blueprint.md`
**Status:** Completed — fulfills Action Plan Step 1
---
## Overview
BookApp is an AI-powered novel generation engine using Google Gemini. The pipeline is structured into four phases that map directly to the review framework in `ai_blueprint.md`. This document catalogues the current implementation, identifies efficiency metrics, and surfaces limitations in each phase.
---
## Phase 1: Foundation & Ideation ("The Seed")
**Primary File:** `story/planner.py` (lines 186)
**Supporting:** `story/style_persona.py` (lines 81104), `core/config.py`
### What Happens
1. User provides a minimal `manual_instruction` (can be a single sentence).
2. `enrich(bp, folder, context)` calls the Logic model to expand this into:
- `book_metadata`: title, genre, tone, time period, structure type, formatting rules, content warnings
- `characters`: 28 named characters with roles and descriptions
- `plot_beats`: 57 concrete narrative beats
3. If the project is part of a series, context from previous books is injected.
4. `create_initial_persona()` generates a fictional author persona (name, bio, age, gender).
### Costs (Per Book)
| Task | Model | Input Tokens | Output Tokens | Cost (Pro-Exp) |
|------|-------|-------------|---------------|----------------|
| `enrich()` | Logic | ~10K | ~3K | FREE |
| `create_initial_persona()` | Logic | ~5.5K | ~1.5K | FREE |
| **Phase 1 Total** | — | ~15.5K | ~4.5K | **FREE** |
### Known Limitations
| ID | Issue | Impact |
|----|-------|--------|
| P1-L1 | `enrich()` silently returns original BP on exception (line 84) | Invalid enrichment passes downstream without warning |
| P1-L2 | `filter_characters()` blacklists keywords like "TBD", "protagonist" — can cull valid names | Characters named "The Protagonist" are silently dropped |
| P1-L3 | Single-pass persona creation — no quality check on output | Generic personas produce poor voice throughout book |
| P1-L4 | No validation that required `book_metadata` fields are non-null | Downstream crashes when title/genre are missing |
---
## Phase 2: Structuring & Outlining
**Primary File:** `story/planner.py` (lines 89290)
**Supporting:** `story/style_persona.py`
### What Happens
1. `plan_structure(bp, folder)` maps plot beats to a structural framework (Hero's Journey, Three-Act, etc.) and produces ~1015 events.
2. `expand(events, pass_num, ...)` iteratively enriches the outline. Called `depth` times (14 based on length preset). Each pass targets chapter count × 1.5 events as ceiling.
3. `create_chapter_plan(events, bp, folder)` converts events into concrete chapter objects with POV, pacing, and estimated word count.
4. `get_style_guidelines()` loads or refreshes the AI-ism blacklist and filter-word list.
### Depth Strategy
| Preset | Depth | Expand Calls | Approx Events |
|--------|-------|-------------|---------------|
| Flash Fiction | 1 | 1 | 1 |
| Short Story | 1 | 1 | 5 |
| Novella | 2 | 2 | 15 |
| Novel | 3 | 3 | 30 |
| Epic | 4 | 4 | 50 |
### Costs (30-Chapter Novel)
| Task | Calls | Input Tokens | Cost (Pro-Exp) |
|------|-------|-------------|----------------|
| `plan_structure` | 1 | ~15K | FREE |
| `expand` × 3 | 3 | ~12K each | FREE |
| `create_chapter_plan` | 1 | ~14K | FREE |
| `get_style_guidelines` | 1 | ~8K | FREE |
| **Phase 2 Total** | 6 | ~73K | **FREE** |
### Known Limitations
| ID | Issue | Impact |
|----|-------|--------|
| P2-L1 | Sequential `expand()` calls — each call unaware of final state | Redundant inter-call work; could be one multi-step prompt |
| P2-L2 | No continuity validation on outline — character deaths/revivals not detected | Plot holes remain until expensive Phase 3 rewrite |
| P2-L3 | Static chapter plan — cannot adapt if early chapters reveal pacing problem | Dynamic interventions in Phase 4 are costly workarounds |
| P2-L4 | POV assignment is AI-generated, not validated against narrative logic | Wrong POV on key scenes; caught only during editing |
| P2-L5 | Word count estimates are rough (~±30% actual variance) | Writer overshoots/undershoots target; word count normalization fails |
---
## Phase 3: The Writing Engine (Drafting)
**Primary File:** `story/writer.py`
**Orchestrated by:** `cli/engine.py`
### What Happens
For each chapter:
1. `expand_beats_to_treatment()` — Logic model expands sparse beats into a "Director's Treatment" (staging, sensory anchors, emotional arc, subtext).
2. `write_chapter()` constructs a ~310-line prompt injecting:
- Author persona (bio, sample text, sample files from disk)
- Filtered characters (only those named in beats + POV character)
- Character tracking state (location, clothing, held items)
- Lore context (relevant locations/items from tracking)
- Style guidelines + genre-specific mandates
- Smart context tail: last ~1000 tokens of previous chapter
- Director's Treatment
3. Writer model generates first draft.
4. Logic model evaluates on 13 rubrics (110 scale). Automatic fail conditions apply for filter-word density, summary mode, and labeled emotions.
5. Iterative quality loop (up to 3 attempts):
- Score ≥ 8.0 → Auto-accept
- Score ≥ 7.0 → Accept after max attempts
- Score < 7.0 → Refinement pass (Writer model)
- Score < 6.0 → Full rewrite (Pro model)
6. Every 5 chapters: `refine_persona()` updates author bio based on actual written text.
### Key Innovations
- **Dynamic Character Injection:** Only injects characters named in chapter beats (saves ~5K tokens/chapter).
- **Smart Context Tail:** Takes last ~1000 tokens of previous chapter (not first 1000) — preserves handoff point.
- **Auto Model Escalation:** Low-scoring drafts trigger switch to Pro model for full rewrite.
### Costs (30-Chapter Novel, Mixed Model Strategy)
| Task | Calls | Input Tokens | Output Tokens | Cost Estimate |
|------|-------|-------------|---------------|---------------|
| `expand_beats_to_treatment` × 30 | 30 | ~5K | ~2K | FREE (Logic) |
| `write_chapter` draft × 30 | 30 | ~25K | ~3.5K | ~$0.087 (Writer) |
| Evaluation × 30 | 30 | ~20K | ~1.5K | FREE (Logic) |
| Refinement passes × 15 (est.) | 15 | ~20K | ~3K | ~$0.090 (Writer) |
| `refine_persona` × 6 | 6 | ~6K | ~1.5K | FREE (Logic) |
| **Phase 3 Total** | ~111 | ~1.9M | ~310K | **~$0.18** |
### Known Limitations
| ID | Issue | Impact |
|----|-------|--------|
| P3-L1 | Persona files re-read from disk on every chapter | I/O overhead; persona doesn't change between reads |
| P3-L2 | Beat expansion called even when beats are already detailed (>100 words) | Wastes ~5K tokens/chapter on ~30% of chapters |
| P3-L3 | Full rewrite triggered at score < 6.0 — discards entire draft | If draft scores 5.9, all 25K output tokens wasted |
| P3-L4 | No priority weighting for climax chapters | Ch 28 (climax) uses same resources/attempts as Ch 3 (setup) |
| P3-L5 | Previous chapter context hard-capped at 1000 tokens | For long chapters, might miss setup context from earlier pages |
| P3-L6 | Scoring thresholds fixed regardless of book position | Strict standards in early chapters = expensive refinement for setup scenes |
---
## Phase 4: Review & Refinement (Editing)
**Primary Files:** `story/editor.py`, `story/bible_tracker.py`
**Orchestrated by:** `cli/engine.py`
### What Happens
**During writing loop (every chapter):**
- `update_tracking()` refreshes character state (location, clothing, held items, speech style, events).
- `update_lore_index()` extracts canonical descriptions of locations and items.
**Every 2 chapters:**
- `check_pacing()` detects if story is rushing or repeating beats; triggers ADD_BRIDGE or CUT_NEXT interventions.
**After writing completes:**
- `analyze_consistency()` scans entire manuscript for plot holes and contradictions.
- `harvest_metadata()` extracts newly invented characters not in the original bible.
- `check_and_propagate()` cascades chapter edits forward through the manuscript.
### 13 Evaluation Rubrics
1. Engagement & tension
2. Scene execution (no summaries)
3. Voice & tone
4. Sensory immersion
5. Show, Don't Tell / Deep POV (**auto-fail trigger**)
6. Character agency
7. Pacing
8. Genre appropriateness
9. Dialogue authenticity
10. Plot relevance
11. Staging & flow
12. Prose dynamics (sentence variety)
13. Clarity & readability
**Automatic fail conditions:** filter-word density > 1/120 words → cap at 5; summary mode detected → cap at 6; >3 labeled emotions → cap at 5.
### Costs (30-Chapter Novel)
| Task | Calls | Input Tokens | Cost (Pro-Exp) |
|------|-------|-------------|----------------|
| `update_tracking` × 30 | 30 | ~18K | FREE |
| `update_lore_index` × 30 | 30 | ~15K | FREE |
| `check_pacing` × 15 | 15 | ~18K | FREE |
| `analyze_consistency` | 1 | ~25K | FREE |
| `harvest_metadata` | 1 | ~25K | FREE |
| **Phase 4 Total** | 77 | ~1.34M | **FREE** |
### Known Limitations
| ID | Issue | Impact |
|----|-------|--------|
| P4-L1 | Consistency check is post-generation only | Plot holes caught too late to cheaply fix |
| P4-L2 | Ripple propagation (`check_and_propagate`) has no cost ceiling | A single user edit in Ch 5 can trigger 100K+ tokens of cascading rewrites |
| P4-L3 | `rewrite_chapter_content()` uses Logic model instead of Writer model | Less creative rewrite output — Logic model optimizes reasoning, not prose |
| P4-L4 | `check_pacing()` sampling only looks at recent chapters, not cumulative arc | Slow-building issues across 10+ chapters not detected until critical |
| P4-L5 | No quality metric for the evaluator itself | Can't confirm if 13-rubric scores are calibrated correctly |
---
## Cross-Phase Summary
### Total Costs (30-Chapter Novel)
| Phase | Token Budget | Cost Estimate |
|-------|-------------|---------------|
| Phase 1: Ideation | ~20K | FREE |
| Phase 2: Outline | ~73K | FREE |
| Phase 3: Writing | ~2.2M | ~$0.18 |
| Phase 4: Review | ~1.34M | FREE |
| Imagen Cover (3 images) | — | ~$0.12 |
| **Total** | **~3.63M** | **~$0.30** |
*Assumes quality-first model selection (Pro-Exp for Logic, Flash for Writer)*
### Efficiency Frontier
- **Best case** (all chapters pass first attempt): ~$0.18 text + $0.04 cover = ~$0.22
- **Worst case** (30% rewrite rate with Pro escalations): ~$0.45 text + $0.12 cover = ~$0.57
- **Budget per blueprint goal:** $2.00 total — current system is 1529% of budget
### Top 5 Immediate Optimization Opportunities
| Priority | ID | Change | Savings |
|----------|----|--------|---------|
| 1 | P3-L1 | Cache persona per book (not per chapter) | ~90K tokens |
| 2 | P3-L2 | Skip beat expansion for detailed beats | ~45K tokens |
| 3 | P2-L2 | Add pre-generation outline validation | Prevent expensive rewrites |
| 4 | P1-L1 | Fix silent failure in `enrich()` | Prevent silent corrupt state |
| 5 | P3-L6 | Adaptive scoring thresholds by chapter position | ~15% fewer refinement passes |

290
docs/experiment_design.md Normal file
View File

@@ -0,0 +1,290 @@
# Experiment Design: A/B Tests for BookApp Optimization
**Date:** 2026-02-22
**Status:** Completed — fulfills Action Plan Step 3
---
## Methodology
All experiments follow a controlled A/B design. We hold all variables constant except the single variable under test. Success is measured against three primary metrics:
- **Cost per chapter (CPC):** Total token cost / number of chapters written
- **Human Quality Score (HQS):** 110 score from a human reviewer blind to which variant generated the chapter
- **Continuity Error Rate (CER):** Number of plot/character contradictions per 10 chapters (lower is better)
Each experiment runs on the same 3 prompts (one each of short story, novella, and novel length). Results are averaged across all 3.
**Baseline:** Current production configuration as of 2026-02-22.
---
## Experiment 1: Persona Caching
**Alt Reference:** Alt 3-D
**Hypothesis:** Caching persona per book reduces I/O overhead with no quality impact.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Persona loading | Re-read from disk each chapter | Load once per book run, pass as argument |
| Everything else | Identical | Identical |
### Metrics to Measure
- Token count per chapter (to verify savings)
- Wall-clock generation time per book
- Chapter quality scores (should be identical)
### Success Criterion
- Token reduction ≥ 2,000 tokens/chapter on books with sample files
- HQS difference < 0.1 between A and B (no quality impact)
- Zero new errors introduced
### Implementation Notes
- Modify `cli/engine.py`: call `style_persona.load_persona_data()` once before chapter loop
- Modify `story/writer.py`: accept optional `persona_info` parameter, skip disk reads if provided
- Estimated implementation: 30 minutes
---
## Experiment 2: Skip Beat Expansion for Detailed Beats
**Alt Reference:** Alt 3-E
**Hypothesis:** Skipping `expand_beats_to_treatment()` when beats exceed 100 words saves tokens with no quality loss.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Beat expansion | Always called | Skipped if total beats > 100 words |
| Everything else | Identical | Identical |
### Metrics to Measure
- Percentage of chapters that skip expansion (expected: ~30%)
- Token savings per book
- HQS for chapters that skip vs. chapters that don't skip
- Rate of beat-coverage failures (chapters that miss a required beat)
### Success Criterion
- ≥ 25% of chapters skip expansion (validating hypothesis)
- HQS difference < 0.2 between chapters that skip and those that don't
- Beat-coverage failure rate unchanged
### Implementation Notes
- Modify `story/writer.py` `write_chapter()`: add `if sum(len(b) for b in beats) > 100` guard before calling expansion
- Estimated implementation: 15 minutes
---
## Experiment 3: Outline Validation Gate
**Alt Reference:** Alt 2-B
**Hypothesis:** Pre-generation outline validation prevents costly Phase 3 rewrites by catching plot holes at the outline stage.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Outline validation | None | Run `validate_outline()` after `create_chapter_plan()`; block if critical issues found |
| Everything else | Identical | Identical |
### Metrics to Measure
- Number of critical outline issues flagged per run
- Rewrite rate during Phase 3 (did validation prevent rewrites?)
- Phase 3 token cost difference (A vs B)
- CER difference (did validation reduce continuity errors?)
### Success Criterion
- Validation blocks at least 1 critical issue per 3 runs
- Phase 3 rewrite rate drops ≥ 15% when validation is active
- CER improves ≥ 0.5 per 10 chapters
### Implementation Notes
- Add `validate_outline(events, chapters, bp, folder)` to `story/planner.py`
- Prompt: "Review this chapter plan for: (1) missing required plot beats, (2) character deaths/revivals without explanation, (3) severe pacing imbalances, (4) POV character inconsistency. Return: {issues: [...], severity: 'critical'|'warning'|'ok'}"
- Modify `cli/engine.py`: call `validate_outline()` and log issues before Phase 3 begins
- Estimated implementation: 2 hours
---
## Experiment 4: Adaptive Scoring Thresholds
**Alt Reference:** Alt 3-B
**Hypothesis:** Lowering SCORE_PASSING for early setup chapters reduces refinement cost while maintaining quality on high-stakes scenes.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| SCORE_AUTO_ACCEPT | 8.0 (all chapters) | 8.0 (all chapters) |
| SCORE_PASSING | 7.0 (all chapters) | 6.5 (ch 120%), 7.0 (ch 2070%), 7.5 (ch 70100%) |
| Everything else | Identical | Identical |
### Metrics to Measure
- Refinement pass count per chapter position bucket
- HQS per chapter position bucket (A vs B)
- CPC for each bucket
- Overall HQS for full book (A vs B)
### Success Criterion
- Setup chapters (120%): ≥ 20% fewer refinement passes in B
- Climax chapters (70100%): HQS improvement ≥ 0.3 in B
- Full book HQS unchanged or improved
### Implementation Notes
- Modify `story/writer.py` `write_chapter()`: accept `chapter_position` (0.01.0 float)
- Compute adaptive threshold: `passing = 6.5 + position * 1.0` (linear scaling)
- Modify `cli/engine.py`: pass `chapter_num / total_chapters` to `write_chapter()`
- Estimated implementation: 1 hour
---
## Experiment 5: Mid-Generation Consistency Snapshots
**Alt Reference:** Alt 4-B
**Hypothesis:** Running `analyze_consistency()` every 10 chapters reduces post-generation CER without significant cost increase.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Consistency check | Post-generation only | Every 10 chapters + post-generation |
| Everything else | Identical | Identical |
### Metrics to Measure
- CER post-generation (A vs B)
- Number of issues caught mid-generation vs post-generation
- Token cost difference (mid-gen checks add ~25K × N/10 tokens)
- Generation time difference
### Success Criterion
- Post-generation CER drops ≥ 30% in B
- Issues caught mid-generation prevent at least 1 expensive post-gen ripple propagation per run
- Additional cost ≤ $0.01 per book (all free on Pro-Exp)
### Implementation Notes
- Modify `cli/engine.py`: every 10 chapters, call `analyze_consistency()` on written chapters so far
- If issues found: log warning and optionally pause for user review
- Estimated implementation: 1 hour
---
## Experiment 6: Iterative Persona Validation
**Alt Reference:** Alt 1-C
**Hypothesis:** Validating the initial persona with a sample passage reduces voice-drift rewrites in Phase 3.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Persona creation | Single-pass, no validation | Generate persona → generate 200-word sample → evaluate → accept if ≥ 7/10, else regenerate (max 3 attempts) |
| Everything else | Identical | Identical |
### Metrics to Measure
- Initial persona acceptance rate (how often does first-pass persona pass the check?)
- Phase 3 persona-related rewrite rate (rewrites where critique mentions "voice inconsistency" or "doesn't match persona")
- HQS for first 5 chapters (voice is most important early on)
### Success Criterion
- Phase 3 persona-related rewrite rate drops ≥ 20% in B
- HQS for first 5 chapters improves ≥ 0.2
### Implementation Notes
- Modify `story/style_persona.py`: after `create_initial_persona()`, call a new `validate_persona()` function
- `validate_persona()` generates 200-word sample, evaluates with `evaluate_chapter_quality()` (light version)
- Estimated implementation: 2 hours
---
## Experiment 7: Two-Pass Drafting (Draft + Polish)
**Alt Reference:** Alt 3-A
**Hypothesis:** A cheap rough draft followed by a polished revision produces better quality than iterative retrying.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Drafting strategy | Single draft → evaluate → retry | Rough draft (Flash) → polish (Pro) → evaluate → accept if ≥ 7.0 (max 1 retry) |
| Max retry attempts | 3 | 1 (after polish) |
| Everything else | Identical | Identical |
### Metrics to Measure
- CPC (A vs B)
- HQS (A vs B)
- Rate of chapters needing retry (A vs B)
- Total generation time per book
### Success Criterion
- HQS improvement ≥ 0.3 in B with no cost increase
- OR: CPC reduction ≥ 20% in B with no HQS decrease
### Implementation Notes
- Modify `story/writer.py` `write_chapter()`: add polish pass using Pro model after initial draft
- Reduce max_attempts to 1 for final retry (after polish)
- This requires Pro model to be available (handled by auto-selection)
---
## Experiment Execution Order
Run experiments in this order to minimize dependency conflicts:
1. **Exp 1** (Persona Caching) — independent, 30 min, no risk
2. **Exp 2** (Skip Beat Expansion) — independent, 15 min, no risk
3. **Exp 4** (Adaptive Thresholds) — independent, 1 hr, low risk
4. **Exp 3** (Outline Validation) — independent, 2 hrs, low risk
5. **Exp 6** (Persona Validation) — independent, 2 hrs, low risk
6. **Exp 5** (Mid-gen Consistency) — requires stable Phase 3, 1 hr, low risk
7. **Exp 7** (Two-Pass Drafting) — highest risk, run last; 3 hrs, medium risk
---
## Success Metrics Definitions
### Cost per Chapter (CPC)
```
CPC = (total_input_tokens × input_price + total_output_tokens × output_price) / num_chapters
```
Measure in both USD and token-count to separate model-price effects from efficiency effects.
### Human Quality Score (HQS)
Blind evaluation by a human reviewer:
1. Read 3 chapters from treatment A and 3 from treatment B (same book premise)
2. Score each on: prose quality (15), pacing (15), character consistency (15)
3. HQS = average across all dimensions, normalized to 110
### Continuity Error Rate (CER)
After generation, manually review character states and key plot facts across chapters. Count:
- Character location contradictions
- Continuity breaks (held items, injuries, time-of-day)
- Plot event contradictions (character alive vs. dead)
Report as errors per 10 chapters.