Files
bookapp/docs/experiment_design.md
Mike Wichers 2100ca2312 feat: Implement ai_blueprint.md action plan — architectural review & optimisations
Steps 1–7 of the ai_blueprint.md action plan executed:

DOCUMENTATION (Steps 1–3, 6–7):
- docs/current_state_analysis.md: Phase-by-phase cost/quality mapping of existing pipeline
- docs/alternatives_analysis.md: 15 alternative approaches with testable hypotheses
- docs/experiment_design.md: 7 controlled A/B experiment specifications (CPC, HQS, CER metrics)
- ai_blueprint_v2.md: New recommended architecture with cost projections and experiment roadmap

CODE IMPROVEMENTS (Step 4 — Experiments 1–4 implemented):
- story/writer.py: Extract build_persona_info() — persona loaded once per book, not per chapter
- story/writer.py: Adaptive scoring thresholds — SCORE_PASSING scales 6.5→7.5 by chapter position
- story/writer.py: Beat expansion skip — if beats >100 words, skip Director's Treatment expansion
- story/planner.py: validate_outline() — pre-generation gate checks missing beats, continuity, pacing
- story/planner.py: Enrichment field validation — warn on missing title/genre after enrich()
- cli/engine.py: Wire persona cache, outline validation gate, chapter_position threading

Expected savings: ~285K tokens per 30-chapter novel (~7% cost reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 22:01:30 -05:00

10 KiB
Raw Permalink Blame History

Experiment Design: A/B Tests for BookApp Optimization

Date: 2026-02-22 Status: Completed — fulfills Action Plan Step 3


Methodology

All experiments follow a controlled A/B design. We hold all variables constant except the single variable under test. Success is measured against three primary metrics:

  • Cost per chapter (CPC): Total token cost / number of chapters written
  • Human Quality Score (HQS): 110 score from a human reviewer blind to which variant generated the chapter
  • Continuity Error Rate (CER): Number of plot/character contradictions per 10 chapters (lower is better)

Each experiment runs on the same 3 prompts (one each of short story, novella, and novel length). Results are averaged across all 3.

Baseline: Current production configuration as of 2026-02-22.


Experiment 1: Persona Caching

Alt Reference: Alt 3-D Hypothesis: Caching persona per book reduces I/O overhead with no quality impact.

Setup

Parameter Control (A) Treatment (B)
Persona loading Re-read from disk each chapter Load once per book run, pass as argument
Everything else Identical Identical

Metrics to Measure

  • Token count per chapter (to verify savings)
  • Wall-clock generation time per book
  • Chapter quality scores (should be identical)

Success Criterion

  • Token reduction ≥ 2,000 tokens/chapter on books with sample files
  • HQS difference < 0.1 between A and B (no quality impact)
  • Zero new errors introduced

Implementation Notes

  • Modify cli/engine.py: call style_persona.load_persona_data() once before chapter loop
  • Modify story/writer.py: accept optional persona_info parameter, skip disk reads if provided
  • Estimated implementation: 30 minutes

Experiment 2: Skip Beat Expansion for Detailed Beats

Alt Reference: Alt 3-E Hypothesis: Skipping expand_beats_to_treatment() when beats exceed 100 words saves tokens with no quality loss.

Setup

Parameter Control (A) Treatment (B)
Beat expansion Always called Skipped if total beats > 100 words
Everything else Identical Identical

Metrics to Measure

  • Percentage of chapters that skip expansion (expected: ~30%)
  • Token savings per book
  • HQS for chapters that skip vs. chapters that don't skip
  • Rate of beat-coverage failures (chapters that miss a required beat)

Success Criterion

  • ≥ 25% of chapters skip expansion (validating hypothesis)
  • HQS difference < 0.2 between chapters that skip and those that don't
  • Beat-coverage failure rate unchanged

Implementation Notes

  • Modify story/writer.py write_chapter(): add if sum(len(b) for b in beats) > 100 guard before calling expansion
  • Estimated implementation: 15 minutes

Experiment 3: Outline Validation Gate

Alt Reference: Alt 2-B Hypothesis: Pre-generation outline validation prevents costly Phase 3 rewrites by catching plot holes at the outline stage.

Setup

Parameter Control (A) Treatment (B)
Outline validation None Run validate_outline() after create_chapter_plan(); block if critical issues found
Everything else Identical Identical

Metrics to Measure

  • Number of critical outline issues flagged per run
  • Rewrite rate during Phase 3 (did validation prevent rewrites?)
  • Phase 3 token cost difference (A vs B)
  • CER difference (did validation reduce continuity errors?)

Success Criterion

  • Validation blocks at least 1 critical issue per 3 runs
  • Phase 3 rewrite rate drops ≥ 15% when validation is active
  • CER improves ≥ 0.5 per 10 chapters

Implementation Notes

  • Add validate_outline(events, chapters, bp, folder) to story/planner.py
  • Prompt: "Review this chapter plan for: (1) missing required plot beats, (2) character deaths/revivals without explanation, (3) severe pacing imbalances, (4) POV character inconsistency. Return: {issues: [...], severity: 'critical'|'warning'|'ok'}"
  • Modify cli/engine.py: call validate_outline() and log issues before Phase 3 begins
  • Estimated implementation: 2 hours

Experiment 4: Adaptive Scoring Thresholds

Alt Reference: Alt 3-B Hypothesis: Lowering SCORE_PASSING for early setup chapters reduces refinement cost while maintaining quality on high-stakes scenes.

Setup

Parameter Control (A) Treatment (B)
SCORE_AUTO_ACCEPT 8.0 (all chapters) 8.0 (all chapters)
SCORE_PASSING 7.0 (all chapters) 6.5 (ch 120%), 7.0 (ch 2070%), 7.5 (ch 70100%)
Everything else Identical Identical

Metrics to Measure

  • Refinement pass count per chapter position bucket
  • HQS per chapter position bucket (A vs B)
  • CPC for each bucket
  • Overall HQS for full book (A vs B)

Success Criterion

  • Setup chapters (120%): ≥ 20% fewer refinement passes in B
  • Climax chapters (70100%): HQS improvement ≥ 0.3 in B
  • Full book HQS unchanged or improved

Implementation Notes

  • Modify story/writer.py write_chapter(): accept chapter_position (0.01.0 float)
  • Compute adaptive threshold: passing = 6.5 + position * 1.0 (linear scaling)
  • Modify cli/engine.py: pass chapter_num / total_chapters to write_chapter()
  • Estimated implementation: 1 hour

Experiment 5: Mid-Generation Consistency Snapshots

Alt Reference: Alt 4-B Hypothesis: Running analyze_consistency() every 10 chapters reduces post-generation CER without significant cost increase.

Setup

Parameter Control (A) Treatment (B)
Consistency check Post-generation only Every 10 chapters + post-generation
Everything else Identical Identical

Metrics to Measure

  • CER post-generation (A vs B)
  • Number of issues caught mid-generation vs post-generation
  • Token cost difference (mid-gen checks add ~25K × N/10 tokens)
  • Generation time difference

Success Criterion

  • Post-generation CER drops ≥ 30% in B
  • Issues caught mid-generation prevent at least 1 expensive post-gen ripple propagation per run
  • Additional cost ≤ $0.01 per book (all free on Pro-Exp)

Implementation Notes

  • Modify cli/engine.py: every 10 chapters, call analyze_consistency() on written chapters so far
  • If issues found: log warning and optionally pause for user review
  • Estimated implementation: 1 hour

Experiment 6: Iterative Persona Validation

Alt Reference: Alt 1-C Hypothesis: Validating the initial persona with a sample passage reduces voice-drift rewrites in Phase 3.

Setup

Parameter Control (A) Treatment (B)
Persona creation Single-pass, no validation Generate persona → generate 200-word sample → evaluate → accept if ≥ 7/10, else regenerate (max 3 attempts)
Everything else Identical Identical

Metrics to Measure

  • Initial persona acceptance rate (how often does first-pass persona pass the check?)
  • Phase 3 persona-related rewrite rate (rewrites where critique mentions "voice inconsistency" or "doesn't match persona")
  • HQS for first 5 chapters (voice is most important early on)

Success Criterion

  • Phase 3 persona-related rewrite rate drops ≥ 20% in B
  • HQS for first 5 chapters improves ≥ 0.2

Implementation Notes

  • Modify story/style_persona.py: after create_initial_persona(), call a new validate_persona() function
  • validate_persona() generates 200-word sample, evaluates with evaluate_chapter_quality() (light version)
  • Estimated implementation: 2 hours

Experiment 7: Two-Pass Drafting (Draft + Polish)

Alt Reference: Alt 3-A Hypothesis: A cheap rough draft followed by a polished revision produces better quality than iterative retrying.

Setup

Parameter Control (A) Treatment (B)
Drafting strategy Single draft → evaluate → retry Rough draft (Flash) → polish (Pro) → evaluate → accept if ≥ 7.0 (max 1 retry)
Max retry attempts 3 1 (after polish)
Everything else Identical Identical

Metrics to Measure

  • CPC (A vs B)
  • HQS (A vs B)
  • Rate of chapters needing retry (A vs B)
  • Total generation time per book

Success Criterion

  • HQS improvement ≥ 0.3 in B with no cost increase
  • OR: CPC reduction ≥ 20% in B with no HQS decrease

Implementation Notes

  • Modify story/writer.py write_chapter(): add polish pass using Pro model after initial draft
  • Reduce max_attempts to 1 for final retry (after polish)
  • This requires Pro model to be available (handled by auto-selection)

Experiment Execution Order

Run experiments in this order to minimize dependency conflicts:

  1. Exp 1 (Persona Caching) — independent, 30 min, no risk
  2. Exp 2 (Skip Beat Expansion) — independent, 15 min, no risk
  3. Exp 4 (Adaptive Thresholds) — independent, 1 hr, low risk
  4. Exp 3 (Outline Validation) — independent, 2 hrs, low risk
  5. Exp 6 (Persona Validation) — independent, 2 hrs, low risk
  6. Exp 5 (Mid-gen Consistency) — requires stable Phase 3, 1 hr, low risk
  7. Exp 7 (Two-Pass Drafting) — highest risk, run last; 3 hrs, medium risk

Success Metrics Definitions

Cost per Chapter (CPC)

CPC = (total_input_tokens × input_price + total_output_tokens × output_price) / num_chapters

Measure in both USD and token-count to separate model-price effects from efficiency effects.

Human Quality Score (HQS)

Blind evaluation by a human reviewer:

  1. Read 3 chapters from treatment A and 3 from treatment B (same book premise)
  2. Score each on: prose quality (15), pacing (15), character consistency (15)
  3. HQS = average across all dimensions, normalized to 110

Continuity Error Rate (CER)

After generation, manually review character states and key plot facts across chapters. Count:

  • Character location contradictions
  • Continuity breaks (held items, injuries, time-of-day)
  • Plot event contradictions (character alive vs. dead)

Report as errors per 10 chapters.