Files

Mike Wichers 2100ca2312 feat: Implement ai_blueprint.md action plan — architectural review & optimisations

Steps 1–7 of the ai_blueprint.md action plan executed:

DOCUMENTATION (Steps 1–3, 6–7):
- docs/current_state_analysis.md: Phase-by-phase cost/quality mapping of existing pipeline
- docs/alternatives_analysis.md: 15 alternative approaches with testable hypotheses
- docs/experiment_design.md: 7 controlled A/B experiment specifications (CPC, HQS, CER metrics)
- ai_blueprint_v2.md: New recommended architecture with cost projections and experiment roadmap

CODE IMPROVEMENTS (Step 4 — Experiments 1–4 implemented):
- story/writer.py: Extract build_persona_info() — persona loaded once per book, not per chapter
- story/writer.py: Adaptive scoring thresholds — SCORE_PASSING scales 6.5→7.5 by chapter position
- story/writer.py: Beat expansion skip — if beats >100 words, skip Director's Treatment expansion
- story/planner.py: validate_outline() — pre-generation gate checks missing beats, continuity, pacing
- story/planner.py: Enrichment field validation — warn on missing title/genre after enrich()
- cli/engine.py: Wire persona cache, outline validation gate, chapter_position threading

Expected savings: ~285K tokens per 30-chapter novel (~7% cost reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-22 22:01:30 -05:00

10 KiB

Raw Permalink Blame History

Experiment Design: A/B Tests for BookApp Optimization

Date: 2026-02-22 Status: Completed — fulfills Action Plan Step 3

Methodology

All experiments follow a controlled A/B design. We hold all variables constant except the single variable under test. Success is measured against three primary metrics:

Cost per chapter (CPC): Total token cost / number of chapters written
Human Quality Score (HQS): 1–10 score from a human reviewer blind to which variant generated the chapter
Continuity Error Rate (CER): Number of plot/character contradictions per 10 chapters (lower is better)

Each experiment runs on the same 3 prompts (one each of short story, novella, and novel length). Results are averaged across all 3.

Baseline: Current production configuration as of 2026-02-22.

Experiment 1: Persona Caching

Alt Reference: Alt 3-D Hypothesis: Caching persona per book reduces I/O overhead with no quality impact.

Setup

Parameter	Control (A)	Treatment (B)
Persona loading	Re-read from disk each chapter	Load once per book run, pass as argument
Everything else	Identical	Identical

Metrics to Measure

Token count per chapter (to verify savings)
Wall-clock generation time per book
Chapter quality scores (should be identical)

Success Criterion

Token reduction ≥ 2,000 tokens/chapter on books with sample files
HQS difference < 0.1 between A and B (no quality impact)
Zero new errors introduced

Implementation Notes

Modify cli/engine.py: call style_persona.load_persona_data() once before chapter loop
Modify story/writer.py: accept optional persona_info parameter, skip disk reads if provided
Estimated implementation: 30 minutes

Experiment 2: Skip Beat Expansion for Detailed Beats

Alt Reference: Alt 3-E Hypothesis: Skipping expand_beats_to_treatment() when beats exceed 100 words saves tokens with no quality loss.

Setup

Parameter	Control (A)	Treatment (B)
Beat expansion	Always called	Skipped if total beats > 100 words
Everything else	Identical	Identical

Metrics to Measure

Percentage of chapters that skip expansion (expected: ~30%)
Token savings per book
HQS for chapters that skip vs. chapters that don't skip
Rate of beat-coverage failures (chapters that miss a required beat)

Success Criterion

≥ 25% of chapters skip expansion (validating hypothesis)
HQS difference < 0.2 between chapters that skip and those that don't
Beat-coverage failure rate unchanged

Implementation Notes

Modify story/writer.py write_chapter(): add if sum(len(b) for b in beats) > 100 guard before calling expansion
Estimated implementation: 15 minutes

Experiment 3: Outline Validation Gate

Alt Reference: Alt 2-B Hypothesis: Pre-generation outline validation prevents costly Phase 3 rewrites by catching plot holes at the outline stage.

Setup

Parameter	Control (A)	Treatment (B)
Outline validation	None	Run `validate_outline()` after `create_chapter_plan()`; block if critical issues found
Everything else	Identical	Identical

Metrics to Measure

Number of critical outline issues flagged per run
Rewrite rate during Phase 3 (did validation prevent rewrites?)
Phase 3 token cost difference (A vs B)
CER difference (did validation reduce continuity errors?)

Success Criterion

Validation blocks at least 1 critical issue per 3 runs
Phase 3 rewrite rate drops ≥ 15% when validation is active
CER improves ≥ 0.5 per 10 chapters

Implementation Notes

Add validate_outline(events, chapters, bp, folder) to story/planner.py
Prompt: "Review this chapter plan for: (1) missing required plot beats, (2) character deaths/revivals without explanation, (3) severe pacing imbalances, (4) POV character inconsistency. Return: {issues: [...], severity: 'critical'|'warning'|'ok'}"
Modify cli/engine.py: call validate_outline() and log issues before Phase 3 begins
Estimated implementation: 2 hours

Experiment 4: Adaptive Scoring Thresholds

Alt Reference: Alt 3-B Hypothesis: Lowering SCORE_PASSING for early setup chapters reduces refinement cost while maintaining quality on high-stakes scenes.

Setup

Parameter	Control (A)	Treatment (B)
SCORE_AUTO_ACCEPT	8.0 (all chapters)	8.0 (all chapters)
SCORE_PASSING	7.0 (all chapters)	6.5 (ch 1–20%), 7.0 (ch 20–70%), 7.5 (ch 70–100%)
Everything else	Identical	Identical

Metrics to Measure

Refinement pass count per chapter position bucket
HQS per chapter position bucket (A vs B)
CPC for each bucket
Overall HQS for full book (A vs B)

Success Criterion

Setup chapters (1–20%): ≥ 20% fewer refinement passes in B
Climax chapters (70–100%): HQS improvement ≥ 0.3 in B
Full book HQS unchanged or improved

Implementation Notes

Modify story/writer.py write_chapter(): accept chapter_position (0.0–1.0 float)
Compute adaptive threshold: passing = 6.5 + position * 1.0 (linear scaling)
Modify cli/engine.py: pass chapter_num / total_chapters to write_chapter()
Estimated implementation: 1 hour

Experiment 5: Mid-Generation Consistency Snapshots

Alt Reference: Alt 4-B Hypothesis: Running analyze_consistency() every 10 chapters reduces post-generation CER without significant cost increase.

Setup

Parameter	Control (A)	Treatment (B)
Consistency check	Post-generation only	Every 10 chapters + post-generation
Everything else	Identical	Identical

Metrics to Measure

CER post-generation (A vs B)
Number of issues caught mid-generation vs post-generation
Token cost difference (mid-gen checks add ~25K × N/10 tokens)
Generation time difference

Success Criterion

Post-generation CER drops ≥ 30% in B
Issues caught mid-generation prevent at least 1 expensive post-gen ripple propagation per run
Additional cost ≤ $0.01 per book (all free on Pro-Exp)

Implementation Notes

Modify cli/engine.py: every 10 chapters, call analyze_consistency() on written chapters so far
If issues found: log warning and optionally pause for user review
Estimated implementation: 1 hour

Experiment 6: Iterative Persona Validation

Alt Reference: Alt 1-C Hypothesis: Validating the initial persona with a sample passage reduces voice-drift rewrites in Phase 3.

Setup

Parameter	Control (A)	Treatment (B)
Persona creation	Single-pass, no validation	Generate persona → generate 200-word sample → evaluate → accept if ≥ 7/10, else regenerate (max 3 attempts)
Everything else	Identical	Identical

Metrics to Measure

Initial persona acceptance rate (how often does first-pass persona pass the check?)
Phase 3 persona-related rewrite rate (rewrites where critique mentions "voice inconsistency" or "doesn't match persona")
HQS for first 5 chapters (voice is most important early on)

Success Criterion

Phase 3 persona-related rewrite rate drops ≥ 20% in B
HQS for first 5 chapters improves ≥ 0.2

Implementation Notes

Modify story/style_persona.py: after create_initial_persona(), call a new validate_persona() function
validate_persona() generates 200-word sample, evaluates with evaluate_chapter_quality() (light version)
Estimated implementation: 2 hours

Experiment 7: Two-Pass Drafting (Draft + Polish)

Alt Reference: Alt 3-A Hypothesis: A cheap rough draft followed by a polished revision produces better quality than iterative retrying.

Setup

Parameter	Control (A)	Treatment (B)
Drafting strategy	Single draft → evaluate → retry	Rough draft (Flash) → polish (Pro) → evaluate → accept if ≥ 7.0 (max 1 retry)
Max retry attempts	3	1 (after polish)
Everything else	Identical	Identical

Metrics to Measure

CPC (A vs B)
HQS (A vs B)
Rate of chapters needing retry (A vs B)
Total generation time per book

Success Criterion

HQS improvement ≥ 0.3 in B with no cost increase
OR: CPC reduction ≥ 20% in B with no HQS decrease

Implementation Notes

Modify story/writer.py write_chapter(): add polish pass using Pro model after initial draft
Reduce max_attempts to 1 for final retry (after polish)
This requires Pro model to be available (handled by auto-selection)

Experiment Execution Order

Run experiments in this order to minimize dependency conflicts:

Exp 1 (Persona Caching) — independent, 30 min, no risk
Exp 2 (Skip Beat Expansion) — independent, 15 min, no risk
Exp 4 (Adaptive Thresholds) — independent, 1 hr, low risk
Exp 3 (Outline Validation) — independent, 2 hrs, low risk
Exp 6 (Persona Validation) — independent, 2 hrs, low risk
Exp 5 (Mid-gen Consistency) — requires stable Phase 3, 1 hr, low risk
Exp 7 (Two-Pass Drafting) — highest risk, run last; 3 hrs, medium risk

Success Metrics Definitions

Cost per Chapter (CPC)

CPC = (total_input_tokens × input_price + total_output_tokens × output_price) / num_chapters

Measure in both USD and token-count to separate model-price effects from efficiency effects.

Human Quality Score (HQS)

Blind evaluation by a human reviewer:

Read 3 chapters from treatment A and 3 from treatment B (same book premise)
Score each on: prose quality (1–5), pacing (1–5), character consistency (1–5)
HQS = average across all dimensions, normalized to 1–10

Continuity Error Rate (CER)

After generation, manually review character states and key plot facts across chapters. Count:

Character location contradictions
Continuity breaks (held items, injuries, time-of-day)
Plot event contradictions (character alive vs. dead)

Report as errors per 10 chapters.

10 KiB Raw Permalink Blame History Unescape Escape

Experiment Design: A/B Tests for BookApp Optimization

Methodology

Experiment 1: Persona Caching

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment 2: Skip Beat Expansion for Detailed Beats

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment 3: Outline Validation Gate

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment 4: Adaptive Scoring Thresholds

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment 5: Mid-Generation Consistency Snapshots

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment 6: Iterative Persona Validation

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment 7: Two-Pass Drafting (Draft + Polish)

Setup

Metrics to Measure

Success Criterion

Implementation Notes

Experiment Execution Order

Success Metrics Definitions

Cost per Chapter (CPC)

Human Quality Score (HQS)

Continuity Error Rate (CER)

10 KiB

Raw Permalink Blame History