Files
bookapp/docs/experiment_design.md
Mike Wichers 2100ca2312 feat: Implement ai_blueprint.md action plan — architectural review & optimisations
Steps 1–7 of the ai_blueprint.md action plan executed:

DOCUMENTATION (Steps 1–3, 6–7):
- docs/current_state_analysis.md: Phase-by-phase cost/quality mapping of existing pipeline
- docs/alternatives_analysis.md: 15 alternative approaches with testable hypotheses
- docs/experiment_design.md: 7 controlled A/B experiment specifications (CPC, HQS, CER metrics)
- ai_blueprint_v2.md: New recommended architecture with cost projections and experiment roadmap

CODE IMPROVEMENTS (Step 4 — Experiments 1–4 implemented):
- story/writer.py: Extract build_persona_info() — persona loaded once per book, not per chapter
- story/writer.py: Adaptive scoring thresholds — SCORE_PASSING scales 6.5→7.5 by chapter position
- story/writer.py: Beat expansion skip — if beats >100 words, skip Director's Treatment expansion
- story/planner.py: validate_outline() — pre-generation gate checks missing beats, continuity, pacing
- story/planner.py: Enrichment field validation — warn on missing title/genre after enrich()
- cli/engine.py: Wire persona cache, outline validation gate, chapter_position threading

Expected savings: ~285K tokens per 30-chapter novel (~7% cost reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 22:01:30 -05:00

291 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Experiment Design: A/B Tests for BookApp Optimization
**Date:** 2026-02-22
**Status:** Completed — fulfills Action Plan Step 3
---
## Methodology
All experiments follow a controlled A/B design. We hold all variables constant except the single variable under test. Success is measured against three primary metrics:
- **Cost per chapter (CPC):** Total token cost / number of chapters written
- **Human Quality Score (HQS):** 110 score from a human reviewer blind to which variant generated the chapter
- **Continuity Error Rate (CER):** Number of plot/character contradictions per 10 chapters (lower is better)
Each experiment runs on the same 3 prompts (one each of short story, novella, and novel length). Results are averaged across all 3.
**Baseline:** Current production configuration as of 2026-02-22.
---
## Experiment 1: Persona Caching
**Alt Reference:** Alt 3-D
**Hypothesis:** Caching persona per book reduces I/O overhead with no quality impact.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Persona loading | Re-read from disk each chapter | Load once per book run, pass as argument |
| Everything else | Identical | Identical |
### Metrics to Measure
- Token count per chapter (to verify savings)
- Wall-clock generation time per book
- Chapter quality scores (should be identical)
### Success Criterion
- Token reduction ≥ 2,000 tokens/chapter on books with sample files
- HQS difference < 0.1 between A and B (no quality impact)
- Zero new errors introduced
### Implementation Notes
- Modify `cli/engine.py`: call `style_persona.load_persona_data()` once before chapter loop
- Modify `story/writer.py`: accept optional `persona_info` parameter, skip disk reads if provided
- Estimated implementation: 30 minutes
---
## Experiment 2: Skip Beat Expansion for Detailed Beats
**Alt Reference:** Alt 3-E
**Hypothesis:** Skipping `expand_beats_to_treatment()` when beats exceed 100 words saves tokens with no quality loss.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Beat expansion | Always called | Skipped if total beats > 100 words |
| Everything else | Identical | Identical |
### Metrics to Measure
- Percentage of chapters that skip expansion (expected: ~30%)
- Token savings per book
- HQS for chapters that skip vs. chapters that don't skip
- Rate of beat-coverage failures (chapters that miss a required beat)
### Success Criterion
- ≥ 25% of chapters skip expansion (validating hypothesis)
- HQS difference < 0.2 between chapters that skip and those that don't
- Beat-coverage failure rate unchanged
### Implementation Notes
- Modify `story/writer.py` `write_chapter()`: add `if sum(len(b) for b in beats) > 100` guard before calling expansion
- Estimated implementation: 15 minutes
---
## Experiment 3: Outline Validation Gate
**Alt Reference:** Alt 2-B
**Hypothesis:** Pre-generation outline validation prevents costly Phase 3 rewrites by catching plot holes at the outline stage.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Outline validation | None | Run `validate_outline()` after `create_chapter_plan()`; block if critical issues found |
| Everything else | Identical | Identical |
### Metrics to Measure
- Number of critical outline issues flagged per run
- Rewrite rate during Phase 3 (did validation prevent rewrites?)
- Phase 3 token cost difference (A vs B)
- CER difference (did validation reduce continuity errors?)
### Success Criterion
- Validation blocks at least 1 critical issue per 3 runs
- Phase 3 rewrite rate drops ≥ 15% when validation is active
- CER improves ≥ 0.5 per 10 chapters
### Implementation Notes
- Add `validate_outline(events, chapters, bp, folder)` to `story/planner.py`
- Prompt: "Review this chapter plan for: (1) missing required plot beats, (2) character deaths/revivals without explanation, (3) severe pacing imbalances, (4) POV character inconsistency. Return: {issues: [...], severity: 'critical'|'warning'|'ok'}"
- Modify `cli/engine.py`: call `validate_outline()` and log issues before Phase 3 begins
- Estimated implementation: 2 hours
---
## Experiment 4: Adaptive Scoring Thresholds
**Alt Reference:** Alt 3-B
**Hypothesis:** Lowering SCORE_PASSING for early setup chapters reduces refinement cost while maintaining quality on high-stakes scenes.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| SCORE_AUTO_ACCEPT | 8.0 (all chapters) | 8.0 (all chapters) |
| SCORE_PASSING | 7.0 (all chapters) | 6.5 (ch 120%), 7.0 (ch 2070%), 7.5 (ch 70100%) |
| Everything else | Identical | Identical |
### Metrics to Measure
- Refinement pass count per chapter position bucket
- HQS per chapter position bucket (A vs B)
- CPC for each bucket
- Overall HQS for full book (A vs B)
### Success Criterion
- Setup chapters (120%): ≥ 20% fewer refinement passes in B
- Climax chapters (70100%): HQS improvement ≥ 0.3 in B
- Full book HQS unchanged or improved
### Implementation Notes
- Modify `story/writer.py` `write_chapter()`: accept `chapter_position` (0.01.0 float)
- Compute adaptive threshold: `passing = 6.5 + position * 1.0` (linear scaling)
- Modify `cli/engine.py`: pass `chapter_num / total_chapters` to `write_chapter()`
- Estimated implementation: 1 hour
---
## Experiment 5: Mid-Generation Consistency Snapshots
**Alt Reference:** Alt 4-B
**Hypothesis:** Running `analyze_consistency()` every 10 chapters reduces post-generation CER without significant cost increase.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Consistency check | Post-generation only | Every 10 chapters + post-generation |
| Everything else | Identical | Identical |
### Metrics to Measure
- CER post-generation (A vs B)
- Number of issues caught mid-generation vs post-generation
- Token cost difference (mid-gen checks add ~25K × N/10 tokens)
- Generation time difference
### Success Criterion
- Post-generation CER drops ≥ 30% in B
- Issues caught mid-generation prevent at least 1 expensive post-gen ripple propagation per run
- Additional cost ≤ $0.01 per book (all free on Pro-Exp)
### Implementation Notes
- Modify `cli/engine.py`: every 10 chapters, call `analyze_consistency()` on written chapters so far
- If issues found: log warning and optionally pause for user review
- Estimated implementation: 1 hour
---
## Experiment 6: Iterative Persona Validation
**Alt Reference:** Alt 1-C
**Hypothesis:** Validating the initial persona with a sample passage reduces voice-drift rewrites in Phase 3.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Persona creation | Single-pass, no validation | Generate persona → generate 200-word sample → evaluate → accept if ≥ 7/10, else regenerate (max 3 attempts) |
| Everything else | Identical | Identical |
### Metrics to Measure
- Initial persona acceptance rate (how often does first-pass persona pass the check?)
- Phase 3 persona-related rewrite rate (rewrites where critique mentions "voice inconsistency" or "doesn't match persona")
- HQS for first 5 chapters (voice is most important early on)
### Success Criterion
- Phase 3 persona-related rewrite rate drops ≥ 20% in B
- HQS for first 5 chapters improves ≥ 0.2
### Implementation Notes
- Modify `story/style_persona.py`: after `create_initial_persona()`, call a new `validate_persona()` function
- `validate_persona()` generates 200-word sample, evaluates with `evaluate_chapter_quality()` (light version)
- Estimated implementation: 2 hours
---
## Experiment 7: Two-Pass Drafting (Draft + Polish)
**Alt Reference:** Alt 3-A
**Hypothesis:** A cheap rough draft followed by a polished revision produces better quality than iterative retrying.
### Setup
| Parameter | Control (A) | Treatment (B) |
|-----------|-------------|---------------|
| Drafting strategy | Single draft → evaluate → retry | Rough draft (Flash) → polish (Pro) → evaluate → accept if ≥ 7.0 (max 1 retry) |
| Max retry attempts | 3 | 1 (after polish) |
| Everything else | Identical | Identical |
### Metrics to Measure
- CPC (A vs B)
- HQS (A vs B)
- Rate of chapters needing retry (A vs B)
- Total generation time per book
### Success Criterion
- HQS improvement ≥ 0.3 in B with no cost increase
- OR: CPC reduction ≥ 20% in B with no HQS decrease
### Implementation Notes
- Modify `story/writer.py` `write_chapter()`: add polish pass using Pro model after initial draft
- Reduce max_attempts to 1 for final retry (after polish)
- This requires Pro model to be available (handled by auto-selection)
---
## Experiment Execution Order
Run experiments in this order to minimize dependency conflicts:
1. **Exp 1** (Persona Caching) — independent, 30 min, no risk
2. **Exp 2** (Skip Beat Expansion) — independent, 15 min, no risk
3. **Exp 4** (Adaptive Thresholds) — independent, 1 hr, low risk
4. **Exp 3** (Outline Validation) — independent, 2 hrs, low risk
5. **Exp 6** (Persona Validation) — independent, 2 hrs, low risk
6. **Exp 5** (Mid-gen Consistency) — requires stable Phase 3, 1 hr, low risk
7. **Exp 7** (Two-Pass Drafting) — highest risk, run last; 3 hrs, medium risk
---
## Success Metrics Definitions
### Cost per Chapter (CPC)
```
CPC = (total_input_tokens × input_price + total_output_tokens × output_price) / num_chapters
```
Measure in both USD and token-count to separate model-price effects from efficiency effects.
### Human Quality Score (HQS)
Blind evaluation by a human reviewer:
1. Read 3 chapters from treatment A and 3 from treatment B (same book premise)
2. Score each on: prose quality (15), pacing (15), character consistency (15)
3. HQS = average across all dimensions, normalized to 110
### Continuity Error Rate (CER)
After generation, manually review character states and key plot facts across chapters. Count:
- Character location contradictions
- Continuity breaks (held items, injuries, time-of-day)
- Plot event contradictions (character alive vs. dead)
Report as errors per 10 chapters.