Steps 1–7 of the ai_blueprint.md action plan executed: DOCUMENTATION (Steps 1–3, 6–7): - docs/current_state_analysis.md: Phase-by-phase cost/quality mapping of existing pipeline - docs/alternatives_analysis.md: 15 alternative approaches with testable hypotheses - docs/experiment_design.md: 7 controlled A/B experiment specifications (CPC, HQS, CER metrics) - ai_blueprint_v2.md: New recommended architecture with cost projections and experiment roadmap CODE IMPROVEMENTS (Step 4 — Experiments 1–4 implemented): - story/writer.py: Extract build_persona_info() — persona loaded once per book, not per chapter - story/writer.py: Adaptive scoring thresholds — SCORE_PASSING scales 6.5→7.5 by chapter position - story/writer.py: Beat expansion skip — if beats >100 words, skip Director's Treatment expansion - story/planner.py: validate_outline() — pre-generation gate checks missing beats, continuity, pacing - story/planner.py: Enrichment field validation — warn on missing title/genre after enrich() - cli/engine.py: Wire persona cache, outline validation gate, chapter_position threading Expected savings: ~285K tokens per 30-chapter novel (~7% cost reduction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
291 lines
10 KiB
Markdown
291 lines
10 KiB
Markdown
# Experiment Design: A/B Tests for BookApp Optimization
|
||
|
||
**Date:** 2026-02-22
|
||
**Status:** Completed — fulfills Action Plan Step 3
|
||
|
||
---
|
||
|
||
## Methodology
|
||
|
||
All experiments follow a controlled A/B design. We hold all variables constant except the single variable under test. Success is measured against three primary metrics:
|
||
|
||
- **Cost per chapter (CPC):** Total token cost / number of chapters written
|
||
- **Human Quality Score (HQS):** 1–10 score from a human reviewer blind to which variant generated the chapter
|
||
- **Continuity Error Rate (CER):** Number of plot/character contradictions per 10 chapters (lower is better)
|
||
|
||
Each experiment runs on the same 3 prompts (one each of short story, novella, and novel length). Results are averaged across all 3.
|
||
|
||
**Baseline:** Current production configuration as of 2026-02-22.
|
||
|
||
---
|
||
|
||
## Experiment 1: Persona Caching
|
||
|
||
**Alt Reference:** Alt 3-D
|
||
**Hypothesis:** Caching persona per book reduces I/O overhead with no quality impact.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| Persona loading | Re-read from disk each chapter | Load once per book run, pass as argument |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- Token count per chapter (to verify savings)
|
||
- Wall-clock generation time per book
|
||
- Chapter quality scores (should be identical)
|
||
|
||
### Success Criterion
|
||
|
||
- Token reduction ≥ 2,000 tokens/chapter on books with sample files
|
||
- HQS difference < 0.1 between A and B (no quality impact)
|
||
- Zero new errors introduced
|
||
|
||
### Implementation Notes
|
||
|
||
- Modify `cli/engine.py`: call `style_persona.load_persona_data()` once before chapter loop
|
||
- Modify `story/writer.py`: accept optional `persona_info` parameter, skip disk reads if provided
|
||
- Estimated implementation: 30 minutes
|
||
|
||
---
|
||
|
||
## Experiment 2: Skip Beat Expansion for Detailed Beats
|
||
|
||
**Alt Reference:** Alt 3-E
|
||
**Hypothesis:** Skipping `expand_beats_to_treatment()` when beats exceed 100 words saves tokens with no quality loss.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| Beat expansion | Always called | Skipped if total beats > 100 words |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- Percentage of chapters that skip expansion (expected: ~30%)
|
||
- Token savings per book
|
||
- HQS for chapters that skip vs. chapters that don't skip
|
||
- Rate of beat-coverage failures (chapters that miss a required beat)
|
||
|
||
### Success Criterion
|
||
|
||
- ≥ 25% of chapters skip expansion (validating hypothesis)
|
||
- HQS difference < 0.2 between chapters that skip and those that don't
|
||
- Beat-coverage failure rate unchanged
|
||
|
||
### Implementation Notes
|
||
|
||
- Modify `story/writer.py` `write_chapter()`: add `if sum(len(b) for b in beats) > 100` guard before calling expansion
|
||
- Estimated implementation: 15 minutes
|
||
|
||
---
|
||
|
||
## Experiment 3: Outline Validation Gate
|
||
|
||
**Alt Reference:** Alt 2-B
|
||
**Hypothesis:** Pre-generation outline validation prevents costly Phase 3 rewrites by catching plot holes at the outline stage.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| Outline validation | None | Run `validate_outline()` after `create_chapter_plan()`; block if critical issues found |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- Number of critical outline issues flagged per run
|
||
- Rewrite rate during Phase 3 (did validation prevent rewrites?)
|
||
- Phase 3 token cost difference (A vs B)
|
||
- CER difference (did validation reduce continuity errors?)
|
||
|
||
### Success Criterion
|
||
|
||
- Validation blocks at least 1 critical issue per 3 runs
|
||
- Phase 3 rewrite rate drops ≥ 15% when validation is active
|
||
- CER improves ≥ 0.5 per 10 chapters
|
||
|
||
### Implementation Notes
|
||
|
||
- Add `validate_outline(events, chapters, bp, folder)` to `story/planner.py`
|
||
- Prompt: "Review this chapter plan for: (1) missing required plot beats, (2) character deaths/revivals without explanation, (3) severe pacing imbalances, (4) POV character inconsistency. Return: {issues: [...], severity: 'critical'|'warning'|'ok'}"
|
||
- Modify `cli/engine.py`: call `validate_outline()` and log issues before Phase 3 begins
|
||
- Estimated implementation: 2 hours
|
||
|
||
---
|
||
|
||
## Experiment 4: Adaptive Scoring Thresholds
|
||
|
||
**Alt Reference:** Alt 3-B
|
||
**Hypothesis:** Lowering SCORE_PASSING for early setup chapters reduces refinement cost while maintaining quality on high-stakes scenes.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| SCORE_AUTO_ACCEPT | 8.0 (all chapters) | 8.0 (all chapters) |
|
||
| SCORE_PASSING | 7.0 (all chapters) | 6.5 (ch 1–20%), 7.0 (ch 20–70%), 7.5 (ch 70–100%) |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- Refinement pass count per chapter position bucket
|
||
- HQS per chapter position bucket (A vs B)
|
||
- CPC for each bucket
|
||
- Overall HQS for full book (A vs B)
|
||
|
||
### Success Criterion
|
||
|
||
- Setup chapters (1–20%): ≥ 20% fewer refinement passes in B
|
||
- Climax chapters (70–100%): HQS improvement ≥ 0.3 in B
|
||
- Full book HQS unchanged or improved
|
||
|
||
### Implementation Notes
|
||
|
||
- Modify `story/writer.py` `write_chapter()`: accept `chapter_position` (0.0–1.0 float)
|
||
- Compute adaptive threshold: `passing = 6.5 + position * 1.0` (linear scaling)
|
||
- Modify `cli/engine.py`: pass `chapter_num / total_chapters` to `write_chapter()`
|
||
- Estimated implementation: 1 hour
|
||
|
||
---
|
||
|
||
## Experiment 5: Mid-Generation Consistency Snapshots
|
||
|
||
**Alt Reference:** Alt 4-B
|
||
**Hypothesis:** Running `analyze_consistency()` every 10 chapters reduces post-generation CER without significant cost increase.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| Consistency check | Post-generation only | Every 10 chapters + post-generation |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- CER post-generation (A vs B)
|
||
- Number of issues caught mid-generation vs post-generation
|
||
- Token cost difference (mid-gen checks add ~25K × N/10 tokens)
|
||
- Generation time difference
|
||
|
||
### Success Criterion
|
||
|
||
- Post-generation CER drops ≥ 30% in B
|
||
- Issues caught mid-generation prevent at least 1 expensive post-gen ripple propagation per run
|
||
- Additional cost ≤ $0.01 per book (all free on Pro-Exp)
|
||
|
||
### Implementation Notes
|
||
|
||
- Modify `cli/engine.py`: every 10 chapters, call `analyze_consistency()` on written chapters so far
|
||
- If issues found: log warning and optionally pause for user review
|
||
- Estimated implementation: 1 hour
|
||
|
||
---
|
||
|
||
## Experiment 6: Iterative Persona Validation
|
||
|
||
**Alt Reference:** Alt 1-C
|
||
**Hypothesis:** Validating the initial persona with a sample passage reduces voice-drift rewrites in Phase 3.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| Persona creation | Single-pass, no validation | Generate persona → generate 200-word sample → evaluate → accept if ≥ 7/10, else regenerate (max 3 attempts) |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- Initial persona acceptance rate (how often does first-pass persona pass the check?)
|
||
- Phase 3 persona-related rewrite rate (rewrites where critique mentions "voice inconsistency" or "doesn't match persona")
|
||
- HQS for first 5 chapters (voice is most important early on)
|
||
|
||
### Success Criterion
|
||
|
||
- Phase 3 persona-related rewrite rate drops ≥ 20% in B
|
||
- HQS for first 5 chapters improves ≥ 0.2
|
||
|
||
### Implementation Notes
|
||
|
||
- Modify `story/style_persona.py`: after `create_initial_persona()`, call a new `validate_persona()` function
|
||
- `validate_persona()` generates 200-word sample, evaluates with `evaluate_chapter_quality()` (light version)
|
||
- Estimated implementation: 2 hours
|
||
|
||
---
|
||
|
||
## Experiment 7: Two-Pass Drafting (Draft + Polish)
|
||
|
||
**Alt Reference:** Alt 3-A
|
||
**Hypothesis:** A cheap rough draft followed by a polished revision produces better quality than iterative retrying.
|
||
|
||
### Setup
|
||
|
||
| Parameter | Control (A) | Treatment (B) |
|
||
|-----------|-------------|---------------|
|
||
| Drafting strategy | Single draft → evaluate → retry | Rough draft (Flash) → polish (Pro) → evaluate → accept if ≥ 7.0 (max 1 retry) |
|
||
| Max retry attempts | 3 | 1 (after polish) |
|
||
| Everything else | Identical | Identical |
|
||
|
||
### Metrics to Measure
|
||
|
||
- CPC (A vs B)
|
||
- HQS (A vs B)
|
||
- Rate of chapters needing retry (A vs B)
|
||
- Total generation time per book
|
||
|
||
### Success Criterion
|
||
|
||
- HQS improvement ≥ 0.3 in B with no cost increase
|
||
- OR: CPC reduction ≥ 20% in B with no HQS decrease
|
||
|
||
### Implementation Notes
|
||
|
||
- Modify `story/writer.py` `write_chapter()`: add polish pass using Pro model after initial draft
|
||
- Reduce max_attempts to 1 for final retry (after polish)
|
||
- This requires Pro model to be available (handled by auto-selection)
|
||
|
||
---
|
||
|
||
## Experiment Execution Order
|
||
|
||
Run experiments in this order to minimize dependency conflicts:
|
||
|
||
1. **Exp 1** (Persona Caching) — independent, 30 min, no risk
|
||
2. **Exp 2** (Skip Beat Expansion) — independent, 15 min, no risk
|
||
3. **Exp 4** (Adaptive Thresholds) — independent, 1 hr, low risk
|
||
4. **Exp 3** (Outline Validation) — independent, 2 hrs, low risk
|
||
5. **Exp 6** (Persona Validation) — independent, 2 hrs, low risk
|
||
6. **Exp 5** (Mid-gen Consistency) — requires stable Phase 3, 1 hr, low risk
|
||
7. **Exp 7** (Two-Pass Drafting) — highest risk, run last; 3 hrs, medium risk
|
||
|
||
---
|
||
|
||
## Success Metrics Definitions
|
||
|
||
### Cost per Chapter (CPC)
|
||
|
||
```
|
||
CPC = (total_input_tokens × input_price + total_output_tokens × output_price) / num_chapters
|
||
```
|
||
|
||
Measure in both USD and token-count to separate model-price effects from efficiency effects.
|
||
|
||
### Human Quality Score (HQS)
|
||
|
||
Blind evaluation by a human reviewer:
|
||
1. Read 3 chapters from treatment A and 3 from treatment B (same book premise)
|
||
2. Score each on: prose quality (1–5), pacing (1–5), character consistency (1–5)
|
||
3. HQS = average across all dimensions, normalized to 1–10
|
||
|
||
### Continuity Error Rate (CER)
|
||
|
||
After generation, manually review character states and key plot facts across chapters. Count:
|
||
- Character location contradictions
|
||
- Continuity breaks (held items, injuries, time-of-day)
|
||
- Plot event contradictions (character alive vs. dead)
|
||
|
||
Report as errors per 10 chapters.
|