Skip to main content
Paul Welty, PhD AI, WORK, AND STAYING HUMAN

Work log: Phantasmagoria — March 28, 2026

What shipped today

The big theme today was pipeline architecture and AI-driven quality gates. Three major pieces landed.

Decoupled Stage 1 and Stage 2. Previously, generate_release.py --stage 2 would regenerate Stage 1 narratives before generating outcomes, which meant you couldn’t iterate on Stage 2 without risking Stage 1 drift. Now Stage 1 outputs are saved as snapshots in stage1/, and Stage 2 reads from those snapshots. This makes the pipeline deterministic — Stage 1 is frozen once approved, and Stage 2 can be re-run freely against a stable narrative foundation.

Split Stage 3 into render (Stage 3) and validate (Stage 4). The old check_c3_to_ship() did both rendering and linting in one pass. Now rendering and validation are separate stages with their own contract gates (check_c3_to_c4() and check_c4_to_ship()). The linter was renamed to “validator” (stellaris_mod_validator.py) to better reflect its role. This also fixed renderer bugs — modifier directory output and on_action trigger syntax were broken.

AI-based choice tension evaluation (Stage 2B+). This is the headline feature: after Stage 2A generates event outcomes, a new Stage 2B+ pass scores each event’s option set for tension on a 1-5 scale. Events scoring below 3 get rejected with specific critique, and Stage 2A regenerates with that feedback — up to 5 retries. The evaluator checks for dominant options (one choice clearly better than all others), reward stacking, tech parity violations, and punished caution. Both test events (The Cartographer’s Obsession and The Weighted Void) passed after 2-3 retries, showing the feedback loop catches and fixes real problems.

Key design rules baked into the prompts: max 2 effects per option (down from 3), one premium reward per option (tech OR follow-up OR strong modifier — pick one), follow-up events reframed as gambles rather than free bonuses, and the follow-up used as a balancing lever (fewer other effects on the follow-up option since it already offers “more content”).

Completed

  • Pipeline decoupling: Stage 1 snapshots, Stage 2 reads from frozen narratives
  • Stage 3/4 split: render and validate as separate pipeline stages
  • Renderer fixes: modifier directory, on_action syntax
  • Playtest mode: zeroed out min_planets and min_years gates
  • Stage 2B+ tension evaluation with anti-dominance rules
  • 9 documentation files updated across all changes
  • Created /ship skill for streamlined doc-update-commit-push workflow

Release progress

  • v2: 5/5 closed (complete)
  • v1.5: 18/18 closed (complete)

No open milestones — next release milestone hasn’t been created yet.

Carry-over

  • Issue #249 (Stage 2B dominant option detection) is ready-for-prep — the core 2B+ evaluator shipped today, but the issue may need its spec updated to reflect what actually landed vs. what’s still needed
  • Issue #247 (on_action syntax validation) is ready-for-dev with PR #248 open — needs rebase since main renamed linter to validator
  • Issue #246 (common/ subdirectory validation) is ready-for-dev
  • Generated events under data/releases/celestial_equinox/events/ were regenerated multiple times during testing — not committed. Need human review of final output before committing.
  • Backup files (SUMMARY_pre_stage1_reset_*, events_pre_stage1_reset_*) are untracked, intentionally not committed

Risks

  • The 2B+ evaluator uses the same AI model (Claude Sonnet 4.6) for both generation and evaluation — there’s a risk of shared blind spots where the evaluator doesn’t catch patterns the generator favors
  • Stage 1 narrative balance affects Stage 2 outcomes significantly — if Stage 1 options are structurally asymmetric (e.g., only one has a follow-up), Stage 2A has to work harder to create tension

Flags and watch-outs

  • All pipeline stages use anthropic / claude-sonnet-4-6 via AI_VENDOR/AI_MODEL env vars — this is configurable but hasn’t been tested with other providers
  • The resource_gain_scaled vs modifier_value distinction is subtle: scaled is a one-time percentage hit on stockpile, modifier_value is an ongoing production multiplier. Docs were updated but this remains a common confusion point.

Next session

  1. Review generated events — the celestial_equinox events under events/ need human review before committing. Run Stage 2 fresh if needed.
  2. Rebase PR #248 — on_action syntax validation PR needs rebase after linter→validator rename
  3. Close or update #249 — the 2B+ evaluator shipped; decide if the issue scope is satisfied or if more work is needed
  4. Execute #246 — common/ subdirectory validation is ready for dev
  5. Consider creating a v3 milestone — both v1.5 and v2 are fully closed, time to plan the next release

Why customer tools are organized wrong

This article reveals a fundamental flaw in how customer support tools are designed—organizing by interaction type instead of by customer—and explains why this fragmentation wastes time and obscures the full picture you need to help users effectively.

Infrastructure shapes thought

The tools you build determine what kinds of thinking become possible. On infrastructure, friction, and building deliberately for thought rather than just throughput.

Server-side dashboard architecture: Why moving data fetching off the browser changes everything

How choosing server-side rendering solved security, CORS, and credential management problems I didn't know I had.

The work of being available now

A book on AI, judgment, and staying human at work.

The practice of work in progress

Practical essays on how work actually gets done.

Silence by design

Most systems have more suppression than their owners realize. It gets installed for good reasons. The cost accumulates slowly, in the form of systems you can't operate because you've removed the signals that would let you understand them.

Designed to learn, built to ignore

The most dangerous organizational failures don't throw errors. They look fine, return results, and quietly stay frozen at the moment of their creation.

The variable that was never wired in

The gap between having a solution and using a solution is one of the most persistent failure modes in organizations. You see the escaped variable. You see the risk register. You assume the work is done.