Skip to main content
Paul Welty, PhD AI, WORK, AND STAYING HUMAN

The day we shipped two products and the agents got bored

112 issues across 12 projects. Two new products went from nothing to code-complete MVP in single sessions. And the most interesting signal wasn't the speed — it was the scout that came back empty-handed.

Duration: 8:13 | Size: 7.52 MB

The most interesting thing that happened today wasn’t the two products we shipped. It was the scout that found nothing.

I run autonomous agents that explore codebases looking for improvements — bugs, missing tests, security gaps, dead code, anything worth fixing. They’ve been running for weeks across 15 projects, consistently finding real issues. Billing webhooks that ignore database failures. Security filters with variable name mismatches. Podcast feeds with broken URLs. The agents are patient and systematic in a way humans can’t sustain, and they keep finding things.

Today, one of them came back empty. The paulos scout, the agent that explores the orchestration codebase, the infrastructure that runs everything else, filed five issues and every single one was a false positive. Tests already existed. Coverage was already comprehensive. The agent couldn’t find anything worth fixing.

That’s the sound of a system reaching steady state. And it tells you something about what autonomous improvement actually looks like at scale: it has a ceiling. The interesting question is what happens when you hit it.

But first, the speed story, because it’s worth telling.

Two products went from empty repositories to code-complete MVPs in single sessions today. Dinly, a family meal planning tool, shipped 20 issues: full household management with onboarding wizard, a recipe library with Mela import (3,805 real recipes imported with zero failures), weekly voting cycles with a deterministic ranking engine that scores recipes using vote signals, preference data, dietary constraints, and cook history. A meal plan builder, auto-generated shopping lists, and calendar export. One session. From nothing to a working product that a family could actually use tomorrow.

Prakta had an even bigger day — 34 issues across three sessions. From a marketing shell to a complete task intelligence platform with AI-powered onboarding, four cognitive modes, a Claude-powered sequencing engine, energy check-ins, cycle cards with timers, a contextual AI sidebar, end-of-day capture, and a learning model that adjusts to how you actually work. The complete flow works: signup through wrap-up, billing gates access, Stripe checkout, the works.

What makes this possible, and why it matters beyond the flex: the infrastructure for standing up a new product is now a checklist. Create the repo, scaffold the framework, write PRODUCT.md, add to workspaces.toml, create the orchestrate plist, install the launchd agent, create the tmux session, add the bot user, provision labels, create a milestone. Every step documented, every step done before, every step takes minutes. The new product inherits the full fleet infrastructure from minute one — autonomous scouting, orchestrated issue dispatch, cross-project synthesis.

The cost of starting something new is approaching zero. Not the cost of building it. But the cost of plugging it into the system that makes building fast. The infrastructure is the force multiplier. Without it, shipping two MVPs in a day is a stunt. With it, it’s a Tuesday.

The third insight is about what I’m calling “the critical miss pattern.” Yesterday, agents working on Phantasmagoria — a Stellaris mod generator — added four new effect types to make generated events more visible and interesting. Planet deposits, technology grants, research options. They wrote the code, defined the effect blocks, closed the issues. Both milestones marked complete.

Today, a different agent ran a scout pass on the same codebase and found something the implementation agents missed: the new effect blocks were defined in the code but never wired into the roller’s pattern structures. Generated events would have looked identical to before the work was done. The effects existed in the code. They just had no path to actually appear in any output.

The adversarial review pattern working exactly as designed. The agent that builds a thing checks whether it compiles, whether it passes tests, whether it matches the spec. The agent that scouts a thing checks whether it actually works in the context of the whole system. Different agents, different incentives, different things they notice. The builder closed the issue because the code was correct. The scout reopened the question because the system behavior hadn’t changed.

In an agency, this is why QA is a separate role from development. Not because the developer is careless — the code was correct. But because the developer’s frame is the issue spec, and QA’s frame is the product. They’re looking at different things. You need both frames. You need the tension between them.

Fourth: three products are approaching launch simultaneously. Eclectis was formally declared launch-ready today after a comprehensive audit — all pipeline handlers work, all delivery channels functional, billing works, admin tooling solid. Paul wants to do a Product Hunt launch. Prakta and Dinly are code-complete and heading toward deployment. Three products, three different markets, all reaching launchability in the same week.

This creates a new kind of coordination problem. Not “how do I build three things” — the fleet handles that. But “which do I launch first?” That’s a strategy question, not an engineering question. The agents can’t answer it. They can tell you which product is more technically ready. They can’t tell you which market is more receptive, which story is more compelling, which launch has better timing.

This is where taste and strategy still live. The agents build. The human decides what to build and when to ship it to the world. That boundary has been remarkably stable even as the agents get more capable. The building gets cheaper. The deciding doesn’t.

Fifth: Skillexis, which has been stalled for a week. Six consecutive sessions with no code shipped. A migration spec approved on March 12, and nothing since. Today the scout finally broke the pattern — it filed five issues, three of them grindable. The pipeline has work for the first time since the migration was approved. But the migration itself still hasn’t started.

There’s a pattern here that goes beyond one project. Capable planning systems produce plans. Plans feel like progress. They’re not. The migration spec is thorough, well-reasoned, and complete. It’s also been sitting there for six days while the orchestrate loop triages an empty queue. Planning without executing is the organizational equivalent of a heating system that only has a thermostat — all measurement, no warmth.

The fix isn’t “plan less.” The fix is “start before you’re ready.” The scout broke the stall not by creating a better plan but by creating work that was smaller than the migration. Small enough to execute without thinking about it. The security fix, the null crash guard, the env validation — none of these are the migration. But they’re shippable today, and shipping creates momentum that planning never does.

Sixth: diminishing returns and what they mean for autonomous systems. When the paulos scout came back with five false positives, the signal was clear: test coverage is saturated for this codebase. The synaxis-h scout, pass number nine, found zero issues. The site is comprehensively clean after 63 issues across eight prior passes.

This is healthy. A system that always finds problems is either looking at a terrible codebase or applying increasingly marginal standards. The interesting question is what the scouts should do next. Not more test coverage. Not more code hygiene. The agents need new angles: features, performance, security posture, user experience. The improvement pipeline needs to evolve when one dimension is exhausted.

In an organization, this is what happens when the quality team has nothing left to flag. Either you’ve achieved quality, in which case the team needs a new mission, or you’ve exhausted the team’s vocabulary for describing problems. The first is cause for celebration. The second is cause for alarm. Knowing which one you’re in requires judgment the agents don’t have.

Today the fleet closed 112 issues across 12 projects. Two new MVPs are code-complete. One product is launch-ready. The autonomous improvement pipeline is reaching steady state on mature projects while still finding critical issues on active ones. The infrastructure cost of starting a new product is approaching zero.

So what does it mean when your agents get bored? When the scout comes back and says “I looked everywhere and there’s nothing to fix”? Is that the system working perfectly? Or is it the system running out of imagination?

I think it depends on whether you give it new questions to ask. The improvement pipeline doesn’t plateau because there’s nothing left to improve. It plateaus because it’s been asking the same kind of question — “are there bugs?” “are there missing tests?” — and it’s answered them all. The ceiling isn’t capability. It’s curiosity.

Which means the next evolution isn’t smarter agents. It’s agents that know when to change the question.

Why customer tools are organized wrong

This article reveals a fundamental flaw in how customer support tools are designed—organizing by interaction type instead of by customer—and explains why this fragmentation wastes time and obscures the full picture you need to help users effectively.

Infrastructure shapes thought

The tools you build determine what kinds of thinking become possible. On infrastructure, friction, and building deliberately for thought rather than just throughput.

Server-side dashboard architecture: Why moving data fetching off the browser changes everything

How choosing server-side rendering solved security, CORS, and credential management problems I didn't know I had.

The work of being available now

A book on AI, judgment, and staying human at work.

The practice of work in progress

Practical essays on how work actually gets done.

The org chart your agents need

The AI community is reinventing organizational design from scratch — badly. Agencies figured this out decades ago. Competencies, not clients. Briefs, not prompts. Lateral communication, not hub-and-spoke. The answers are already there.

AI agents need org charts, not pipelines

Every agent framework organizes around tasks. The agencies that actually work organize around competencies. The AI community is about to rediscover this the hard way.

The delegation problem nobody talks about

When your automated systems start finding real bugs instead of formatting issues, delegation has crossed a line most managers never see coming.

The org chart your agents need

The AI community is reinventing organizational design from scratch — badly. Agencies figured this out decades ago. Competencies, not clients. Briefs, not prompts. Lateral communication, not hub-and-spoke. The answers are already there.

AI agents need org charts, not pipelines

Every agent framework organizes around tasks. The agencies that actually work organize around competencies. The AI community is about to rediscover this the hard way.

The first real user breaks everything

Your product works until someone actually uses it. The gap between 'works in dev' and 'works for a person' is where most systems fail — and most organizations avoid looking.