Manual fluency is the prerequisite for agent supervision

You cannot responsibly automate what you cannot do manually. AI agents speed up work for people who already know how to do it. They do not replace the need to learn the work in the first place.

This essay first appeared in my weekly newsletter, Philosopher at Large, where I write once a week about work, learning, and judgment.

Last week I moved my AI pipeline from one project to another. Same code. Same agent architecture. Same everything. It broke in three places within the first hour.

The API key lookup was hard-coded to the first project’s conventions. The codebase map ballooned to 62 megabytes because nobody had set a size limit. A label the pipeline depended on didn’t exist in the new repo, so the entire routing system silently did nothing.

None of those were logic bugs. They were assumption bugs. Things I stopped noticing because they were always true in my environment. I’d built the system, I knew every piece of it, and I still got blindsided by my own invisible assumptions.

Now imagine someone who didn’t build it. Someone who’s never manually triaged an issue, never written a spec by hand, never debugged a failed deployment. They wouldn’t even know where to look. The error messages would be meaningless because the mental model those messages reference doesn’t exist in their head.

That’s the gap we’re racing into. AI agents can now generate complete systems faster than most developers can read the output. Companies are treating agent-assisted development as pure productivity gain. But the gap between what agents can produce and what developers can evaluate is not a training problem or a tooling problem. It’s a judgment problem.

You cannot responsibly automate what you cannot do manually.

Supervision without fluency is not supervision

Steve Yegge, the veteran engineer who spent decades at Amazon and Google, recently described trying Claude Code for the first time. “I used it and was like, ‘oh, I get it. We’re all doomed.’” He wasn’t being dramatic. He was recognizing what happens when the friction that forces understanding disappears.

Two years ago, AI tools were fancy autocomplete. You typed, they suggested, you decided. The human was in the loop on every keystroke. That’s not supervision. That’s collaboration. You couldn’t avoid understanding what was happening because you were doing it alongside the machine.

What changed is autonomy. Current agents take a goal and execute multi-step plans without checking back. They scout your codebase, write specs, implement code, run tests, fix what fails, and commit. All in one pass. The coordination tax dropped to nearly zero.

But the judgment tax didn’t drop at all. Someone still has to decide what to build, whether the output is correct, and when to stop. Except now there’s no friction forcing you to understand what’s happening between the goal and the result. The human nerve endings got severed.

You used to develop intuition about cost and quality through the friction of doing the work. Automate that friction away, and the intuition never forms.

Take spec writing. My pipeline has a prep step where the agent reads an issue, explores the relevant code, and writes a detailed implementation spec. The spec includes file paths, function signatures, acceptance criteria, edge cases. When I review that spec, I’m not reading it like a document. I’m simulating the implementation in my head.

I know what a Hugo partial looks like, so when the spec says “modify the post-teaser partial to add width and height attributes,” I can picture the template, picture the image rendering pipeline, and ask: wait, what about external images that aren’t Hugo resources? Does the fallback path handle this?

Someone without that manual experience reads the same spec and it looks complete. It has file paths. It has acceptance criteria. It mentions edge cases. It checks every box. But they can’t tell the difference between a spec that covers the real complexity and one that covers the obvious complexity while missing the thing that will actually break.

They don’t know which questions to ask because they’ve never hit those walls themselves. The judgment call they can’t make is: “This looks right but I don’t believe it.” That instinct comes from having been wrong in exactly this way before. You can’t install it from a document.

Working is not the same as correct

At OpenAI, roughly 95% of engineers use Codex, often working with fleets of 10 to 20 parallel AI agents. Code review times dropped from 10 to 15 minutes down to 2 to 3 minutes. Sherwin Wu, who leads engineering for OpenAI’s API platform, says “I ship code I don’t read.”

That statement should make you pause. Not because it’s reckless. Wu knows what he’s doing. He’s built the manual fluency that lets him evaluate agent output without reading every line. But that statement, repeated by someone who hasn’t built that fluency, becomes something else entirely.

When you submit a code review approval on agent-generated code you don’t understand, you’re not making an error. You’re performing a ritual. The green checkmark means “I evaluated this and it meets our standards.” If you can’t evaluate it, the checkmark is a false statement.

Everyone downstream treats that checkmark as signal. They make decisions based on it. They skip their own review because yours already happened. That’s what makes it lying. Not the intent to deceive. Most people doing this aren’t malicious. It’s the structural dishonesty. You’re producing an artifact that represents a judgment you did not make. And the entire system around you is built on the assumption that you did.

Working doesn’t mean correct. Working means nobody has exploited the flaw yet. The gap between those two things is where most organizational risk actually lives.

I’ve seen this pattern three times in the last month. Two of my projects hit the same failure in the same week. Both had their schema change tracking fall out of sync because the work was moving faster than the tracking system could handle. People routed around the migration files, applied changes directly, added guardrails after the fact. The agents were executing correctly. The governance system underneath was built for a tempo that no longer existed.

Nobody caught it because nobody was manually running migrations anymore, so nobody noticed the tracking had drifted.

Second example: I have three separate Hugo templates that all contain the same hardcoded Brevo form URL and Cloudflare Turnstile sitekey. When I set up the first one manually, I understood every line. The second and third were agent-generated from “make it work like the other one.” They work. But they duplicated credentials across three files instead of centralizing them in config, because the agent optimized for “works” not “maintainable.”

I only caught it during a scout pass. A systematic code review I run specifically because I know agent output drifts toward local correctness at the expense of global coherence.

Third example: I have fifteen posts tagged “podcast” that have no audio URL. They were tagged before the automated audio generation hook existed. The hook generates audio on commit for podcast-tagged posts, but these older posts predate it. No agent flagged the inconsistency because each post in isolation looks fine. It has tags, it has content, it renders.

The pattern only becomes visible when you ask a question no agent asks: “Do all podcast-tagged posts actually have audio?” That question comes from knowing the full system, not from reading individual files.

The AI vampire effect

Simon Willison describes what he calls “the AI vampire.” You use AI to achieve 10x productivity. You work full eight-hour days to impress your employer. The company captures all the value from your enhanced productivity. You receive no proportional compensation increase. You alienate colleagues. You become exhausted.

Yegge calls this the “Dracula effect” and argues that working with AI agents at maximum productivity is physically and mentally draining. He believes companies should only expect about three hours of intensive AI-augmented work per day from engineers. Four hours of agent work is more realistic than eight because the cognitive burden is surprisingly taxing.

Willison puts it this way: “I’ve argued that AI has turned us all into Jeff Bezos, by automating the easy work, and leaving us with all the difficult decisions, summaries, and problem-solving.” Rather than making work easier, AI has transformed every worker into an executive-level decision-maker. That role requires intense mental effort and can only be sustained in short bursts.

But here’s what neither of them says explicitly: that exhaustion only makes sense if you’re evaluating the output. If you’re not evaluating it, if you’re just accepting what the agent produces and passing it along, then you’re not doing the hard work. You’re doing theater.

The exhaustion is the cost of judgment. If you’re not exhausted, you’re probably not judging.

The operating rule

Do it manually first. At least once. Probably three times.

Not because the manual work is valuable. It usually isn’t, and the agent will do it faster. But because the manual pass is where you build the evaluation model. You need to know what “done” looks like, what “almost right” looks like, and what “wrong in a way that passes every automated check” looks like.

Those three things are different, and you can only distinguish them through experience.

After that, automate aggressively. I’m not arguing for Luddism. My entire pipeline runs through agents. Scouting, triaging, spec writing, implementation, QA, review. All of it. But I built every piece of that pipeline by hand first. I triaged hundreds of issues manually before I wrote the triage skill. I wrote dozens of specs before I automated spec generation.

The automation encodes my judgment. If I’d skipped the manual phase, it would encode nothing. Just the shape of judgment without the substance.

Joe McCormick, a principal software engineer at Babylist who lost most of his central vision due to a rare genetic disorder, demonstrates this principle perfectly. He uses Claude Code to build custom Chrome extensions for his specific accessibility needs. He built a Slack image-to-text converter, an AI-powered spell checker, and a link summarization tool in under 25 minutes during a live demonstration.

But here’s what matters: McCormick is a principal engineer. He knows what good code looks like. He knows what a Chrome extension should do. He knows how to evaluate whether the AI-generated solution actually solves his problem. The AI didn’t replace his judgment. It amplified his ability to act on it.

That’s the difference. McCormick can build personal software rapidly because he already knows how to build software. The AI removed the tedious parts. It didn’t remove the need to know what he was building.

The settled thought

We’re not in a speed race. We’re in an evaluation race.

The organizations that win won’t be the ones that automated first. They’ll be the ones whose people can tell the difference between output that’s correct and output that merely looks correct. That skill doesn’t come from prompting. It comes from having done the work manually, slowly, badly at first, until the pattern recognition is in your hands, not just in your tools.

Yegge is right that “engineers are becoming sorcerers.” But sorcery requires knowledge of what you’re summoning. You can’t conjure what you don’t understand. And you can’t evaluate what you’ve never built.

Every time you skip the manual phase, you’re not saving time. You’re borrowing against a judgment debt that compounds quietly until the day it doesn’t. And on that day, you won’t know what went wrong, because you never knew what “right” felt like in the first place.

Agents are useful. They speed up work for people who already know how to do it. They do not replace the need to learn the work in the first place.

Automation without understanding is not productivity. It is abdication.

Manual fluency is the prerequisite for agent supervision

Supervision without fluency is not supervision

Working is not the same as correct

The AI vampire effect

The operating rule

The settled thought

Further reading

Why customer tools are organized wrong

Infrastructure shapes thought

Server-side dashboard architecture: Why moving data fetching off the browser changes everything

The work of being available now

The practice of work in progress

Nobody promotes you to operator

The job you didn't know you were hiring for

The second project problem

The work that remains

Your process was built for a different speed

The bottleneck moved