Dev reflection - February 20, 2026

I want to talk about the difference between execution and verification. Because something happened this week that made the distinction painfully clear, and I think it matters far beyond software.

Duration: 9:03 | Size: 10.4 MB

Hey, it’s Paul. Friday, February 20, 2026.

I want to talk about the difference between execution and verification. Because something happened this week that made the distinction painfully clear, and I think it matters far beyond software.

Here’s the setup. I ran 87 tasks through automated agents in a single day across five different projects. Features shipped. Code merged. Test counts jumped by nearly 200. On paper, it was one of the most productive days I’ve ever had. And then a real person — Stacy — sat down to actually use one of the products. And things broke. Not because the code was wrong, exactly. The code did what it was supposed to do. But switching workspaces silently failed because nobody had wrapped the call in error handling. Team settings showed random identifier strings instead of people’s names. A database constraint was missing a single value, so an entire feature just… didn’t work.

The agents that built those features executed flawlessly against the specifications they were given. They just couldn’t see what a human clicking through the interface would experience. And that gap — between code that works in isolation and software that works in context — that’s not a technical problem. That’s an epistemological one.

There’s a concept in philosophy of science that I keep coming back to. Pierre Duhem and W.V.O. Quine both argued, in different ways, that you can never test a single hypothesis in isolation. Every test carries with it a web of background assumptions — about your instruments, your environment, your interpretation of results. When something fails, you don’t automatically know which part of the web broke.

That’s exactly what’s happening here, but with work instead of science. An agent writes a feature. The feature passes its tests. But the tests themselves carry assumptions — that the database has certain constraints, that the user arrives from a certain path, that other features behave in certain ways. The agent can’t see the web. It sees its task.

And honestly? This is how most knowledge work has always operated. We’ve just been able to hide it because the person writing the code was also the person using the product, or at least close enough to it. The background assumptions were carried in someone’s head, unarticulated but present. When you separate execution from the person who holds the context, those assumptions become invisible. Not wrong. Invisible.

This is why handoffs fail in organizations. This is why consultants deliver technically correct recommendations that nobody implements. This is why new hires can follow every process document perfectly and still miss the point. The explicit instructions are never the whole story. The web of assumptions matters as much as the task.

So here’s the second thing I noticed. When I looked at what the agents handled well versus what still required me, the pattern was clean. Almost too clean.

Agents crushed mechanical work. Consolidating duplicate code structures, replacing repetitive patterns with helpers, adding timeouts, writing tests, deleting dead files, compressing images. Anything with a clear start state, a clear end state, and a well-defined transformation between them. They also shipped greenfield features — notification preferences, feedback tables, demo routes — as long as the spec was tight.

But every single production bug required me. Not because the bugs were hard. They weren’t. A missing try-catch. A database query that only fetched one user’s email. A constraint enum that was missing a value. Trivial fixes, five minutes each. The difficulty wasn’t in the fix. It was in knowing the fix was needed. You had to be Stacy, sitting in front of the screen, trying to switch workspaces and watching nothing happen.

This maps onto a distinction I think about a lot in organizational work. There’s a difference between tasks that are complicated and tasks that are complex. Complicated tasks have many steps but are ultimately predictable — you can write them down, decompose them, parallelize them. Complex tasks involve emergent behavior, feedback loops, things that only become visible when the whole system is running. Dave Snowden’s Cynefin framework draws this line, and it’s useful here.

The agents are spectacular at complicated work. They might never be good at complex work. Not because of some limitation in AI — maybe that gets solved, maybe it doesn’t — but because complex work requires being embedded in the context. You have to be the user. You have to hold the whole web of assumptions in your head and notice when one of them breaks.

The third thing. I consolidated two sets of scheduled processes this week into single pipelines. Instead of three independent timers firing three independent scripts, I now have one orchestrator that runs the full sequence: scan, search, generate, brief. Same thing on another project: aggregate logs, synthesize, generate audio, publish. One command, full flow.

This felt like progress. And it is. The dependencies are explicit now. The flow is visible. There’s one place to reason about the daily cycle instead of three.

But I also created something I didn’t have before: a single point of failure. When three scripts run independently, one can fail without affecting the others. When they’re a pipeline, a failure in step two means steps three and four don’t happen. And right now, I don’t have good answers for what happens when step two fails. Does the pipeline retry? Skip? Use stale output? I haven’t built that yet.

This is a pattern I’ve seen in every organization I’ve ever worked with. Consolidation creates clarity and fragility at the same time. You can see the whole flow, which is great. But you’ve also coupled things that used to be independent. The org chart redesign that puts everything under one VP. The platform migration that moves twelve services into one. The workflow automation that chains six steps into a single trigger. Every one of these trades resilience for legibility.

The question isn’t whether to consolidate. It’s whether you’ve designed for graceful degradation. And most people — most organizations — don’t think about degradation until something degrades.

So here’s where I’ve landed. My role has shifted. Meaningfully. I’m not writing code most of the day anymore. I’m deciding what work to do, configuring how agents execute it, and then verifying the results. Orchestration and verification. The execution itself is handled.

And you’d think that would make the job easier. In some ways it does. But in other ways it’s harder, because the skills required are different. When I was writing code, I understood my own decisions. I could trace my logic. When an agent writes code and I’m reviewing it, I have to reconstruct its reasoning. I have to understand not just what the code does but what the agent thought it was doing and whether those assumptions hold. Debugging someone else’s thinking is harder than debugging your own.

This is the part that I think generalizes. As more knowledge work gets automated — not just software, but analysis, writing, research, design — the human role shifts from doing the work to verifying the work. And verification is a skill that almost nobody has been trained for. We’ve been trained to produce. We’ve been trained to execute. We have not been trained to look at output we didn’t create and assess whether it’s actually good.

Think about what that means for how we hire, how we train, how we evaluate performance. The person who can write the best report isn’t necessarily the person who can tell whether an AI-generated report is trustworthy. The person who writes the best code isn’t necessarily the person who can catch the subtle assumption violation in code an agent wrote. These are different muscles.

I got 87 tasks done in a day. That’s real. That matters. But the work that actually moved the needle — testing with a real user, catching the invisible failures, deciding what to build next, writing the product vision document that no agent could write because it required judgment about what matters — that work took just as long as it always has. Maybe longer, because now I’m also managing the agents.

The use is enormous. But use doesn’t eliminate the load. It moves it. And if you’re not paying attention to where it moved, you’ll optimize for the wrong thing.

So here’s the question I’m sitting with: if the future of work is orchestration and verification rather than execution, what does mastery look like? Because I don’t think we know yet. And I don’t think the answer is “learn to write better prompts.” I think it’s something closer to learning to see what’s missing. Which, if you think about it, has always been the hardest skill. We just used to be able to hide from it.

We can’t hide anymore.

Dev reflection - February 20, 2026

Why customer tools are organized wrong

Infrastructure shapes thought

Server-side dashboard architecture: Why moving data fetching off the browser changes everything

The work of being available now

The practice of work in progress

Dev reflection - February 18, 2026

Dev reflection - February 17, 2026

Dev reflection - February 16, 2026

Dev reflection - February 18, 2026

Dev reflection - February 17, 2026

Dev reflection - February 16, 2026