Paul Welty, PhD AI, WORK, AND STAYING HUMAN

· development

Your biggest problems are the ones running fine

The most dangerous failures in any system — technical or organizational — aren't the ones throwing errors. They're the ones that appear to work perfectly. And they'll keep appearing to work perfectly right up until they don't.

Duration: 7:16 | Size: 6.67 MB


Hey, it’s Paul. Sunday, March 1st, 2026.


Three different projects today. Three completely different domains. And all three surfaced the same problem — something that had been running fine, without errors, without complaints, for weeks or months, turned out to be wrong. Not broken. Wrong. There’s a difference.

Broken is easy. Broken throws an error. Broken stops the build. Broken sends you an alert at 2 AM. You fix broken things because they demand your attention.

Wrong is quiet. Wrong keeps running. Wrong passes your tests, serves your pages, processes your data. Wrong looks exactly like right until the day the context shifts and you finally see what was there the whole time.


Here’s the first one. A blog I run — a Hugo site — had been silently publishing 2,098 raw images. Full-resolution PNGs, DALL-E outputs, WordPress import artifacts. Every one of them was being served to every visitor who happened to request them. No error. No warning. Hugo was doing exactly what it was designed to do: publish everything in the content directory as a web resource.

The images weren’t referenced anywhere. No template called them. No post linked to them. They just… existed. Publicly. Consuming 1.8 gigabytes of build output. And I didn’t notice because the site loaded fine. The pages I looked at rendered correctly. The images I cared about showed up.

The problem was invisible because I was only looking at what I expected to see. The extra 2,098 images weren’t in my mental model of the site, so they weren’t in my attention. They occupied space, consumed bandwidth, slowed builds — but they never produced an error, so they never produced a reason to look.

Once found, the fix took an afternoon. Deleted the orphans, moved the real images into an asset pipeline, cut the output from 1.8 gig to 82 megabytes. A 95% reduction. The system had been carrying twenty times its necessary weight, and the only reason I found it was because I happened to be in the neighborhood for a different task.


Second one. A content discovery platform I’m building uses AI to score articles for relevance. The scoring worked like this: collect search results, send them all to Claude in one prompt, ask it to return a JSON array with a relevance score for each one. Clean, efficient, one API call.

This worked great for months. Thirty results? Perfect scores every time. Fifty? Still fine. A little slower, but reliable.

Then a user needed international content. We added multi-country search — seven countries, same queries. Suddenly we’re sending 232 results in a single prompt and asking for a 232-element JSON array back.

It collapsed. Not gracefully. Not with an error you could catch and retry. The model just couldn’t reliably produce a JSON array that large within its output limits. Elements got dropped. Indices misaligned. Scores attached to the wrong articles. The system was still running — no crashes, no timeouts — but the data coming out was wrong.

Here’s what’s important: the batch scoring approach was always architecturally fragile. It embedded an assumption — that the model could produce arbitrarily large structured outputs — that was never true. At thirty results, the assumption happened to hold. At 232, it didn’t. The approach didn’t break at scale. It was broken at every scale. We just couldn’t see it yet.

The fix was to score each result individually. Simpler. More reliable. More expensive per call, but with a model cheap enough that it doesn’t matter. And the new architecture doesn’t have a hidden ceiling. It works at 30 results, 232 results, or 2,000 results, because it never depends on the model holding the whole collection in its head at once.


Third one. I run a development pipeline where AI agents implement features autonomously on separate branches. Each agent works in its own isolated copy of the codebase. They don’t know about each other. They build their feature, test it, submit it.

With one agent, this is perfect. Clean branch, clean merge, no conflicts.

Today I ran two agents on two different features. Both features needed to modify the same six files — configuration, data models, rendering, tests. Each agent did its work correctly. Each branch was internally consistent. And when it came time to merge both into the main codebase, there were thirty conflicts across six files.

Nobody did anything wrong. Both agents produced correct code. The problem was that the system assumed independence — that features could be developed in parallel without coordination. And at one agent, that assumption held trivially. At two agents touching the same code, it broke.

The fix was manual integration. Read both diffs, understand what each agent changed, apply them in sequence. Not elegant. Not scalable. But it worked for today. The deeper fix — which I haven’t built yet — would require the agents to be aware of each other’s work, or a smarter merge strategy that understands semantic conflicts, not just textual ones.


So here’s the pattern. In each case, the problem existed from day one. The orphaned images were published from the first build. The batch scoring assumption was fragile from the first prompt. The independence assumption in the merge pipeline was wrong from the first agent. None of these were introduced by a recent change. None of them were regressions. They were original sins — design decisions that encoded assumptions about scale, and scale hadn’t yet arrived to test them.

This is the thing that makes me uncomfortable. Because if three separate systems, built at different times for different purposes, all harbored silent problems that only became visible under new conditions — how many more are sitting in systems I’m not currently stress-testing?

The answer, obviously, is: a lot. And not just in my systems. In yours too.

Every organization has processes that work at current load. Approval chains that handle current volume. Communication patterns that function at current team size. Reporting structures that make sense at current complexity. And embedded in every one of those processes is a set of assumptions about scale that nobody’s written down and nobody’s tested, because there’s been no reason to. Everything’s running fine.

Running fine is the most dangerous status report in any system. It means nobody’s looking. And nobody’s looking because there’s no signal telling them to look. The signal only arrives when the assumption breaks, and by then you’re not debugging a design decision — you’re fighting a fire.


I don’t have a clean prescription for this. You can’t stress-test everything preemptively. You’d never ship anything. But I think there’s a practice worth cultivating — a habit of asking, when something works: what assumption is this depending on, and what would make that assumption false?

Not as a paranoid exercise. As a genuine curiosity. The batch scoring worked because the result set was small. What if it weren’t? The orphaned images didn’t matter because nobody checked the build size. What if the build got slow? The single-agent pipeline merged cleanly because there was no contention. What if there were two?

You won’t catch everything. But you’ll catch the ones that matter, because you’ll start seeing the assumptions instead of seeing through them.

The things running fine in your system right now aren’t neutral. They’re either genuinely sound or they’re wrong in a way you can’t see yet. And you won’t know which until something changes.

The question is whether you’ll be the one who changes it — on your terms, at your pace — or whether the change finds you.

That’s it for today. Talk to you tomorrow.

Why customer tools are organized wrong

This article reveals a fundamental flaw in how customer support tools are designed—organizing by interaction type instead of by customer—and explains why this fragmentation wastes time and obscures the full picture you need to help users effectively.

Infrastructure shapes thought

The tools you build determine what kinds of thinking become possible. On infrastructure, friction, and building deliberately for thought rather than just throughput.

Server-side dashboard architecture: Why moving data fetching off the browser changes everything

How choosing server-side rendering solved security, CORS, and credential management problems I didn't know I had.

The work of being available now

A book on AI, judgment, and staying human at work.

The practice of work in progress

Practical essays on how work actually gets done.

The day all five of my AI projects stopped building and started cleaning

I want to talk about something that happened this week that I almost missed because it looked boring. Five separate software projects — all mine, all running semi-autonomously with AI pipelines — i...

The silence that ships

Three projects independently discovered the same bug pattern today — code that reports success when something important didn't happen. The most dangerous failures don't look like failures at all.

When your work moves faster than your rules can keep up, governance quietly becomes theater

I want to talk about something that happened this week that looks like a technical problem but is actually a management problem. And I think it maps onto something most organizations are going to f...

The silence that ships

Three projects independently discovered the same bug pattern today — code that reports success when something important didn't happen. The most dangerous failures don't look like failures at all.

What happens when the pipeline doesn't need you

So here's something I noticed today that I want to sit with. I run several projects that use autonomous pipelines — AI systems that pick up tasks, write code, open pull requests, ship changes. One ...

Dev reflection - February 24, 2026

I want to talk about what happens when the thing that runs the factory needs more maintenance than the factory itself.