What production teaches you about done
So here's something I've been sitting with. You finish a piece of work. You ship it. Everything looks good. And then production starts teaching you that you weren't actually done.
Duration: 8:04 | Size: 7.4 MB
Hey, it’s Paul. January 29th, 2026.
So here’s something I’ve been sitting with. You finish a piece of work. You ship it. Everything looks good. And then production starts teaching you that you weren’t actually done.
This keeps happening. Not because of sloppiness—because of how systems actually mature. You fix the critical path, the thing users hit most often, and you ship it. Then someone takes a different route through your system, and suddenly you discover all the roads you forgot to pave.
I had this experience this week where I’d done what felt like a complete refactor. Moved a bunch of hardcoded behavior into configuration files. Clean separation. Elegant, even. The main flow worked perfectly. Users could sign up, content moved through stages, everything hummed along.
Then someone tried to approve content and advance it to the next stage simultaneously. Silent failure. No errors. No exceptions. The system just… didn’t do what it was supposed to do. Content sat there, marked as pending, forever. Logs looked fine. Database looked fine. Nothing happened.
When something throws an error, you can debug it. You have a stack trace, an error message, a place to start. But when a system succeeds at doing nothing? That’s a different problem entirely. The operation completed. It just didn’t trigger the thing that was supposed to happen next.
The bug was subtle. The code set the status to “approved,” then immediately set it to “pending.” By the time the callback checked whether to queue the next job, the status wasn’t “approved” anymore. The precondition was never true. So the job never queued. And nobody knew.
This is worse than a crash. A crash demands attention. Silence lets you think everything’s fine while work quietly piles up, stuck in limbo.
I started thinking about why this happens, and I realized it’s the nature of migration work. You can’t test every path through a system before you ship. You test the critical paths, the ones you know about, the ones users hit constantly. But systems have all these alternate routes—the “approve and advance” button, the “redo this specific piece” action, the edge cases that only matter sometimes.
Half the system was reading from configuration. Half was still hardcoded. And users couldn’t tell which was which. The signup flow worked beautifully. The approve-and-advance flow didn’t. Same system, different paths, completely different behavior.
Incomplete migrations are worse than no migrations. If everything is hardcoded, at least it’s consistent. You know where to look when something breaks. But when half your system follows one pattern and half follows another, you’ve created a maze. You can’t reason about behavior without checking which path you’re on.
This connects to something I keep circling back to: the difference between constraints that help and constraints that hurt.
I was working on a newsletter system this week. The template requires that featured content live in a separate file. You can’t just write it inline where you’re writing the metadata. At first, this felt annoying. Why can’t I just put the content where I’m already working?
But then I realized what the constraint prevents. If you could write content in two places—inline in the newsletter metadata, or in a separate content file—you’d eventually have both. And then you’d have to figure out which one is authoritative. The constraint makes that impossible. Content lives in one place. Period. There’s never ambiguity about where to look.
Compare that to a different pattern I saw: a fallback that fills in a default when configuration is missing. Looks helpful, right? “If the config doesn’t specify which job to run, use this sensible default.” But here’s what that actually does: it hides the fact that your configuration is incomplete. The system keeps working, so you never notice the gap. Then months later, someone changes the default, and suddenly behavior shifts in ways nobody expected.
The fallback was permissive. It said “I’ll figure it out for you.” The newsletter template was restrictive. It said “You have to be explicit.” The permissive approach feels easier in the moment but creates mysterious failures later. The restrictive approach requires understanding the constraint upfront but makes errors obvious when they happen.
I’ve been thinking about this as a design question: when should systems discover what they need, and when should they demand explicit declaration?
There’s a pattern I’ve been using: try explicit configuration first, fall back to auto-discovery if nothing’s specified. So if you tell the system exactly which projects to include, it uses your list. If you don’t specify anything, it scans for projects that match certain criteria and includes those.
This feels elegant. You get control when you want it, convenience when you don’t. But it has its own complexity: now there are two sources of truth. If something unexpected happens, you have to figure out which source is active. Did the system use your explicit list, or did it discover something you didn’t expect?
Git handles this well. It auto-discovers files—you don’t have to list every file in your project. But commits are explicit. You choose what goes in. The boundary is clear: discovery handles “what exists,” explicit declaration handles “what matters.”
That’s the distinction I keep coming back to. Discovery is good for facts about the world. What files are in this directory? What projects have a certain structure? Those are discoverable. But intent—what should be included, what matters, what you’re trying to accomplish—that needs to be explicit. You can’t discover intent. You have to declare it.
There’s another pattern I’ve been wrestling with: when coupling is useful versus when it’s fragile.
Standard advice says avoid coupling. Don’t create dependencies between systems. Copy-paste is better than the wrong abstraction. Duplication is preferable to tight connections that break when one thing changes.
But I’ve been sharing resources across projects—linking to a single source of truth for certain definitions, so when I improve something in one place, every project gets the improvement automatically. That’s coupling. Twenty-nine projects now depend on one directory existing in one location. If I move it, twenty-nine things break.
For personal infrastructure where I’m the only user, this is fine. I control both ends. I know where everything lives. The coupling gives me leverage: one improvement, twenty-nine beneficiaries.
But at what scale does this become a problem? What signals indicate you’ve crossed the line from “useful shared resource” to “fragile dependency that’s going to bite you”?
I don’t have a clean answer. I think it depends on how many people touch the system, how often the shared resource changes, and how bad the failure is when the dependency breaks. For personal tools, coupling is cheap. For team infrastructure, it gets expensive fast.
Something else I noticed this week. A project I hadn’t touched in months needed a dependency update. The SDK had shipped 125 versions since I last looked at it. The language itself had a new major release, but the libraries I depend on don’t support it yet.
This is the ecosystem maturity problem. Fast-moving protocols and new language versions create work for everyone downstream. The protocol is still finding its shape, which is good for the protocol but expensive for projects using it. You’re constantly evaluating: do I upgrade now and deal with breaking changes, or do I wait and fall further behind?
I ended up pinning to a version ceiling. When the next major version ships, I want to choose when to upgrade, not have it happen automatically. That’s defensive, but it creates a maintenance obligation. I have to actively monitor releases. I have to evaluate the migration path. I have to decide when the cost of staying behind exceeds the cost of upgrading.
There’s no way to avoid this work. You either do it continuously, staying current with every release, or you do it in bursts, catching up after long gaps. Neither approach is free.
So where does all this leave me?
Production is where systems teach you what they need to be. You can’t design everything upfront. You can’t anticipate every path through the system. You ship something that works for the critical cases, and then you listen. You watch for silent failures. You notice when constraints help versus hurt. You pay attention to what the system reveals about its own incompleteness.
The question isn’t whether your system is finished. It’s whether you’re listening when it tells you it isn’t.
That’s the work. Not just building, but noticing. Not just shipping, but learning what you shipped doesn’t yet do.
And maybe that’s okay. Maybe systems aren’t supposed to be complete. Maybe they’re supposed to be conversations—between what you intended and what users actually need, between what you built and what production reveals.
The half-built system isn’t a failure state. It’s the natural state. The question is whether you recognize it.
The agent-shaped org chart
Every real org has the same topology: principal, role-holder, specialists. Staff AI maps onto it, node for node, and the cost collapse shows up in the deliverables that were always just human-handoff overhead.
AI as staff, not software
Two frames for what AI is doing to work. The tool frame makes tools smarter. The staff frame makes roles unnecessary. Those aren't the same product, the same company, or the same industry.
Knowledge work was never work
Knowledge work was always coordination between humans who couldn't share state directly. The artifacts were never the work. They were the overhead — and AI just made the overhead optional.
The work of being available now
A book on AI, judgment, and staying human at work.
The practice of work in progress
Practical essays on how work actually gets done.
Want to learn about agents? Talk to someone who ran an agency.
I spent 20 years running consulting engagements at Fortune 500 companies. Turns out that's the best preparation for running a fleet of AI agents ... because the problems are identical.
Your AI agents need a water cooler
We run a twelve-session AI fleet that coordinates through an IRC breakroom. A friend asked: why are you making AI agents act like humans? The answer turned out to be more interesting than the question.
The file I almost made twice
A small operational footgun that runs everywhere — building a parallel system when the one you have is fine.
The best customers are the first ones you turn against
Every subscription makes a bet that most customers won't use what they're paying for. The customer who closes that gap becomes a problem to be managed.
Delegation without comprehension is just prayer
The organizations that survive won't be the ones that automated the most. They'll be the ones that figured out what to stop delegating.
The case for corporate amnesia
Most organizations worship institutional memory. But what if the thing they're preserving is mostly decay?