What production teaches you about done

So here's something I've been sitting with. You finish a piece of work. You ship it. Everything looks good. And then production starts teaching you that you weren't actually done.

Hey, it’s Paul. January 29th, 2026.

So here’s something I’ve been sitting with. You finish a piece of work. You ship it. Everything looks good. And then production starts teaching you that you weren’t actually done.

This keeps happening. Not because of sloppiness—because of how systems actually mature. You fix the critical path, the thing users hit most often, and you ship it. Then someone takes a different route through your system, and suddenly you discover all the roads you forgot to pave.

I had this experience this week where I’d done what felt like a complete refactor. Moved a bunch of hardcoded behavior into configuration files. Clean separation. Elegant, even. The main flow worked perfectly. Users could sign up, content moved through stages, everything hummed along.

Then someone tried to approve content and advance it to the next stage simultaneously. Silent failure. No errors. No exceptions. The system just… didn’t do what it was supposed to do. Content sat there, marked as pending, forever. Logs looked fine. Database looked fine. Nothing happened.

When something throws an error, you can debug it. You have a stack trace, an error message, a place to start. But when a system succeeds at doing nothing? That’s a different problem entirely. The operation completed. It just didn’t trigger the thing that was supposed to happen next.

The bug was subtle. The code set the status to “approved,” then immediately set it to “pending.” By the time the callback checked whether to queue the next job, the status wasn’t “approved” anymore. The precondition was never true. So the job never queued. And nobody knew.

This is worse than a crash. A crash demands attention. Silence lets you think everything’s fine while work quietly piles up, stuck in limbo.

I started thinking about why this happens, and I realized it’s the nature of migration work. You can’t test every path through a system before you ship. You test the critical paths, the ones you know about, the ones users hit constantly. But systems have all these alternate routes—the “approve and advance” button, the “redo this specific piece” action, the edge cases that only matter sometimes.

Half the system was reading from configuration. Half was still hardcoded. And users couldn’t tell which was which. The signup flow worked beautifully. The approve-and-advance flow didn’t. Same system, different paths, completely different behavior.

Incomplete migrations are worse than no migrations. If everything is hardcoded, at least it’s consistent. You know where to look when something breaks. But when half your system follows one pattern and half follows another, you’ve created a maze. You can’t reason about behavior without checking which path you’re on.

This connects to something I keep circling back to: the difference between constraints that help and constraints that hurt.

I was working on a newsletter system this week. The template requires that featured content live in a separate file. You can’t just write it inline where you’re writing the metadata. At first, this felt annoying. Why can’t I just put the content where I’m already working?

But then I realized what the constraint prevents. If you could write content in two places—inline in the newsletter metadata, or in a separate content file—you’d eventually have both. And then you’d have to figure out which one is authoritative. The constraint makes that impossible. Content lives in one place. Period. There’s never ambiguity about where to look.

Compare that to a different pattern I saw: a fallback that fills in a default when configuration is missing. Looks helpful, right? “If the config doesn’t specify which job to run, use this sensible default.” But here’s what that actually does: it hides the fact that your configuration is incomplete. The system keeps working, so you never notice the gap. Then months later, someone changes the default, and suddenly behavior shifts in ways nobody expected.

The fallback was permissive. It said “I’ll figure it out for you.” The newsletter template was restrictive. It said “You have to be explicit.” The permissive approach feels easier in the moment but creates mysterious failures later. The restrictive approach requires understanding the constraint upfront but makes errors obvious when they happen.

I’ve been thinking about this as a design question: when should systems discover what they need, and when should they demand explicit declaration?

There’s a pattern I’ve been using: try explicit configuration first, fall back to auto-discovery if nothing’s specified. So if you tell the system exactly which projects to include, it uses your list. If you don’t specify anything, it scans for projects that match certain criteria and includes those.

This feels elegant. You get control when you want it, convenience when you don’t. But it has its own complexity: now there are two sources of truth. If something unexpected happens, you have to figure out which source is active. Did the system use your explicit list, or did it discover something you didn’t expect?

Git handles this well. It auto-discovers files—you don’t have to list every file in your project. But commits are explicit. You choose what goes in. The boundary is clear: discovery handles “what exists,” explicit declaration handles “what matters.”

That’s the distinction I keep coming back to. Discovery is good for facts about the world. What files are in this directory? What projects have a certain structure? Those are discoverable. But intent—what should be included, what matters, what you’re trying to accomplish—that needs to be explicit. You can’t discover intent. You have to declare it.

There’s another pattern I’ve been wrestling with: when coupling is useful versus when it’s fragile.

Standard advice says avoid coupling. Don’t create dependencies between systems. Copy-paste is better than the wrong abstraction. Duplication is preferable to tight connections that break when one thing changes.

But I’ve been sharing resources across projects—linking to a single source of truth for certain definitions, so when I improve something in one place, every project gets the improvement automatically. That’s coupling. Twenty-nine projects now depend on one directory existing in one location. If I move it, twenty-nine things break.

For personal infrastructure where I’m the only user, this is fine. I control both ends. I know where everything lives. The coupling gives me leverage: one improvement, twenty-nine beneficiaries.

But at what scale does this become a problem? What signals indicate you’ve crossed the line from “useful shared resource” to “fragile dependency that’s going to bite you”?

I don’t have a clean answer. I think it depends on how many people touch the system, how often the shared resource changes, and how bad the failure is when the dependency breaks. For personal tools, coupling is cheap. For team infrastructure, it gets expensive fast.

Something else I noticed this week. A project I hadn’t touched in months needed a dependency update. The SDK had shipped 125 versions since I last looked at it. The language itself had a new major release, but the libraries I depend on don’t support it yet.

This is the ecosystem maturity problem. Fast-moving protocols and new language versions create work for everyone downstream. The protocol is still finding its shape, which is good for the protocol but expensive for projects using it. You’re constantly evaluating: do I upgrade now and deal with breaking changes, or do I wait and fall further behind?

I ended up pinning to a version ceiling. When the next major version ships, I want to choose when to upgrade, not have it happen automatically. That’s defensive, but it creates a maintenance obligation. I have to actively monitor releases. I have to evaluate the migration path. I have to decide when the cost of staying behind exceeds the cost of upgrading.

There’s no way to avoid this work. You either do it continuously, staying current with every release, or you do it in bursts, catching up after long gaps. Neither approach is free.

So where does all this leave me?

Production is where systems teach you what they need to be. You can’t design everything upfront. You can’t anticipate every path through the system. You ship something that works for the critical cases, and then you listen. You watch for silent failures. You notice when constraints help versus hurt. You pay attention to what the system reveals about its own incompleteness.

The question isn’t whether your system is finished. It’s whether you’re listening when it tells you it isn’t.

That’s the work. Not just building, but noticing. Not just shipping, but learning what you shipped doesn’t yet do.

And maybe that’s okay. Maybe systems aren’t supposed to be complete. Maybe they’re supposed to be conversations—between what you intended and what users actually need, between what you built and what production reveals.

The half-built system isn’t a failure state. It’s the natural state. The question is whether you recognize it.

What production teaches you about done

Nobody takes you aside anymore

Your AI agents need a water cooler

On the death of the author and the birth of the detector

The work of being available now

The practice of work in progress

Memory is (almost) solved. time is next.

Did the state change? A simple test for whether work actually happened

How to manage content for multiple clients without flattening their voices

The best customers are the first ones you turn against

Delegation without comprehension is just prayer

The case for corporate amnesia