Paul Welty, PhD AI, WORK, AND STAYING HUMAN

Dev reflection - February 06, 2026

I want to talk about the difference between a system that works and a system that's ready. These aren't the same thing. The gap between them is where most projects stall out—not from failure, but f...

Duration: 6:15 | Size: 5.73 MB


Hey, it’s Paul. Thursday, February 6th, 2026.

I want to talk about the difference between a system that works and a system that’s ready. These aren’t the same thing. The gap between them is where most projects stall out—not from failure, but from a false confidence that comes from watching things function.

Here’s what brought this up. I’ve been integrating Linear as the authoritative source for task state across a bunch of projects. Before this, everything ran through a TODO.md file. Simple text. Tasks lived there. When a skill asked “what should I work on?” it read that file. When something finished, it updated that file.

It worked. For months, it worked fine.

But here’s what I didn’t notice until I started the migration: the system’s model of reality depended entirely on me remembering to update a text file. The actual work state might live in Linear, in GitHub issues, in my head—but the system only knew what I told it through that file. The system was exactly as reliable as my discipline in maintaining it.

This is the pattern I keep seeing in organizations. Not just in code, but everywhere. Personal infrastructure becomes invisible infrastructure. The workarounds you build for yourself, the spreadsheets, the naming conventions, the folder structures, they start as shortcuts. Then they become load-bearing. Then someone else depends on them. By the time you notice, the whole operation runs on assumptions nobody wrote down.

Making Linear authoritative wasn’t just changing where data lives. It was making explicit something that had been implicit. That’s a different kind of work than building features. It’s the work of saying out loud what the system actually believes.


Second thing I noticed: tests that fail aren’t the problem. Tests that pass for the wrong reasons are the problem.

One of my projects had 26 test failures this week. Every single one was a mismatch between what the tests expected and what the code actually did. Tests assumed a certain UI flow that didn’t exist anymore. Tests assumed a database column that had been renamed. Tests assumed records would sort by insertion order, but with UUID primary keys, they sort alphabetically. Completely different.

None of these were regressions. The code was fine. The tests were documenting a system that no longer existed.

This is the silent failure pattern that kills projects. Not the loud crashes. Those get fixed immediately. It’s the things that look right from a distance. The homepage serving a TODO document to every visitor because internal docs ended up in the wrong folder. The deployment that fails silently while the CDN serves a stale build from three weeks ago. The test that passes sometimes and fails sometimes, so you assume it’s environmental and ignore it.

Systems that accept anything that looks like valid input will eventually process something invalid. The output will look plausible enough that nobody checks.

This applies way beyond code. Think about the reports that get generated but never read. The metrics that get tracked but never questioned. The processes that run on autopilot because they’ve always run. Every one of these is a test that passes for the wrong reasons. A system producing output that looks correct without anyone verifying it still measures what matters.

The defense isn’t more validation. It’s clearer boundaries. Internal docs don’t belong in the content directory. Tests shouldn’t rely on insertion order without explicit ordering. Organizational processes shouldn’t run indefinitely without someone asking: is this still solving the problem it was built to solve?


Third thing. I rewrote several roadmaps this week, and they all struggled with the same question: when is a milestone actually done?

The original roadmaps mixed two kinds of milestones. Some measured feature completion: “infrastructure is set up,” “AI task breakdown is implemented.” Others measured capability: “a URL you can send someone,” “the learning loop works end-to-end,” “ten people use it independently.”

These are different kinds of done.

Feature-based milestones are completed when you merge a pull request. Capability-based milestones are completed when someone uses the thing successfully. The first measures readiness to ship. The second measures readiness to matter.

Here’s the trap: optimizing for the first doesn’t get you the second.

I shipped a distribution milestone in about an hour this week. Binary available through Homebrew, anyone can install it. Done. Except there were 31 open questions sitting in the repo that nobody had answered, because nobody had tried to use the tool for real work yet. The plan assumed the product was ready for launch. But “ready for launch” and “ready to matter” are different states.

Dogfooding isn’t testing. Testing verifies that code does what you intended. Dogfooding forces the questions that only surface when someone tries to get real work done. When you use your own tool daily, you discover which features are actually load-bearing and which ones you thought were important but never touch.

A launch milestone that doesn’t include “the builder uses this for real work” is measuring the wrong thing. You’re measuring whether the artifact exists, not whether it solves the problem.


Last thing. How much of what we call technical debt is actually just tooling that solved a problem the system no longer has?

I found a Python build system this week that was generating static pages. It worked. It had been working for a while. But Hugo, the static site generator I was using, already knew how to do everything that Python script did. The script wasn’t solving a problem Hugo couldn’t handle. It was solving a problem I hadn’t realized Hugo solved natively.

When systems accumulate tooling, each tool encodes assumptions about what the other tools can’t do. Sometimes those assumptions were right when the tool was built. Sometimes they were never right. Sometimes they became wrong when something else in the system changed.

The only way to find out which is to delete the tool and see what breaks.

This is uncomfortable because deletion feels risky. But keeping tools that encode outdated assumptions is also risky. It’s just a slower, less visible kind of risk. Every unnecessary layer is a place where the system’s model of itself diverges from reality.


So here’s the question I’m sitting with: how do you know when a system is ready? Not feature complete. Ready. Ready to show someone. Ready to charge for. Ready to depend on.

The answer isn’t in the code. It’s in whether the system’s model of reality matches actual reality. The only way to test that is contact with the world. Real users, real work, real friction.

Everything else is just passing tests that might be checking the wrong things.


Featured writing

Why customer tools are organized wrong

This article reveals a fundamental flaw in how customer support tools are designed—organizing by interaction type instead of by customer—and explains why this fragmentation wastes time and obscures the full picture you need to help users effectively.

Infrastructure shapes thought

The tools you build determine what kinds of thinking become possible. On infrastructure, friction, and building deliberately for thought rather than just throughput.

Server-Side Dashboard Architecture: Why Moving Data Fetching Off the Browser Changes Everything

How choosing server-side rendering solved security, CORS, and credential management problems I didn't know I had.

Books

The Work of Being (in progress)

A book on AI, judgment, and staying human at work.

The Practice of Work (in progress)

Practical essays on how work actually gets done.

Recent writing

Dev reflection - February 05, 2026

I want to talk about something I keep running into: the moment when you realize the outside of something no longer matches the inside. And what that actually costs.

Dev reflection - February 04, 2026

I've been thinking about the gap between 'it works' and 'you can use it.' These aren't the same thing, and the distance between them is where most organizational dysfunction lives.

We always panic about new tools (and we're always wrong)

Every time a new tool emerges for making or manipulating symbols, we panic. The pattern is so consistent it's almost embarrassing. Here's what happened each time.

Notes and related thinking

Dev reflection - February 05, 2026

I want to talk about something I keep running into: the moment when you realize the outside of something no longer matches the inside. And what that actually costs.

Dev reflection - February 04, 2026

I've been thinking about the gap between 'it works' and 'you can use it.' These aren't the same thing, and the distance between them is where most organizational dysfunction lives.

Dev reflection - February 03, 2026

I've been thinking about constraints today. Not the kind that block you—the kind that clarify. There's a difference, and most people miss it.