Llama 2 avoids errors by staying quiet, GPT-4 gives long, if useless, samples

The article discusses a study conducted by computer scientists at the University of California San Diego on the reliability and robustness of large language models (LLMs) in generating code. The researchers evaluated four different code-capable LLMs using an API checker called RobustAPI. They gathered 1,208 coding questions from StackOverflow involving 24 common Java APIs and tested the LLMs with three different types of questions. The results showed that the LLMs had high rates of API misuse, with GPT-3.5 and GPT-4 from OpenAI exhibiting the highest failure rates. However, Meta’s Llama 2 performed exceptionally well, with a failure rate of less than one percent. The study highlights the importance of assessing code reliability and the need for improvement in large language models’ ability to generate clean code.
Featured writing
Why customer tools are organized wrong
This article reveals a fundamental flaw in how customer support tools are designed—organizing by interaction type instead of by customer—and explains why this fragmentation wastes time and obscures the full picture you need to help users effectively.
Busy is not a state
We've built work cultures that reward activity, even when nothing actually changes. In technical systems, activity doesn't count—only state change does. This essay explores why "busy" has become the most misleading signal we have, and how focusing on state instead of motion makes work more honest, less draining, and actually productive.
Infrastructure shapes thought
The tools you build determine what kinds of thinking become possible. On infrastructure, friction, and building deliberately for thought rather than just throughput.
Books
The Work of Being (in progress)
A book on AI, judgment, and staying human at work.
The Practice of Work (in progress)
Practical essays on how work actually gets done.
Recent writing
When execution becomes cheap, ideas become expensive
This article reveals a fundamental shift in how organizations operate: as AI makes execution nearly instantaneous, the bottleneck has moved from implementation to decision-making. Understanding this transition is critical for anyone leading teams or making strategic choices in an AI-enabled world.
Dev reflection - February 02, 2026
I've been thinking about what happens when your tools get good enough to tell you the truth. Not good enough to do the work—good enough to show you what you've been avoiding.
Dev reflection - January 31, 2026
I've been thinking about what happens when your tools start asking better questions than you do.
Notes and related thinking
It's going to take a century for artifical intelligence to be able to perform most human jobs. But there are going to be some key developments during the next decade.
Explore how AI will transform jobs in the next decade, from enhancing security to automating coding, reshaping the future of work.
Many businesses are not yet prepared to fully reap the benefits of AI.
Unlock AI's true potential for your business by integrating it into your strategy, boosting productivity, and enhancing customer experiences.
Rose-tinted predictions for artificial intelligence’s grand achievements will be swept aside by underwhelming performance and dangerous results.
Explore the reality of generative AI in 2024 as hype fades, revealing limitations, job displacement, and the need for regulation.