Large language models struggle with generating clean code

Explore how large language models struggle with clean code generation, revealing high API misuse and the need for better reliability assessments.

The article discusses a study on the reliability and robustness of code generated by large language models (LLMs) for Java coding questions. The study evaluated four code-capable LLMs, including GPT-3.5 and GPT-4 from OpenAI, and found that they exhibited high rates of API misuse. The study also highlighted the importance of assessing code reliability beyond semantic correctness and emphasized the need for static analysis to ensure full coverage. Llama 2, an open model, performed the best with a failure rate of less than one percent.

Original article: Perhaps AI is going to take away coding jobs of those who trust this tech too much

Featured writing

Nobody takes you aside anymore

Print taught a generation when to stop. What we lose when the machines absorb the constraints that used to form us.

Your AI agents need a water cooler

Coordination is a property of the room, not the org chart. What that means when your coworkers are agents.

On the death of the author and the birth of the detector

Why worrying about AI authorship is lazier, and more prejudiced, than it looks.

Books

The work of being available now

A book on AI, judgment, and staying human at work.

The practice of work in progress

Practical essays on how work actually gets done.

Recent writing

Did the state change? A simple test for whether work actually happened

Either something exists now that did not exist before, or it does not. A simple test for whether work actually happened, and what changes when you build your systems so they can't record anything else.

How to manage content for multiple clients without flattening their voices

How to manage content for multiple clients without their voices blurring into one house style: a workspace and a voice profile per client, batchable stages, and approval buffers.

Why does AI writing sound generic? It has nothing to work with

Why does AI writing sound generic? Because the model has none of your perspective, examples, constraints, or stakes to work with. The fix is interview-first, not better adjectives.

View all writing →

Related thinking