The article discusses a study on the reliability and robustness of code generated by large language models (LLMs) for Java coding questions. The study evaluated four code-capable LLMs, including GPT-3.5 and GPT-4 from OpenAI, and found that they exhibited high rates of API misuse. The study also highlighted the importance of assessing code reliability beyond semantic correctness and emphasized the need for static analysis to ensure full coverage. Llama 2, an open model, performed the best with a failure rate of less than one percent.
Large language models struggle with generating clean code
Digital transformation, including agile and devops, across many industries, most recently in higher education. Designed and built the Emory faculty information system. Working in continuing education to improve and expand career-focused learning, esp. in workforce development. Expanding the role of innovation and entrepreneurship. Designed, built, and launched the Emory Center for Innovation.
- Daring Fireball
- Manager Tools