Llama 2 avoids errors by staying quiet, GPT-4 gives long, if useless, samples

January 25, 2024

Llama 2 avoids errors by staying quiet, GPT-4 gives long, if useless, samples

The article discusses a study conducted by computer scientists at the University of California San Diego on the reliability and robustness of large language models (LLMs) in generating code. The researchers evaluated four different code-capable LLMs using an API checker called RobustAPI. They gathered 1,208 coding questions from StackOverflow involving 24 common Java APIs and tested the LLMs with three different types of questions. The results showed that the LLMs had high rates of API misuse, with GPT-3.5 and GPT-4 from OpenAI exhibiting the highest failure rates. However, Meta’s Llama 2 performed exceptionally well, with a failure rate of less than one percent. The study highlights the importance of assessing code reliability and the need for improvement in large language models’ ability to generate clean code.

https://www.theregister.com/2023/08/29/ai_models_coding/

Found on the web, Innovation, Ruby on Rails, Technology

collected, newsletter, summary, teams

Posted by:

paul

About Me

Digital transformation, including agile and devops, across many industries, most recently in higher education. Designed and built the Emory faculty information system. Working in continuing education to improve and expand career-focused learning, esp. in workforce development. Expanding the role of innovation and entrepreneurship. Designed, built, and launched the Emory Center for Innovation.

Favorite sites

Daring Fireball

Favorite podcasts

Manager Tools

Polymathic