There are new insights into how well LLs can program – because until now it was difficult to measure how reliably language models work. They often provide good results, sometimes none and rarely false – but these hallucinations are often formulated in a very convincing way. Therefore benchmarks involving humans have been established for general language models, for example the Chatbot Arena from LMSYS. Also, individual quality criteria are measured, which are then aggregated into a lead board.
Advertisement
Review code systematically
For more specific language models this can be done more systematically. First and foremost, LLMs that can generate program code are ideal. This code can be checked syntactically and semantically. Automated test generation software provider Simflower tested exactly this There’s a whole blog series about it written. The insights gained are exciting and provide interesting insights into the performance of LL.M.
However, there are some limitations: code generation is performed only for Java and Go. Other, widely used programming languages such as Python and JavaScript have not yet been taken into account. It is therefore unclear whether the results can be reproduced there. It can be assumed that better results can be achieved here due to the larger amount of code available.
The previous parts of the blog series only created tests for simple, “empty” classes. Now it has expanded significantly and the scenario has become more complex. The LLM now has to create tests for 23 real programming examples. The exciting findings are:
- Only 58 percent of the results were translatable (only ten models had more than 80 percent). So manual rework is needed here. It is easy to measure this metric for compiler languages, but it will be difficult for Python and JavaScript.
- Some models did not produce any translatable code at all. This can be compared to a real programmer who only produces syntactically incorrect code.
- Most syntax errors were minor and could be fixed instantly with IDE support.
- For Java, three models (gpt-4o, deepseek-coder, cloud-3-opus) always produced translatable code. Unfortunately, this was not possible with Go (which is of course due to the small amount of training).
People remain indispensable
Programmers are still needed in any case. Surprisingly, code generation works much better with Java than Go. The large amount of training will give hope that it can also work well with Python and JavaScript. However, it is harder to determine metrics there because the code does not need to be translated. Dynamic typing can also lead to more errors that need to be checked manually.
Different models handle exceptions differently: in such a case there is a choice of catching them or alternatively failing the test. Both strategies are also used by human programmers, so the models learned both easily.
The ranking of the models is exciting. Compared to the previous test, SimFlower has slightly changed and optimized the scores. This process is not over yet, as some models generate higher coverage while others translate more tests. Therefore, the scores will be revised again for the next iteration.
Finally, the article shows how efficient the tests are. In some cases, LLMs generate synchronous rather than asynchronous methods, making the runtime significantly longer. Due to poor handling of permutations and associated logging, the tests generated huge log files. Both of these can easily prevent entire test suites from running, jeopardizing the stability of the entire software.
The article explains exactly how the models were selected, which sandboxes were used and much more. It’s insightful if you want to try out the testing yourself. The technical description provides an easy introduction to the framework Published on GitHub Is.
(For)