Which LLM is the best at finding real vulnerabilities?
A recent evaluation tested various LLMs for their ability to identify vulnerabilities in code. The models were assessed based on their accuracy and quality of reporting, with GPT-OSS and Gemma performing particularly well. The results highlighted the strengths and weaknesses of each model, especially regarding precision and the generation of duplicate vulnerabilities.
- ▪Seven LLMs were tested on their ability to find vulnerabilities in a fake banking web application.
- ▪GPT-OSS scored the highest with 19 out of 22 points, identifying 10 of the 13 critical vulnerabilities.
- ▪Gemma, despite having fewer parameters, found 8 critical vulnerabilities and produced a better report than GPT-OSS.
Opening excerpt (first ~120 words) tap to expand
Which LLM is the best at finding real vulnerabilities (Part 1)?Jeremie A <lp1>5 min read·1 hour ago--ListenSharePress enter or click to view image in full sizeA few weeks ago, I built a framework that allows me to automatically decompile and apps, binaries and audit code.I used it to find 500 actual vulns on public apps (that I'm not even sure what to do with) and now I'm using this toolset to try and find the most cost-effective LLM to do vulnerability research.I was teaching a class in Paris when I created this exercise https://github.com/lp1dev/Mybank_WebSec_Exercise/ , the assignment is simple: run and audit the application, write a penetration testing report and send it to me!The app has a list of 13 vulnerabilities that must absolutely be reported, they are the ones that should (in…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Medium.