GPT 5.5 xhigh scores 13.5 percent on ProgramBench
GPT 5.5 xhigh completed the first task on ProgramBench, a benchmark of 200 tasks and more than 248,000 tests that evaluates whether models can rebuild programs from scratch. The model used C in one run and Python in another, reaching 13.5 percent almost resolved and 0.5 percent resolved. It outperformed Opus 4.7 xhigh at 4.5 percent almost resolved, while all tested models scored below 50 percent.
I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%.
It's time to retire evals like GQPA and bring in a new set.
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵
And, of course, they should be plotted with compute, latency, or cost on the x-axis.
I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.
@polynoamial If you like low scores, we should evaluate GPT-5.5 on @NetHack_LE / http://balrogai.com 😅
I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.
We built ProgramBench to have a wide range of task difficulties- from terminal utilities to million-line repos that have been built over decades. GPT 5.5 just solved the first (super simple) task, and I expect to see more tasks solved soon:
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵
To give the comms people a second to use the bathroom we will not be doing any ProgramBench-related tweets in the next 15 mins.

Thanks Noam!
I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.
GPT 5.5 xhigh is a beast. Fun to see progress on this (very hard) benchmark.
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵