1d ago

GPT 5.5 xhigh scores 13.5 percent on ProgramBench

0

GPT 5.5 xhigh completed the first task on ProgramBench, a benchmark of 200 tasks and more than 248,000 tests that evaluates whether models can rebuild programs from scratch. The model used C in one run and Python in another, reaching 13.5 percent almost resolved and 0.5 percent resolved. It outperformed Opus 4.7 xhigh at 4.5 percent almost resolved, while all tested models scored below 50 percent.

Original post

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

8:02 AM · May 12, 2026 View on X
Reposted by

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%.

It's time to retire evals like GQPA and bring in a new set.

Kilian Lieret@KLieret

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

3:02 PM · May 12, 2026 · 199.3K Views
5:42 PM · May 12, 2026 · 99.5K Views

And, of course, they should be plotted with compute, latency, or cost on the x-axis.

Noam Brown@polynoamial

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.

5:42 PM · May 12, 2026 · 99.5K Views
5:43 PM · May 12, 2026 · 6K Views

@polynoamial If you like low scores, we should evaluate GPT-5.5 on @NetHack_LE / http://balrogai.com 😅

Noam Brown@polynoamial

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.

5:42 PM · May 12, 2026 · 99.5K Views
7:17 PM · May 12, 2026 · 792 Views

We built ProgramBench to have a wide range of task difficulties- from terminal utilities to million-line repos that have been built over decades. GPT 5.5 just solved the first (super simple) task, and I expect to see more tasks solved soon:

Kilian Lieret@KLieret

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

3:02 PM · May 12, 2026 · 199.3K Views
3:56 PM · May 12, 2026 · 5.2K Views

To give the comms people a second to use the bathroom we will not be doing any ProgramBench-related tweets in the next 15 mins.

11:53 PM · May 12, 2026 · 1.9K Views

Thanks Noam!

Noam Brown@polynoamial

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.

5:42 PM · May 12, 2026 · 99.5K Views
6:20 PM · May 12, 2026 · 4.5K Views

GPT 5.5 xhigh is a beast. Fun to see progress on this (very hard) benchmark.

Kilian Lieret@KLieret

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

3:02 PM · May 12, 2026 · 199.3K Views
9:44 PM · May 12, 2026 · 2.6K Views