1d ago

GPT 5.5 xhigh scores 13.5 percent on ProgramBench

822.4K178445326.6K

——0——

GPT 5.5 xhigh completed the first task on ProgramBench, a benchmark of 200 tasks and more than 248,000 tests that evaluates whether models can rebuild programs from scratch. The model used C in one run and Python in another, reaching 13.5 percent almost resolved and 0.5 percent resolved. It outperformed Opus 4.7 xhigh at 4.5 percent almost resolved, while all tested models scored below 50 percent.

Original post

OP#130@OFIRPRESS @KLIERET

Kilian Lieret@KLIERET

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

8:02 AM · May 12, 2026

Cluster engagement

7 snapshots

Reposted by

AD#796@THEREALADAMG

GM#134@GARYMARCUS

OP#130@OFIRPRESS

QUOTE POSTNB #22Noam Brown@POLYNOAMIAL

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%.

It's time to retire evals like GQPA and bring in a new set.

Kilian Lieret@KLieret

3:02 PM · May 12, 2026 · 199.3K Views

5:42 PM · May 12, 2026 · 99.5K Views

REPLYNB #22Noam Brown@POLYNOAMIAL

And, of course, they should be plotted with compute, latency, or cost on the x-axis.

Noam Brown@polynoamial

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.

5:42 PM · May 12, 2026 · 99.5K Views

5:43 PM · May 12, 2026 · 6K Views

REPLYTR #74Tim Rocktäschel@_ROCKT

@polynoamial If you like low scores, we should evaluate GPT-5.5 on @NetHack_LE / http://balrogai.com 😅

Noam Brown@polynoamial

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.

5:42 PM · May 12, 2026 · 99.5K Views

7:17 PM · May 12, 2026 · 792 Views

QUOTE POSTOP #130Ofir Press@OFIRPRESS

We built ProgramBench to have a wide range of task difficulties- from terminal utilities to million-line repos that have been built over decades. GPT 5.5 just solved the first (super simple) task, and I expect to see more tasks solved soon:

Kilian Lieret@KLieret

3:02 PM · May 12, 2026 · 199.3K Views

3:56 PM · May 12, 2026 · 5.2K Views

ORIGINAL POSTOP #130Ofir Press@OFIRPRESS

To give the comms people a second to use the bathroom we will not be doing any ProgramBench-related tweets in the next 15 mins.

11:53 PM · May 12, 2026 · 1.9K Views

QUOTE POSTOP #130Ofir Press@OFIRPRESS

Thanks Noam!

Noam Brown@polynoamial

I love seeing a new eval with such low scores. When we announced GPT-5.5, almost every benchmark had a score above 50%. It's time to retire evals like GQPA and bring in a new set.

5:42 PM · May 12, 2026 · 99.5K Views

6:20 PM · May 12, 2026 · 4.5K Views

QUOTE POSTPW #216Peter Welinder@NPEW

GPT 5.5 xhigh is a beast. Fun to see progress on this (very hard) benchmark.

Kilian Lieret@KLieret

3:02 PM · May 12, 2026 · 199.3K Views

9:44 PM · May 12, 2026 · 2.6K Views