Interesting fact: if Opus 4.7 is β35% less token-efficient than 4.6, this suggests its long-context degradation is *vastly worse* than suggested by MRCR or GraphWalks, because as a user I care about the codebase/text, not "tokens" it's broken down into. https://twitter.com/YouJiacheng/status/2044956691540951115
View on X βContext: Quoting @YouJiacheng: "But GraphWalks scores also degraded. GraphWalks has 100Γ 256k problems and 100Γ 1M problems. So we can calculate 1M subset scores based on 256K&1M and 256k scores. Opus 4.7: BFS@1M = 40.29%, Parents@1M = 56.63% Opus 4.6 (64k): BFS@1M = 41.2%, Parents@1M = 71.1% @YouJiacheng: "But GraphWalks scores also degraded. GraphWalks has 100Γ 256k problems and 100Γ 1M problems. So we can calculate 1M subset scores based on 256K&1M and 256k scores. Opus 4.7: BFS@1M = 40.29%, Parents@1M = 56.63% Opus 4.6 (64k): BFS@1M = 41.2%, Parents@1M = 71.1% https://x.com/YouJiacheng/status/2044956691540951115/photo/1 https://twitter.com/bcherny/status/2044821479980929082" @bcherny: "π We kept MRCR in the system card for scientific honesty, but we've actually been phasing it out slowly. Two reasons: (1) it's built around stacking distractors to trick the model, which isn't how people actually use long context, and (2) we care more about applied long-context capability than needle-retrieval. Graphwalks is a better signal for applied reasoning over long context, and internally we've seen this model do really well on long-context code. MRCR wasn't included in the Mythos Preview system card for these reasons, but Graphwalks was - that will be the case for future models too." Tweet: Interesting fact: if Opus 4.7 is β35% less token-efficient than 4.6, this suggests its long-context degradation is *vastly worse* than suggested by MRCR or GraphWalks, because as a user I care about the codebase/text, not "tokens" it's broken down into. @YouJiacheng: "But GraphWalks scores also degraded. GraphWalks has 100Γ 256k problems and 100Γ 1M problems. So we can calculate 1M subset scores based on 256K&1M and 256k scores. Opus 4.7: BFS@1M = 40.29%, Parents@1M = 56.63% Opus 4.6 (64k): BFS@1M = 41.2%, Parents@1M = 71.1% https://x.com/YouJiacheng/status/2044956691540951115/photo/1 https://twitter.com/bcherny/status/2044821479980929082"
| Time | Views | Likes | Bookmarks | RTs | Replies |
|---|---|---|---|---|---|
| 11:00 AM UTC | +98 | β | β | β | β |
| 10:50 AM UTC | +82 | +3 | β | β | β |
| 10:40 AM UTC | +63 | β | β | β | β |
| 10:30 AM UTC | +62 | +2 | +1 | β | β |
| 10:20 AM UTC | +61 | β | β | β | β |
| 10:10 AM UTC | +69 | +1 | β | β | β |
| 10:00 AM UTC | +111 | β | β | β | β |
| 9:50 AM UTC | +12 | β | β | β | β |
| 9:40 AM UTC | +308 | +4 | β | +1 | β |
| 9:30 AM UTC | +196 | +1 | +1 | β | +1 |