The progress on some of these benchmarks has been insane! @AnthropicAI @DarioAmodei May I please ask you to request Claude to give you a list of the of the top 1000 areas of STEM, top 1000 magazine topics, top 500 professions, and for each list item pick a (not in training data) book or set of long articles, and finally report NLL on all these test sets. If every company reported 2500 fast to compute evals of this nature, the public would have a better understanding of the capabilities of each model in their area of study, work or hobby. Thanks ๐ AI community: Thoughts on how to improve these evals or how to report them? @d_spiegel
View on X โContext: Quoting @claudeai: "Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision. @claudeai: "Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision. https://x.com/claudeai/status/2044785261393977612/photo/1" Tweet: The progress on some of these benchmarks has been insane! @AnthropicAI @DarioAmodei May I please ask you to request Claude to give you a list of the of the top 1000 areas of STEM, top 1000 magazine topics, top 500 professions, and for each list item pick a (not in training data) book or set of long articles, and finally report NLL on all these test sets. If every company reported 2500 fast to compute evals of this nature, the public would have a better understanding of the capabilities of each model in their area of study, work or hobby. Thanks ๐ AI community: Thoughts on how to improve these evals or how to report them? @d_spiegel
| Time | Views | Likes | Bookmarks | RTs | Replies |
|---|---|---|---|---|---|
| 11:00 AM UTC | +493 | +4 | โ | โ | โ |
| 10:50 AM UTC | +425 | +3 | โ | โ | โ |
| 10:40 AM UTC | +295 | +3 | โ | โ | โ |
| 10:30 AM UTC | +304 | +1 | โ | โ | โ |
| 10:20 AM UTC | +234 | +2 | โ | โ | +1 |
| 10:10 AM UTC | +211 | +1 | โ | โ | โ |
| 10:00 AM UTC | +212 | +1 | โ | โ | โ |
| 9:50 AM UTC | +15 | +1 | โ | โ | โ |
| 9:40 AM UTC | +341 | +3 | โ | โ | โ |
| 9:30 AM UTC | +131 | โ | โ | โ | โ |