Announcement_11
Introducing Agent’s Last Exam, a large-scale benchmark evaluating AI agents on long-horizon, economically valuable professional tasks. Check out what we find on the importance of the harness vs. models, and how we confirm the downgrade of Claude Fable 5.