ScienceAgentBench (ICLR 2025)
102 executable tasks from 44 peer-reviewed papers across 4 disciplines with containerized evaluation
Source attribution
- Awesome AI for Science — github.com/osu-nlp-group/scienceagentbench
Related resources
Benchmark quantifying end-to-end autonomous AI research abilities of LLM agents across 20 tasks from SOTA machine learning papers spanning NLP, code, math, biochemical modelling, and time series forecasting, with normalized score metrics against human SOTA and HuggingFace dataset
Benchmark evaluating AI agents' ability to replicate 20 ICML 2024 Spotlight/Oral papers from scratch, with 8,316 gradable tasks and author-co-developed rubrics
Benchmark evaluating AI agents on 75 curated Kaggle-style ML engineering competitions with reproducible Docker-based grading harness, human baselines, and end-to-end task lifecycle, used as a primary benchmark for autonomous ML research agents (e.g., InternAgent #1 at 36.44%)
Evaluating multimodal autonomous agents in realistic scientific workflows across real scientific software environments (KAlgebra, Celestia, Grass GIS, Lean 4, etc.) with VM-based evaluation infrastructure and agent trajectories
First physics-aligned interactive benchmark for LLM agents in engineering construction, designing rockets/cars/bridges in physics simulator with 3D spatial geometry library
Research coding benchmark curated by scientists with 338 subproblems across 16 subdomains (physics, math, materials, biology, chemistry), evaluating LLMs on realistic scientific programming tasks with gold-standard solutions (NeurIPS 2024)