ScienceAgentBench (ICLR 2025)

Evaluation & Benchmarking

102 executable tasks from 44 peer-reviewed papers across 4 disciplines with containerized evaluation

Repository: github.com/osu-nlp-group/scienceagentbench

Source attribution

Awesome AI for Science — github.com/osu-nlp-group/scienceagentbench

Related resources

AIRS-Bench (Meta, 2026)

Tool

Evaluation & Benchmarking

Benchmark quantifying end-to-end autonomous AI research abilities of LLM agents across 20 tasks from SOTA machine learning papers spanning NLP, code, math, biochemical modelling, and time series forecasting, with normalized score metrics against human SOTA and HuggingFace dataset

PaperBench (OpenAI, 2025)

Tool

Evaluation & Benchmarking

Benchmark evaluating AI agents' ability to replicate 20 ICML 2024 Spotlight/Oral papers from scratch, with 8,316 gradable tasks and author-co-developed rubrics

MLE-Bench (OpenAI, 2024)

Tool

Evaluation & Benchmarking

Benchmark evaluating AI agents on 75 curated Kaggle-style ML engineering competitions with reproducible Docker-based grading harness, human baselines, and end-to-end task lifecycle, used as a primary benchmark for autonomous ML research agents (e.g., InternAgent #1 at 36.44%)

ScienceBoard (ICLR 2026)

Tool

Evaluation & Benchmarking

Evaluating multimodal autonomous agents in realistic scientific workflows across real scientific software environments (KAlgebra, Celestia, Grass GIS, Lean 4, etc.) with VM-based evaluation infrastructure and agent trajectories

BuildArena

Tool

Evaluation & Benchmarking

First physics-aligned interactive benchmark for LLM agents in engineering construction, designing rockets/cars/bridges in physics simulator with 3D spatial geometry library

SciCode

Tool

Evaluation & Benchmarking

Research coding benchmark curated by scientists with 338 subproblems across 16 subdomains (physics, math, materials, biology, chemistry), evaluating LLMs on realistic scientific programming tasks with gold-standard solutions (NeurIPS 2024)