ProteinGym

Biology & Medicine

Large-scale benchmark suite for protein fitness prediction and design, aggregating 200+ deep mutational scanning assays and clinical variant datasets across diverse protein families and taxa, with standardized zero-shot and supervised leaderboards for variant effect prediction, mutation effect prediction, and protein language model evaluation (OATML & Marks Lab, NeurIPS 2023 Spotlight, Datasets & Benchmarks)

Source attribution

  • Awesome AI for Sciencegithub.com/oatml-markslab/proteingym

Related resources

Comprehensive collection of Chinese medical datasets for AI research

2801 year ago

Therapeutics Data Commons: 66 AI-ready datasets across 22 drug discovery tasks with 29 leaderboards, covering target identification, molecular generation, ADMET prediction, and clinical trial outcomes (Harvard MIMS, NeurIPS 2021/2024)

Curated open dataset collection of 602M+ observational and perturbational single-cell profiles for accelerating virtual cell model creation, integrating Tahoe-100M and scBaseCount data with Google Cloud Marketplace distribution (Arc Institute, 2025-2026)