Find open-source science resources

Toolkit for large-scale whole-slide image processing supporting 22+ patch encoders (UNI, CONCH, Virchow, H-Optimus-0, etc.), slide encoders (TITAN, GigaPath, PRISM, CHIEF, Madeleine, Feather), tissue segmentation, and multi-GPU inference with end-to-end pipeline and smart resume for standardized deployment of computational pathology foundation models (Mahmood Lab, Harvard Medical School, 553+ stars)

TITAN (Nature Medicine 2024)

Multimodal whole-slide pathology foundation model jointly pretrained on H&E histology and diagnostic text reports, enabling zero-shot cancer subtyping, biomarker prediction, and multimodal reasoning across diverse cancer types (Mahmood Lab, 341+ stars)

PLIP (Nature Medicine 2023)

First vision-and-language foundation model for pathology AI, fine-tuned from CLIP on 249K image-caption pairs, enabling open-ended visual-semantic search and zero-shot diagnosis across histopathology (Pathology Foundation, 376+ stars)

CONCH (Nature Medicine 2024)

Vision-language pathology foundation model using contrastive learning on histopathology image-text pairs, enabling zero-shot classification, slide-level retrieval, and multimodal reasoning across diverse cancer types (Mahmood Lab, 494+ stars)

Prov-GigaPath (Nature 2024)

Whole-slide pathology foundation model trained on 1.3 billion image tiles from 171K slides using a LongNet-based architecture to encode gigapixel-scale WSIs for cancer subtyping and biomarker prediction (Microsoft Research & Providence, 601+ stars)

UNI (Nature Medicine 2024)

General-purpose pathology foundation model pretrained on 100K+ diagnostic whole-slide images across 20 major tissue types, achieving state-of-the-art transfer learning across 30+ clinical tasks and serving as a universal feature extractor for digital pathology (Mahmood Lab, 722+ stars)

nilearn

Machine learning and statistical learning for neuroimaging in Python, providing easy-to-use tools for fMRI and MRI analysis including decoding, connectivity estimation, and parcellation with seamless scikit-learn integration (INRIA Parietal team, 1.4K+ stars)

braindecode

Deep learning software to decode EEG, ECG or MEG signals, providing standardized neural network models, preprocessing pipelines, and evaluation workflows for brain-computer interfaces and cognitive neuroscience research (1.2K+ stars, BSD 3-Clause, actively maintained)

TRIBE v2

Meta FAIR's foundation model of vision, audition, and language for in-silico neuroscience, predicting fMRI brain responses to naturalistic multimodal stimuli (video, audio, text) through unified Transformer architecture mapped to the cortical surface (2026)

Kilosort (Nature Methods 2024)

Fast spike sorting with drift correction for extracellular electrophysiology, enabling universal neural spike sorting via deep learning on high-density neural probe recordings (MouseLand, 609+ stars)

CEBRA (Nature 2023)

Learnable latent embeddings for joint behavioral and neural analysis, enabling consistent and interpretable mapping of neural activity to behavior across modalities, species, and experiments (EPFL & Harvard, 1K+ stars)

SLEAP

Deep learning-based multi-animal pose tracking and behavior classification, enabling automated quantification of social interactions and collective behavior across species (Nature Methods 2022, 2.2K+ stars)

DeepLabCut

Markerless pose estimation of user-defined features with deep learning for all animals including humans, enabling quantitative behavioral analysis in neuroscience and ethology (Nature Neuroscience 2018, 5.6K+ stars)

AlphaGenome

Google DeepMind's unified DNA sequence foundation model predicting molecular consequences of genetic variants from single-base resolution up to 1 megabase context, jointly outputting thousands of regulatory tracks (RNA expression, splicing, chromatin accessibility, TF binding, contact maps) for human and mouse genomes via a Python client and non-commercial API (2025)

AlphaMissense

Google DeepMind's AlphaFold-derived classifier for proteome-wide missense variant effect prediction, providing pathogenicity scores for all ~71M possible human missense variants and classifying 89% with 90% precision; pre-computed predictions are integrated into Ensembl VEP and UCSC Genome Browser to support clinical variant interpretation (Science 2023)

OpenCRISPR

First open-source AI-generated gene editing systems developed with protein language models, enabling programmable CRISPR-Cas nucleases for synthetic biology and therapeutic genome editing (Profluent, 2024)

DNA Claude Analysis

Interactive personal genome analysis toolkit using Claude Code and Python. Parses raw genotyping data from consumer DNA services and analyzes SNPs across 17 categories including health risks, pharmacogenomics, ancestry, and nutrition, with a terminal-style HTML dashboard.

GenePT

Generative pre-training for genomics

scBERT

Single-cell BERT for gene expression

DNABERT-2 (ICLR 2024)

Efficient foundation model and benchmark for multi-species genome understanding with context-aware nucleotide representations, improving upon DNABERT for diverse genomic task transfer learning (UIUC MAGICS Lab, 484+ stars)

DNABERT

DNA sequence analysis

Enformer

Gene expression prediction

ChatSpatial

MCP server enabling spatial transcriptomics analysis via natural language, integrating 60+ methods including SpaGCN, Cell2location, LIANA+, CellRank for Visium, Xenium, MERFISH platforms

Cell2Sentence

Teaching Large Language Models the Language of Biology through single-cell transcriptomics (ICML 2024)

mLLMCelltype

Multi-LLM consensus framework for automated cell type annotation in single-cell transcriptomics, integrating predictions from 10+ large language models with iterative discussion and uncertainty quantification to reduce single-model biases, achieving up to 95% accuracy without reference datasets; available as CRAN R package and PyPI Python package with Scanpy/Seurat integration (2025)

CellTypist

Automated cell type annotation tool for single-cell transcriptomics using gradient boosting and logistic regression with reference atlases, enabling standardized classification across datasets (Wellcome Sanger Institute, Nature Biotechnology 2022)

scGPT

Single-cell analysis with transformers

GEARS

Geometric deep learning model predicting transcriptional outcomes of novel single- and multi-gene perturbations using gene–gene knowledge graphs, 40% higher precision than prior methods on combinatorial perturbation prediction (Stanford, Nature Biotechnology 2024)

scvi-tools

Deep probabilistic framework for single-cell and spatial omics analysis, integrating scVI, scANVI, totalVI and other VAE-based models for batch correction, cell annotation, multi-omics integration, and RNA velocity (scverse/NumFOCUS, Nature Methods 2018/2024)

State (Arc Institute, bioRxiv 2025)

Machine learning model predicting cellular perturbation response across diverse contexts with State Transition (ST) and State Embedding (SE) variants, featuring CLI tooling, PyPI distribution, and Virtual Cell Challenge integration (575+ stars)

Stack

Arc Institute's single-cell foundation model enabling in-context learning at inference time via a novel tabular attention architecture, trained on 150M uniformly-preprocessed cells for generalizing biological effects and generating unseen cell profiles in novel contexts (2025)

Tahoe-x1

Apache 2.0 single-cell foundation model family scaling to 3B parameters, pretrained on 266M cell profiles including perturbation data and released with training, embedding, and downstream benchmarking workflows for disease-relevant single-cell tasks (2025)

scFoundation

100M-parameter foundation model pretrained on 50M+ human single-cell transcriptomes covering ~20,000 genes, achieving SOTA on gene expression enhancement, drug response and perturbation prediction (Nature Methods 2024)

Nicheformer

Foundation model jointly trained on single-cell and spatial transcriptomics data, enabling unified representation learning across cellular and tissue spatial contexts for cell type prediction, spatial domain inference, and cross-modal integration (theislab, bioRxiv 2024, 164+ stars)

Geneformer

Single-cell transformer foundation model pretrained on 104M human transcriptomes via masked gene prediction, enabling transfer learning for cell type classification, gene network analysis, and in silico perturbation with limited labeled data (Nature 2023, V2 2024)

LucaOne

Generalized biological foundation model with unified nucleic acid and protein language, integrating DNA/RNA/protein sequences (Nature Machine Intelligence 2025)

CodonFM (NVIDIA)

Family of codon-resolution language models trained on 130 million protein-coding sequences from over 20,000 species, enabling cross-species gene expression prediction and codon-level functional genomics (2025)

Caduceus (ICML 2024)

Bi-directional DNA language model based on the Mamba state space architecture, enabling efficient long-range genomic sequence modeling with linear-time complexity and built-in reverse-complement equivariance; achieves strong performance on chromatin accessibility, enhancer, and promoter prediction benchmarks (Stanford & UC Berkeley, 500+ stars)

HyenaDNA

Long-range genomic foundation model using subquadratic Hyena operators instead of Transformer attention, enabling context lengths up to 1 million nucleotides for chromosome-scale DNA sequence modeling and downstream genomics tasks (Stanford Hazy Research, NeurIPS 2023, 784+ stars, Apache 2.0)

Nucleotide Transformer

Foundation models for genomics and transcriptomics pretrained on 3,000+ human genomes and 850+ diverse species, enabling chromatin accessibility prediction, splice site detection, and promoter classification across multiple model scales (InstaDeep, NVIDIA & TUM, Nature Methods 2023)

Evo 2

Arc Institute's 40B-parameter genome foundation model trained on 9 trillion nucleotides from all domains of life, supporting 1M base pair context for generalist DNA/RNA/protein prediction and design (Nature 2026)

AIDO.ModelGenerator

GenBio AI's software stack for the AI-Driven Digital Organism, supporting adaptation and finetuning of multiscale biological foundation models across DNA, RNA, protein, structure, and single-cell tasks with reproducible CLIs and pretrained model zoo (2025)

gRNAde

Generative AI framework for inverse design of 3D RNA structure and function using geometric deep learning, learning design rules from 3D structures to capture complex tertiary interactions (pseudoknots, non-canonical base pairs) with expert-level accuracy for designing functional RNAs including aptamers and ribozymes (bioRxiv 2025)

RNA-FM (Nature Methods 2024)

RNA foundation model trained on millions of RNA sequences for generalist RNA sequence understanding, enabling downstream structure prediction, function annotation, and representation learning for non-coding RNAs (ml4bio, 372+ stars)

RhoFold+

End-to-end RNA 3D structure prediction using RNA language model pretrained on 23.7M sequences, outperforming existing methods and human expert groups on RNA-Puzzles and CASP15 (Nature Methods 2024)

mosaic

Composite-objective protein design framework integrating Boltz, AlphaFold2, OpenFold3, ProteinMPNN, and ESM via JAX-based gradient optimization over continuous relaxed sequence space for multi-property binder design (319+ stars, MIT License, 2025)

ImmunoStruct (Nature Machine Intelligence 2025)

Multimodal deep learning framework integrating peptide-MHC protein sequence, structure, and biochemical properties to predict class-I immunogenicity for infectious disease epitopes and cancer neoepitopes with cancer-wildtype contrastive learning, enabling personalized vaccine design (Krishnaswamy Lab, Yale University)

Foldseek

Fast and accurate protein structure search using a learned 3Di structural alphabet (VQ-VAE) that discretizes tertiary interactions into structural tokens, enabling protein-universe-scale structural alignment at sequence-search speeds (4-5 orders of magnitude faster than DALI/TM-align) and underpinning many AI4S tools such as SaProt, ESMAtlas search, and AFDB clustering pipelines (Steinegger Lab, Nature Biotechnology 2023)

DPLM (ByteDance, ICML 2024 / ICLR 2025)

Family of diffusion protein language models demonstrating versatile generative and predictive capabilities for protein sequences and structures, including multimodal co-generation, conditional folding, inverse folding, motif scaffolding, and representation learning, with open pretrained weights and training scripts (327+ stars, ICML 2024, ICLR 2025, ICML 2025 Spotlight)

EVOLVEpro