Find open-source science resources

Sequence Processing

A quality control tool for high throughput sequence data.

6013 days ago

Java

topGO package provides tools for testing GO terms while accounting for the topology of the GO graph. Different test statistics and different methods for eliminating local similarities and dependencies between GO terms can be implemented and applied.

210 months ago

AnVILAz

The AnVIL is a cloud computing resource developed in part by the National Human Genome Research Institute. The AnVILAz package supports end-users and developers using the AnVIL platform in the Azure cloud. The package provides a programmatic interface to AnVIL resources, including workspaces, notebooks, tables, and workflows. The package also provides utilities for managing resources, including copying files to and from Azure Blob Storage, and creating shared access signatures (SAS) for secure access to Azure resources.

02 days ago

regionReport

DifferentialExpression

Generate HTML or PDF reports to explore a set of regions such as the results from annotation-agnostic expression analysis of RNA-seq data at base-pair resolution performed by derfinder. You can also create reports for DESeq2 or edgeR results.

91 month ago

SeqFu

Sequence Processing

Sequence manipulation toolkit for FASTA/FASTQ files written in Nim.

1273 days ago

Nim

PHAT

Genome Browsers / Gene Diagrams

Point and click, cross platform suite for analysing and visualizing next-generation sequencing datasets.

173 years ago

TypeScript

RFGeneRank

Transcriptomics

Tools to harmonize bulk RNA-seq matrices, optionally apply batch correction, and train cross-validated classification models using ranger, glmnet, or xgboost. Supports leakage-safe feature selection, permutation importance, SHAP-based interpretability, and calibration methods (Platt or isotonic). Provides stability metrics across folds, embeddings (PCA/UMAP), ROC visualization, SHAP dependence plots, and tidy ranked-gene tables for downstream analysis.

01 month ago

NVIDIA PhysicsNeMo

Physics-Informed Neural Networks

Open-source framework for building physics-ML models at scale (renamed from Modulus, 2025)

2.8K2 days ago

pRoloc

ImmunoOncology

The pRoloc package implements machine learning and visualisation methods for the analysis and interogation of quantitiative mass spectrometry data to reliably infer protein sub-cellular localisation.

161 month ago

scConform

Builds prediction interval for cell type annotation using conformal inference and conformal risk control. It provides two main methods. The first one gives prediction intervals with coverage guarantees based on standard conformal inference. The second one instead gives hierarchical prediction intervals that are consistent with the cell ontology.

71 month ago

MedSAM

Medical AI & Clinical Applications

Universal medical image segmentation foundation model trained on 1.57M image-mask pairs across 10 imaging modalities and 30+ cancer types (Nature Communications 2024)

4.3K1 year ago

Jupyter Notebook

BiocSingular

Implements exact and approximate methods for singular value decomposition and principal components analysis, in a framework that allows them to be easily switched within Bioconductor packages or workflows. Where possible, parallelization is achieved using the BiocParallel framework.

86 months ago

pathview

Pathways

Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.

481 year ago

REINVENT

Protein & Drug Discovery

Industrial-grade reinforcement-learning-based generative platform for de novo molecular design with transformer architectures, supporting multi-objective optimization, scaffold decoration, and curriculum learning (AstraZeneca MolecularAI, REINVENT 4, 2024)

Archived3731 year ago

gwasurvivr

GenomeWideAssociation

gwasurvivr is a package to perform survival analysis using Cox proportional hazard models on imputed genetic data.

132 years ago

Molecular Transformers

Chemical Synthesis

AI for chemical reaction prediction and synthesis planning

4244 years ago

graphein

Machine Learning

Provides functionality for producing geometric representations of protein and RNA structures, and biological interaction networks.

1.2K2 days ago

Jupyter Notebook

The Leek group guide to genomics papers

Miscellaneous

Expertly curated genomics papers to get up to speed on genomics, RNA-seq, statistics (used in genomics), software development, and more.

5027 years ago

OmicVerse

Genomics & Bioinformatics

Unified Python framework for bulk, single-cell, and spatial RNA-seq multi-omics analysis with deep learning deconvolution (VAE) and graph neural networks, bridging Bindea, Bindea, scanpy and squidpy ecosystems (Nature Communications 2024)

1K1 hour ago

rmelting

BiomedicalInformatics

R interface to the MELTING 5 program (https://www.ebi.ac.uk/biomodels/tools/melting/) to compute melting temperatures of nucleic acid duplexes along with other thermodynamic parameters.

23 years ago

scQTLtools

scQTLtools is a comprehensive R/Bioconductor package that facilitates end-to-end single-cell eQTL analysis, from preprocessing to visualization

63 months ago

AnVILGCP

The package provides a set of functions to interact with the Google Cloud Platform (GCP) services on the AnVIL platform. The package is designed to use the API calls from the AnVIL package. It coordinates AnVIL workspace functionality with native GCP tools.

02 days ago

TargetDecoy

MassSpectrometry

A first step in the data analysis of Mass Spectrometry (MS) based proteomics data is to identify peptides and proteins. With this respect the huge number of experimental mass spectra typically have to be assigned to theoretical peptides derived from a sequence database. Search engines are used for this purpose. These tools compare each of the observed spectra to all candidate theoretical spectra derived from the sequence data base and calculate a score for each comparison. The observed spectrum is then assigned to the theoretical peptide with the best score, which is also referred to as the peptide to spectrum match (PSM). It is of course crucial for the downstream analysis to evaluate the quality of these matches. Therefore False Discovery Rate (FDR) control is used to return a reliable list PSMs. The FDR, however, requires a good characterisation of the score distribution of PSMs that are matched to the wrong peptide (bad target hits). In proteomics, the target decoy approach (TDA) is typically used for this purpose. The TDA method matches the spectra to a database of real (targets) and nonsense peptides (decoys). A popular approach to generate these decoys is to reverse the target database. Hence, all the PSMs that match to a decoy are known to be bad hits and the distribution of their scores are used to estimate the distribution of the bad scoring target PSMs. A crucial assumption of the TDA is that the decoy PSM hits have similar properties as bad target hits so that the decoy PSM scores are a good simulation of the target PSM scores. Users, however, typically do not evaluate these assumptions. To this end we developed TargetDecoy to generate diagnostic plots to evaluate the quality of the target decoy method.

13 years ago

Python for chemoinformatics

2344 years ago

DifferentialEquations.jl

Neural Differential Equations

Julia differential equations suite

3.1K1 week ago

Julia

simplifyEnrichment

A new clustering algorithm, "binary cut", for clustering similarity matrices of functional terms is implemeted in this package. It also provides functions for visualizing, summarizing and comparing the clusterings.

1253 months ago

ai-models (ECMWF)

Climate Modeling

ECMWF's unified framework and command-line tool to run AI-based weather forecasting models (GraphCast, Aurora, Pangu, NeuralGCM, FourCastNet) with operational ECMWF data infrastructure, enabling standardized inference and benchmarking across state-of-the-art meteorological AI systems (ECMWF, 576+ stars)

5795 months ago

Structstrings

DataImport

The Structstrings package implements the widely used dot bracket annotation for storing base pairing information in structured RNA. Structstrings uses the infrastructure provided by the Biostrings package and derives the DotBracketString and related classes from the BString class. From these, base pair tables can be produced for in depth analysis. In addition, the loop indices of the base pairs can be retrieved as well. For better efficiency, information conversion is implemented in C, inspired to a large extend by the ViennaRNA package.

55 months ago

ACME

Technology

ACME (Algorithms for Calculating Microarray Enrichment) is a set of tools for analysing tiling array ChIP/chip, DNAse hypersensitivity, or other experiments that result in regions of the genome showing "enrichment". It does not rely on a specific array technology (although the array should be a "tiling" array), is very general (can be applied in experiments resulting in regions of enrichment), and is very insensitive to array noise or normalization methods. It is also very fast and can be applied on whole-genome tiling array experiments quite easily with enough memory.

01 month ago

target

Implement the BETA algorithm for infering direct target genes from DNA-binding and perturbation expression data Wang et al. (2013) <doi: 10.1038/nprot.2013.150>. Extend the algorithm to predict the combined function of two DNA-binding elements from comprable binding and expression data.

56 years ago

HTML

MIRit

MIRit is an R package that provides several methods for investigating the relationships between miRNAs and genes in different biological conditions. In particular, MIRit allows to explore the functions of dysregulated miRNAs, and makes it possible to identify miRNA-gene regulatory axes that control biological pathways, thus enabling the users to unveil the complexity of miRNA biology. MIRit is an all-in-one framework that aims to help researchers in all the central aspects of an integrative miRNA-mRNA analyses, from differential expression analysis to network characterization.

23 months ago

CluMSID

Metabolomics

CluMSID is a tool that aids the identification of features in untargeted LC-MS/MS analysis by the use of MS2 spectra similarity and unsupervised statistical methods. It offers functions for a complete and customisable workflow from raw data to visualisations and is interfaceable with the xmcs family of preprocessing packages.

104 years ago

scPassport

SingleCell

Stamps Seurat, SingleCellExperiment, and SummarizedExperiment objects with a persistent metadata passport. For Seurat objects the passport is stored in the misc slot; for SingleCellExperiment and SummarizedExperiment objects it is stored in the metadata slot. Tracks animal info, experiment details, lineage (parent/child relationships), RDS registry numbers, processing logs, and custom fields. Includes an interactive Shiny gadget to fill and update the passport, and a read mode to print the full passport to console. The passport persists inside the RDS file with no external files needed.

31 month ago

multistateQTL

FunctionalGenomics

A collection of tools for doing various analyses of multi-state QTL data, with a focus on visualization and interpretation. The package 'multistateQTL' contains functions which can remove or impute missing data, identify significant associations, as well as categorise features into global, multi-state or unique. The analysis results are stored in a 'QTLExperiment' object, which is based on the 'SummarisedExperiment' framework.

12 weeks ago

barcodetrackR

barcodetrackR is an R package developed for the analysis and visualization of clonal tracking data. Data required is samples and tag abundances in matrix form. Usually from cellular barcoding experiments, integration site retrieval analyses, or similar technologies.

55 years ago

CC0-1.0

Ruffus

Workflow Managers

Computation Pipeline library for python widely used in science and bioinformatics.

1754 years ago

markeR

GeneExpression

markeR is an R package that provides a modular and extensible framework for the systematic evaluation of gene sets as phenotypic markers using transcriptomic data. The package is designed to support both quantitative analyses and visual exploration of gene set behaviour across experimental and clinical phenotypes. It implements multiple methods, including score-based and enrichment approaches, and also allows the exploration of expression behaviour of individual genes. In addition, users can assess the similarity of their own gene sets against established collections (e.g., those from MSigDB), facilitating biological interpretation.

102 weeks ago

matter

Infrastructure

Toolbox for larger-than-memory scientific computing and visualization, providing efficient out-of-core data structures using files or shared memory, for dense and sparse vectors, matrices, and arrays, with applications to nonuniformly sampled signals and images.

611 month ago

Artistic-2.0

SEMPLR

MotifAnnotation

SEMPLR computes transcription factor binding affinity scores for genomic positions and genetic variants. Scores are computed from SNP Effect Matrices (SEMs) produced by SEMpl. 223 pre-computed SEMs are included with the package or custom sets can be provided. Enrichment can be tested among sets of genomic positions to determine if transcription factor binding events occur more often than expected. Comparing binding affinity scores between alleles can reveal differences in transcription factor binding with genetic variation. This package also includes several visualization functions to view scores both on the motif and variant/position level.

11 week ago

NequIP

Materials Discovery

E(3)-equivariant neural network interatomic potentials achieving DFT accuracy with up to 1000× less training data than invariant models, foundational architecture behind MACE and Allegro (Harvard, MIT, Nature Communications 2022)

9144 days ago

mosdef

GeneExpression

This package provides functionality to run a number of tasks in the differential expression analysis workflow. This encompasses the most widely used steps, from running various enrichment analysis tools with a unified interface to creating plots and beautifying table components linking to external websites and databases. This streamlines the generation of comprehensive analysis reports.

05 months ago

scDD

ImmunoOncology

This package implements a method to analyze single-cell RNA- seq Data utilizing flexible Dirichlet Process mixture models. Genes with differential distributions of expression are classified into several interesting patterns of differences between two conditions. The package also includes functions for simulating data with these patterns from negative binomial distributions.

354 years ago

CalcUS

Simulations

Quantum chemisttry web platform that brings all the necessary tools to perform quantum chemistry in a user-friendly web interface.

762 months ago

JavaScript

GENESIS

SNP

The GENESIS package provides methodology for estimating, inferring, and accounting for population and pedigree structure in genetic analyses. The current implementation provides functions to perform PC-AiR (Conomos et al., 2015, Gen Epi) and PC-Relate (Conomos et al., 2016, AJHG). PC-AiR performs a Principal Components Analysis on genome-wide SNP data for the detection of population structure in a sample that may contain known or cryptic relatedness. Unlike standard PCA, PC-AiR accounts for relatedness in the sample to provide accurate ancestry inference that is not confounded by family structure. PC-Relate uses ancestry representative principal components to adjust for population structure/ancestry and accurately estimate measures of recent genetic relatedness such as kinship coefficients, IBD sharing probabilities, and inbreeding coefficients. Additionally, functions are provided to perform efficient variance component estimation and mixed model association testing for both quantitative and binary phenotypes.

442 months ago

FengWu

Climate Modeling

Shanghai AI Lab's deep learning-based global weather forecasting model pushing skillful forecasts beyond 10 days lead, with open-source inference code and pretrained ONNX model weights (arXiv 2023)

1695 months ago

gemma.R

Low- and high-level wrappers for Gemma's RESTful API. They enable access to curated expression and differential expression data from over 10,000 published studies. Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles.

101 week ago

lipidr

Lipidomics

lipidr an easy-to-use R package implementing a complete workflow for downstream analysis of targeted and untargeted lipidomics data. lipidomics results can be imported into lipidr as a numerical matrix or a Skyline export, allowing integration into current analysis frameworks. Data mining of lipidomics datasets is enabled through integration with Metabolomics Workbench API. lipidr allows data inspection, normalization, univariate and multivariate analysis, displaying informative visualizations. lipidr also implements a novel Lipid Set Enrichment Analysis (LSEA), harnessing molecular information such as lipid class, total chain length and unsaturation.

332 years ago

immLynx

A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis.

21 week ago

MIT + file LICENSE

OutSplice

AlternativeSplicing

An easy to use tool that can compare splicing events in tumor and normal tissue samples using either a user generated matrix, or data from The Cancer Genome Atlas (TCGA). This package generates a matrix of splicing outliers that are significantly over or underexpressed in tumors samples compared to normal denoted by chromosome location. The package also will calculate the splicing burden in each tumor and characterize the types of splicing events that occur.

11 year ago

GPL-2

scider