Open Science Index

Find open-source science resources

Cross-domain directory aggregating tools, AI models, datasets, and research resources from bio.tools, Bioconductor, HuggingFace, curated GitHub awesome-lists, and more.

Filters

Domain

Software9
Infrastructure8
SingleCell4
Protein & Drug Discovery3
StatisticalMethod3
Annotation2
BiologicalQuestion2
DNAMethylation2
GeneExpression2
General Chemistry2
Genetics2
Genomics & Bioinformatics2
(None)55

Language

R55
Python36
HTML14
Makefile6
Jupyter Notebook4
C2
C++2
Java2
JavaScript2
Web Ontology Language2
XSLT2
CSS1
(None)18

License

MIT19
Artistic-2.013
CC-BY-4.013
NOASSERTION11
Apache-2.010
CC0-1.08
GPL-37
GPL-3.06
BSD-3-Clause3
AGPL-3.02
CC-BY-3.02
GPL (>= 3)2
(None)41

Source(1)

bioregistry2418
bioconductor2412
awesome-ai-for-science363
huggingface168
github150
awesome-bioinformatics126
awesome-python-chemistry87
bio.tools50
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool91
Database59

Filters

Domain

Software9
Infrastructure8
SingleCell4
Protein & Drug Discovery3
StatisticalMethod3
Annotation2
BiologicalQuestion2
DNAMethylation2
GeneExpression2
General Chemistry2
Genetics2
Genomics & Bioinformatics2
(None)55

Language

R55
Python36
HTML14
Makefile6
Jupyter Notebook4
C2
C++2
Java2
JavaScript2
Web Ontology Language2
XSLT2
CSS1
(None)18

License

MIT19
Artistic-2.013
CC-BY-4.013
NOASSERTION11
Apache-2.010
CC0-1.08
GPL-37
GPL-3.06
BSD-3-Clause3
AGPL-3.02
CC-BY-3.02
GPL (>= 3)2
(None)41

Source(1)

bioregistry2418
bioconductor2412
awesome-ai-for-science363
huggingface168
github150
awesome-bioinformatics126
awesome-python-chemistry87
bio.tools50
awesome-cheminformatics45
awesome-scientific-python18

Type

Software tool91
Database59

150 of 5,662 resources

Showing 1–50

Unified Code for Units of Measure

Unified Code for Units of Measure (UCUM) is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business.

★9810 months ago

qsvaR

The qsvaR package contains functions for removing the effect of degration in rna-seq data from postmortem brain tissue. The package is equipped to help users generate principal components associated with degradation. The components can be used in differential expression analysis to remove the effects of degradation.

★01 month ago

Paper2All

Website & Interactive Content Generation

AI-powered pipeline converting papers into interactive websites, posters, and multimedia presentations with "Let's Make Your Paper Alive!" philosophy

★3737 months ago

Casanovo

Genomics & Bioinformatics

Transformer encoder-decoder for de novo peptide sequencing from tandem mass spectrometry, translating MS/MS spectra directly to peptide sequences without reference databases, enabling identification of novel peptides for immunopeptidomics, antibody repertoires, and metaproteomes (Noble Lab UW, Nature Communications 2024)

★1872 days ago

NanoParticle Ontology

An ontology that represents the basic knowledge of physical, chemical and functional characteristics of nanotechnology as used in cancer diagnosis and therapy.

★110 years ago

Web Ontology Language

stk

General Chemistry

A library for building, manipulating, analyzing and automatic design of molecules, including a genetic algorithm.

★2844 months ago

DISCO

Protein & Drug Discovery

General multimodal protein design framework enabling DNA-encoding of chemistry for programmable enzyme design and diverse protein generation through diffusion-based generative modeling (190+ stars, Apache 2.0, 2026)

★1901 week ago

spatialDE

SpatialDE is a method to find spatially variable genes (SVG) from spatial transcriptomics data. This package provides wrappers to use the Python SpatialDE library in R, using reticulate and basilisk.

★31 year ago

The Extensible Observation Ontology

The Extensible Observation Ontology (OBOE) is a formal ontology for capturing the semantics of scientific observation and measurement. The ontology supports researchers to add detailed semantic annotations to scientific data, thereby clarifying the inherent meaning of scientific observations.

★336 years ago

multiHiCcompare

multiHiCcompare provides functions for joint normalization and difference detection in multiple Hi-C datasets. This extension of the original HiCcompare package now allows for Hi-C experiments with more than 2 groups and multiple samples per group. multiHiCcompare operates on processed Hi-C data in the form of sparse upper triangular matrices. It accepts four column (chromosome, region1, region2, IF) tab-separated text files storing chromatin interaction matrices. multiHiCcompare provides cyclic loess and fast loess (fastlo) methods adapted to jointly normalizing Hi-C data. Additionally, it provides a general linear model (GLM) framework adapting the edgeR package to detect differences in Hi-C data in a distance dependent manner.

★104 years ago

Vitro Application Ontology

Vitro is a full stack framework for building semantic web applications. It is not domain specific.

★1151 week ago

fastreeR

Calculate distances, build phylogenetic trees or perform hierarchical clustering between the samples of a VCF or FASTA file. Functions are implemented in Java-11 and called via rJava. Parallel implementation that operates directly on the VCF or FASTA file for fast execution.

★313 weeks ago

fmcsR

Cheminformatics

The fmcsR package introduces an efficient maximum common substructure (MCS) algorithms combined with a novel matching strategy that allows for atom and/or bond mismatches in the substructures shared among two small molecules. The resulting flexible MCSs (FMCSs) are often larger than strict MCSs, resulting in the identification of more common features in their source structures, as well as a higher sensitivity in finding compounds with weak structural similarities. The fmcsR package provides several utilities to use the FMCS algorithm for pairwise compound comparisons, structure similarity searching and clustering.

★610 years ago

SCFA

Subtyping via Consensus Factor Analysis (SCFA) can efficiently remove noisy signals from consistent molecular patterns in multi-omics data. SCFA first uses an autoencoder to select only important features and then repeatedly performs factor analysis to represent the data with different numbers of factors. Using these representations, it can reliably identify cancer subtypes and accurately predict risk scores of patients.

★33 years ago

scider

scider is an user-friendly R package providing functions to model the global density of cells in a slide of spatial transcriptomics data. All functions in the package are built based on the SpatialExperiment object, allowing integration into various spatial transcriptomics-related packages from Bioconductor. After modelling density, the package allows for several downstream analysis, including colocalization analysis, boundary detection analysis and differential density analysis.

★104 weeks ago

jvecfor

Drop-in replacement for BiocNeighbors::findKNN using the jvecfor Java library, which builds on the jvector library to leverage the Java Vector API for portable SIMD acceleration across AVX2, AVX-512, and ARM NEON hardware. jvecfor/jvector implements HNSW-DiskANN approximate search and VP-tree exact search. The package achieves approximately 2x speedup over Annoy-based search at n >= 50K cells while returning output structurally identical to BiocNeighbors, making it suitable for seamless integration into existing Bioconductor single-cell workflows. Convenience wrappers delegate shared nearest-neighbor (SNN) and k-nearest-neighbor (KNN) graph construction to the bluster package.

★33 weeks ago

koinar

MassSpectrometry

A client to simplify fetching predictions from the Koina web service. Koina is a model repository enabling the remote execution of models. Predictions are generated as a response to HTTP/S requests, the standard protocol used for nearly all web traffic.

★532 weeks ago

Jupyter Notebook

TOP

TOP constructs a transferable model across gene expression platforms for prospective experiments. Such a transferable model can be trained to make predictions on independent validation data with an accuracy that is similar to a re-substituted model. The TOP procedure also has the flexibility to be adapted to suit the most common clinical response variables, including linear response, binomial and Cox PH models.

★011 months ago

metapod

MultipleComparison

Implements a variety of methods for combining p-values in differential analyses of genome-scale datasets. Functions can combine p-values across different tests in the same analysis (e.g., genomic windows in ChIP-seq, exons in RNA-seq) or for corresponding tests across separate analyses (e.g., replicated comparisons, effect of different treatment conditions). Support is provided for handling log-transformed input p-values, missing values and weighting where appropriate.

★23 months ago

SPAdes

SPAdes (St. Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines and the de-facto standard for prokaryotic genome assemblies.

★9352 weeks ago

MsFeatures

The MsFeature package defines functionality for Mass Spectrometry features. This includes functions to group (LC-MS) features based on some of their properties, such as retention time (coeluting features), or correlation of signals across samples. This packge hence allows to group features, and its results can be used as an input for the `QFeatures` package which allows to aggregate abundance levels of features within each group. This package defines concepts and functions for base and common data types, implementations for more specific data types are expected to be implemented in the respective packages (such as e.g. `xcms`). All functionality of this package is implemented in a modular way which allows combination of different grouping approaches and enables its re-use in other R packages.

★73 months ago

rBiopaxParser

DataRepresentation

Parses BioPAX files and represents them in R, at the moment BioPAX level 2 and level 3 are supported.

★105 years ago

OutSplice

AlternativeSplicing

An easy to use tool that can compare splicing events in tumor and normal tissue samples using either a user generated matrix, or data from The Cancer Genome Atlas (TCGA). This package generates a matrix of splicing outliers that are significantly over or underexpressed in tumors samples compared to normal denoted by chromosome location. The package also will calculate the splicing burden in each tumor and characterize the types of splicing events that occur.

★11 year ago

Annotation Ontology

The Annotation Ontology specification is currently used as input for the activities of the http://www.w3.org/community/openannotation/'>W3C Open Annotation Community Group that works towards a common, RDF-based, specification for annotating digital resources. The Group effort starts by working towards a reconciliation of two proposals that have emerged over the past two years: the http://code.google.com/p/annotation-ontology/'>Annotation Ontology and the http://www.openannotation.org/spec/beta/'>Open Annotation Model. Initially, editors of these proposals will closely collaborate to devise a common draft specification that addresses requirements and use cases that were identified in the course of their respective efforts. The goal is to make this draft available for public feedback and experimentation in the second quarter of 2012. The final deliverable of the Open Annotation Community Group will be a specification, published under an appropriate open license, that is informed by the existing proposals, the common draft specification, and the community feedback. [from homepage]

★011 years ago

Web Ontology Language

CytoML

Uses platform-specific implemenations of the GatingML2.0 standard to exchange gated cytometry data with other software platforms.

★353 months ago

PLSDAbatch

StatisticalMethod

A novel framework to correct for batch effects prior to any downstream analysis in microbiome data based on Projection to Latent Structures Discriminant Analysis. The main method is named “PLSDA-batch”. It first estimates treatment and batch variation with latent components, then subtracts batch-associated components from the data whilst preserving biological variation of interest. PLSDA-batch is highly suitable for microbiome data as it is non-parametric, multivariate and allows for ordination and data visualisation. Combined with centered log-ratio transformation for addressing uneven library sizes and compositional structure, PLSDA-batch addresses all characteristics of microbiome data that existing correction methods have ignored so far. Two other variants are proposed for 1/ unbalanced batch x treatment designs that are commonly encountered in studies with small sample sizes, and for 2/ selection of discriminative variables amongst treatment groups to avoid overfitting in classification problems. These two variants have widened the scope of applicability of PLSDA-batch to different data settings.

★144 months ago

G4SNVHunter

G-quadruplexes (G4s) are unique nucleic acid secondary structures predominantly found in guanine-rich regions and have been shown to be involved in various biological regulatory processes. G4SNVHunter is an R package designed to rapidly identify genomic sequences with G4-forming propensity and to accurately screen user-provided single nucleotide variants—as well as other small-scale variants such as indels and MNVs—for their potential to destabilize these structures. This allows researchers to then screen these critical variants for deeper study, digging into how they might influence biological functions—think gene regulation, for instance—by impairing G4 formation propensity.

★012 months ago

regionalpcs

Functions to summarize DNA methylation data using regional principal components. Regional principal components are computed using principal components analysis within genomic regions to summarize the variability in methylation levels across CpGs. The number of principal components is chosen using either the Marcenko-Pasteur or Gavish-Donoho method to identify relevant signal in the data.

★42 years ago

FitSNAP

A Package For Training SNAP Interatomic Potentials for use in the LAMMPS molecular dynamics package.

★1867 months ago

tRNAscanImport

The package imports the result of tRNAscan-SE as a GRanges object.

★27 months ago

GeneExpressionSignature

This package gives the implementations of the gene expression signature and its distance to each. Gene expression signature is represented as a list of genes whose expression is correlated with a biological state of interest. And its distance is defined using a nonparametric, rank-based pattern-matching strategy based on the Kolmogorov-Smirnov statistic. Gene expression signature and its distance can be used to detect similarities among the signatures of drugs, diseases, and biological states of interest.

★15 years ago

VCFArray

VCFArray extends the DelayedArray to represent VCF data entries as array-like objects with on-disk / remote VCF file as backend. Data entries from VCF files, including info fields, FORMAT fields, and the fixed columns (REF, ALT, QUAL, FILTER) could be converted into VCFArray instances with different dimensions.

★17 years ago

asuri

The ASURI (Analysis of SUrvival and patients RIsk prediction based on gene signatures) package discovers marker genes that are related to risk prediction capabilities and to a clinical variable of interest. It uses two main steps, including subsampling glmnet and unicox. The package implements robust functions to discover survival markers related to a clinical phenotype and to predict a risk score, allowing to study the patient's risk based on the gene signatures. Several plots are provided to visualise the relevance of the genes, the risk score, and patient stratification, as well as a robust version of the Kaplan-Meier curves.

★03 weeks ago

CNVMetrics

BiologicalQuestion

The CNVMetrics package calculates similarity metrics to facilitate copy number variant comparison among samples and/or methods. Similarity metrics can be employed to compare CNV profiles of genetically unrelated samples as well as those with a common genetic background. Some metrics are based on the shared amplified/deleted regions while other metrics rely on the level of amplification/deletion. The data type used as input is a plain text file containing the genomic position of the copy number variations, as well as the status and/or the log2 ratio values. Finally, a visualization tool is provided to explore resulting metrics.

★44 months ago

PyLabRobot

Lab Automation & Robotics

Interactive and hardware-agnostic SDK for laboratory automation, enabling programmatic control of liquid handlers, plate readers, and other lab instruments across multiple vendors; foundational infrastructure for self-driving laboratories and AI-driven experimental execution (447+ stars)

★4502 days ago

GOaGO

GO-a-GO annotates Gene Ontology terms that are enriched in a given set of gene pairs. The enrichment is calculated from a permutation test for overrepresentation of gene pairs that are associated with a shared term. Such gene pairs are counted for the original set of gene pairs and compared against randomized sets in which the structure of the pairs is preserved, but the gene identities (including the associated terms) are permuted.

★12 weeks ago

SciWrite

Scientific Writing & Collaboration

Agent skill for AI-assisted scientific manuscript writing review distilled from Stanford's *Writing in the Sciences* course, performing five sequential editorial audit passes on clarity, voice, structure, consistency, and integrity (2026)

★6751 month ago

sRACIPE

sRACIPE implements a randomization-based method for gene circuit modeling. It allows us to study the effect of both the gene expression noise and the parametric variation on any gene regulatory circuit (GRC) using only its topology, and simulates an ensemble of models with random kinetic parameters at multiple noise levels. Statistical analysis of the generated gene expressions reveals the basin of attraction and stability of various phenotypic states and their changes associated with intrinsic and extrinsic noises. sRACIPE provides a holistic picture to evaluate the effects of both the stochastic nature of cellular processes and the parametric variation.

★63 months ago

ggseqalign

Simple visualizations of alignments of DNA or AA sequences as well as arbitrary strings. Compatible with Biostrings and ggplot2. The plots are fully customizable using ggplot2 modifiers such as theme().

★01 year ago

Scholarly Contributions and Roles Ontology

An ontology based on PRO for describing the contributions that may be made, and the roles that may be held by a person with respect to a journal article or other publication (e.g. the role of article guarantor or illustrator).

★16 years ago

AfterQC

Sequence Processing

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data.

★2146 years ago

immLynx

A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis.

★21 week ago

Helical

Genomics & Bioinformatics

Unified framework for state-of-the-art pre-trained bio foundation models across genomics and transcriptomics, providing standardized interfaces and pipelines for DNA, RNA, and single-cell models including Evo 2, Geneformer, scGPT, and UCE with streamlined inference, benchmarking, and fine-tuning workflows (213+ stars, 2024-2025)

★2153 weeks ago

VariantFiltering

Filter genetic variants using different criteria such as inheritance model, amino acid change consequence, minor allele frequencies across human populations, splice site strength, conservation, etc.

★47 months ago

scToppR

scToppR provides an easy-to-use API wrapper for the ToppGene web platform, used for gene ontology and functional enrichment research. The package also integrates visualization tools, making it a convenient tool directly connecting ToppGene to code-based workflows in R. The tool can also easily save results into different formats.

★71 month ago

CCPlotR

CCPlotR is an R package for visualising results from tools that predict cell-cell interactions from single-cell RNA-seq data. These plots are generic and can be used to visualise results from multiple tools such as Liana, CellPhoneDB, NATMI etc.

★472 months ago

Awesome LLM Scientific Discovery

📋 Paper Collections & Repositories

LLM papers for scientific discovery

★3456 months ago

ChemFormula

General Chemistry

ChemFormula provides a class for working with chemical formulas. It allows parsing chemical formulas, calculating formula weights, and generating formatted output strings (e.g. in HTML, LaTeX, or Unicode).

★336 months ago

Equiformer

Machine Learning for Physics

Equivariant graph attention Transformer (ICLR2023)

★2821 year ago

DeepAnalyze

Data Analysis & Visualization

First agentic LLM for autonomous data science with end-to-end pipeline from data to analyst-grade reports

★4.2K1 month ago

← Prev

1
2
3

Submit a resource bio.tools Awesome Bioinformatics