Find open-source science resources

Sequence Processing

FASTQ and SAM quality control using Python.

ngs-preprocess

A pipeline for preprocessing short and long sequencing reads, built with Nextflow.

R-Peridot

Customizable pipeline for differential expression analysis with an intuitive GUI.

bcbio-nextgen

Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction.

Bacannot

A generic but comprehensive bacterial annotation pipeline, built with Nextflow, with nice graphical options for investigating results.

Bactopia

A flexible pipeline, built with Nextflow, for the complete analysis of bacterial genomes.

Awesome-Pipeline

A list of pipeline resources.

Workflow Descriptor Language

Workflow standard developed by the Broad.

SeqWare

Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments.

Workflow library embedded in the Go programming language, focusing on supporting complex workflow constructs, compiling to a single binary, providing powerful file naming and comprehensive audit reports for every output

redun

A python-based workflow manager.

Cromwell

A Workflow Management System geared towards scientific workflows.

Common Workflow Language

a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.

Bpipe

A small language for defining pipeline stages and linking them together to make pipelines.

BigDataScript

A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities.

zindex

Create an index on a compressed text file.

wormtable

Write-once-read-many table for large datasets.

tabix

Table file index.

gsort

Sort genomic files according to a specified order.

grepq

Fast FASTQ filtering by matching reads against one or more regex patterns.

grabix

A wee tool for random access into BGZF files.

easy_qsub

Easily submitting PBS jobs with script template. Multiple input files supported.

csvtk

Another cross-platform, efficient, practical and pretty CSV/TSV toolkit.

CSVKit

Utilities for working with CSV/Tab-delimited files.

bioSyntax

Syntax Highlighting for Computational Biology file formats (SAM, VCF, GTF, FASTA, PDB, etc...) in vim/less/gedit/sublime.

BioNode

Modular and universal bioinformatics, Bionode provides pipeable UNIX command line tools and JavaScript APIs for bioinformatics analysis workflows.

Bioinformatics One Liners

Git repo of useful single line commands.

Genozip

Compressing

A compressor of common genomic file formats (BAM, CRAM, FASTQ, VCF etc).

SRA-Explorer

Downloading

Easily get SRA download links and other information.

GGD

Downloading

Go Get Data; A command line interface for obtaining genomic data.

Biojava

Java framework for processing biological data.

Biocaml

Biocaml aims to be a high-performance user-friendly library for Bioinformatics.

(Poly)merase

A Go library and command line utility for engineering organisms.

SeqAn

The modern C++ library for sequence analysis.

Rust-Bio

Rust implementations of algorithms and data structures useful for bioinformatics.

Biopython

Freely available tools for biological computing in Python, with included cookbook, packaging and thorough documentation. Part of the [Open Bioinformatics Foundation](http://open-bio.org/). Contains the very useful [Entrez](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html) package for API access to the NCBI databases.

Bioperl

International association of users & developers of open source Perl tools for bioinformatics, genomics and life sciences.

phantasus

Gene expression

It is a web-application for visual and interactive gene expression analysis. Phantasus is based on Morpheus – a web-based software for heatmap visualisation and analysis, which was integrated with an R environment via OpenCPU API. Aside from basic visualization and filtering methods, R-based methods such as k-means clustering, principal component analysis or differential expression analysis with limma package are supported.

MIT

Open Neuroscience Graph

Biosciences

The Open Neuroscience Graph (openneuroscience.org) is an open-access, curated knowledge graph that maps the open science ecosystem in neuroscience as a browsable digital garden. Built from an Obsidian vault and published as a static website using Quartz, the project replaces traditional linear presentation with a networked structure of interlinked Markdown notes. Bidirectional links, full-text search, and an integrated graph visualization allow users to navigate thematic relationships dynamically rather than sequentially. The complete source material is openly available to sustain, replicate and extend the resource, includding all Markdown content, media attachments, Quartz configuration files, and site customizations. Researchers, educators, and open-science practitioners may explore the site directly, download the vault for offline use in Obsidian, or fork the material to build new, derivative knowledge bases. PID=https://doi.org/10.5281/zenodo.20181900

JavaScript

CC-BY-4.0

DigestedProteinDB

Proteomics

DigestedProteinDB provides a scalable computational infrastructure for indexing and querying peptide cleavage data. Designed for seamless integration into high-throughput mass spectrometry pipelines, it enables low-latency searches and advanced filtering of digested protein datasets to accelerate experimental spectra cross-referencing.

plantiSMASH

Transcription factors and regulatory sites

PlantiSMASH is a specialized extension of antiSMASH for the identification and analysis of biosynthetic gene clusters (BGCs) in plant genomes. It supports advanced plant-specific detection rules and features for comparative genomics, visualization, and more.

AGPL-3.0

LifeSoaks

X-ray diffraction

LifeSoaks was designed to find solvent channels in macromolecular structures solved by X-ray crystallography. It predicts their accessibility by molecules through an automated annotation of so-called bottleneck radii. It simplifies the process of manually checking a crystal structure for solvent channels. Bottleneck radii can be calculated for solvent channels and small molecule binding sites. The tool is ideally suited for channel analyses before the actual soaking experiments to select the most promising experimental conditions and crystal forms. LifeSoaks runs fully automated and will finish within seconds to minutes for moderately sized crystals.

StructureProfiler

Structure analysis

Three-dimensional protein structures play a vital role in drug design. Structure-based design necessitates an in-depth examination of the available quality data before using the structure in computational experiments and for method evaluation. StructureProfiler assists in automatically profiling sets of protein-ligand complex structures based on multiple quality indicators, ranging from model characteristics, e.g., the R factor, and active site features, e.g., bond length deviations, to ligand properties such as electron density support and the validity of torsion angles.

EDIAscorer

Structure analysis

The electron density score for individual atoms (EDIA) quantifies the electron density fit of each atom in a crystallographically resolved structure. Multiple EDIA values can be combined using the power mean to compute the EDIAm, i.e., the electron density score for a group of several atoms. It enables users to score a set of atoms, such as a ligand, a residue, or an active site.

PrimerPickr

Primerpickr is an open-source tool for rational primer design powered by the aggregation of public usage of pcr primers

CC-BY-NC-ND-4.0

Protoss

Protein interactions

Protoss is a fully automated hydrogen atom placement tool for protein-ligand complexes. It adds missing hydrogen atoms to protein structures and detects reasonable protonation states, tautomeric states, and hydrogen coordinates of both protein and ligand molecules by optimizing the hydrogen bond network.

WarPP

Protein properties

WarPP predicts the position and orientation of water molecules in small-molecule binding sites. It places and scores water molecules in binding sites of crystallographic structures based on EDIAscorer results and interaction geometries as known from experimentally solved protein structures. WarPP was validated on a high-quality set of 1,500 protein-ligand complexes, containing 20,000 crystallographically observed water molecules. It is sufficiently fast for high-throughput analyses. It correctly places water molecules in approx. 80% of the cases. Users can export the predictions as PDB files for, e.g., molecular docking with JAMDA.

GeoMine

Protein interactions

GeoMine enables the automated mining of protein-ligand binding sites. Based on individually designed queries, users can search for spatial interaction patterns in huge collections of protein-ligand complexes and binding pockets. The regularly updated GeoMine database relies on the free database systems SQLite and PostgreSQL. It supports radius-based pockets (based on ligands and predicted pockets (based on DoGSite3) for query generation. The query management is based on XML (for the REST service) or JSON in the GUI mode. Its output consists of the query-based superpositions of the matched binding sites and statistics on matching points, distances, and angles.

SIENA

Protein binding sites

SIENA is a software pipeline enabling the fully automated construction of protein structure ensembles from the PDB. Starting with a single query structure, all binding sites with high sequence similarity are extracted from the PDB, aligned, and superimposed. SIENA also handles complicated cases, such as comparing binding sites at protein domain interfaces or within multimeric proteins.

MicroMiner

Protein structure analysis

MicroMiner assists in identifying single-residue substitutions in protein structure databases. It searches protein residue environments with local sequence and structural similarity based on the SIENA methodology. Users can search for structural mutation in the entire PDB, their in-house structure collection, or (subsets of) the AlphaFold Database. They can use the method to explore the mutation landscape of proteins with experimental or predicted structures. MicroMiner can be applied to single domains or even protein-protein or protein-ligand interfaces. Several filter options to simplify downstream analysis are available.