Find open-source science resources

GeneExpression

This package gives the implementations of the gene expression signature and its distance to each. Gene expression signature is represented as a list of genes whose expression is correlated with a biological state of interest. And its distance is defined using a nonparametric, rank-based pattern-matching strategy based on the Kolmogorov-Smirnov statistic. Gene expression signature and its distance can be used to detect similarities among the signatures of drugs, diseases, and biological states of interest.

15 years ago

GPL-2

G4SNVHunter

Epigenetics

G-quadruplexes (G4s) are unique nucleic acid secondary structures predominantly found in guanine-rich regions and have been shown to be involved in various biological regulatory processes. G4SNVHunter is an R package designed to rapidly identify genomic sequences with G4-forming propensity and to accurately screen user-provided single nucleotide variants—as well as other small-scale variants such as indels and MNVs—for their potential to destabilize these structures. This allows researchers to then screen these critical variants for deeper study, digging into how they might influence biological functions—think gene regulation, for instance—by impairing G4 formation propensity.

012 months ago

pmp

MassSpectrometry

Methods and tools for (pre-)processing of metabolomics datasets (i.e. peak matrices), including filtering, normalisation, missing value imputation, scaling, and signal drift and batch effect correction methods. Filtering methods are based on: the fraction of missing values (across samples or features); Relative Standard Deviation (RSD) calculated from the Quality Control (QC) samples; the blank samples. Normalisation methods include Probabilistic Quotient Normalisation (PQN) and normalisation to total signal intensity. A unified user interface for several commonly used missing value imputation algorithms is also provided. Supported methods are: k-nearest neighbours (knn), random forests (rf), Bayesian PCA missing value estimator (bpca), mean or median value of the given feature and a constant small value. The generalised logarithm (glog) transformation algorithm is available to stabilise the variance across low and high intensity mass spectral features. Finally, this package provides an implementation of the Quality Control-Robust Spline Correction (QCRSC) algorithm for signal drift and batch effect correction of mass spectrometry-based datasets.

GPL-3

spatialDE

SpatialDE is a method to find spatially variable genes (SVG) from spatial transcriptomics data. This package provides wrappers to use the Python SpatialDE library in R, using reticulate and basilisk.

31 year ago

multiHiCcompare

multiHiCcompare provides functions for joint normalization and difference detection in multiple Hi-C datasets. This extension of the original HiCcompare package now allows for Hi-C experiments with more than 2 groups and multiple samples per group. multiHiCcompare operates on processed Hi-C data in the form of sparse upper triangular matrices. It accepts four column (chromosome, region1, region2, IF) tab-separated text files storing chromatin interaction matrices. multiHiCcompare provides cyclic loess and fast loess (fastlo) methods adapted to jointly normalizing Hi-C data. Additionally, it provides a general linear model (GLM) framework adapting the edgeR package to detect differences in Hi-C data in a distance dependent manner.

104 years ago

jvecfor

SingleCell

Drop-in replacement for BiocNeighbors::findKNN using the jvecfor Java library, which builds on the jvector library to leverage the Java Vector API for portable SIMD acceleration across AVX2, AVX-512, and ARM NEON hardware. jvecfor/jvector implements HNSW-DiskANN approximate search and VP-tree exact search. The package achieves approximately 2x speedup over Annoy-based search at n >= 50K cells while returning output structurally identical to BiocNeighbors, making it suitable for seamless integration into existing Bioconductor single-cell workflows. Convenience wrappers delegate shared nearest-neighbor (SNN) and k-nearest-neighbor (KNN) graph construction to the bluster package.

33 weeks ago

GPL-3

CCPlotR

SingleCell

CCPlotR is an R package for visualising results from tools that predict cell-cell interactions from single-cell RNA-seq data. These plots are generic and can be used to visualise results from multiple tools such as Liana, CellPhoneDB, NATMI etc.

472 months ago

qsvaR

The qsvaR package contains functions for removing the effect of degration in rna-seq data from postmortem brain tissue. The package is equipped to help users generate principal components associated with degradation. The components can be used in differential expression analysis to remove the effects of degradation.

01 month ago

Artistic-2.0

immLynx

A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis.

21 week ago

VariantFiltering

Genetics

Filter genetic variants using different criteria such as inheritance model, amino acid change consequence, minor allele frequencies across human populations, splice site strength, conservation, etc.

47 months ago

Artistic-2.0

CytoML

ImmunoOncology

Uses platform-specific implemenations of the GatingML2.0 standard to exchange gated cytometry data with other software platforms.

353 months ago

AGPL-3.0-only

peco

Sequencing

Our approach provides a way to assign continuous cell cycle phase using scRNA-seq data, and consequently, allows to identify cyclic trend of gene expression levels along the cell cycle. This package provides method and training data, which includes scRNA-seq data collected from 6 individual cell lines of induced pluripotent stem cells (iPSCs), and also continuous cell cycle phase derived from FUCCI fluorescence imaging data.

GPL (>= 3)

ideal

ImmunoOncology

This package provides functions for an Interactive Differential Expression AnaLysis of RNA-sequencing datasets, to extract quickly and effectively information downstream the step of differential expression. A Shiny application encapsulates the whole package. Support for reproducibility of the whole analysis is provided by means of a template report which gets automatically compiled and can be stored/shared.

codelink

Microarray

This package facilitates reading, preprocessing and manipulating Codelink microarray data. The raw data must be exported as text file using the Codelink software.

GPL-2

INTACT

Bayesian

This package integrates colocalization probabilities from colocalization analysis with transcriptome-wide association study (TWAS) scan summary statistics to implicate genes that may be biologically relevant to a complex trait. The probabilistic framework implemented in this package constrains the TWAS scan z-score-based likelihood using a gene-level colocalization probability. Given gene set annotations, this package can estimate gene set enrichment using posterior probabilities from the TWAS-colocalization integration step.

GPL-3 + file LICENSE

phantasus

GeneExpression

Phantasus is a web-application for visual and interactive gene expression analysis. Phantasus is based on Morpheus – a web-based software for heatmap visualisation and analysis, which was integrated with an R environment via OpenCPU API. Aside from basic visualization and filtering methods, R-based methods such as k-means clustering, principal component analysis or differential expression analysis with limma package are supported.

Chronos (Amazon Science, NeurIPS 2024)

General Science Models

Pretrained time series foundation model for zero-shot forecasting across diverse scientific and real-world domains; tokenizes continuous time series into discrete bins to train transformer language models on large-scale corpora, achieving strong zero-shot generalization and competitive performance with task-specific supervised models on climate, energy, and health benchmarks (5.3K+ stars, Apache 2.0, 2024-2026)

Unified Code for Units of Measure

Unified Code for Units of Measure (UCUM) is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business.

9810 months ago

HTML

NOASSERTION

Paper2All

Website & Interactive Content Generation

AI-powered pipeline converting papers into interactive websites, posters, and multimedia presentations with "Let's Make Your Paper Alive!" philosophy

3737 months ago

Casanovo

Genomics & Bioinformatics

Transformer encoder-decoder for de novo peptide sequencing from tandem mass spectrometry, translating MS/MS spectra directly to peptide sequences without reference databases, enabling identification of novel peptides for immunopeptidomics, antibody repertoires, and metaproteomes (Noble Lab UW, Nature Communications 2024)

1872 days ago

Apache-2.0

NanoParticle Ontology

An ontology that represents the basic knowledge of physical, chemical and functional characteristics of nanotechnology as used in cancer diagnosis and therapy.

110 years ago

Web Ontology Language

BSD-3-Clause

stk

General Chemistry

A library for building, manipulating, analyzing and automatic design of molecules, including a genetic algorithm.

2844 months ago

DISCO

Protein & Drug Discovery

General multimodal protein design framework enabling DNA-encoding of chemistry for programmable enzyme design and diverse protein generation through diffusion-based generative modeling (190+ stars, Apache 2.0, 2026)

1901 week ago

Apache-2.0

The Extensible Observation Ontology

The Extensible Observation Ontology (OBOE) is a formal ontology for capturing the semantics of scientific observation and measurement. The ontology supports researchers to add detailed semantic annotations to scientific data, thereby clarifying the inherent meaning of scientific observations.

336 years ago

XSLT

Vitro Application Ontology

Vitro is a full stack framework for building semantic web applications. It is not domain specific.

1151 week ago

Java

BSD-3-Clause

SPAdes

Assembly

SPAdes (St. Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines and the de-facto standard for prokaryotic genome assemblies.

9352 weeks ago

C++

NOASSERTION

Annotation Ontology

The Annotation Ontology specification is currently used as input for the activities of the http://www.w3.org/community/openannotation/'>W3C Open Annotation Community Group that works towards a common, RDF-based, specification for annotating digital resources. The Group effort starts by working towards a reconciliation of two proposals that have emerged over the past two years: the http://code.google.com/p/annotation-ontology/'>Annotation Ontology and the http://www.openannotation.org/spec/beta/'>Open Annotation Model. Initially, editors of these proposals will closely collaborate to devise a common draft specification that addresses requirements and use cases that were identified in the course of their respective efforts. The goal is to make this draft available for public feedback and experimentation in the second quarter of 2012. The final deliverable of the Open Annotation Community Group will be a specification, published under an appropriate open license, that is informed by the existing proposals, the common draft specification, and the community feedback. [from homepage]

011 years ago

Web Ontology Language

FitSNAP

Force Fields

A Package For Training SNAP Interatomic Potentials for use in the LAMMPS molecular dynamics package.

1867 months ago

GPL-2.0

PyLabRobot

Lab Automation & Robotics

Interactive and hardware-agnostic SDK for laboratory automation, enabling programmatic control of liquid handlers, plate readers, and other lab instruments across multiple vendors; foundational infrastructure for self-driving laboratories and AI-driven experimental execution (447+ stars)

4502 days ago

SciWrite

Scientific Writing & Collaboration

Agent skill for AI-assisted scientific manuscript writing review distilled from Stanford's *Writing in the Sciences* course, performing five sequential editorial audit passes on clarity, voice, structure, consistency, and integrity (2026)

6751 month ago

NOASSERTION

Scholarly Contributions and Roles Ontology

An ontology based on PRO for describing the contributions that may be made, and the roles that may be held by a person with respect to a journal article or other publication (e.g. the role of article guarantor or illustrator).

16 years ago

AfterQC

Sequence Processing

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data.

2146 years ago

Helical

Genomics & Bioinformatics

Unified framework for state-of-the-art pre-trained bio foundation models across genomics and transcriptomics, providing standardized interfaces and pipelines for DNA, RNA, and single-cell models including Evo 2, Geneformer, scGPT, and UCE with streamlined inference, benchmarking, and fine-tuning workflows (213+ stars, 2024-2025)

2153 weeks ago

AGPL-3.0

JThomas-CoE/coe-gemma4-biology-mmlu_pro-14b-a4b-q4

by JThomas-CoE

text-generation

Base model: google/gemma-4-26b-it Architecture: MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers) Method: Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction) Quantization: Q4KM (≈9.7 GB on disk) Tags:…

581 week ago

nvidia/geneformer_V2_104M_CLcancer

by nvidia

fill-mask

## Description: Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. This model version was continually pretrained on ~14 million cancer transcriptomes…

135 months ago

Verdugie/STEM-Oracle-27B

by Verdugie

text-generation

# or·a·cle /ˈôrəkəl/ — a source of wise counsel; one who provides authoritative knowledge. From Latin ōrāculum, meaning divine announcement. In computer science, an oracle is a black box that always returns the correct answer — you don't ask it how it knows, you ask and it answers.

1362 months ago

keejkrej/cellpose-cpsam-onnx

by keejkrej

image-segmentation

ONNX export of the Cellpose cpsam (Cellpose-SAM) model for cell segmentation in microscopy images.

03 months ago

darkknight25/deepseek-16b-medical-GPT

by darkknight25

text-generation

darkknight25/deepseek-16b-medical-GPT is a fine-tuned version of deepseek-ai/deepseek-l6b-moe-chat, optimized for medical question answering, reasoning, and clinical summarization using QLoRA and open-access healthcare datasets.

010 months ago

nvidia/AMPLIFY_350M

by nvidia

fill-mask

> [!NOTE] > This model has been optimized using NVIDIA's TransformerEngine > library. Slight numerical differences may be observed between the original model and the optimized > model. For instructions on how to install TransformerEngine, please refer to the > official documentation.

278 months ago

mradermacher/Dans-PersonalityEngine-V1.3.0-24b-i1-GGUF

by mradermacher

For a convenient overview and download list, visit our model page for this model.

38910 months ago

SaltySander/MOSAIC

by SaltySander

011 months ago

nvidia/geneformer_V2_104M

by nvidia

fill-mask

245 months ago

Awesome LLM Scientific Discovery

📋 Paper Collections & Repositories

LLM papers for scientific discovery

3456 months ago

ChemFormula

General Chemistry

ChemFormula provides a class for working with chemical formulas. It allows parsing chemical formulas, calculating formula weights, and generating formatted output strings (e.g. in HTML, LaTeX, or Unicode).

336 months ago

Equiformer

Machine Learning for Physics

Equivariant graph attention Transformer (ICLR2023)

2821 year ago

DeepAnalyze

Data Analysis & Visualization

First agentic LLM for autonomous data science with end-to-end pipeline from data to analyst-grade reports

4.2K1 month ago

skm

NFDI MatWerk Ontology

NFDI-MatWerk aims to establish a digital infrastructure for Materials Science and Engineering (MSE), fostering improved data sharing and collaboration. This repository provides comprehensive documentation for NFDI MatWerk Ontology (MWO) v3.0.0, a foundational framework designed to structure research data and enhance interoperability within the MSE community. To ensure compliance with top-level ontology standards, MWO v3.0.0 is aligned with the Basic Formal Ontology (BFO) and incorporates the modular approach of the NFDIcore mid-level ontology, enriching metadata through standardized classes and properties. The mwo addresses key aspects of MSE research data, including the NFDI-MatWerk community structure, covering task areas, infrastructure use cases, projects, researchers, and organizations. It also describes essential NFDI resources, such as software, workflows, ontologies, publications, datasets, metadata schemas, instruments, facilities, and educational materials. Additionally, mwo represents NFDI-MatWerk services, academic events, courses, and international collaborations. As the foundation for the MSE Knowledge Graph, mwo facilitates efficient data integration and retrieval, promoting collaboration and knowledge representation across MSE domains. This digital transformation enhances data discoverability, reusability, and accelerates scientific exchange, innovation, and discoveries by optimizing research data management and accessibility. (from repository)

12 weeks ago

Makefile

CC0-1.0

Galaxy Training Network

Identifiers in the GTN correspond to training materials in various formats (markdown, slides, video). The users can apply learned concepts directly within the framework via galaxy workflows.

3651 day ago

HTML

CC-BY-4.0

Keylab/COMO

by Keylab