alabaster.ndexr.io

Bioconductor objects, file-shaped.

Aaron Lun's alabaster framework replaces RDS serialization for Bioconductor S4 objects with language-agnostic JSON + HDF5 artifacts — validated against the takane specification, modular on disk, readable from R and Python.

What it solves

RDS files are tightly coupled to the R class hierarchy that produced them. A schema bump in SummarizedExperiment can invalidate every file you've written, force expensive updateObject() calls, and lock the data inside R.

Alabaster splits each object into a directory of standard files — JSON for metadata, HDF5 for arrays — versioned against the takane spec. Result:

Stable: schema-versioned. The reader handles old artifacts without an updateObject() dance.
Modular: load only the parts you need. Update one component without rewriting the rest.
Interoperable: Python reads the same artifacts via dolomite-base . The data outlives the R session.
Validated: validateObject() runs the takane spec on every save — broken artifacts can't escape into production.

The sub-packages we mirror

Live mirror state →

One alabaster.* package per Bioconductor class family. Install alabaster to pull them all, or pick the ones your pipeline touches.

alabaster.base

saveObject() / readObject() / validateObject(); DataFrame and base R types.

alabaster.schemas

JSON schemas (takane) the validator runs each artifact against.

alabaster.matrix

Sparse + dense matrices, HDF5-backed; DelayedArray-friendly.

alabaster.ranges

IRanges, GenomicRanges, and friends.

alabaster.se

SummarizedExperiment — the omics workhorse.

alabaster.sce

SingleCellExperiment — single-cell layers + reduced dims.

alabaster.mae

MultiAssayExperiment — multi-omics joined on sample.

alabaster.spatial

SpatialExperiment — spatial coordinates + image data.

alabaster.sfe

SpatialFeatureExperiment — spatial + sf geometries.

alabaster.string

Biostrings XStringSets.

alabaster.vcf

VariantAnnotation VCF objects.

alabaster.bumpy

BumpyMatrix — per-sample sparse matrices.

alabaster.files

External file references inside artifacts.

alabaster

Umbrella — depends on every alabaster.* so dynamic dispatch never misses.

Example — a single-cell RNA-seq study

A typical biostatistics workflow ends with a SingleCellExperiment holding 1M cells × 30K genes, cell-level metadata (donor, treatment, QC flags), gene metadata, and reduced dimensions (PCA, UMAP). Saving this as RDS produces a single fragile binary; saving it with alabaster produces a directory of validated artifacts.

                library(SingleCellExperiment)
library(alabaster.sce)

# 1. Build the experiment as usual.
sce <- SingleCellExperiment(
  assays      = list(counts = counts_matrix, logcounts = log_matrix),
  colData     = DataFrame(donor = donor, treatment = treatment, qc_pass = qc),
  rowData     = DataFrame(symbol = symbols, ensembl = ensembl_ids),
  reducedDims = list(PCA = pca_mat, UMAP = umap_mat)
)

# 2. Save as a directory of artifacts (JSON metadata + HDF5 arrays).
saveObject(sce, '/data/cohortA/sce-v1')

# 3. Reload anywhere — same R session, a new one, or another machine.
sce2 <- readObject('/data/cohortA/sce-v1')

# 4. Re-save just the UMAP after re-fitting — the counts matrix is untouched.
reducedDim(sce2, 'UMAP') <- new_umap
saveObject(sce2, '/data/cohortA/sce-v2')

What you get on disk: a tree of small JSON files describing every component (assays, colData, rowData, reducedDims, metadata), with the heavy numeric arrays in HDF5. Cross-language: a Python pipeline can pick up the same directory via dolomite-base and use it with anndata or scanpy without round-tripping through R.

Why biostats teams adopt this: cohort artifacts live in object storage (S3, GCS), are queryable by metadata before any R is loaded, and survive every Bioconductor release because the format is versioned independently of the R class definitions.

Where it lives in our mirror

Every alabaster.* tarball — release 3.22 and devel 3.23 — is in the bucket alongside everything else from the Bioconductor software channel:

                s3://ndexr/bioc/3.22/bioc/src/contrib/alabaster_1.10.0.tar.gz
s3://ndexr/bioc/3.23/bioc/src/contrib/alabaster_1.11.0.tar.gz
... and one tarball per alabaster.* sub-package per release.

Browse live state at repo.ndexr.io (filter by alabaster in the package column). The bucket isn't yet served as an install.packages(repos = ...) endpoint — the tarballs are mirrored but PACKAGES indexes and a public HTTP frontend are still to do.

Bioconductor objects, file-shaped.

What it solves

The sub-packages we mirror

Example — a single-cell RNA-seq study

Where it lives in our mirror

Let's Talk

Reach out about onboarding the workflow, the mirrors, or the build pipeline. We follow up shortly.

What can we help with?

Name

Email

Phone

Message

Please include enough detail for a useful reply.

Let's Talk