alabaster.ndexr.io

Bioconductor objects, file-shaped.

Aaron Lun's alabaster framework replaces RDS serialization for Bioconductor S4 objects with language-agnostic JSON + HDF5 artifacts — validated against the takane specification, modular on disk, readable from R and Python.

What it solves

RDS files are tightly coupled to the R class hierarchy that produced them. A schema bump in SummarizedExperiment can invalidate every file you've written, force expensive updateObject() calls, and lock the data inside R.

Alabaster splits each object into a directory of standard files — JSON for metadata, HDF5 for arrays — versioned against the takane spec. Result:

  • Stable: schema-versioned. The reader handles old artifacts without an updateObject() dance.
  • Modular: load only the parts you need. Update one component without rewriting the rest.
  • Interoperable: Python reads the same artifacts via dolomite-base . The data outlives the R session.
  • Validated: validateObject() runs the takane spec on every save — broken artifacts can't escape into production.
The sub-packages we mirror
Live mirror state →

One alabaster.* package per Bioconductor class family. Install alabaster to pull them all, or pick the ones your pipeline touches.

alabaster.base

saveObject() / readObject() / validateObject(); DataFrame and base R types.

alabaster.schemas

JSON schemas (takane) the validator runs each artifact against.

alabaster.matrix

Sparse + dense matrices, HDF5-backed; DelayedArray-friendly.

alabaster.ranges

IRanges, GenomicRanges, and friends.

alabaster.se

SummarizedExperiment — the omics workhorse.

alabaster.sce

SingleCellExperiment — single-cell layers + reduced dims.

alabaster.mae

MultiAssayExperiment — multi-omics joined on sample.

alabaster.spatial

SpatialExperiment — spatial coordinates + image data.

alabaster.sfe

SpatialFeatureExperiment — spatial + sf geometries.

alabaster.string

Biostrings XStringSets.

alabaster.vcf

VariantAnnotation VCF objects.

alabaster.bumpy

BumpyMatrix — per-sample sparse matrices.

alabaster.files

External file references inside artifacts.

alabaster

Umbrella — depends on every alabaster.* so dynamic dispatch never misses.

Example — a single-cell RNA-seq study

A typical biostatistics workflow ends with a SingleCellExperiment holding 1M cells × 30K genes, cell-level metadata (donor, treatment, QC flags), gene metadata, and reduced dimensions (PCA, UMAP). Saving this as RDS produces a single fragile binary; saving it with alabaster produces a directory of validated artifacts.

                library(SingleCellExperiment)
library(alabaster.sce)

# 1. Build the experiment as usual.
sce <- SingleCellExperiment(
  assays      = list(counts = counts_matrix, logcounts = log_matrix),
  colData     = DataFrame(donor = donor, treatment = treatment, qc_pass = qc),
  rowData     = DataFrame(symbol = symbols, ensembl = ensembl_ids),
  reducedDims = list(PCA = pca_mat, UMAP = umap_mat)
)

# 2. Save as a directory of artifacts (JSON metadata + HDF5 arrays).
saveObject(sce, '/data/cohortA/sce-v1')

# 3. Reload anywhere — same R session, a new one, or another machine.
sce2 <- readObject('/data/cohortA/sce-v1')

# 4. Re-save just the UMAP after re-fitting — the counts matrix is untouched.
reducedDim(sce2, 'UMAP') <- new_umap
saveObject(sce2, '/data/cohortA/sce-v2')
              

What you get on disk: a tree of small JSON files describing every component (assays, colData, rowData, reducedDims, metadata), with the heavy numeric arrays in HDF5. Cross-language: a Python pipeline can pick up the same directory via dolomite-base and use it with anndata or scanpy without round-tripping through R.

Why biostats teams adopt this: cohort artifacts live in object storage (S3, GCS), are queryable by metadata before any R is loaded, and survive every Bioconductor release because the format is versioned independently of the R class definitions.

Where it lives in our mirror

Every alabaster.* tarball — release 3.22 and devel 3.23 — is in the bucket alongside everything else from the Bioconductor software channel:

                s3://ndexr/bioc/3.22/bioc/src/contrib/alabaster_1.10.0.tar.gz
s3://ndexr/bioc/3.23/bioc/src/contrib/alabaster_1.11.0.tar.gz
... and one tarball per alabaster.* sub-package per release.
              

Browse live state at repo.ndexr.io (filter by alabaster in the package column). The bucket isn't yet served as an install.packages(repos = ...) endpoint — the tarballs are mirrored but PACKAGES indexes and a public HTTP frontend are still to do.