Bioconductor objects, file-shaped.
Aaron Lun's alabaster framework replaces RDS serialization for Bioconductor S4 objects with language-agnostic JSON + HDF5 artifacts — validated against the takane specification, modular on disk, readable from R and Python.
What it solves
RDS files are tightly coupled to the R class hierarchy that produced them. A schema bump in
SummarizedExperiment
can invalidate every file you've written, force expensive
updateObject()
calls, and lock the data inside R.
Alabaster splits each object into a directory of standard files — JSON for metadata, HDF5 for arrays — versioned against the takane spec. Result:
-
Stable:
schema-versioned. The reader handles old artifacts without an
updateObject()dance. - Modular: load only the parts you need. Update one component without rewriting the rest.
-
Interoperable:
Python reads the same artifacts via
dolomite-base. The data outlives the R session. -
Validated:
validateObject()runs the takane spec on every save — broken artifacts can't escape into production.
The sub-packages we mirror
Live mirror state →
One alabaster.* package per Bioconductor class family. Install
alabaster
to pull them all, or pick the ones your pipeline touches.
saveObject() / readObject() / validateObject(); DataFrame and base R types.
JSON schemas (takane) the validator runs each artifact against.
Sparse + dense matrices, HDF5-backed; DelayedArray-friendly.
IRanges, GenomicRanges, and friends.
SummarizedExperiment — the omics workhorse.
SingleCellExperiment — single-cell layers + reduced dims.
MultiAssayExperiment — multi-omics joined on sample.
SpatialExperiment — spatial coordinates + image data.
SpatialFeatureExperiment — spatial + sf geometries.
Biostrings XStringSets.
VariantAnnotation VCF objects.
BumpyMatrix — per-sample sparse matrices.
External file references inside artifacts.
Umbrella — depends on every alabaster.* so dynamic dispatch never misses.
Example — a single-cell RNA-seq study
A typical biostatistics workflow ends with a
SingleCellExperiment
holding 1M cells × 30K genes, cell-level metadata (donor, treatment, QC flags), gene metadata, and reduced dimensions (PCA, UMAP). Saving this as RDS produces a single fragile binary; saving it with alabaster produces a directory of validated artifacts.
library(SingleCellExperiment)
library(alabaster.sce)
# 1. Build the experiment as usual.
sce <- SingleCellExperiment(
assays = list(counts = counts_matrix, logcounts = log_matrix),
colData = DataFrame(donor = donor, treatment = treatment, qc_pass = qc),
rowData = DataFrame(symbol = symbols, ensembl = ensembl_ids),
reducedDims = list(PCA = pca_mat, UMAP = umap_mat)
)
# 2. Save as a directory of artifacts (JSON metadata + HDF5 arrays).
saveObject(sce, '/data/cohortA/sce-v1')
# 3. Reload anywhere — same R session, a new one, or another machine.
sce2 <- readObject('/data/cohortA/sce-v1')
# 4. Re-save just the UMAP after re-fitting — the counts matrix is untouched.
reducedDim(sce2, 'UMAP') <- new_umap
saveObject(sce2, '/data/cohortA/sce-v2')
What you get on disk:
a tree of small JSON files describing every component (assays, colData, rowData, reducedDims, metadata), with the heavy numeric arrays in HDF5. Cross-language: a Python pipeline can pick up the same directory via
dolomite-base
and use it with anndata or scanpy without round-tripping through R.
Why biostats teams adopt this: cohort artifacts live in object storage (S3, GCS), are queryable by metadata before any R is loaded, and survive every Bioconductor release because the format is versioned independently of the R class definitions.
Where it lives in our mirror
Every alabaster.* tarball — release 3.22 and devel 3.23 — is in the bucket alongside everything else from the Bioconductor software channel:
s3://ndexr/bioc/3.22/bioc/src/contrib/alabaster_1.10.0.tar.gz
s3://ndexr/bioc/3.23/bioc/src/contrib/alabaster_1.11.0.tar.gz
... and one tarball per alabaster.* sub-package per release.
Browse live state at
repo.ndexr.io
(filter by
alabaster
in the package column). The bucket isn't yet served as an
install.packages(repos = ...)
endpoint — the tarballs are mirrored but
PACKAGES
indexes and a public HTTP frontend are still to do.