Axes in MuData
Contents
Axes in MuData#
This notebooks introduces axes interface that supercharges MuData to be used beyond multimodal data storage.
Briefly, the default multimodal storage means that the modalities (AnnData objects) have observations as a shared axis (axis=0
), and the variables are effectively concatenated.
We can imagine a symmetrical storage model where the variables are shared and observations are concatenated. This is possible with axis=1
provided at MuData creation time.
More than that, in some cases we might want to relax constraints even more and assume that both observations and variables are in fact shared. This allows, for instance, to store subsets of features in the same object. As both axes are shared, a convention is used here, and it is axis=-1
.
Imports#
First, install and import mudata
and other libraries.
[1]:
! pip install mudata
[2]:
import mudata as md
from mudata import MuData, AnnData
[3]:
import numpy as np
import pandas as pd
np.random.seed(1)
Multimodal: axis=0
#
As expected, this is the default behaviour.
To illustrate it, let’s prepare some modalities first:
[4]:
n, d1, d2 = 100, 1000, 1500
ax = AnnData(np.random.normal(size=(n,d1)))
ay = AnnData(np.random.normal(size=(n,d2)))
[5]:
# same as:
# mdata = MuData({"x": ax, "y": ay})
mdata = MuData({"x": ax, "y": ay}, axis=0)
mdata
/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mudata/_core/mudata.py:445: UserWarning: Cannot join columns with the same name because var_names are intersecting.
warnings.warn(
[5]:
MuData object with n_obs × n_vars = 100 × 2500 2 modalities x: 100 x 1000 y: 100 x 1500
As axis=0
corresponds to shared observations, the features should be specific to their modalities. The variable names, however, are unique, which the warning is displayed about:
[6]:
print("ax.var_names: [", ", ".join(ax.var_names.values[:5]) + ", ..., ", ax.var_names.values[d1-1], "]")
print("ay.var_names: [", ", ".join(ay.var_names.values[:5]) + ", ..., ", ay.var_names.values[d2-1], "]")
ax.var_names: [ 0, 1, 2, 3, 4, ..., 999 ]
ay.var_names: [ 0, 1, 2, 3, 4, ..., 1499 ]
In real-world workflows we expect to be able to identify features by their (unique) names:
[7]:
ax.var_names = [f"x_var{i+1}" for i in range(d1)]
ay.var_names = [f"y_var{i+1}" for i in range(d2)]
[8]:
mdata = MuData({"x": ax, "y": ay}, axis=0)
mdata
[8]:
MuData object with n_obs × n_vars = 100 × 2500 2 modalities x: 100 x 1000 y: 100 x 1500
Multidataset: axis=1
#
Now, AnnData objects can represent e.g. multiple scRNA-seq datasets. When analysing them together, it is convenient to store them in one object. This object can then incorporate annotations such as a joint embedding of the datasets.
[9]:
n1, n2, d = 100, 500, 1000
ad1 = AnnData(np.random.normal(size=(n1,d)))
ad2 = AnnData(np.random.normal(size=(n2,d)))
[10]:
# Cell barcodes are dataset-specific
ad1.obs_names = [f"dat1-cell{i+1}" for i in range(n1)]
ad2.obs_names = [f"dat2-cell{i+1}" for i in range(n2)]
What would happen if we create a MuData without specifying the axis?
mdata = MuData({"dat1": ad1, "dat2": ad2})
mdata
Answer
By default, variables are dataset/modality-specific so the number of features in MuData will be d + d = 2000
. Cells are considered shared but here, obs_names
are unique for each dataset, so the number of cells will be n1 + n2 = 600
.
UserWarning: Cannot join columns with the same name because var_names are intersecting.
MuData object with n_obs × n_vars = 600 × 2000
2 modalities
dat1: 100 x 1000
dat2: 500 x 1000
Now, if we point the shared axes to be variables:
[11]:
mdata = MuData({"dat1": ad1, "dat2": ad2}, axis=1)
mdata
[11]:
MuData object with n_obs × n_vars = 600 × 1000 2 modalities dat1: 100 x 1000 dat2: 500 x 1000
Different views on one modality: axis=-1
#
In some workflows, like the ones with scVI, AnnData objects typically contain only selected features, e.g. genes. Raw counts for all of the genes are still valuable to keep, for other analyses.
MuData handles this scenario using the axis=-1
convention.
[12]:
n, d_raw, d_preproc = 100, 900, 300
a_raw = AnnData(np.random.normal(size=(n,d_raw)))
a_preproc = a_raw[:,np.sort(np.random.choice(np.arange(d_raw), d_preproc, replace=False))].copy()
What would happen if we create a MuData with axis=0
?
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=0)
mdata
Answer
With axis=0
, cells are (fully) shared (100
), variables are concatenated (1200
). As the names for the latter intersect between AnnData objects, a warning will be displayed.
UserWarning: Cannot join columns with the same name because var_names are intersecting.
MuData object with n_obs × n_vars = 100 × 1200
2 modalities
raw: 100 x 900
preproc: 100 x 300
What would happen if we create a MuData with axis=1
?
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=1)
mdata
Answer
With axis=1
, variables are shared (900
), while the cells are dataset-specific (200
). As the names for the latter are actually the same in both AnnData objects, a warning will be displayed.
UserWarning: Cannot join columns with the same name because obs_names are intersecting.
MuData object with n_obs × n_vars = 200 × 900
2 modalities
raw: 100 x 900
preproc: 100 x 300
What we want from a MuData object is to be of dimensions (100, 900)
— cells are the same for both AnnData objects as well as a subset of features.
That’s what we achieve when we point that both axes are shared:
[13]:
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=-1)
mdata
[13]:
MuData object with n_obs × n_vars = 100 × 900 2 modalities raw: 100 x 900 preproc: 100 x 300