Axes in MuData#

Open in Colab Binder

This notebooks introduces axes interface that supercharges MuData to be used beyond multimodal data storage.

Briefly, the default multimodal storage means that the modalities (AnnData objects) have observations as a shared axis (axis=0), and the variables are effectively concatenated.

We can imagine a symmetrical storage model where the variables are shared and observations are concatenated. This is possible with axis=1 provided at MuData creation time.

More than that, in some cases we might want to relax constraints even more and assume that both observations and variables are in fact shared. This allows, for instance, to store subsets of features in the same object. As both axes are shared, a convention is used here, and it is axis=-1.

Imports#

First, install and import mudata and other libraries.

[1]:
! pip install mudata
[2]:
import mudata as md
from mudata import MuData, AnnData
[3]:
import numpy as np
import pandas as pd

np.random.seed(1)

Multimodal: axis=0#

As expected, this is the default behaviour.

To illustrate it, let’s prepare some modalities first:

[4]:
n, d1, d2 = 100, 1000, 1500

ax = AnnData(np.random.normal(size=(n,d1)))

ay = AnnData(np.random.normal(size=(n,d2)))
[5]:
# same as:
#   mdata = MuData({"x": ax, "y": ay})
mdata = MuData({"x": ax, "y": ay}, axis=0)
mdata
/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mudata/_core/mudata.py:445: UserWarning: Cannot join columns with the same name because var_names are intersecting.
  warnings.warn(
[5]:
MuData object with n_obs × n_vars = 100 × 2500
  2 modalities
    x:      100 x 1000
    y:      100 x 1500

As axis=0 corresponds to shared observations, the features should be specific to their modalities. The variable names, however, are unique, which the warning is displayed about:

[6]:
print("ax.var_names: [", ", ".join(ax.var_names.values[:5]) + ", ..., ", ax.var_names.values[d1-1], "]")
print("ay.var_names: [", ", ".join(ay.var_names.values[:5]) + ", ..., ", ay.var_names.values[d2-1], "]")
ax.var_names: [ 0, 1, 2, 3, 4, ...,  999 ]
ay.var_names: [ 0, 1, 2, 3, 4, ...,  1499 ]

In real-world workflows we expect to be able to identify features by their (unique) names:

[7]:
ax.var_names = [f"x_var{i+1}" for i in range(d1)]
ay.var_names = [f"y_var{i+1}" for i in range(d2)]
[8]:
mdata = MuData({"x": ax, "y": ay}, axis=0)
mdata
[8]:
MuData object with n_obs × n_vars = 100 × 2500
  2 modalities
    x:      100 x 1000
    y:      100 x 1500

Multidataset: axis=1#

Now, AnnData objects can represent e.g. multiple scRNA-seq datasets. When analysing them together, it is convenient to store them in one object. This object can then incorporate annotations such as a joint embedding of the datasets.

[9]:
n1, n2, d = 100, 500, 1000

ad1 = AnnData(np.random.normal(size=(n1,d)))
ad2 = AnnData(np.random.normal(size=(n2,d)))
[10]:
# Cell barcodes are dataset-specific
ad1.obs_names = [f"dat1-cell{i+1}" for i in range(n1)]
ad2.obs_names = [f"dat2-cell{i+1}" for i in range(n2)]

What would happen if we create a MuData without specifying the axis?

mdata = MuData({"dat1": ad1, "dat2": ad2})
mdata

Answer

By default, variables are dataset/modality-specific so the number of features in MuData will be d + d = 2000. Cells are considered shared but here, obs_names are unique for each dataset, so the number of cells will be n1 + n2 = 600.

UserWarning: Cannot join columns with the same name because var_names are intersecting.

MuData object with n_obs × n_vars = 600 × 2000
  2 modalities
    dat1:   100 x 1000
    dat2:   500 x 1000

Now, if we point the shared axes to be variables:

[11]:
mdata = MuData({"dat1": ad1, "dat2": ad2}, axis=1)
mdata
[11]:
MuData object with n_obs × n_vars = 600 × 1000
  2 modalities
    dat1:   100 x 1000
    dat2:   500 x 1000

Different views on one modality: axis=-1#

In some workflows, like the ones with scVI, AnnData objects typically contain only selected features, e.g. genes. Raw counts for all of the genes are still valuable to keep, for other analyses.

MuData handles this scenario using the axis=-1 convention.

[12]:
n, d_raw, d_preproc = 100, 900, 300

a_raw = AnnData(np.random.normal(size=(n,d_raw)))
a_preproc = a_raw[:,np.sort(np.random.choice(np.arange(d_raw), d_preproc, replace=False))].copy()

What would happen if we create a MuData with axis=0?

mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=0)
mdata

Answer

With axis=0, cells are (fully) shared (100), variables are concatenated (1200). As the names for the latter intersect between AnnData objects, a warning will be displayed.

UserWarning: Cannot join columns with the same name because var_names are intersecting.

MuData object with n_obs × n_vars = 100 × 1200
  2 modalities
    raw:    100 x 900
    preproc:    100 x 300

What would happen if we create a MuData with axis=1?

mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=1)
mdata

Answer

With axis=1, variables are shared (900), while the cells are dataset-specific (200). As the names for the latter are actually the same in both AnnData objects, a warning will be displayed.

UserWarning: Cannot join columns with the same name because obs_names are intersecting.

MuData object with n_obs × n_vars = 200 × 900
  2 modalities
    raw:    100 x 900
    preproc:    100 x 300

What we want from a MuData object is to be of dimensions (100, 900) — cells are the same for both AnnData objects as well as a subset of features.

That’s what we achieve when we point that both axes are shared:

[13]:
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=-1)
mdata
[13]:
MuData object with n_obs × n_vars = 100 × 900
  2 modalities
    raw:    100 x 900
    preproc:        100 x 300