# MuData quickstart

## Contents

# MuData quickstart#

Introducing multimodal data — `MuData`

— objects built on top of AnnData, `mudata`

library naturally enriches the Python ecosystem for data analysis to enable multimodal data analysis. Be sure to check tools that take advantage of this data format such as muon — the Python framework for multimodal omics analysis.

This notebooks provides an introduction to multimodal data objects.

```
[1]:
```

```
! pip install mudata
```

```
[2]:
```

```
import mudata as md
from mudata import MuData
```

## Multimodal objects#

To see how multimodal objects behave, we will simulate some data first:

```
[3]:
```

```
import numpy as np
np.random.seed(1)
n, d, k = 1000, 100, 10
z = np.random.normal(loc=np.arange(k), scale=np.arange(k)*2, size=(n,k))
w = np.random.normal(size=(d,k))
y = np.dot(z, w.T)
y.shape
```

```
[3]:
```

```
(1000, 100)
```

Creating an `AnnData`

object from the matrix will allow us to add annotations to its different dimensions (*“observations”*, e.g. samples, and measured *“variables”*):

```
[4]:
```

```
from anndata import AnnData
adata = AnnData(y)
adata.obs_names = [f"obs_{i+1}" for i in range(n)]
adata.var_names = [f"var_{j+1}" for j in range(d)]
adata
```

```
[4]:
```

```
AnnData object with n_obs × n_vars = 1000 × 100
```

We will go ahead and create a second object with data for the *same observations* but for *different variables*:

```
[5]:
```

```
d2 = 50
w2 = np.random.normal(size=(d2,k))
y2 = np.dot(z, w2.T)
adata2 = AnnData(y2)
adata2.obs_names = [f"obs_{i+1}" for i in range(n)]
adata2.var_names = [f"var2_{j+1}" for j in range(d2)]
adata2
```

```
[5]:
```

```
AnnData object with n_obs × n_vars = 1000 × 50
```

We can now wrap these two objects into a `MuData`

object:

```
[6]:
```

```
mdata = MuData({"A": adata, "B": adata2})
mdata
```

```
[6]:
```

MuData object with n_obs × n_vars = 1000 × 150 2 modalities A: 1000 x 100 B: 1000 x 50

*Observations* and *variables* of the `MuData`

object are global, which means that observations with the identical name (`.obs_names`

) in different modalities are considered to be the same observation. This also means variable names (`.var_names`

) should be unique.

This is reflected in the object description above: `mdata`

has 1000 *observations* and 150=100+50 *variables*.

### Variable mappings#

Upon construction of a `MuData`

object, a global binary mapping between *observations* and individual modalities is created as well as between *variables* and modalities.

Since all the observations are the same across modalities in `mdata`

, all the values in the *observations* mappings are set to `True`

:

```
[7]:
```

```
np.sum(mdata.obsm["A"]) == np.sum(mdata.obsm["B"]) == n
```

```
[7]:
```

```
True
```

For variables, those are 150-long vectors, e.g. for the `A`

modality — with 100 `True`

values followed by 50 `False`

values:

```
[8]:
```

```
mdata.varm['A']
```

```
[8]:
```

```
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False])
```

### Object references#

Importantly, individual modalities are stored as references to the original objects.

```
[9]:
```

```
# Add some unstructured data to the original object
adata.uns["misc"] = {"adata": True}
```

```
[10]:
```

```
# Access modality A via the .mod attribute
mdata.mod["A"].uns["misc"]
```

```
[10]:
```

```
{'adata': True}
```

This is also why the `MuData`

object has to be updated in order to reflect the latest changes to the modalities it includes:

```
[11]:
```

```
adata2.var_names = ["var_ad2_" + e.split("_")[1] for e in adata2.var_names]
```

```
[12]:
```

```
print(f"Outdated variables names: ...,", ", ".join(mdata.var_names[-3:]))
mdata.update()
print(f"Updated variables names: ...,", ", ".join(mdata.var_names[-3:]))
```

```
Outdated variables names: ..., var2_48, var2_49, var2_50
Updated variables names: ..., var_ad2_48, var_ad2_49, var_ad2_50
```

### Common observations#

While `mdata`

is comprised of the same observations for both modalities, it is not always the case in the real world where some data might be missing. By design, `mudata`

accounts for these scenarios since there’s no guarantee observations are the same — or even intersecting — for a `MuData`

instance.

It’s worth noting that other tools might provide convenience functions for some common scenarios of dealing with missing data, such as `intersect_obs()`

implemented in muon.

### Rich representation#

Some notebook environments such as Jupyter/IPython allow for the rich object representation. This is what `mudata`

uses in order to provide an optional HTML representation that allows to interactively explore `MuData`

objects. While the dataset in our example is not the most comprehensive one, here is how it looks like:

```
[13]:
```

```
with md.set_options(display_style = "html", display_html_expand = 0b000):
display(mdata)
```

## Metadata.obs0 elements

No metadata## Embeddings & mappings.obsm2 elements

A | bool | numpy.ndarray | |

B | bool | numpy.ndarray |

## Distances.obsp0 elements

No distances## A1000 × 100

AnnData object 1000 obs × 100 var## Layers.layers0 elements

No layers## Metadata.obs0 elements

No metadata## Embeddings.obsm0 elements

No embeddings## Distances.obsp0 elements

No distances## Miscellaneous.uns1 elements

misc | dict | 1 element | adata: True |

## B1000 × 50

AnnData object 1000 obs × 50 var## Layers.layers0 elements

No layers## Metadata.obs0 elements

No metadata## Embeddings.obsm0 elements

No embeddings## Distances.obsp0 elements

No distances## Miscellaneous.uns0 elements

No miscellaneousRunning `md.set_options(display_style = "html")`

will change the setting for the current Python session.

The flag `display_html_expand`

has three bits that correspond to

`MuData`

attributes,modalities,

`AnnData`

attributes,

and indicates if the fields should be expanded by default (`1`

) or collapsed under the `<summary>`

tag (`0`

).

### .h5mu files#

`MuData`

objects were designed to be serialized into `.h5mu`

files. Modalities are stored under their respective names in the `/mod`

HDF5 group of the `.h5mu`

file. Each individual modality, e.g. `/mod/A`

, is stored in the same way as it would be stored in the `.h5ad`

file.

```
[14]:
```

```
import tempfile
# Create a temporary file
temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".h5mu", prefix="muon_getting_started_")
mdata.write(temp_file.name)
mdata_r = md.read(temp_file.name, backed=True)
mdata_r
```

```
[14]:
```

MuData object with n_obs × n_vars = 1000 × 150 backed at '/var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_m8own7bb.h5mu' 2 modalities A: 1000 x 100 uns: 'misc' B: 1000 x 50

Individual modalities are backed as well — inside the `.h5mu`

file:

```
[15]:
```

```
mdata_r["A"].isbacked
```

```
[15]:
```

```
True
```

The rich representation would also reflect the *backed* state of `MuData`

objects when they are loaded from `.h5mu`

files in the read-only mode and would point to the respective file:

```
[16]:
```

```
with md.set_options(display_style = "html", display_html_expand = 0b000):
display(mdata_r)
```

↳ backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_m8own7bb.h5mu

## Metadata.obs0 elements

No metadata## Embeddings & mappings.obsm2 elements

A | bool | numpy.ndarray | |

B | bool | numpy.ndarray |

## Distances.obsp0 elements

No distances## A1000 × 100

AnnData object 1000 obs × 100 var↳ backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_m8own7bb.h5mu

## Layers.layers0 elements

No layers## Metadata.obs0 elements

No metadata## Embeddings.obsm0 elements

No embeddings## Distances.obsp0 elements

No distances## Miscellaneous.uns1 elements

misc | dict | 1 element | adata: True |

## B1000 × 50

AnnData object 1000 obs × 50 var↳ backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_m8own7bb.h5mu

## Layers.layers0 elements

No layers## Metadata.obs0 elements

No metadata## Embeddings.obsm0 elements

No embeddings## Distances.obsp0 elements

No distances## Miscellaneous.uns0 elements

No miscellaneous## Multimodal methods#

When the `MuData`

object is prepared, it is up to multimodal methods to be used to make sense of the data. The most simple and naïve approach is to concatenate matrices from multiple modalities to perform e.g. dimensionality reduction.

```
[17]:
```

```
x = np.hstack([mdata.mod["A"].X, mdata.mod["B"].X])
x.shape
```

```
[17]:
```

```
(1000, 150)
```

We can write a simple function to run principal component analysis on such a concatenated matrix. `MuData`

object provides a place to store multimodal embeddings — `MuData.obsm`

. It is similar to how the embeddings generated on invidual modalities are stored, only this time it is saved inside the `MuData`

object rather than in `AnnData.obsm`

.

```
[18]:
```

```
def simple_pca(mdata):
from sklearn import decomposition
x = np.hstack([m.X for m in mdata.mod.values()])
pca = decomposition.PCA(n_components=2)
components = pca.fit_transform(x)
# By default, methods operate in-place
# and embeddings are stored in the .obsm slot
mdata.obsm["X_pca"] = components
return
```

```
[19]:
```

```
simple_pca(mdata)
print(mdata)
```

```
MuData object with n_obs × n_vars = 1000 × 150
obsm: 'X_pca'
2 modalities
A: 1000 x 100
uns: 'misc'
B: 1000 x 50
```

In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable.

This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as *multi-omics integration methods*. `MuData`

objects make it easy for the new methods to be easily applied to such data, and some of them are implemented in muon.