{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Axes in MuData"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scverse/mudata/blob/master/docs/source/notebooks/axes.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/scverse/mudata/master?labpath=docs%2Fsource%2Fnotebooks%2Faxes.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebooks introduces *axes* interface that supercharges MuData to be used beyond multimodal data storage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Briefly, the default multimodal storage means that the modalities (AnnData objects) have observations as a shared axis (`axis=0`), and the variables are effectively concatenated.\n",
    "\n",
    "We can imagine a symmetrical storage model where the variables are shared and observations are concatenated. This is possible with `axis=1` provided at MuData creation time.\n",
    "\n",
    "More than that, in some cases we might want to relax constraints even more and assume that both observations and variables are in fact shared. This allows, for instance, to store subsets of features in the same object. As both axes are shared, a convention is used here, and it is `axis=-1`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, install and import `mudata` and other libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install mudata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mudata as md\n",
    "from mudata import MuData, AnnData"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "np.random.seed(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multimodal: `axis=0`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, this is the default behaviour."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate it, let's prepare some modalities first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "n, d1, d2 = 100, 1000, 1500\n",
    "\n",
    "ax = AnnData(np.random.normal(size=(n,d1)))\n",
    "\n",
    "ay = AnnData(np.random.normal(size=(n,d2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/mudata/_core/mudata.py:445: UserWarning: Cannot join columns with the same name because var_names are intersecting.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre>MuData object with n_obs × n_vars = 100 × 2500\n",
       "  2 modalities\n",
       "    x:\t100 x 1000\n",
       "    y:\t100 x 1500</pre>"
      ],
      "text/plain": [
       "MuData object with n_obs × n_vars = 100 × 2500\n",
       "  2 modalities\n",
       "    x:\t100 x 1000\n",
       "    y:\t100 x 1500"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# same as:\n",
    "#   mdata = MuData({\"x\": ax, \"y\": ay})\n",
    "mdata = MuData({\"x\": ax, \"y\": ay}, axis=0)\n",
    "mdata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As `axis=0` corresponds to shared observations, the features should be specific to their modalities. The variable names, however, are unique, which the warning is displayed about:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ax.var_names: [ 0, 1, 2, 3, 4, ...,  999 ]\n",
      "ay.var_names: [ 0, 1, 2, 3, 4, ...,  1499 ]\n"
     ]
    }
   ],
   "source": [
    "print(\"ax.var_names: [\", \", \".join(ax.var_names.values[:5]) + \", ..., \", ax.var_names.values[d1-1], \"]\")\n",
    "print(\"ay.var_names: [\", \", \".join(ay.var_names.values[:5]) + \", ..., \", ay.var_names.values[d2-1], \"]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In real-world workflows we expect to be able to identify features by their (unique) names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax.var_names = [f\"x_var{i+1}\" for i in range(d1)]\n",
    "ay.var_names = [f\"y_var{i+1}\" for i in range(d2)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>MuData object with n_obs × n_vars = 100 × 2500\n",
       "  2 modalities\n",
       "    x:\t100 x 1000\n",
       "    y:\t100 x 1500</pre>"
      ],
      "text/plain": [
       "MuData object with n_obs × n_vars = 100 × 2500\n",
       "  2 modalities\n",
       "    x:\t100 x 1000\n",
       "    y:\t100 x 1500"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mdata = MuData({\"x\": ax, \"y\": ay}, axis=0)\n",
    "mdata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multidataset: `axis=1`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, AnnData objects can represent e.g. multiple scRNA-seq datasets. When analysing them together, it is convenient to store them in one object. This object can then incorporate annotations such as a joint embedding of the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "n1, n2, d = 100, 500, 1000\n",
    "\n",
    "ad1 = AnnData(np.random.normal(size=(n1,d)))\n",
    "ad2 = AnnData(np.random.normal(size=(n2,d)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell barcodes are dataset-specific\n",
    "ad1.obs_names = [f\"dat1-cell{i+1}\" for i in range(n1)]\n",
    "ad2.obs_names = [f\"dat2-cell{i+1}\" for i in range(n2)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What would happen if we create a MuData without specifying the axis?\n",
    "\n",
    "```python\n",
    "mdata = MuData({\"dat1\": ad1, \"dat2\": ad2})\n",
    "mdata\n",
    "```\n",
    "\n",
    "<details>   \n",
    "\n",
    "<summary style=\"display: list-item;\">Answer</summary>\n",
    "    \n",
    "By default, variables are dataset/modality-specific so the number of features in MuData will be `d + d = 2000`.\n",
    "Cells are considered shared but here, `obs_names` are unique for each dataset, so the number of cells will be `n1 + n2 = 600`.\n",
    "\n",
    "\n",
    "```\n",
    "UserWarning: Cannot join columns with the same name because var_names are intersecting.\n",
    "\n",
    "MuData object with n_obs × n_vars = 600 × 2000\n",
    "  2 modalities\n",
    "    dat1:\t100 x 1000\n",
    "    dat2:\t500 x 1000\n",
    "```\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we point the shared axes to be variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>MuData object with n_obs × n_vars = 600 × 1000\n",
       "  2 modalities\n",
       "    dat1:\t100 x 1000\n",
       "    dat2:\t500 x 1000</pre>"
      ],
      "text/plain": [
       "MuData object with n_obs × n_vars = 600 × 1000\n",
       "  2 modalities\n",
       "    dat1:\t100 x 1000\n",
       "    dat2:\t500 x 1000"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mdata = MuData({\"dat1\": ad1, \"dat2\": ad2}, axis=1)\n",
    "mdata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Different views on one modality: `axis=-1`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some workflows, like the ones with [scVI](https://scvi-tools.org/), AnnData objects typically contain only selected features, e.g. genes. Raw counts for all of the genes are still valuable to keep, for other analyses.\n",
    "\n",
    "MuData handles this scenario using the `axis=-1` convention."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "n, d_raw, d_preproc = 100, 900, 300\n",
    "\n",
    "a_raw = AnnData(np.random.normal(size=(n,d_raw)))\n",
    "a_preproc = a_raw[:,np.sort(np.random.choice(np.arange(d_raw), d_preproc, replace=False))].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What would happen if we create a MuData with `axis=0`?\n",
    "\n",
    "```python\n",
    "mdata = MuData({\"raw\": a_raw, \"preproc\": a_preproc}, axis=0)\n",
    "mdata\n",
    "```\n",
    "\n",
    "<details>   \n",
    "\n",
    "<summary style=\"display: list-item;\">Answer</summary>\n",
    "    \n",
    "With `axis=0`, cells are (fully) shared (`100`), variables are concatenated (`1200`). As the names for the latter intersect between AnnData objects, a warning will be displayed.\n",
    "\n",
    "\n",
    "```\n",
    "UserWarning: Cannot join columns with the same name because var_names are intersecting.\n",
    "\n",
    "MuData object with n_obs × n_vars = 100 × 1200\n",
    "  2 modalities\n",
    "    raw:\t100 x 900\n",
    "    preproc:\t100 x 300\n",
    "```\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What would happen if we create a MuData with `axis=1`?\n",
    "\n",
    "```python\n",
    "mdata = MuData({\"raw\": a_raw, \"preproc\": a_preproc}, axis=1)\n",
    "mdata\n",
    "```\n",
    "\n",
    "<details>   \n",
    "\n",
    "<summary style=\"display: list-item;\">Answer</summary>\n",
    "    \n",
    "With `axis=1`, variables are shared (`900`), while the cells are dataset-specific (`200`). As the names for the latter are actually the same in both AnnData objects, a warning will be displayed.\n",
    "\n",
    "\n",
    "```\n",
    "UserWarning: Cannot join columns with the same name because obs_names are intersecting.\n",
    "\n",
    "MuData object with n_obs × n_vars = 200 × 900\n",
    "  2 modalities\n",
    "    raw:\t100 x 900\n",
    "    preproc:\t100 x 300\n",
    "```\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What we want from a MuData object is to be of dimensions `(100, 900)` — cells are the same for both AnnData objects as well as a subset of features.\n",
    "\n",
    "That's what we achieve when we point that both axes are shared:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>MuData object with n_obs × n_vars = 100 × 900\n",
       "  2 modalities\n",
       "    raw:\t100 x 900\n",
       "    preproc:\t100 x 300</pre>"
      ],
      "text/plain": [
       "MuData object with n_obs × n_vars = 100 × 900\n",
       "  2 modalities\n",
       "    raw:\t100 x 900\n",
       "    preproc:\t100 x 300"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mdata = MuData({\"raw\": a_raw, \"preproc\": a_preproc}, axis=-1)\n",
    "mdata"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}