Mutual Information
Given two discrete random variables $A$ and $B$ with pardinal probability distributions $p_A(a)$ and $p_B(b)$ and joint probability distribution $p_{AB}(a,b)$, the two variables are said to be statistically independent is $p_{AB}(a,b) = p_A(a).p_B(b)$, and are said to be maximally dependent if they are related by a one-to-one mapping $T$ such that $p_A(a) = p_B(T(a)) = p_{AB}(a, T(a))$. The mutual information $I(A,B)$ represents the degree of dependence of A and B
$$ I(A,B) = \sum_{a,b} p_{AB}(a,b) log \frac{p_{AB}(a,b)}{p_A(a).p_B(b)} $$
Using the notion of entropy from information theory, we also have
$$ I(A,B) = H(A) + H(B) - H(A,B) = H(A) - H(A|B) = H(B) - H(B|A) $$ where $H$ represents the entropy.
Mutual Information Registration Criterion
Given two images $\mathbb{A}$ and $\mathbb{B}$ that are geometrically related by the registration transformation $\boldsymbol{T_\alpha}$ with parameters $\boldsymbol{\alpha}$ such that pixels/voxels $\boldsymbol{p}$ in $\mathbb{A}$ with intensity $a$ physically correspond to pixels/voxels $\boldsymbol{T_\alpha(p)}$ in $\mathbb{B}$ with intensity $b$, we have
$$ a = \mathbb{A}(\boldsymbol{p})$$$$ b = \mathbb{B}(\boldsymbol{T_\alpha(p)})$$$$ I(A,B) = \sum_{a,b} p_{AB}(a,b) log_{2} \frac{p_{AB}(a,b)}{p_A(a).p_B(b)} $$
The relationship between $a$ and $b$ and therefore their mutual information $p_{AB}(a,b)$ depends on $\boldsymbol{T_\alpha}$. Therefore, according to the mutual information registration criterion, the maximum alignment of the images is obtained for
$$ \boldsymbol{\alpha^*} = \text{argmax}_{\boldsymbol{\alpha}} \ I(A,B) $$
It should be noted that if both the marginal distributions $p_A(a)$ and $p_B(b)$ are independent of the registration parameter $\boldsymbol{\alpha}$, then this criterion effectively reduces to minimizing the joint entropy $H_{AB}(A, B)$.
Since mutual information does not depend directly on the intensity values to map correspondences between images, it is insenstive to intensity permutations and one-to-one intensity transformations as well as capable of handling positive and negative intensity correlations simultaneously. Moreover, since there are no limiting assumptions specific to the image content of the modalities involved, it is successfully applicable to a wide range of multimodal image registration tasks.
This can be generalized as well since mutual information $I(A,B)$ is just one example of the general f-information measures of dependence $f(P || P_1 \times P_2)$ where $P$ represents the set of joint probability distributions $P(A,B)$ and $P_1 \times P_2$ represents the set of joint probability distributions $P(A).P(B)$. The definition of f-divergence is $$ f(P||Q) = \sum_{i} q_i . f\Big(\frac{p_i}{q_i}\Big) $$
With $p_{ij} = P(i,j)$, $p_{i.} = \sum_j p_{ij}$, and $p_{.j} = \sum_i p_{ij}$, we have
$$ I_{\alpha}(P||P_1 \times P_2) = \frac{1}{\alpha(\alpha - 1)} \bigg[\sum_{i,j} \frac{p^\alpha_{ij}}{(p_{i.}p_{.j})^{\alpha-1}} - 1\bigg] $$
As $\alpha \to 1$, $I_\alpha$ equals the mutual information measure.
This summary was written in Fall 2018 as a part of the CMPT 880 Special Topics in AI: Medical Imaging Meets Machine Learning course.