The authors highlight the multiple shortcomings of the contemporary learning based image registration methods, such as the inaccuracy of the correspondences provided for training (especially when the deformed subject image is significantly different from the template image), the difficulty of incorporating new image features for learning without repeating the whole training procedure all over again, and the lack of variation in the training image features primarily because of the prohibitive computational cost associated with it. Moreover, the authors note that the best features are ``often learnt only at the template space", meaning if the template image is changed, the whole training procedure has to be re-done.

Let $x’$ denote an image patch, which when reshaped into a column vector, has a length $L$. Let $t$ denote the index of all $T$ image patches from the training MR images. $$ x^t = [x_1^t, x_2^t, ….., x_L^t]' $$ Let $W = {w^i}_{i = 1,2,…..,N}$ denote the set of $N$ filters, and each $w^i$ is a column vector. $$ w^i = [w^i_1, w^i_2, ….., w^i_L]' $$

Unlike classic feature extraction, where a set of filters are hand-designed to extract features, the paper proposes to use an unsupervised learning algorithm known as Independent Subspace Analysis (ISA), which is an extension of Independent Component Analysis (ICA). In ISA, the responses $s^{t,i}$ (computed by the dot product of the input vector and the weight vector, i.e., $s^{t,i} = x^t \odot w^i$) do not necessarily have to be mutually independent. Instead, these responses can be grouped, and each of these groups is called an independent subspace. The responses within each of these subspaces are permitted to have mutual dependencies, but what is important is that there should be no dependencies between different groups.

Let $V$ represent the subspace structure of all the observed responses $s^{t,i}$, where each entry $v_{i,j}$ represents the association of the basis vector $w^i$ with the $j^{th}$ subspace. The matrix $V$ is fixed and is not updated during the training. The optimal set of weights/filters $W$ can be obtained by ISA by finding independent subspaces by solving

$$ \hat{W} = \text{argmin}_{W} \sum_{t=1}^T \sum_{j=1}^N p_j (x^t; W, V), \text{ such that } WW’ = I $$ where N is the dimensionality of the subspace of the responses, and $p_j (x^t; W, V)$ is the activation of a particular $x^t$ in ISA, and is given by $$ p_j (x^t; W, V) = \sqrt{\sum_{i=1}^L v_{i,j}\left(x^t \odot w^i\right)^2} $$

This is done by hierarchically training a two-layer stacked convolutional neural network, where the first layer is trained using the image patches with a smaller scale, and the second layer is trained using a transformation of the response of the first layer, which the authors state has an advantage because “high-level understanding of large-scale image patch can be perceived from the low-level image features detected by the basis filters in the first layer.”

This summary was written in Fall 2018 as a part of the CMPT 880 Special Topics in AI: Medical Imaging Meets Machine Learning course.