V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

This paper proposes an end-to-end trained fully convolutional neural network model to process 3D image volumes. Unlike previous works that processed the input volumes slice-wise or patch-wise, the authors propose to use volumetric convolutions. Moreover, a new objective function formulated using the Dice coefficient is proposed to be optimized, and the authors demonstrate the fast and superior performance of the algorithm on the segmentation of prostate MRI volumes.

Considering the two parts of the network as left and right (as visualized in the original paper), the left part of the network consists of a compression path, while the right part decompresses the signal till the original size is obtained.

The left side of the network is divided into different stages comprising of one to three convolutional layers at different resolutions. Similar to ResNets, each stage learns a residual function - the output of which is added to the last convolutional layer of that stage. The authors state that such an architecture ensures convergence in a fraction of the time required by a similar network which does not learn residual functions.

The convolutions performed in each stage use volumetric kernels of size $5 \times 5 \times 5$ voxels and with each subsequent stage along the compression path, the resolution of the data is halved by convolution with kernels of size $2 \times 2 \times 2$ voxels applied with stride 2. This is conceptually similar to the popular max-pooling operations in convolutional neural networks. However, replacing pooling operations with convolution operations reduces the memory footprint during training. Parametric ReLU (PReLU) has been used to introduce non-linearities throughout the network. Each stage on the left part of the network computes twice the number of features twice as that of the previous layer.

The right side of the network “extracts features and expands the spatial support of the lower resolution feature maps in order to gather and assemble the necessary information to output a two channel volumetric segmentation.” The feature maps at the last convolutional layer, produced by $1 \times 1 \times 1$ size kernels, represent outputs of the same size as the input volume, and are then converted to probabilistic segmentation outsputs of the foreground and the background regions. This is done by applying the softmax function voxel-wise.

Similar to U-Net, the features extracted from the early stages of the left part of the network are forwarded to the right part of the network. This helps gather fine-grained detail that would otherwise have been lost in the compression path, thereby improving the quality of the final contour prediciton, while simultaneously also reducing the convergence time of the model.

The Dice coefficient $D$ between two binary volumes can be written as

$$ D = \frac{2 \sum_i^N p_i g_i}{\sum_i^N p_i^2 + \sum_i^N g_i^2} $$

Computing the gradient with respect to the $j^{th}$ voxel of the prediction, we have $$ \frac{\partial D}{\partial d_j} = 2 \left[\frac{g_j \left(\sum_i^N p_i^2 + \sum_i^N g_i^2\right) - 2 p_j \left(\sum_i^N p_i g_i\right)}{\Big(\sum_i^N p_i^2 + \sum_i^N g_i^2\Big)^2}\right] $$

This formulation removes the need to assign weights to samples of different classes to overcome voxel class imbalance.

This summary was written in Fall 2018 as a part of the CMPT 880 Special Topics in AI: Medical Imaging Meets Machine Learning course.