In this paper, the authors proposed a fully convolutional neural network architecture for biomedical image segmentation which overcame the limitations of the contemporary algorithms. Unlike other popular algorithms then, the proposed network did not suffer from the redundancy arising out of overlapping training patches. Moreover, the authors eliminate the trade-off between localization accuracy and the use of context and state that “good localization and the use of context are possible at the same time”.
The network architecture consists of two paths - a contracting path and a symmetric expansive path. The network has a total of 23 convolutional layers, and in order to ensure a seamless tiling of the output segmentation map, the input size must be selected such that “all 2x2 max-pooling operations are applied to a layer with an even x- and y-size”.
The contracting path consists of a repeated application of two 3×3 unpadded convolutions followed by a rectified linear unit (ReLU) non-linearity and a $2 \times 2$ max-pooling operation with a stride of 2. At each downsampling step, the number of feature channels is doubled. The expansive path consists of a $2 \times 2$ convolution of the feature map (“up-convolution”), at each stage of which the number of feature channels is halved. Moreover, the correspondingly cropped (to compensate for the loss of border pixels at every convolution) feature map from the contracting path is concatenated to this, and 2 convolutions with $3 \times 3$ are performed followed by a ReLU non-linearity. After this, a $1 \times 1$ convolution maps each 64-component feature vector to the desired number of classes. The symmetry of the contracting and the expansive path is visually similar to a ‘U’-shape, thus lending the network its name.
In order to compensate for the lack of ‘enough’ training data, data augmentation was performed using rotation, translation, and deformations and gray level variations. Smooth deformations were generated using random displacement vectors on a coarse $3 \times 3$ grid, where the displacement vectors were sampled from a Gaussian distribution with a standard deviation of 10 pixels, following which, per-pixel displacements are then computed using bicubic interpolation.
The original paper was implemented in Caffe. Stochastic gradient descent with a batch size of 1 was used to optimize the network, and a high momentum (0.99) was used so that “large number of the previously seen training samples determine the update in the current optimization step.” The loss function being optimized is the pixel-wise softmax over the final feature map combined with the cross entropy loss function. Kaiming initialization was used for drawing the initial weights, and dropout layers were added to the end of the contracting path.
In order to force the network to learn the small separation borders, a large weight in the loss function is placed for separating background labels between touching cells. The weight map is then computed as
$$w(x) = w(x) + w_0 . \exp\left(-\frac{\left(d_1(x) + d_2(x)\right)^2}{2\sigma^2}\right)$$
The authors have emperically chosen $w_0 = 10$ and $\sigma = 5$ pixels, and $d_1(\cdot)$ denotes the distance to the border of the nearest cell, and $d_2(\cdot)$ is the distance to the border of the second nearest cell.
This summary was written in Fall 2018 as a part of the CMPT 880 Special Topics in AI: Medical Imaging Meets Machine Learning course.