Multi-Scale Context Aggregation By Dilated Convolutions

The semantic segmentation task in computer vision involves partitioning an image into a set of multiple non-overlapping and semantically interpretable regions. This entails assigning pixel-wise class labels to the entire image, making it a dense prediction task. Owing to the massive improvements in image classification performance achieved by CNNs over the recent years, there have been several works which successfully repurpose these popular image classification CNN architectures for dense prediction tasks. This paper questions this approach, and instead investigates if modules specifically designed for a dense prediction task would improve the segmentation performance even further. Unlike image classification networks which aggregate multi-scale contextual information through successive downsampling operations to obtain a global prediction, a dense prediction task like semantic segmentation requires “multi-scale contextual reasoning in combination with full-resolution output”. % However, increasing the receptive field of the convolution operator comes at the cost of more parameters. The authors therefore propose using the dilated convolution operator to address this. To this end, this paper makes threefold contributions: (a) a generalized form of the convolution operator to account for dilation, (b) a multi-scale context aggregation module that relies on dilated convolution, and (c) a simiplified front-end module which gets rid of “vestigial components” carried over from image classification networks.

A dilated convolution operation with dilation factor $l$ performs the convolution operation by placing non-zero weights every $l^{\mathrm{th}}$ index in the kernel, and zero otherwise. The multi-scale context aggregation module has 7 convolutional layers, the first 6 of which apply $3 \times 3$ convolutions with exponentially increasing dilation rates followed by a $1 \times 1$ convolution. This enables the receptive field to grow exponentially while the parameter count grows only linearly. They also use an identity initialization scheme for the layers’ weights, where the initial kernel weights are set so as to simply pass a layer’s output onto the next. Finally, the front-end module is a simplified version of the VGG-16 architecture, where they remove the last 2 pooling and striding layers as well as the padding of the intermediate features. For every pooling operation that they remove, they dilate the subsequent convolutions by a factor of 2. The front-end module is trained on the training set of the PASCAL VOC 2012 dataset (VOC hereafter) along with additional annotations from a different source. Comparing its performance to FCN-8s and DeepLab on the VOC test set, the front-end module outperforms both by over $5%$ in mean IoU, and also outperforms modified versions of DeepLab which use multi-scale inputs and dense CRFs for structured prediction. The front-end module was then trained on a combination of VOC and Microsoft COCO dataset, improving the accuracy even further. Next, two versions of the context aggregation module (basic and large; large has more channels in deeper layers) were attached to this front-end module and ablation studies were conducted with and without these modules as well as structured prediction models, CRFs and CRF-RNNs. The results show that the context module always improves performance, and the larger module more so. Finally, in a comparison with other SOTA models, it is shown that while the proposed large context module with CRF-RNN outperforms all competing models, using just the large context module alone outperforms a DeepLab model with CRF.

This paper presents an elegant solution to the problem of global context aggregation, and provides empirical evidence in favor of it through extensive evaluation and ablation studies on a popular dataset. Certain things could have been clearer, such as the initialization scheme and how the front-end module makes full-resolution segmentation predictions without CRFs. A drawback of this approach, as pointed out by the authors, is that it relies on pretraining of certain components.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.