The instance segmentation task in computer vision involves labeling each pixel in an image with a class and an instance label. It can be thought of as a generalization of the semantic segmentation task, since it requires segmenting all the objects in the image while also segmenting each instance. As such, it is a dense prediction task which combines elements from two popular computer vision tasks: semantic segmentation (pixelwise labeling without differentiating between instances) and object detection (detection using bounding boxes). This makes the instance segmentation task vulnerable to challenges from both the parent tasks, such as difficulty segmenting small objects and overlapping instances. Recent advances in instance segmentation, driven primarily by the success of R-CNN, have relied upon sequential (cascaded) prediction of segmentation and classification labels. This paper, on the other hand, proposes Mask R-CNN, a multi-task prediction architecture for simultaneously detecting objects, classifying them, and delineating their fine boundaries within the detected bounding boxes. Mask R-CNN builds upon the massively popular Faster R-CNN model, which was not designed for “pixel-to-pixel alignment between network inputs and outputs”, by adding a mask prediction branch for simultaneous segmentation predictions.

To recap, the Faster R-CNN model comprises of two components: a region proposal network (RPN) which produces candidate object bounding box proposals, which are then fed to another popular model, Fast R-CNN. The Fast R-CNN uses RoIPool (quantizing the RoI to a discrete size, dividing it into quantized spatial bins, and pooling to obtain final feature values) to extract features from each candidate bounding box, and uses these to perform the classification and bounding box detection tasks. Mask R-CNN extends this by retaining the two-stage architecture and with an identical RPN module. The second stage is expanded upon by adding a mask prediction branch to the existing branches, which predicts a binary segmentation mask for each RoI. Accordingly, the loss function that is optimized is also updated to incorporate this new branch, with the new loss function comprising of a classification loss, a bounding box detection loss, and a new average binary cross entropy loss calculated for each binary mask and for each class. For each RoI, the binary segmentation loss is only computed for the class corresponding to its ground truth class. The output of the classification branch is used to select the final output mask, thus enabling the network to decouple class and segmentation mask prediction. Another important distinction is that Mask R-CNN uses per pixel sigmoid instead of softmax activation, allowing the network to generate masks for all classes. The RoIPool operation introduces misalignments between the RoI and the extracted features, which is undesirable for predicting segmentation masks. To address this issue and to retain the alignment, the authors introduce a new layer called RoIAlign, which does away with any quantization of the RoI size, and instead uses bilinear interpolation to calculate exact values of the features at four equally spaced spatial locations in each RoI bin, followed by pooling to aggregate the results. For their experiments, the authors use a backbone (convolutional architecture to extract features from the input image) of residual blocks (ResNet or ResNeXt architectures) or a feature pyramid network (FPN), a top-down architecture which helps build a feature pyramid from a single-scale input. They present their results on the COCO dataset and report the average precision (AP) averaged over IoU thresholds and AP at different scales. All versions of the Mask R-CNN surpass the baselines of previous state-of-the-art models, including the heavily engineered solution of the COCO 2016 challenge winner, while only adding a small computational overhead ($\sim 20%$) on top of Faster R-CNN. Next, the authors present a vast range of ablation studies to assess the importance of each component of Mask R-CNN, and conclude that the following components yield superior performance: FPN backbone over ResNet (and ResNeXt) backbone, sigmoid over softmax activation for segmentation prediction, RoIAlign over RoIPool and RoIWarp, and fully convolutional network (FCN) over multi-layer perceptrons for segmentation prediction. Finally, the authors also demonstrate the generalizability of Mask R-CNN by repurposing it for human pose estimation, where the model predicts masks for each of the keypoint types (e.g., left/right shoulder, left/right elbow, etc.), and this outperforms the COCO 2016 keypoint detection challenge winner despite being a simpler and faster method.

This paper presents Mask R-CNN, an extension of the hugely popular object detection framework Faster R-CNN by simultaneously prediction the exact object boundaries along with the bounding boxes and class labels for objects in an image. The model replaces the RoIPool component from Faster R-CNN with an RoIAlign operation which does not introduce misalignments, and the model overall benefits from multi-task training needed for simultaneous training. The paper proposes a very simple and intuitive solution for obtaining segmentation masks, and the extensive experiments present compelling evidence of its superiority. A recent work has revisited FCN architectures to generalize them for instance segmentation tasks [1].

[1] Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), August 2020.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.