Although deep learning has enabled massive strides in visual recognition tasks including object detection, most of these advances have been made in 2D object recognition. However, these improvements are built upon a critical omission: objects in the real world exist beyond the $XY$ image plane and in a 3D space. While there has also been significant progress in 3D shape understanding tasks, the authors call to attention for methods that amalgamate these two tasks: i.e., approaches which (a) can work in the real world where there are far fewer constraints (as compared to carefully curated datasets) such as constraints on object count, occlusion, illumination, etc., and (b) can do so without ignoring the rich 3D information present therein. They build upon the immensely popular Mask R-CNN multi-task framework and extend it by adding a mesh prediction branch that learns to generate “high-resolution triangle mesh” of the detected objects simultaneously. Whereas previous works on single-view shape prediction rely on post-processing or are limited in the topologies that they can represent as meshes, Mesh R-CNN uses multiple 3D shape representations: 3D voxels and 3D meshes, where the latter is obtained by refining the former.

To recap, Mask R-CNN is a region based object detector build upon Faster R-CNN which, given an input image, generates a bounding box, class label, and segmentation mask for each object in the image. Mesh R-CNN adds a mesh predictor comprised of voxel and mesh refinement branches to Mask R-CNN, which share a similar architecture as the box and the mask prediction branches. This mesh predictor module operates on the convolutional features extracted from an object’s bounding box and predicts its complete 3D shape using triangle meshes. Similar to Mask R-CNN which used RoIAlign to retain alignment between the RoI and the extracted features, the mesh predictor module maintains correspondence between the input and the features using region-specific (RoIAlign) and vertex-specific (VertAlign) alignment operators. Analogous to the mask prediction branch of Mask R-CNN, the voxel branch predicts coarse voxel occupancy probabilities for each position in the input, giving the object shape in 3D. The intrinsic matrix of the camera is used to maintain pixelwise correspondences. Next, to obtain fine-grained 3D shapes, the authors propose a $\texttt{cubify}$ operation to convert the voxel predictions to triangle meshes, which replaces each occupied voxel with a cuboid triangle mesh, resulting in a “watertight mesh whose topology depends on the voxel predictions”. The mesh refinement branch is then used to process this ‘cubified’ mesh by refining its vertex positions. Multiple refinement stages are used, with each stage comprising of 3 operations: (a) vertex alignment to extract features for each mesh vertex aligned using RoIAlign and VertAlign, (b) graph convolution operations to aggregate local information over meshes, and (c) vertex refinement to update the mesh geometry by updating vertex positions while keeping the mesh topology fixed. A binary cross entropy loss between predicted and true voxel occupancies is used to train the voxel branch, whereas the mesh refinement branch is trained using a weighted sum of these losses, averaged across all refinement stages: (a) chamfer and normal distances to penalized position and normal predictions between a pair of point clouds and (b) edge loss to act as a shape regularizer and predict high quality meshes, avoiding degenerate mesh predictions. The authors present evaluations on ShapeNet, a dataset of textured CAD models of 3D shapes, and compare Mesh R-CNN with and without the shape regularizer edge loss (called ‘Pretty’ and ‘Best’ respectively) with a variety of methods for single image shape prediction. All these methods except Mesh R-CNN can only predict “connected meshes of genus zero”, i.e., shapes without any holes. Their ‘Best’ variant outperforms all previous methods, indicating the superiority of the mesh predictor module. Ablation studies on the ShapeNet test set and a subset called Holes Test Set (containing objects from ShapeNet with at least one hole) show that (a) Mesh R-CNN is considerably better than other competing methods at predicting topologically diverse shapes required to model holes or disconnected components, (b) using voxel predictions only without mesh refinement leads to poor results meaning mesh predictions enable capture of fine-grained details, (c) all ‘Best’ models outperform the respective ‘Pretty’ models, and (d) a refinement module based on residual architecture is equally efficient yet lightweight. Finally, they evaluate Mesh R-CNN on Pix3D, a much more challenging dataset with 3D models and real world images. They generate two splits, one by randomly sampling images into train and test partitions, and another by ensuring that there are no common 3D models across train and test partitions. Since the dataset is not completely annotated, only predictions with box IoU $> 0.3$ are considered as ground truths, and the authors observe that (a) Mesh R-CNN outperforms all baselines by at least $14.9%$ $\mathrm{AP}^{\mathrm{mesh}}$, (b) performance on the second split, which is arguably more difficult, is worse, highlighting the challenges with shape recognition in the wild, and (c) shape prediction degrades with only one mesh refinement stage.

This paper presents a novel approach for simultaneous visual perception in 2D and shape inference in 3D from a single image. The method is a clever extension of the Mask R-CNN architecture, and the authors present theoretical justifications for their choices, and validate their method using extensive experiments. Although quite dense, the paper is easy to follow. Some things which could have been clearer are (a) more details about the baselines in the ShapeNet experiment and (b) explanation of the VertAlign operator. Although the paper uses an edge loss to predict high-quality meshes, there is no explicit shape prior used to constrain the model, and adding such constraints would be interesting to study.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.