Several computer vision tasks require perceiving or interacting with 3D environments and objects therein, making a strong case in favor of 3D deep learning. However, unlike images which are most popularly structured as arrays of pixels, there are multiple 3D representations, e.g., meshes, point clouds, volumetric and boundary representations, RGB-D representation, etc. Of these, point clouds are arguably the closest representation of raw sensor data, and their simplicity of representation makes them a canonical 3D representation, meaning it is easy to convert them to and from other representation forms. The majority of previous work on using deep learning for 3D data has been using CNNs (by projecting the point clouds into 2D images), volumetric CNNs (by applying 3D CNNs on voxelized shapes), spectral CNNs (on meshes), and fully connected networks (by extracting feature vectors from 3D data). These approaches suffer from several shortcomings, such as challenges with data sparsity, high computational cost, inability to extend to tasks beyond shape classification and non-isometric shapes, and limited expressiveness of extracted features. To address these concerns, the authors propose PointNet, a deep neural network architecture which is able to process point clouds directly for various 3D tasks such as shape classification, part segmentation, and scene understanding, and is also robust to input points’ corruption and perturbation.
Point clouds, which are a subset of points in an Euclidean space, have three main properties: (a) they are unordered, (b) they have a local structure and are not isolated, and (c) any learned representation should be invariant to input transformation, such as permutations of the points. The PointNet architecture has 3 core components: (a) a symmetric function to aggregate information from all the input points, (b) a module to combine local and global information, and (c) joint alignment networks to align input points and point features. Making the model invariant to input permutation necessitates the use of a symmetric function, and the authors use a max-pooling operation for this. The authors present a theoretical analysis to prove that given a sufficiently large dimension at the max pooling layer, their network can approximate any continuous set function, and that as long a certain sparse set of points (called the “critical point set”) are unaffected, small corruption or noise in the input are unlikely to affect the network output. The global and the local features are aggregated by simply concatenating them, enabling the network to exploit both global context and local geometry. Finally, two joint alignment networks to predict affine transformation matrices are used to align the input points and the point features (with a orthogonality constraint for the point feature transformation matrix to account for the high dimensionality). The authors present evaluations of PointNet on ModelNet shape classification, ShapeNet part segmentation, and Stanford 3D semantic parsing datasets. For shape classification, PointNet predicts $k$ scores for all the $k$ candidate classes and is the first work to directly process the raw point cloud data, outperforming previous works which rely on volume, image, and mesh inputs, and is within a small margin of the multi-view based method. For 3D segmentation, PointNet predicts $n \times m$ scores for all $n$ points and $m$ semantic classes, and outperforms a 3D CNN baseline and traditional methods which leverage pointwise geometry features and shape correspondences in mean IoU for part segmentation task. Next, evaluation on 3D semantic scene segmentation shows that PointNet outperforms a baseline trained using hand-crafted features. Control experiments to ascertain each component’s contribution in PointNet demonstrate that (a) max-pooling outperforms average pooling and attention-based weighted sum as the symmetric function of choice, and (b) input and (regularized) feature transformations help improve classification accuracy. Despite a relatively simple architecture, PointNet is robust to input corruption and noise. The classification accuracy drops by only upto 3.8% when using as little as 50% of the input points, and achieves 80% accuracy with even 20% outlier points. The mIoU for segmentation when evaluated on simulated incomplete point clouds only drops by 5.3%. Visualizing the critical point sets reveals that they summarize the skeletons of the shapes. This explains why input perturbations affect the network performance very little - the global shape signature does not change even when some non-critical points are lost. Finally, PointNet is much more computationally efficient and smaller in terms of parameter count ($141\times$ fewer FLOPS and $17\times$ fewer parameters w.r.t. other works). It also scales well since it has $O(N)$ space and time complexity, enabling it to process more than 1 million points per second on a regular GPU.
This paper proposes PointNet, a deep neural network architecture which consumes raw point clouds for multiple 3D recognition tasks, and performs as well as, and often better than, existing approaches which pre-process the point clouds. The method is simple, space and time efficient, and is robust to input perturbations. The paper is well-written and is easy to follow, with strong theoretical justifications and empirical results for proving PointNet’s superiority. Certain things could have been clearer, such as their explanation about why a stable sorting does not exist in high dimensional spaces, and the domain of $f(\cdot)$ in Eqn. 1. Although the paper talks about using local geometry of the points, PointNet is not explicitly enforced to learn it.
This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.