PolyGen: An Autoregressive Generative Model of 3D Meshes

Polygonal meshes are widely used in computer graphics, robotics, and game development to represent virtual objects and scenes. Exisitng learning-based methods for 3D object generation have relied on template models and parametric shape families. Progress with deep learning based approaches has also been limited because meshes are challenging to work with for deep networks, and therefore recent works have instead used alternative representations of object shape, such as voxels, point clouds, occupancy functions, and surfaces. These works, however, leave mesh reconstruction as a post-processing step, leading to inconsistent mesh quality. Drawing inspiration from the success of previous neural autoregressive models applied to sequential raw data (e.g., images, text, and raw audio waveforms) and building upon previously proposed components (e.g., Transformers and pointer networks), this paper presents PolyGen, a neural autoregressive generative model for generating 3D meshes.

Since a mesh can be defined as a collection of 3D vertices and polygonal mesh faces, PolyGen consists of 2 parts: a vertex model and a face model. The vertex model unconditionally models 3D mesh vertices and the face model models mesh faces conditioned upon the input (mesh) vertices. Both the models are autoregressive, and the joint distribution over vertices and faces is factored into a product of conditional distributions. While 3D meshes are typically represented using collections of triangles, a more compact and consistent representation can be achieved by using polygons of various sizes, also called n-gons (obtained using Blender). While this might potentially be problematic when modeling non-planar points, the authors find that most of the n-gons produced by the model are either planar or close to planar. Moreover, since triangle meshes are a subset of n-gon meshes, it is easy to model using only triangular meshes if needed. Owing to its ability to aggregate non-local information, the Transformer decoder architecture has been used for the vertex model. It receives as input a flattened sequence of vertex coordinates $(z_i, y_i, x_i)$ appended by a stopping token to account for variable length sequences. For each input token, 3 embeddings are used: vertex coordinates, vertex position (order in the input sequence), and learned vertex value embeddings. The model is trained to maximize the log-probability of the observed data w.r.t. the model parameters. 8-bit uniform quantization is applied to the mesh vertices. However, the quadratic nature of attention is Transformers can be problematic for large meshes, and to address this, the authors evaluate 3 variants: a mixture of discretized logistics, a MADE-based decoder, and a Transformer decoder with different vertex embedding schemes. For the face model, the faces are ordered by their lowest vertex index, flattened, and concatenated with a stopping token. The “contextual embeddings” of the input vertices obtained using a Transformer encoder are used with pointer networks, and a Transformer decoder operated outputs pointers at each step. Using a Transformer encoder is advantageous because it allows for “bi-directional information aggregation” as compared to the LSTM used in the original pointer networks. The decoder is conditioned on the input vertices either through dynamic face embeddings or through cross attention in the vertex embedding sequence. Masking the predicted logits to avoid invalid predictions led to worsened performance, so the training was done without masking. The conditioning on a context for the mesh generation was done using either (a) learned class embeddings for global features such as class identity, or (b) embeddings obtained from a domain (images or voxels) appropriate encoder based on convolutional neural networks with pre-activation residual blocks and either 2D (images) or 3D (voxels) convolutions. For evaluation, the training set is filtered to discard any meshes with more than 800 vertices or 2800 face indices after pre-processing. The unconditional modeling perfomance of PolyGen was better than those obtained by a model which assigns uniform probability (a) to the entire data domain and (b) to the region of valid predictions, and (c) the output of a mesh compression library. The model is found to be adept at assigning low probabilities to invalid regions, and surprisingly, the use of cross-attention in the face model led to performance degradation, which the authors attributed to model overfitting. Moreover, the shape statistics of the model’s outputs follow a distribution similar to to the true data distribution, indicating that the mesh outputs largely resemble the human-designed meshes. For conditional modeling, the best performance is achieved when using voxel conditioning, and the authors argue that this is because, unlike images which can provide an ambiguous view depending upon the lighting and the pose of the object, voxels characterize the coarse shape of the object unambiguously. Additionally, the authors observe that (a) conditional models perform worse than unconditional ones because the “conditioning context provides relatively little additional information” and (b) global average pooling in the image and voxel conditional models degrades and improves the vertex models’ and the face model’s performance. Finally, a variant of PolyGen evaluated using symmetric chamfer distance as the reconstruction metric performs comparably to AtlasNet, although the single prediction performance of the latter is better than that of PolyGen, which the authors explain by the fact the AtlasNet is trained to optimize the evaluation metric directly.

This paper proposed PolyGen, an neural autoregressive generative model for synthesizing 3D meshes represented as n-gons. Experiments suggest that the model is able to model the shapes accurately, and more importantly, the shape statistics of the generated outputs are similar to those of human-designed meshes. Although written well and easy to understand for most parts, the paper is a bit dense in writing while explaining “Mesh pointer networks” in Section 2.3, making it difficult to follow. It was also strange to see a huge portion (92.5%) of the data used for training, leaving only 5% for testing, which might not be sufficient to thoroughly test the model’s generalization performance. The paper mentions in the Abstract that PolyGen can “produce samples that capture uncertainty in ambiguous scenarios”, but the notion of uncertainty is discussed nowhere in the paper. If all the authors meant to talk about was probabilistic outputs, then there is no need to mention the concept of uncertainty, since the output is always going to be an $\texttt{argmax()}$ over the logits.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.