Deep Convolutional Priors for Indoor Scene Synthesis

Given the importance and the ubiquity of indoor spaces in our everyday lives, the ability to have computer models which can understand, model, and synthesize indoor scenes is of vital importance for many industries such as but not limited to interior design, architecture, gaming, virtual reality, etc. Previous works towards this goal have relied on constrained synthesis of scenes with statistical priors on object pair relationships, “human-centric relationship priors”, or constraints based on “hand-crafted interior design principles”. Moreover, owing to the difficulty of unconstrained room-scale synthesis of indoor scenes, prior work has focused on either small regions within a room or additional inputs (in the form of fixed set of objects, manually specified relationships, natural language description, sketch, or 3D scan of the room) as constraints, and deep generative models such as GANs and VAEs struggle with producing multi-modal outputs. Driven by the success of convolutional neural networks (CNNs) in scene synthesis tasks and the availability of large 3D scene datasets, this paper proposes the first CNN-based autoregressive model to design interior spaces, where given the wall structure and the type of a room, the model predicts the selection and placement of objects.

In order to simplify the problem, the authors propose to use an “orthographic top-down” view of the rooms, which is also a view that is popular amongst architects, and leverages the fact that despite being in a 3D scene, most objects in a room are places on the floor, which is a 2D plane. At a high level, the entire scene synthesis pipeline consists of (A) extracting the view representation, (B) deciding whether more objects need to be added, (C) selecting the object category and location to place it in the scene, and (D) placing an instance of the object class in the scene, and the steps (A) $\to$ (D) are repeated iteratively to generate predictions in an autoregressive manner. Given a 3D scene, (A) its top-down view representation is constructed by encoding pixel-wise masks for the possible categories of objects for that room type as channels, and also (channel-wise) concatenating masks for depth, inside or outside the room, walls (and doors and windows), the number of objects, and the orientation and the category of each object. For (B) predicting whether the model should continue adding objects to the scene, a ResNet-101 model is trained with binary cross entropy loss and a dataset of equal number of complete rooms (positive samples) and rooms with “a random number of random objects removed” (negative samples). (C) To compute the probability of adding an object at a particular location, a pixel-wise attention mask is added to the top-down view, and auxillary labels: ’no object centroid’, ’existing object centroid’, and ‘outside room’ are added to account for the large unoccupied fraction of room. In addition to the categorical cross entropy loss, a global context loss is used to penalize predictions of categories not in the set of categories removed from the scene. Finally, (D) for placing an instance of the chosen object category, only those instances are allowed during synthesis which either come from the same collection as the object(s) already present in the room, or co-occur in the same scene “with another object already in the room”. To select the orientation of the instance, a model is trained to predict probabilities for 16 equally spaced discrete orientations in $[0, 2\pi]$ and the orientation with the highest probability is chosen. If there is such orientation with more than 0.5 probability, steps (C) and (D) are repeated to resample and this is repeated upto a maximum of 20 times. The entire pipeline takes an average of 4 minutes to synthesize a room. The authors present evaluation against a number of heuristics-based approaches as well as human perception assessment by recruiting top-performing workers on Amazon Mechanical Turks. A NADE-based baseline for independently generating a set of object categories performs worse than the proposed approach in human perceptual studies, with the NADE-based model producing either too sparse or too crowded outputs for certain room geometries. Comparison against a baseline of learned pairwise priors showed that human raters preferred the proposed model’s outputs, and that even in tightly packed scenes, the latter produced outputs that we at most as worse as the former. Finally, a comparison against human-created scenes showed that the human-created scenes were “slightly preferred” for bedrooms and living rooms, and there were no such preferences for offices. Moreover, the model is able to generate rooms with a wise range of functionalities (the presence or absence of certain appliances, offices with different occupancy limits, etc.) even when trained on a small dataset. Failure cases with certain outputs reveal that although the model typically produces plausible scenes, it is “more sensitive to local scene plausibility than global plausibility”.

This paper presents the first deep CNN-based model to synthesize “semantically-enriched” top-down views of indoor scenes with only the room geometry as the input. Extensive experiments demonstrate that the model is able to capture, through convolutional priors learned from the data, “common-sense” object arrangement patterns given a room type and room state. Qualitative perceptual evaluation by AMT workers showcase the high fidelity of the model outputs by the fact that the synthesized scenes are preferred over the baseline methods’ outputs and in some cases, even over the human-created scenes. The paper is written with a great clarity and is easy to follow, with a comprehensive review of the prior work in this field and sufficient details provided about the design choices of the model as well as the evaluation baselines. Although not exactly a shortcoming, one of the things that stands about the evaluation though is the small fraction of the dataset reserved for validation and testing. Compared to over 11k images for training, only 250 images have been held out for validation and testing combined. This would have been problematic had quantitative metrics of evaluation had been reported, since results on such a small partition might not be statistically significant. The use of the global context loss is not motivated well, since the model should be permitted to explore placing all possible objects in the scene, instead of penalizing predictions which deviate from the set of removed objects. This could lead to more diverse outputs, although at a risk of potentially unrealistic object combinations, which can be take care of high-level hand-crafted object relationship priors.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.