Interactive Reconstruction of Monte Carlo Image Sequences using a Recurrent Denoising Autoencoder

Owing to the immense popularity of ray-tracing and path tracing rendering algorithms for visual effects, there has been a surge of interest in developing filtering and reconstruction methods to deal with the noise present in these Monte Carlo renderings. Despite the focus on a large sampling rate (upto thousands of samples per pixel before filtering), even the fastest ray tracers are limited to a few rays per pixel, and a low sampling budget would be realistic for the foreseeable future. This paper proposes a learning-based approach for reconstruction of global illumination with very low sampling budgets (as low as 1 spp) at interactive rates. At 1 sample per pixel (spp), the Monte Carlo integration of indirect illumination results in very noisy images, and the problem can therefore be framed as reconstruction instead of denoising. Previous works on offline and interactive denoising for Monte Carlo rendering suffer from a trade-off between speed and performance, require user-defined parameters, and scale poorly to large scenes. Inspired by the progress in single image restoration (denoising) using deep learning, the authors propose a deep learning based approach which leverages an encoder-decoder architecture and recurrent connections for improved temporal consistency. The proposed model requires no user guidance, is end-to-end trainable and is able to exploit auxiliary pixel features for improved performance.

The noisy images which are used as an input to the deep neural network are generated using an interactive path tracer. GPUs are used to rasterize visible surfaces and store associated shading attributes in a geometry-buffer (G-buffer), after which, an Nvidia GPU based ray tracer is used to trace the path. The authors use ’next event estimation’ to improve convergence. For each pixel, the path tracer generates 1 direct (camera $\to$ surface $\to$ light) and 1 indirect (camera $\to$ surface $\to$ surface $\to$ light) lighting path. The deep neural network architecture, called a denoising autoencoder, is similar to FlowNet and U-Net and consists of encoder and decoder components to progressively subsample and upsample inputs respectively as they pass through the network. This speeds up the execution considerably and increases the receptive field of the deeper layers, allowing information aggregation from larger neighborhoods. This is also the first work to feature skip connections (from the encoder’s layers to corresponding layers in the decoder) in an autoencoder. Since temporal consistency is desirable, the authors propose using fully convolutional recurrent blocks after every stage in the encoder. This helps address the problem of temporal flickering and results in smoother and temporal coherence across frames. The network receives 7 values per pixel: noisy RGB intensities, normal vector, depth, and roughness, and is trained using a weighted combination of 3 loss functions: (a) a spatial $\mathcal{L}_1$ loss to penalize pixel-wise predictions, (b) a gradient-domain $\mathcal{L}_1$ loss to penalize loss of high-frequency details, and (c) a temporal $\mathcal{L}_1$ to penalize temporal incoherence. The network is fed a sequence of 7 consecutive frames and the losses for these 7 images are Gaussian-weighted to penalize the later frames more than the earlier ones. The fully convolutional nature of the architecture permits the network being trained with small patches ($128\times128$ in this case) and evaluated on sequences of all resolutions and lengths. The network is trained on 3 datasets (SponzaDiffuse, SponzaGlossy, and Classroom) and the authors use extensive data augmentation including rotation, random crops, camera stop, and forward and reverse playback. As a post-processing step, the model also uses pixel-scale temporal anti-aliasing to reduce flickering in the output. The proposed model outperforms all previous method on several datasets in terms of RMSE and SSIM metrics. The model also exhibits excellent generalization capability by performing well on previously unseen datasets, despite the presence of new and different lighting setups, camera fly-throughs, view-points, and materials (such as high frequency details present in foliage) in these new datasets. Finally, despite the model being trained on 1 spp inputs, it yields superior on inputs as high as 256 spp, outperforming all non-offline methods and demonstrating the excellent generalizability of the approach.

This is the first work that uses recurrent denoising autoencoders to perform light transport reconstruction and produces “noise-free and temporally coherent animation sequences with global illumination”. The use of a fully convolutional and end-to-end trainable model alleviates the need for any user guidance and enables inference on sequences of arbitrary lengths and resolution. Inference with this model is considerably faster than all but one of the competing methods, and can potentially be sped up to even real-time speeds with the use of custom hardware accelerators. The paper is written quite well, with detailed description of the model architecture, the training and implementation details, as well as a comprehensive literature survey, making it quite an enjoyable read. As the authors write, a useful future work direction could be incorporate lens and time data as inputs to the model to consider motion blur and depth of field.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.