Posts

Interactive Reconstruction of Monte Carlo Image Sequences using a Recurrent Denoising Autoencoder

Owing to the immense popularity of ray-tracing and path tracing rendering algorithms for visual effects, there has been a surge of interest in developing filtering and reconstruction methods to deal with the noise present in these Monte Carlo renderings. Despite the focus on a large sampling rate (upto thousands of samples per pixel before filtering), even the fastest ray tracers are limited to a few rays per pixel, and a low sampling budget would be realistic for the foreseeable future. This paper proposes a learning-based approach for reconstruction of global illumination with very low sampling budgets (as low as 1 spp) at interactive rates. At 1 sample per pixel (spp), the Monte Carlo integration of indirect illumination results in very noisy images, and the problem can therefore be framed as reconstruction instead of denoising. Previous works on offline and interactive denoising for Monte Carlo rendering suffer from a trade-off between speed and performance, require user-defined parameters, and scale poorly to large scenes. Inspired by the progress in single image restoration (denoising) using deep learning, the authors propose a deep learning based approach which leverages an encoder-decoder architecture and recurrent connections for improved temporal consistency. The proposed model requires no user guidance, is end-to-end trainable and is able to exploit auxiliary pixel features for improved performance. ...

Mesh R-CNN

Although deep learning has enabled massive strides in visual recognition tasks including object detection, most of these advances have been made in 2D object recognition. However, these improvements are built upon a critical omission: objects in the real world exist beyond the $XY$ image plane and in a 3D space. While there has also been significant progress in 3D shape understanding tasks, the authors call to attention for methods that amalgamate these two tasks: i.e., approaches which (a) can work in the real world where there are far fewer constraints (as compared to carefully curated datasets) such as constraints on object count, occlusion, illumination, etc., and (b) can do so without ignoring the rich 3D information present therein. They build upon the immensely popular Mask R-CNN multi-task framework and extend it by adding a mesh prediction branch that learns to generate “high-resolution triangle mesh” of the detected objects simultaneously. Whereas previous works on single-view shape prediction rely on post-processing or are limited in the topologies that they can represent as meshes, Mesh R-CNN uses multiple 3D shape representations: 3D voxels and 3D meshes, where the latter is obtained by refining the former. ...

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Several computer vision tasks require perceiving or interacting with 3D environments and objects therein, making a strong case in favor of 3D deep learning. However, unlike images which are most popularly structured as arrays of pixels, there are multiple 3D representations, e.g., meshes, point clouds, volumetric and boundary representations, RGB-D representation, etc. Of these, point clouds are arguably the closest representation of raw sensor data, and their simplicity of representation makes them a canonical 3D representation, meaning it is easy to convert them to and from other representation forms. The majority of previous work on using deep learning for 3D data has been using CNNs (by projecting the point clouds into 2D images), volumetric CNNs (by applying 3D CNNs on voxelized shapes), spectral CNNs (on meshes), and fully connected networks (by extracting feature vectors from 3D data). These approaches suffer from several shortcomings, such as challenges with data sparsity, high computational cost, inability to extend to tasks beyond shape classification and non-isometric shapes, and limited expressiveness of extracted features. To address these concerns, the authors propose PointNet, a deep neural network architecture which is able to process point clouds directly for various 3D tasks such as shape classification, part segmentation, and scene understanding, and is also robust to input points’ corruption and perturbation. ...

Mask R-CNN

The instance segmentation task in computer vision involves labeling each pixel in an image with a class and an instance label. It can be thought of as a generalization of the semantic segmentation task, since it requires segmenting all the objects in the image while also segmenting each instance. As such, it is a dense prediction task which combines elements from two popular computer vision tasks: semantic segmentation (pixelwise labeling without differentiating between instances) and object detection (detection using bounding boxes). This makes the instance segmentation task vulnerable to challenges from both the parent tasks, such as difficulty segmenting small objects and overlapping instances. Recent advances in instance segmentation, driven primarily by the success of R-CNN, have relied upon sequential (cascaded) prediction of segmentation and classification labels. This paper, on the other hand, proposes Mask R-CNN, a multi-task prediction architecture for simultaneously detecting objects, classifying them, and delineating their fine boundaries within the detected bounding boxes. Mask R-CNN builds upon the massively popular Faster R-CNN model, which was not designed for “pixel-to-pixel alignment between network inputs and outputs”, by adding a mask prediction branch for simultaneous segmentation predictions. ...

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

The optical flow estimation task in computer vision is that given two images $\mathcal{I}_1$ and $\mathcal{I}_2$, we want to estimate for each pixel in $\mathcal{I}_1$, where it goes to in $\mathcal{I}_2$. This dense pixel correspondence task is a long-standing problem that has remained largely unsolved because of difficulties including but not limited to shadows, reflections, occlusions, fast moving objects, surfaces with low textures, etc. Traditional approaches for estimating optical flow, which frame it as a hand-crafted optimization problem over the “space of dense displacement fields” between an image pair and with the optimization performed during inference, are limited because of the challenges in hand-crafting the optimization problem. Motivated by these traditional optimization-based approaches, this paper proposes an end-to-end differentiable deep learning (DL)-based architecture called RAFT (Recurrent All-Pairs Field Transforms) for estimating the optical flow. The RAFT architecture comprises of 3 main components: (a) a convolutional feature encoder to extract feature vectors from a pair of images, (b) a correlation layer to construct a 4D correlation volume followed by pooling to produce volumes at multiple lower resolutions, and (c) a gated activation unit based on GRUs to iteratively update a single flow field using values from the correlation volumes. ...