KinectFusion: Real-Time Dense Surface Mapping and Tracking

The surge of interest in augmented and mixed reality applications can at least in part be attributed to research in the “real-time infrastructure-free” tracking of a camera with the simultaneous generation of detailed maps of physical scenes. While computer vision research has enabled this (especially accurate camera tracking and dense scene surface reconstructions) using structure from motion and multiview stereo algorithms, they are not quite well suited for either real-time applications or detailed surface reconstructions. There has also been a contemporaneous improvement of camera technologies, especially depth-sensing cameras based on time-of-flight or structured light sensing, such as Microsoft Kinect, a consumer-grade offering. Microsoft Kinect features a structured light-based depth sensor (sensor hereafter) and generates a 11-bit $640 \times 480$ depth map at 30Hz using an on-board ASIC. However, these depth images are usually noisy with ‘holes’ indicating regions where depth reading was not possible. This paper proposes a system to process these noisy depth maps and perform real-time (9 million new point measurements per second) dense simultaneous localization and mapping (SLAM), thereby generating an incremental and consistent 3D scene model while also tracking the sensor’s motion (all 6 degrees-of-freedom) through each frame. While the paper presents quite an involved description of the method, the key components have been briefly summarized here.

The method uses a a rigid body transformation matrix to represent the estimated 6-DOF camera pose and a single constant camera calibration matrix to map points from the sensor plane onto image pixels. The entire system is comprised of 4 components which we describe in (A) - (D) next.

(A) Surface measurement: the raw depth maps are filtered using bilateral filtering to reduce noise while preserving discontinuities, to obtain vertex and normal maps at each time instance, followed by a 3-scale representation of the surface measurement using vertex and normal map pyramids. The authors also define a pixelwise binary vertex validity mask indicating if a depth measurement results in a valid vertex.
(B) Mapping as surface reconstruction: given the sensor pose estimated by depth tracking from a new frame, the surface measurement is fused incrementally into a single 3D scene model. This is done using a projective truncated signed distance function (SDF) which is easy to compute and trivial to parallelize, and a weighted running average is used to incrementally update the TSDF value. Importantly, raw, not bilatered filtered, depth measurements are used for TSDF fusion to preserve the high frequency details. Note that approximation within truncation region for several fused TSDFs from multiple viewpoints converges to an SDF.
(C) Surface prediction from ray casting the TSDF: unlike previous works which separate the tracking from the mapping, this system performs the two operations in tandem by “tracking the live depth frame against the globally fused model”. Also, given the continuous updates to the TSDF, ray skipping is better at speeding up marching through empty space as compared to a min/max block acceleration.
(D) Sensor pose estimation: The proposed system uses all the data in a frame to localize the current sensor pose, and this is enabled by (a) a high-tracking frame rate allowing for small motion assumption between consecutive frames and (b) ease of parallelizing the pipeline on GPUs. A multi-scale iterative close point (ICP) algorithm is used to the align the current sensor measurement with the predicted surface.

The proposed tracking and mapping system is able to operate in a constant time for a given area of reconstruction, and experiments carried out on a “tabletop scene mounted on a turntable” scanned for 19 seconds resulting in 560 scanned frames show that the system exhibits convergence properties for tracking and mapping without any explicit joint global optimization. Even in a scenario when keyframes are dropped (simulated by subsampling to consider only every $8^{\mathrm{th}}$ frame), the model is able to produce a “drift free result”. Finally, the model is able to scale well to the available computational resources (including GPU memory).

The paper proposes a system for accurate, real-time, and high quality dense volumetric reconstruction of complex room-sized ($\leq 7 \mathrm{m}^3$) interiors in varying lighting conditions using noisy depth maps obtained from a handheld consumer-grade depth camera (Microsoft Kinect) and commodity graphics hardware. The method is explained in intricately, with justifications provided for all the methodological choices, and the presented experiments demonstrate the efficacy and the robustness of the approach. Despite its excellent performance, the system admittedly has some shortcomings: (a) it is unable to scale well to large interiors such as an entire building because of the large memory requirements and the reconstructions with drift arising out of “very large exploratory sequences” and (b) it fails to adequately track in cases with a large planar scene occupying the majority of the field of view because of the sensor’s unconstrained 6-DOF motion. A related work, SLAM++ [1], uses prior information from frames about objects and structures present in scenes.

[1] R. F. Salas-Moreno et al., “SLAM++: Simultaneous localisation and mapping at the level of objects,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Jun. 2013. [Online]. Available: https://doi.org/10.1109/cvpr.2013.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.