Building Rome in A Day

With the advent of digital photography and the popularity of cloud-based digital image sharing websites, there has been a huge proliferation in the number of publicly accessible photographs of popular cities (and landmarks thereof) across the world. As a result, the ability to leverage these photos in a meaningful manner is of massive interest in computer vision communities. One key research area that could immensely benefit from it is city-scale 3D reconstruction. Traditionally, existing systems for this task have relied on images and data acquired in a structured manner, making computation simple. On the contrary, images uploaded on the internet have no such constraints, necessitating the development of algorithms which can work on “extremely diverse, large, and unconstrained image collections”. Building upon previous research and incorporating elements from other disciplines of computer science, this paper proposes a system to construct large-scale 3D geometry from large and unorganized image collections publicly available on the internet, with the ability to process more than a hundred thousand images in a day.

In order to find correspondences between a pair of images, SIFT (scale invariant feature transform) descriptors can be calculated and approximate nearest neighbor search and $k$-d tree could be used to return matches, which would then be verified using a RANSAC-based estimation of the fundamental matrix. However, it is computationally infeasible to generate such matches for all pairs. The authors therefore use 2 methods to generate proposals for edges in a match graph: (a) whole image similarity and (b) query expansion. For calculating whole image similarity, SIFT features from images are clustered into ‘visual words’ and each photo is represented as a sparse histogram of visual words weighted using TF-IDF (term frequency inverse document frequency). For each image, 20 most similar images are calculated, and the top 10 are used to connect the various components in the sparsely connected match graph. After this, the initial query is expanded by proposing and verifying transitivity property between edges. In order to implement a distributed computation (a cluster of nodes with one master node responsible for job scheduling) for these, the images were stored on a central master node and distributed to cluster nodes on demand. Next, for proposal and verification of candidate image pairs, 2 rounds of whole image similarity and 4 sounds of query expansion are used. Because of the distributed nature of the task and the data, many potential approaches had to be disregarded because they were either not scalable for large datasets, led to large idle times for some nodes, or were limited by the computational effort of graph partitioning. The authors therefore chose to use a “simple greedy bin-packing” algorithm, where the master node keeps a list of images for each node, and on demand from a node, updates a bin with the list of available image pairs (from a fixed sized subset of image pairs) that do not require any network transfer, and sends it to the node. Next, to generate correspondences across multiple images, a feature track combines all the pairwise matching information across images, by finding connected components in a graph whose vertices represent features from all the images and edges connect the matching features. After the tracks are generated, structure from motion (SfM) algorithm is used on “each connected component of the match graph to recover the camera poses and a 3D position for every track”. However, to address the inherent redundancy in Internet photo collections, the authors propose to generate a minimal subset of images that capture the scene’s essential geometry, and call it the skeletal set. To perform bundle adjustment, the bottleneck in the reconstruction process, in an incremental manner, the authors developed a software which does so using either a truncated or an exact step Levenberg-Marquardt (LM) algorithm, depending on the problem size. Finally, the registered images are used to generate dense 3D models using a multiview stereo (MVS) algorithm. The authors do so by first clustering the SfM points such that there is a small number of clusters, each with a size lower than and threshold, and each SfM point is visible from enough images in a cluster, followed by using an MVS algorithm to solve for the scene geometry. On a distributed computing cluster with modest hardware, the authors demonstrate their results for 3 city-scale datasets downloaded from Flickr, with their system above to operate on 14k input images in under 3 hours. Rome and Venice datasets, which are essentially collections of landmarks, lead to a simpler geometry and lesser SfM timing as compared to the Dubrovnik dataset, which has images of the entire city, making it a complicated SfM problem with a larger skeletal set and therefore larger SfM time.

The paper proposes a system to generate fast city-scale 3D from large collections of unstructured and uncalibrated images downloaded off the Internet. The method is fast and works quite well, as is demonstrated by performance on 3 huge datasets. The paper is extremely well written and is easy to follow. While familiarity with the LM algorithm is assumed, more details would have improved the explanation of bundle adjustment. Although the paper has been immensely impactful and was awarded the ICCV Helmhotz Prize in 2019, it does have a few shortcomings by its own admission, such as holes in generated surfaces, lack of parallelization of the three key steps, inability to leverage image metadata if and when available, sub-optimal distribution of images across cluster nodes, and use of a batch-mode operation instead of an incremental approach.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.