ImageNet: A Large-Scale Hierarchical Image Database

The availability of large volumes of data is a key requirement in the development of efficient, robust, and advanced machine learning based prediction models. This paper introduces the ImageNet database, “a large scale ontology of images” built upon the hierarchical structure of WordNet, an online lexical database of meaningful concepts. These concepts, described by words or word phrases, are known as synonym sets or synsets. The ImageNet dataset contains 3.2 million labeled images, organized in 12 subtrees and 5247 synsets in total, with an average of 600 images per synset, making it one of the largest publicly available image datasets in terms of the number and the diversity of images, the accuracy of the image labels, and the hierarchical structure of the dataset.

Constructing the ImageNet dataset required collecting candidate images based on search results from several search engines, where the search query set was expanded by appending the word from the parent synsets. These `candidate images’, after removing duplicates, are presented to human raters on Amazon Mechanical Turk (AMT) for labeling. The AMT users are provided, along with the images, the definitions of the target synsets, and the users are encouraged to label images despite the present of other objects, in an attempt to increase the diversity. Finally, to ensure the accuracy of the labels, they (a) have multiple users label the same image and (b) propose an algorithm to determine the number of inter-rater agreements needed for each category. The ImageNet classes are arranged in a ``densely populated semantic hierarchy", has more accurate labels (average $99.7%$ precision on randomly sampled 80 synsets), and is more diverse than contemporaneous image datasets. Finally, to demonstrate the usefulness of this dataset, the authors perform 3 sets of experiments. First, they emphasize the benefits of having clean and high-resolution images by performing object recognition, and show that having high-resolution images helps learn detailed feature-level information better. Second, they show using a tree-max based image classifier that exploiting the ImageNet’s hierarchical structure can help improve classification performance substantially without any additional training. Lastly, they perform object localization by learning the visual representation of object classes, and propose this as a potential extension of the dataset. The authors wrap up the paper by addressing their (the then) future goals with this dataset, notably, extending it to 50 million images and 50k synsets, adding annotation for other computer vision tasks (such as object localization and segmentation), and releasing the dataset publicly for everyone in the ImageNet community to contribute to and benefit from.

The curation and the public release of the ImageNet dataset, along with the (now discontinued) annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have played a (pivotal) role in the transforming AI research in the past decade or so. This paper was published before the widespread use of deep neural networks for computer vision tasks, and now with over 14 million images and over 21k synsets, its impact is much more obvious. The paper is really well written with extensive details about the dataset and its superiority to contemporary datasets, and with empirical evidence in support of its strong suits. Recent work [1] has, however, pointed out ethical concerns over some of the labels, some of which can be traced to the WordNet taxonomy [2]; as a response, over 600k images were removed from the dataset [3]. Similarly, a very recent work [4] reassessed the reliability of ImageNet labels and showed that using a new set of collected annotations provided a more accurate estimate of the models’ performance.

[1] K. Crawford and T. Paglen, “The politics of images in machine learning training sets — Excavating AI,” Sep 2019. [Online]. Available: https://www.excavating.ai/

[2] V. U. Prabhu and A. Birhane, “Large image datasets: A pyrrhic win for computer vision?” arXiv preprint arXiv:2006.16923, 2020.

[3] Z. Small, “600,000 images removed from AI database after art project exposes racist bias,” Sep. 2019. [[Online]]. Available: https://hyperallergic.com/518822/600000-images-removed-from-ai-database-after-art-project-exposes-racist-bias/

[4] L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord, “Are we done with ImageNet?” arXiv preprint arXiv:2006.07159, 2020.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.