Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

The ability to provide human language instructions to robots for carrying out navigational tasks has been a longstanding goal of robotics and artificial intelligence. This task involves achieving visual perception and natural language understanding objectives in tandem, and while advancements in visual question answering and visual dialog have enabled models to combine visual and linguistic reasoning, they do not “allow an agent to move or control the camera”. Natural language-only commands abstract away the visual perception component, and are not very linguistically rich. While hand-crafted rendering models and environments and simulators built thereupon try to address these problems, they possess a limited set of 3D assets and textures, converting the robot’s challenging open-set problem in the real world to a fairly simpler closer set problem, which in turn deteriorates the performance on previously unseen environments. Finally, although reinforcement learning has been used to train navigational agents, they either do not leverage language instructions or rely on very simple linguistic settings. This paper proposes MatterPort3D Simulator, “a large-scale reinforcement learning environment based on real imagery” and an associated Room-to-Room (R2R) dataset with the hope that these will help push forward advancements in vision-and-language navigation (VLN) tasks and improve generalizability in previously unseen environments. ...

November 2, 2020 · 4 min · Kumar Abhishek

ImageNet: A Large-Scale Hierarchical Image Database

The availability of large volumes of data is a key requirement in the development of efficient, robust, and advanced machine learning based prediction models. This paper introduces the ImageNet database, “a large scale ontology of images” built upon the hierarchical structure of WordNet, an online lexical database of meaningful concepts. These concepts, described by words or word phrases, are known as synonym sets or synsets. The ImageNet dataset contains 3.2 million labeled images, organized in 12 subtrees and 5247 synsets in total, with an average of 600 images per synset, making it one of the largest publicly available image datasets in terms of the number and the diversity of images, the accuracy of the image labels, and the hierarchical structure of the dataset. ...

September 21, 2020 · 4 min · Kumar Abhishek