Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

The ability to provide human language instructions to robots for carrying out navigational tasks has been a longstanding goal of robotics and artificial intelligence. This task involves achieving visual perception and natural language understanding objectives in tandem, and while advancements in visual question answering and visual dialog have enabled models to combine visual and linguistic reasoning, they do not “allow an agent to move or control the camera”. Natural language-only commands abstract away the visual perception component, and are not very linguistically rich. While hand-crafted rendering models and environments and simulators built thereupon try to address these problems, they possess a limited set of 3D assets and textures, converting the robot’s challenging open-set problem in the real world to a fairly simpler closer set problem, which in turn deteriorates the performance on previously unseen environments. Finally, although reinforcement learning has been used to train navigational agents, they either do not leverage language instructions or rely on very simple linguistic settings. This paper proposes MatterPort3D Simulator, “a large-scale reinforcement learning environment based on real imagery” and an associated Room-to-Room (R2R) dataset with the hope that these will help push forward advancements in vision-and-language navigation (VLN) tasks and improve generalizability in previously unseen environments.

A quick summary of the Matterport3D dataset, the largest publicly available RGB-D dataset: the dataset contains 10,800 panoramic views of 90 building scale scenes, with each view comprised of 18 RGB-D images from a single 3D position at approximately the human eye-level. Each image contains a 6 DoF camera pose, globally-aligned and textured 3D meshes, and instance- and class-level annotations of scenes and objects. The proposed Matterport3D simulator consists of an agent which is allowed to ‘move’ throughout a scene by “adopting poses coinciding with panoramic viewpoints”, which are defined by a 3-tuple $<$3D position, heading, camera elevation$>$, and at each time step, the simulator produces the agent’s first person camera view as an RGB image. Next, at each step, the simulator outputs a set of reachable (panoramic) viewpoints for the next step. This is done using a weighted undirected graph over viewpoints with edges between robot-navigable points and edge weights denoting the distance, ensuring that there are no non-navigable trajectories and distances are less than 5m to keep motion local. The action space consists of 6 actions with ‘stop’ denoting the end of an episode, and the agent can choose to not move itself while moving the camera. An average graph consists of 117 viewpoints and an average node degree of 4.1. The simulator in itself does not impose any restrictions w.r.t. the agent’s goal, reward, etc. and these are task and dataset dependent. The R2R dataset consists of 21,567 navigation instructions with 29 words on average each, collected using Amazon Mechanical Turks, for 7,189 visually diverse navigation paths of 10m average length. Unlike other vision and language tasks, the R2R task has a clear evaluation metric: ’navigation error’, indicating the shortest path distance between the agent’s final position and the goal location, and an episode is considered successful if the navigation error is less than 3m. Moreover, keeping in line with another work, results are also reported on under an “oracle stopping rule”. The authors provide sequence-to-sequence (seq2seq) model baselines and several others, including a human baseline. The LSTM-based seq2seq model uses a ResNet-152 backbone to extract visual features and concatenates them to the word embeddings, and the action is predicted using global general alignment-based attention mechanism. Two training schemes: teacher-forcing and student-forcing are used to calculate baseline performances, and two “learning-free” baselines are also calculated: RANDOM (agent always chooses a randomly selected heading) and SHORTEST (agent always chooses shortest path to the goal). Results show that student-forcing, although slower to train since it explores more of the environment, is more effective than teacher forcing, and as expected, both of them outperform RANDOM. Moreover, the performance is better in seen validation environments over unseen ones despite strong regularization while training, indicating that learned “visual groundings” might be specific to the seen environments.

This paper proposes the R2R dataset and the associated VLN challenge, with an aim to provide impetus to research of visually-grounded language-based navigation methods in realistic environments. The authors provide human performance baselines, and train, validation, and test splits as well as a public evaluation server for researchers to evaluate their own results. The paper is written quite well with sufficient details about the simulator and the dataset as well as for the seq2seq baselines, and is easy to follow. The authors admit a few shortcomings: (a) for constructing the navigation graph, they only consider RGB images and discard the depth maps, and (b) a bias in the R2R dataset arising out of the selection bias in the Matterport3D dataset (luxurious, clean, and uncluttered surroundings, very few people or animals in the images, and “commanding viewpoints”).

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.