RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

The optical flow estimation task in computer vision is that given two images $\mathcal{I}_1$ and $\mathcal{I}_2$, we want to estimate for each pixel in $\mathcal{I}_1$, where it goes to in $\mathcal{I}_2$. This dense pixel correspondence task is a long-standing problem that has remained largely unsolved because of difficulties including but not limited to shadows, reflections, occlusions, fast moving objects, surfaces with low textures, etc. Traditional approaches for estimating optical flow, which frame it as a hand-crafted optimization problem over the “space of dense displacement fields” between an image pair and with the optimization performed during inference, are limited because of the challenges in hand-crafting the optimization problem. Motivated by these traditional optimization-based approaches, this paper proposes an end-to-end differentiable deep learning (DL)-based architecture called RAFT (Recurrent All-Pairs Field Transforms) for estimating the optical flow. The RAFT architecture comprises of 3 main components: (a) a convolutional feature encoder to extract feature vectors from a pair of images, (b) a correlation layer to construct a 4D correlation volume followed by pooling to produce volumes at multiple lower resolutions, and (c) a gated activation unit based on GRUs to iteratively update a single flow field using values from the correlation volumes.

Given a pair of images $\mathcal{I}_1$ and $\mathcal{I}_2$, the feature encoder comprising of 6 residual blocks extracts 256-channeled feature maps at $\frac{1}{8}^{\textrm{th}}$ resolution for both the images. Additionally, another network with the same architecture known as the context network extracts features only from $\mathcal{I}_1$. The 4D correlation volume is then constructed by taking the inner product between all feature vector pairs, and can efficiently be computed using a single matrix multiplication. This volume is used to construct a 4-layer correlation pyramid by performing pooling along the last 2 dimensions of the volume with kernel sizes 1, 2, 4, and 8, which helps extract information about large and small displacements as well as the motions of fast-moving small objects. A lookup operator then generates feature maps from the correlation pyramid which are concatenated together to form a single feature map. An update operator based on a GRU cell with the FC layers replaced with convolutions takes as input the concatenation of flow, correlation, and context features, and predicts the flow update at $\frac{1}{8}^{\textrm{th}}$ resolution of the input image, followed by upsampling to the full resolution using a weighted (convex) combination of the neighborhood. The network is trained using $\mathcal{L}_1$ loss over the full sequence of predictions with exponentially increasing loss weights. The authors evaluate the proposed method by first training on FlyingChairs$\to$FlyingThings, and then finetuning on the Sintel and the KITTI datasets. Their methods surpasses all existing works and does so with significantly shorter training times. The authors argue that the underlying architecture with an update operator which resembles a $1^{\textrm{st}}$-order descent algorithm constrains the search space, and thus helps reduce overfitting and allows the network to generalize well to multiple datasets. An exhaustive set of ablation studies are also presented to ascertain the contributions of all the compoenents, with the authors concluding that the things that worked the best were: GRU block with convolutional layers, weight tying across all instances of update operator, the use of a context network, feature extraction at a single resolution, a non-zero neighborhood radius for the lookup operator, the presence of correlation pooling, using all pixels-pairs to form correlation volume, and learned upsampling. Finally, the RAFT model has fewer parameters than competing methods, converges faster, and has lower inference time.

The RAFT model draws inspirations from both traditional and DL-based optical flow estimation methods by formulating their update operator to emulate a $1^{\textrm{st}}$-order optimization algorithm approach to train directly on the flow values but learning the features and motion priors instead of hand-crafting them. This helps overcome several shortcomings of existing DL-based optical flow estimation methods, and the resulting model is superior to existing approaches in terms of performance, model size, and training time. The paper is written comprehensively and provides sufficient explanations for all the components. The evaluation on multiple datasets is very convincing of the method’s superiority to all other methods and the exhaustive ablation studies provide valuable insights about the design. Another strong aspect of the paper is the extensive and detailed literature survey to distinguish its contributions from previous works. A concurrent work [1] used unsupervised learning to achieve results comparable to supervised methods.

[1] R. Jonschkowski, A. Stone, J. T. Barron, A. Gordon, K. Konolige, and A. Angelova, “What matters in unsupervised optical flow,” in Proceedings of the European Conference on Computer Vision (ECCV), August 2020.

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.