Motion estimation for large displacements and deformations

Large displacement optical flow is an integral part of many computer vision tasks. Variational optical flow techniques based on a coarse-to-fine scheme interpolate sparse matches and locally optimize an energy model conditioned on colour, gradient and smoothness, making them sensitive to noise in the sparse matches, deformations, and arbitrarily large displacements. This paper addresses this problem and presents HybridFlow, a variational motion estimation framework for large displacements and deformations. A multi-scale hybrid matching approach is performed on the image pairs. Coarse-scale clusters formed by classifying pixels according to their feature descriptors are matched using the clusters’ context descriptors. We apply a multi-scale graph matching on the finer-scale superpixels contained within each matched pair of coarse-scale clusters. Small clusters that cannot be further subdivided are matched using localized feature matching. Together, these initial matches form the flow, which is propagated by an edge-preserving interpolation and variational refinement. Our approach does not require training and is robust to substantial displacements and rigid and non-rigid transformations due to motion in the scene, making it ideal for large-scale imagery such as aerial imagery. More notably, HybridFlow works on directed graphs of arbitrary topology representing perceptual groups, which improves motion estimation in the presence of significant deformations. We demonstrate HybridFlow’s superior performance to state-of-the-art variational techniques on two benchmark datasets and report comparable results with state-of-the-art deep-learning-based techniques.

www.nature.com/scientificreports/ We follow a multi-scale approach, and fine-scale superpixels resulting from the perceptual grouping of pixels contained within the parent coarse-scale cluster form the basis of subsequent processing. Graph matching is performed on the graphs representing the fine-scale superpixels by simultaneously estimating the graph node correspondences based on the first and second-order similarities and a smooth non-rigid transformation between nodes. Graph matching is an NP-hard problem; thus, the graphs' factorization into Kronecker products ensures tractable computational complexity. This process can be repeated at multiple scales to handle arbitrarily large images. At the finest-scale, the pixels' feature descriptors are matched based on their L 2 distance. Pixel-level feature matching is also performed on clusters that are too small to be subdivided into superpixels. We combine both sets of pixel matches to form the initial sparse motion vectors from which the optical flow is interpolated. Finally, variational refinement is applied to the optical flow. HybridFlow is robust to large displacements and deformations and has a minimal computational footprint compared to deep-learning-based approaches. A significant advantage of our technique is that using multi-scale graph matching reduces the computational complexity  HybridFlow: A multi-scale hybrid matching approach is performed on the image pairs. Uniquely, HybridFlow, leverages the strong discriminative nature of feature descriptors, combined with the robustness of graph matching on arbitrary graph topologies. Coarse-scale clusters are formed based on the pixels' feature descriptors and are further subdivided into finer-scale SLIC superpixels. Graph matching is performed on the superpixels contained within the matched coarse-scale clusters. Small clusters that cannot be further subdivided are matched using localized feature matching. Together, these initial matches form the flow, which is propagated by an edge-preserving interpolation and variational refinement. www.nature.com/scientificreports/ from O (n 2 ) to k i=0 O (k 2 ) where k is always smaller than the superpixel size |s| and significantly smaller than n, i.e. k < |s| << n . Our experiments demonstrate the effectiveness of our technique in optical flow estimation. We evaluate HybridFlow on two benchmark datasets (MPI-Sintel 13 , KITTI-2015 14 ) and compare it against stateof-the-art variational techniques. Hybridflow, outperforms all other variational techniques and, on average, gives comparable results with deep-learning-based methods.
To summarize, our contributions are: • A hybrid matching approach that uniquely combines the robustness of feature detection and matching with the invariance to rigid and non-rigid transformations of graph matching. The combination results in high tolerance to large displacements and deformations when compared to other techniques. • An objective function based on first and second-order similarities for matching graph nodes and edges, which results in improved matching as showcased by our experiments. • A complete variational framework for estimating optical flow that does not require training and is robust to large displacements and deformations caused due to motion in the scene while providing superior performance to state-of-the-art variational techniques and comparable performance to state-of-the-art deeplearning-based echniques on benchmark datasets.

Related work
Optical flow is a 2D vector field describing the apparent motion of the objects in the scene. This optical flow field can be very informative about the relations between the viewers' motion and the 3D scene. Over the years, many techniques have been proposed following the predominant way of estimating optical flow using variational methods 15 . The optical flow is estimated via optimization of an energy model conditioned on image brightness/colour, gradient, and smoothness. This energy model fails when dealing with large displacements due to motion in the scene because its solution is approximate and locally optimizes the function.
To address this challenge, Anandan 16 proposed a coarse-to-fine scheme. Coarse-to-fine techniques upsample and interpolate the flow from the finer-scale of the pyramid to the coarser. These techniques can deal with large displacement; however, it comes at the cost of over-smoothing any fine structures and failing to capture smallscale and fast-moving objects.
At the same time, researchers explored the integration of feature matching in optical flow estimation. Revaud et al. 17 recently presented one of the most promising variational techniques where a HOG descriptor was used as a feature matching term in the energy function. Their technique can deal with deformations and is robust to repetitive textures. In subsequent work, the authors proposed EpicFlow, which performs a sparse-to-dense interpolation on the correspondences and estimates optical flow while preserving edges 10 . Hu et al. 12 built upon this work and proposed a robust interpolation technique to address the sensitivity of EpicFlow to noise in the initial matches by enforcing matching neighbourhood flow in the two images and fitting an affine model to the sparse correspondences. Up to now, this improvement produced superior performance than the previous best, which was based on a coarse-to-fine technique using PatchMatch 11 .
More recently, several techniques were proposed based on convolutional neural networks (CNN). These estimate the optical flow in an end-to-end fashion using supervised learning [18][19][20] or unsupervised learning [21][22][23] . One of the recent top-performing CNN-based approaches is SelFlow 24 . SelFlow is a self-supervised learning approach for optical flow that, until lately, produced the highest accuracy among all unsupervised learning methods. The authors achieved this by creating synthetic occlusions from perturbing superpixels. The current state-of-the-art CNN-based technique is RAFT 25 , in which per-pixel features are employed in a deep network architecture of recurrent transforms. RAFT and its variants such as GMA 26 currently achieve the best performance reporting the lowest average endpoint error for all significant optical flow benchmark datasets.
Currently, the average endpoint error (AEE/EPE) reported on Sintel-final for the top-performing deeplearning technique (CRAFT) is 2.424, and for the top-performing variational technique (Hybridflow-ours) is 5.121; a difference of fewer than 2.7 pixels over the entire imageset of 562 images of 1024 × 436. Although deep learning techniques beget superior performance to the variational methods on benchmark datasets for which ground truth is available, they are unusable on real image sequences that seldom have associated ground truth, and training and fine-tuning become impossible. Moreover, even in cases where ground-truth may be available, the training and fine-tuning are time-consuming, offline operations that render them unsuitable in scenarios requiring real or interactive time performance.
For these reasons, we propose a variational optical flow technique that is independent of the content of the image sequences and does not impose additional requirements for training and fine-tuning. Our method follows a hybrid approach for matching to eliminate errors in the initial sparse matches introduced from large displacements and deformations. HybridFlow leverages the strong discriminative nature of feature descriptors combined with the robustness of deformable graph matching. In contrast to variational state-of-the-art, which employs a regular grid structure in their coarse-to-fine matching scheme, HybridFlow operates at only a single image scale and multiple scales of clustering, eliminating over-smoothing and handling small-scale and fast-moving objects better. More notably, our method does not restrict deformations by enforcing smooth neighbourhood matching but instead employs deformable graph matching, which allows for rigid and non-rigid transformations between neighbouring superpixels.

Graph model and matching
Model. A graph G = {P, E, T} consists of nodes P inter-connected with edges E. A node-edge incidence matrix T specifies the topology of the graph G. The nodes are represented in matrix form as www.nature.com/scientificreports/ Similarly, the edges are represented in matrix form as E = � e 1 , � e 2 , . . . , � e M ∈ R dim(� e)×M . An edge-weight function w : E × E −→ R assigns weights to edges. Given the above definitions, the incidence matrix is defined as T ∈ {0, 1} N×M where T (i,k) = T (j,k) = 1 , if an edge e k ∈ E connects the nodes p i , p j ∈ P , otherwise it is set to 0.
Matching. Matching two graphs G 1 = {P 1 , E 1 , T 1 } and G 2 = {P 2 , E 2 , T 2 } is an NP-hard problem for which exact solutions can only be found if the number of nodes and edges are significantly small e.g. N, M < 15 . Proposed solutions typically formulate graph matching as a Quadratic Assignment Problem(QAP) and provide an approximation to the solution 27 . This requires the calculation of two affinity matrices: A P 1,2 ∈ R N×N which encodes the similarities between nodes in G 1 and G 2 , and A E 1,2 R M×M which encodes the similarities between edges in G 1 and G 2 . The functions P : P × P −→ R and E : E × E −→ R measure the similarities between nodes and edges, respectively. Therefore for two corresponding nodes p i ∈ P 1 of G 1 and p k ∈ P 2 of G 2 , the node affinity matrix element is A P i,k = P (p i , p k ) . Similarly, for edges e a ∈ E 1 of G 1 and e b ∈ E 2 of G2 the edge affinity matrix element is A E a,b = E (e a , e b ). Given the above definitions, the solution to matching G 1 and G 2 is equivalent to finding the correspondence matrix C 1,2 ∈ {0, 1} N 1 ×N 2 between the nodes of G 1 and G 2 , that maximizes, where 1 C 1,2 ∈ {0, 1} N 1 ×N 2 is the characteristic function, and K ∈ R N 1 N 2 ×N 1 N 2 is a composite affinity matrix that combines the node affinity matrix A P 1,2 and the edge affinity matrix A E 1,2 . The element of K((p i p j ) 1 , (p k p l ) 2 ) for the nodes p i , p j ∈ P 1 , p k , p l ∈ P 2 , and the edges connecting these nodes e a ∈ E 1 , e b ∈ E 2 respectively, is calculated as, An example is shown in Fig. 3. Intuitively, if the two nodes considered in each graph are co-located, i.e. there is no edge connecting them, then the element's value is the similarity of the function P (., .) for the nodes. If the two nodes are different, i.e. there is an edge connecting them, then the element's value is the similarity of the function E (., .) for the connecting edges; otherwise, it is set to 0. Figure 2 and Algorithm 1 summarize the steps of the proposed technique. HybridFlow is the refined flow resulting from the interpolation of the combined initial flows calculated from the sparse graph matches from superpixels and feature matches of pixels in small clusters, as explained below.

Methods
(1) arg max www.nature.com/scientificreports/ Perceptual grouping and feature matching. Feature descriptors encode discriminative information about a pixel and form the basis of the perceptual grouping and matching. We conduct experiments with three different feature descriptors: rootSIFT proposed in Ref. 28 , pretrained DeepLab on ImageNet, and pretrained encoders with the same architecture as in Ref. 25 . As discussed later in the experimental results and "Implementation details" section, the latter descriptor results in the best performance. Next, we cluster pixels based on their feature descriptors to replace the rigid structure of the pixel grid as shown in Fig. 1b. Specifically, we classify each pixel as the argmax value of its N-dimensional feature descriptor and aggregate them into clusters. Thus, a pixel p is assigned a cluster index i p given by, where F c is the feature descriptor. Hence, this results in an arbitrary number of coarse-scale clusters in each image matched according to their cluster indices. A cluster may be non-contiguous. Since the index is calculated from the feature descriptor as in Eq. (3), it specifies the class of the object and is used during graph matching to match clusters of the same class, as explained in the following section. Pixels contained in clusters with an area less than 10,000 are matched according to the similarity of their feature descriptors using the sum of squared differences (SSD) with a ratio-test. Outliers in the initial matches are removed from subsequent processing using RANSAC, which finds a localized fundamental matrix per cluster.
The initial sparse flow resulting from this step consists of the flow calculated from each of the inlier features. Figure 1f shows the initial flow resulting from the sparse feature matching of the pixels contained within all small clusters. The size of pixels is magnified by 10 × 10 for clarity in the visualization.
Coarse-scale clusters with a larger area than 10,000 pixels are further clustered by a simple linear iterative clustering (SLIC) which adapts k-means clustering to group pixels into perceptually meaningful atomic regions 29 . The parameter κ is calculated based on the image size and the desired superpixel size and is given by κ = |I| |s| where |s| ≈ 2223, s ∈ S , and |I| is the size of the image. This restricts the number of the approximately equally-sized superpixels S ; in our experiments discussed in "Implementation details" section, the optimal value for κ ≈ 250 to 300. For the finer-scale superpixels S , a graph is constructed where each node corresponds to a superpixel's centroid, and edges correspond to the result Delaunay triangulation as explained in the following "Graph matching" section.
Graph matching. The two sets of superpixels contained in the matched coarse-scale clusters of images I 1 , I 2 are represented with the graph model described in "Graph model and matching" section. For each superpixel S, the nodes P are a subset of all the pixels p in S i.e. P ⊆ {p : ∀p ∈ S ∈ I} . The edges E and topology T of each graph are derived from a Delaunay triangulation of the nodes P. The graph is undirected, and the edge-weight . The similarity functions P (., .) and E (., .) are also symmetrical; for p i , p j ∈ P 1 , p k , p l ∈ P 2 , and edges e a ∈ E 1 , e b ∈ E 2 , the similarity functions are given by, where • is given by, www.nature.com/scientificreports/ f : P −→ S is a feature descriptor with cardinality S for a node p ∈ P , C : P −→ 6 is a function which calculates the 6-vector < µ r , µ g , µ b , σ r , σ g , σ b > containing color distribution means and variances ( µ, σ ) at p modeled as a 1D Gaussian for each color channel, d P : S × S −→ R is the L 1 -norm of the difference between the feature descriptors of two nodes in p i , p j , p k , p l ∈ P , d E : R × R −→ R is the difference between the angles θ e a , θ e b of the two edges e a ∈ E 1 , e b ∈ E 2 to the horizontal axes, and d C : 6 × 6 −→ R is the L 1 -norm of the difference between the two 6-vectors containing color distribution information for the two nodes in p i , p j , p k , p l ∈ P.
1 * signify first-order similarities and measures similarities between the nodes and edges of the two graphs. In addition to the first-order similarities 1 * , the functions in the above equations define additional second-order similarities 2 * which have been shown to improve the performance of the matching 30 . That is, instead of using only similarity functions that result in small differences between similar gradients/colours and large otherwise, e.g. first-order, we additionally incorporate the second-order similarities defined above, which measure the similarity between the two gradients and colours using the distance between their differences 31 . For example, the first-order similarity 1 gradient calculates the distance between the two feature descriptors in the two graphs i.e. P (p i , p k ) in Eq. (4), whereas the second-order similarity calculates the distance between the feature descriptor differences of the end-points in each graph i.e. 2 gradient and 2 color in Eqs. (4) and (8). A descriptor f (s i ) , as defined in Eq. (6), is calculated for each centroid-node representing superpixel s i ∈ S as the average of the feature descriptors of all pixels contained within it f (s i ) = 1 |s i | ∀p∈s i ⊂I φ p where |s i | is the number of pixels in superpixel s i , and φ p is the feature descriptor of pixel p ∈ s i ⊂ I.
Given the above function definitions, graph matching is solved by maximizing Eq. (1) using a path-following algorithm. K is factorized into a Kronecker product of six smaller matrices which ensures tractable computational complexity on graphs with nodes N, M ≈ 300 32 . Furthermore, robustness to geometric transformations such as rotation and scale is increased by finding an optimal transformation at the same time as finding the optimal correspondences and thus enforcing global rigid (e.g. similarity, affine) and non-rigid geometric constraints during the optimization 33 .
The result is superpixels matches within the matched coarse-scale clusters. Assuming a piecewise rigid motion, we use RANSAC to remove outliers from the superpixel matches. For each superpixel s having at least three matched neighbours, we fit an affine transformation. We only check whether the superpixel s is an outlier, in which case it is removed from further processing. This process is repeated for all small clusters and graphmatched superpixels. We proceed by matching the pixels contained within the matched superpixels based on their feature descriptors. Similar to earlier in "Perceptual grouping and feature matching" section, we remove outlier pixel matches contained in the superpixels using RANSAC to find a localized fundamental matrix.
The initial sparse flow resulting from graph matching consists of flow calculated from every pixel contained in the matched superpixels. Figure 1b shows the result of the clustering of the feature descriptors for the image shown in Fig. 1a. Clusters having a large area are further divided into superpixels. The graph nodes correspond to each superpixel's centroid, and the edges result from the Delaunay triangulation of the nodes, as explained above. Interpolation and refinement. The combined initial sparse flows (Fig. 1e,f) calculated from sparse feature matching and graph matching, as described above in "Perceptual grouping and feature matching" and "Graph matching" sections respectively, are first interpolated and then refined. For the interpolation, we apply an edgepreserving technique 10 . This results in dense flow as shown in Fig. 1g. In the final step, we refine the interpolated flow using variational optimization on the full-scale of the initial flows, i.e. no coarse-to-fine scheme, with the same data and smoothness terms as used in Ref. 10 . The final result is shown in Fig. 1h.

Experimental results
In this section, we report on the evaluation of HybridFlow on benchmark datasets and compare it with state-ofthe-art variational optical flow techniques. In "Application: large-scale 3D reconstruction" section, we present two applications of the proposed technique on large-scale image-based reconstruction where ground truth is unavailable. Specifically, we use large-scale aerial imagery, and Full-Motion Video (FMV) captured from aerial (6) Datasets and evaluation metrics. We evaluate HybridFlow on the two widely used benchmark datasets for motion estimation: • MPI-Sintel 13 -a synthetic data set for the evaluation of optical flow derived from the open source 3D animated short film, Sintel. It includes image sequences with large displacements, motion blur, and non-rigid motion. • KITTI-2015 14 -a real data set captured with an autonomous driving platform. It contains dynamic scenes of real world conditions and features large displacements and complex 3D objects.
The quantitative evaluation is performed in terms of the average endpoint error(EPE) for MPI-Sintel, and percentage of optical flow outliers(FI) for KITTI-2015.
Implementation details. The proposed approach was implemented by Q. Chen in Python. All experiments were run on a workstation with an Intel i7 processor. We extract the features descriptors using the approach introduced in Ref. RAFT 25 . Perceptual grouping using SLIC superpixels is performed using the method in Ref. 29 . We factorize graphs into Kronecker products as presented in Ref. 32 and perform deformable graph matching following the approach in Ref. 33 . Finally, we interpolate the combined initial flows from sparse feature matching and graph matching using the edge-preserving interpolation and variational refinement in EpicFlow 10 .
Superpixel size. We empirically determined the optimal size of the superpixels which subsequently determined the number of superpixels κ as defined in "Perceptual grouping and feature matching" section. Initial coarse-scale clustering. The initial coarse-scale clusters are formed by clustering the pixels' feature descriptors. This is a crucial part of the process, which increases robustness to large displacements. As shown in Fig. 4c, using SLIC superpixels on the entire image results in a near-rigid rectangular pixel grid and consequently failures in graph matching. This is evident from the mismatching of the dark red circles in the middle of the right image. Our experiments show that an irregular pixel grid based on features descriptors increases the robustness in the presence of large displacements and deformations.  Figures (a,b), |s| = 2232 (200 superpixels, Figures  (c,d), |s| = 1116 (5 clusters subdivided into 80 superpixels, Figures (e,f) and |s| = 223 (5 clusters subdivided into 400 superpixels, Figures (g,h)respectively. The figures show the colour-coded graph node matches using only graph matching as explained in "Graph matching" section.  On real data (KITTI-2015). Failure cases. Graph matching is robust to texture variations, illumination variations, and deformations. However, erroneous matches can be introduced when large occluded areas fall inside the convex graph, as shown in the example in Fig. 4c. Mismatches in the graph matching can lead to the wrong matching of the finer-scale superpixels, and consequently, significant errors in the optical flow. This is clearly evident from the results in Table 1 for Sintel and KITTI-2015, where for the non-occluded test-sets, HybridFlow outperforms all state-ofthe-art variational methods and matches the performance of deep-learning techniques such as ScopeFlow.

Application: large-scale 3D reconstruction
The motivation for our work is large-scale 3D reconstructions from airborne images. In particular, we focus on full-motion video (FMV) and large-scale aerial imagery, typically captured by a UAV/helicopter and an airplane, respectively. Deep learning techniques are not applicable since they have a fixed input size. Thus, a very highresolution image must be scaled-down to typically less than 1 K × 1 K to be used as input to the network. This significant reduction in resolution leads to low-resolution optical flow and significantly low-fidelity 3D models.
Most notably, there is no ground truth dataset for real scenarios to train the deep learning models. On the other hand, the state-of-the-art variational methods considered in this work also impose restrictions on the input image size. For example, RicFlow and EpicFlow use a hierarchical structure employed by DeepMatching, which on an 8GB GPU can only handle 1 K × 1 K resolutions. HybridFlow can handle arbitrary-sized resolutions with a low memory footprint. In this section, we present the results of the application of HybridFlow on the use case of large-scale 3D reconstruction from airborne images. We reiterate that there is no ground truth data for training models in such scenarios, and the resolutions can be significantly higher than 1K × 1K. For MPI-Sintel results, EPE-noc is the EPE on the non-occluded areas, and EPE-occ is the EPE on occluded areas. s0-10 is the EPE for pixels whose motion speed is between 0 and 10 pixels, similarly for s10-40 and s40+; d0-10 is the endpoint error over regions between 0 and 10 pixels apart from the nearest occlusion boundary, similarly for d10-60 and d60-140. For the KITTI-2015 test-set non-occluded pixels, FI-bg is the percentage of optical flow outliers for background, FI-fg is the percentage of optical flow outliers for the foreground, FI-all/Est is the percentage of outliers averaged over all non-occluded ground truth pixels. Best performance shown in bold. . In contrast, we reformulate the reconstruction as a single-step process. Using HybridFlow allows us to triangulate directly the dense matches without MVS as a post-processing step, therefore achieving faster reconstructions. We design a specialized off-memory, on-disk data structure for storing the matches. As shown in Fig. 7, at every image, we keep a tensor with layers containing pixel-level matches to subsequent images based on the HybridFlow. Unmatched pixels in the second image are stored in the tensor data structure for the second image, which contains layers with pixel-level matches to the third image and onwards. The data structure can scale up dynamically to arbitrary-sized datasets (subject to the disk limits) and allows for efficient outlier removal and validation, i.e. multiple pixels in the same image cannot be matched to the same pixel in the following image. A simple look-up at a fiber of the tensor gives the matches for that pixel in all subsequent images. Hence, reconstruction is reduced to traversing all fibers in each tensor and triangulating to get a 3D position.
We demonstrate the effectiveness of HybridFlow on large-scale reconstruction from images and present result on two different types of datasets: full-motion video, and large-scale aerial imagery. We followed the single step process described above employing the dynamic tensor-shaped data structure for the efficient processing of the matches calculated by HybridFlow.
Full-motion video. Full-motion video (FMV) is typically captured by a helicopter at an oblique aerial angle so that the rooftops and the facades of the buildings are visible in the images. The ground sampling density is significantly higher than that of a satellite image, i.e. in the order of a few cms, and can vary according to the aircraft's flight height, depending on the area it is flying over.
We ran experiments on a full-motion video dataset containing images taken from a helicopter circling an area containing a few mockup buildings. Our test dataset contains 71 images with resolution 1280 × 720 with unknown camera calibrations or EXIF information. We report results using the (i) single-step reconstruction using HybridFlow matches, the (ii) same single-step reconstruction using EpicFlow matches, (iii) and the stateof-the-art incremental SfM techniques Bundler 38 , VisualSFM 39 , COLMAP 40 .
Perhaps the most popular feature extraction methods used in SfM is SIFT 41 . In COLMAP 40 , they use a modified version called RootSIFT 28 for extracting and matching each image. The first comparison focuses on the density of the matches. Figure 8c shows the SIFT matches, Fig. 8d the RootSIFT matches, Fig. 8e the EpicFlow matches, and Fig. 8f the HybridFlow matches for the input images shown in Fig. 8a,b. The latter two show the matches as colour-coded optical flows for visualization clarity, otherwise drawing the matches will cover the entire image. Table 2 presents the total number of matches per technique. As expected, SIFT and RootSIFT have the lowest number of matches since they only extract scale-space extrema. On the other hand, the dense optical flow technique EpicFlow results in eight times lower number of matches than HybridFlow.
The reconstruction can serve as a proxy for the accuracy of the matches in cases where ground truth is not available. We proceed with the evaluation of the reconstruction in terms of the reprojection error. Figure 9 shows the reconstructed pointcloud of (a) COLMAP's sparse (SfM) reconstruction, (b) COLMAP's dense (MVS) reconstruction, (c) our single-step reconstruction using HybridFlow matches, and (d) our single-step reconstruction using EpicFlow matches. The reconstructed point clouds are rendered from the same viewpoint and camera intrinsics. The reprojection error using our single-step method with HybridFlow achieves the highest number of reconstructed points in the lowest time per point, while the reprojection error is comparable with COLMAP for almost 60x more points.

Figure 7.
On-disk dynamic tensor-shaped data structure. For each image, we store a tensor with layers containing pixel-level matches to subsequent images based on the HybridFlow. Unmatched pixels in the second image are stored in the tensor data structure for the second image, which contains layers with pixel-level matches to the third image and onward. A fiber is shown in blue. Each cell contains the match of that pixel, i.e. the top right corner in all subsequent images. Reconstruction is reduced to triangulating the matches contained within each fiber.  Figure 10a shows an example of large-scale aerial imagery capturing a downtown urban area. The resolution is 6600 × 4400 is considered average amongst large-scale aerial imagery, since some of the larger resolutions can reach sizes of up to 14, 000 × 12, 000 . Deep learning techniques can be applied only (i) by rescaling the image to the fixed input size expected by the neural network, or (ii) tiling the image, calculating flows per tile, and then merging the results. In the first case, rescaling reduces the resolution and subsequently the final number of reconstructed points. Furthermore, essential details such as cars and trees are completely removed. In the latter case, there is no one-to-one mapping between tiles. For example, a tile may contain areas appearing in two or more different tiles in the second image. Furthermore, the deep optical flow techniques always return a match for every pixel. That means that even if an area is not present in a tile, this will nevertheless be matched to another area in the second image. For these reasons, deep learning techniques cannot be applied in these use cases.
Competing variational methods such as RicFlow 12 , EpicFlow 10 cannot be applied either since hierarchical structure employed by DeepMatching 17 , which on an 8GB GPU can only handle 1 K × 1 K resolutions. In contrast, HybridFlow is the only top-performing variational method that can handle arbitrary-sized images such as large-scale aerial imagery. Figure 10a,b shows two consecutive images capturing a downtown urban area   Figure 9. The reconstruction serves as a proxy to the accuracy of the matches. We calculate and compare reprojection errors for the techniques shown in Table 2. (a) Shows COLMAP's sparse (SfM) reconstruction, (b) shows COLMAP's dense (MVS) reconstruction 40 , (c) shows our single step reconstruction using dense matches from Epicflow 10 , and (d) shows our single step reconstruction with Hybridflow. HybridFlow produces 60× more matches than COLMAP and 47× more matches than EpicFlow. The reprojection error is comparable with COLMAP (for 60× more points) while the runtime is less than half. www.nature.com/scientificreports/ www.nature.com/scientificreports/ having a resolution of 6600 × 4400 . HybridFlow is the only top-performing variational method that can handle high-resolution images as shown in Fig. 10c. Deep learning techniques cannot be applied due to the fixed input size of the networks. Similarly, competing state-of-the-art variational methods cannot be applied for this size of images as explained above. Figure 10d shows the resampled image from Fig. 10b using the HybridFlow matches in Figure 10c and the matched pixels in Fig. 10a. Figure 10e shows a render of the reconstructed pointcloud for the downtown urban area generated using 320 images of the same size.

Conclusion
We addressed the problem of large displacement optical flow and presented a hybrid approach based on sparse feature matching using feature descriptors and graph matching, named HybridFlow. In contrast to state-of-theart, it does not require training, and the use of sparse feature matching is robust and can scale up to arbitrary image sizes. This makes our technique applicable in use-cases such as reconstruction or object tracking where ground-truth is unavailable, and processing must be performed in interactive time. We match initial coarsescale clusters based on a clustering of context features. We employ graph matching to match perceptual groups clustered using SLIC superpixels within each initial coarse-scale cluster, and perform pixel matching on smaller clusters. Based on the combined feature matches and the graph-node matches, we calculate the initial flow which is interpolated using an edge-preserving interpolation and refined using variational refinement. The proposed technique has been evaluated on two benchmark datasets (Sintel, KITTI), and we compared it with the current state-of-the-art variational optical flow techniques. We show that HybridFlow surpasses all other state-of-the-art variational methods in non-occluded test sets. Specifically, for Sintel, HybridFlow has the lowest overall EPE, while for KITTI, it gives comparable results.