Multi-Scale Warping for Video Frame Interpolation

A novel video interpolation network to improve the temporal resolutions of video sequences is proposed in this work. We develop a multi-scale warping module to interpolate intermediate frames robustly for both small and large motions. Specifically, the proposed multi-scale warping module deals with large motions between two consecutive frames using coarse-scale features, while estimating detailed local motions by exploring fine-scale features. To this end, it takes multi-scale features from the encoder and estimates kernel weights and offset vectors for each scale. Finally, it synthesizes multi-scale warping frames and combines them to obtain an intermediate frame. Extensive experimental results demonstrate that the proposed algorithm outperforms state-of-the-art video interpolation algorithms on various benchmark datasets.


I. INTRODUCTION
The objective of video frame interpolation is to synthesize intermediate frames between two consecutive video frames. Video sequences with low temporal resolutions suffer from blur artifacts, temporal jittering, and motion aliasing, which provide unpleasant visual experience. Thus, the video frame interpolation task is essential for generating visually pleasant videos with high frame rates. Video frame interpolation can be applied to various fields, such as frame rate-up conversion [1], [2], slow motion generation [3], frame recovery in video streaming [4], [5], and novel view interpolation [6]. Many attempts have been made to interpolate intermediate frames, but it is still difficult to generate high-quality middle frames due to challenging factors such as occlusions and fast motions.
Most video frame interpolation algorithms include two processes: motion estimation and frame interpolation. They estimate motions to determine corresponding pixel positions in input frames and then interpolate an intermediate frame based on the correspondence matching The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang. information. In this regard, accurate motion estimation is required to obtain visually plausible intermediate frames.
Recently, flow-based video interpolation methods [7]- [10] employ deep-learning-based optical flow techniques [11], [12] to yield reliable intermediate frames. However, these flow-based methods are vulnerable to optical flow errors. Moreover, they require large network parameters and additional training data to learn the optical flow networks.
Attempts have been made to obtain intermediate frames without explicit optical flow information. For example, some video frame interpolation algorithms [13], [14] derive pixel-wise adaptive convolution filters to perform the interpolation. Since they do not estimate motion vectors explicitly, they need large kernel sizes (e.g. 51 × 51) instead to cover large reference regions due to possible large motions. Thus, it requires high memory complexity and time consumption to obtain pixel-wise weights of these large kernels. Also, these algorithms may fail to find matching positions properly when motions are larger than the kernel size. Inspired by deformable convolution [15], recent video frame interpolation algorithms [16], [17] estimate kernel weights and offsets simultaneously. They use the offsets as pseudo-motions to determine matching locations, so they can reduce the  kernel size and exploit motion information without additional optical flow networks. However, these algorithms still have limitations on large motions, since they extract only fine-scale features to obtain kernel weights and offsets.
In this paper, we propose a novel video interpolation network, which consists of a multi-scale warping module based on deformable convolution. The proposed algorithm extracts coarse-scale features, as well as fine-scale features, to obtain multi-scale kernels and offsets for video frame interpolation. Then, the multi-scale warping module interpolates an intermediate frame between two input frames by performing deformable convolution based on those multi-scale kernels and offsets. The proposed multi-scale warping module deals with large motions between two consecutive frames using coarse-scale features while estimating detailed local motions by exploring fine-scale features. Experimental results demonstrate that the proposed algorithm outperforms state-of-the-art video interpolation algorithms on various benchmark datasets. This paper is organized as follows: Section II reviews related work. Section III describes the proposed algorithm, and Section IV discusses its experimental results. Finally, Section V concludes this paper.

II. RELATED WORKS
Many video frame interpolation algorithms use motion information to determine reference pixels for interpolation in two consecutive frames. In the past, conventional motion-compensated frame interpolation algorithms focused on estimating accurate motion vectors between two consecutive frames and then interpolated an intermediate frame by halving those motion vectors. To obtain reliable motion vectors for frame interpolation, motion vector refinement algorithms [23]- [28] have been developed. Choi et al. [23] used adaptive block sizes for motion estimation to represent complex motions around object boundaries. Huang and Nguyen [24] proposed a multistage motion vector refinement algorithm, which analyzes the distribution of residual energies to update unreliable motion vectors with varying block shapes and sizes. Jacobson et al. [25] used saliency and segmentation to refine motion vectors. Jeong et al. [26] formed multiple motion hypotheses for each pixel using various parameters such as block sizes and directions. They then solved a labeling problem to determine optimal parameters. Zhang et al. [27] modeled pixel intensities across neighboring frames as a differentiable function and developed a motion estimation scheme based on polynomial approximation. Choi et al. [28] proposed a MAP-based motion vector field refinement algorithm, which iteratively updates the motion vector of each block.
With the advance of deep-learning-based optical flow techniques, some flow-based video frame interpolation methods [7]- [9] employ existing optical flow techniques [11], [12] to warp two consecutive frames to generate intermediate frames. Niklaus and Liu [7] estimated bi-directional optical flow using PWC-Net [12] and performed the forward warping using the estimated optical flow to generate initial intermediate frames. MEMC-Net [8] proposed an adaptive warping layer, which uses optical flow [11] as offsets to decide matching positions. DAIN [9] refined optical flow [12] using depth information [29] to deal with occlusions. Niklaus and Liu [10] proposed softmax splatting to forward-warp frames based on the off-the-shelf optical flow method [12]. However, these flow-based methods are prone to optical flow errors, and some of them require additional training data for optical flow estimation and additional training time.
Instead of using off-the-shelf optical flow methods, some flow-based algorithms design end-to-end video frame interpolation networks, which extract motion information and perform motion-based frame warping jointly. DVF [30] is an encoder-decoder network to predict 3D optical flow across space and time in a video sequence. Liu et al. [31] proposed a cycle consistency loss for intermediate frame warping and improved the performance of DVF. Jiang et al. [3] developed an end-to-end convolutional neural network, which estimates bi-directional motions to interpolate intermediate frames. Also, BMBC [32] consists of a bilateral motion network to estimate intermediate motions and a dynamic filter generation network to obtain intermediate frames.
Instead of using motion information explicitly, the kernel-based approach interpolates intermediate frames by convolving input images with learnable kernel weights. Ada-Conv [14] uses pixel-wise convolution filters of a large size for frame interpolation, but it requires a large number of parameters to provide those pixel-wise coefficients. To reduce the memory usage, SepConv [13] performs separable convolution by dividing each 2D kernel into 1D horizontal and 1D vertical kernels. However, these algorithms [13], [14] cannot deal with motions larger than kernel sizes.
Recently, attempts have been made to estimate kernel weights and reference positions simultaneously [16], [17], [33]. Peleg et al. [33] formulated the motion estimation as a classification problem and performed convolutions with trained kernel weights based on the classified motions. Inspired by deformable convolution [15], DSepConv [16] and AdaCoF [17] adopt convolutional neural networks to produce offsets to decide reference positions in input images. With the estimated offsets, they can provide reliable interpolation results using a small kernel size. Figure 1 shows the structure of the proposed network for video frame interpolation. The proposed network takes two successive video frames I t and I t+1 , where t is a frame index, and produces an intermediate frameĨ out . The feature extractor yields multi-scale features, and the two multi-scale warping modules generate warped results from I t and I t+1 , respectively. Finally, the network combines the two warped results,Ĩ t andĨ t+1 , based on a pixel-wise learnable weights to obtain the intermediate frameĨ out .

1) FEATURE EXTRACTOR
The feature extractor takes a stack of I t and I t+1 and produces multi-scale features for video frame interpolation. As shown in Figure 1, we design the feature extractor based on the U-net architecture [34], which is composed of an encoder, a decoder, and skip connections. The encoder contains six convolution blocks, each of which includes three sets of a 3 × 3 convolution layer with the ReLU activation. Also, each convolution block except the first one performs average pooling to extract high-level features. From the output of the encoder, the decoder yields multi-scale features for video interpolation. In the decoder, there are four sets of an up-sample block and a convolution block. Each up-sample block contains an up-sample layer with factor 2 and a 3 × 3 convolution layer with the ReLU activation. Then, from the four up-sample blocks, we extract multi-scale features, where F l is the output feature of the lth up-sample block. The specification of the multi-scale feature extractor is summarized in Table 1.

2) MULTI-SCALE WARPING MODULE
The multi-scale warping module warps an input image to obtain an intermediate frame. We formulate the warping process as the convolution with kernel weights. For reliable warping, motion information to determine reference (or matching) positions in the input image is required. Therefore, the proposed multi-scale warping module performs deformable convolution [15] to convolve the input image with kernel weights on reference positions. Figure 2 shows the structure of the proposed multi-scale warping module, which contains 12 sub-networks to yield multi-scale kernel weights and offsets for the image warping. Each feature in F is fed into three sub-networks, which generate kernel weights, horizontal offsets, and vertical offsets, respectively. Using all multi-scale features in F = {F 1 , . . . , F 4 }, we obtain four sets of kernel weights, horizontal offsets, and vertical offsets. The offsets estimated at coarse scales can cover large motions, whereas small but detailed motions can be estimated at fine scales.
More specifically, for each feature F l at the lth scale, three convolution blocks produce kernel weights K l ∈ R H l ×W l ×k 2 , horizontal offsets U l ∈ R H l ×W l ×k 2 , and vertical offsets V l ∈ R H l ×W l ×k 2 , respectively. Here, H l × W l is the spatial resolution of the lth scale, and the kernel size k is set to 5 in this work. Then, K l , U l , and V l are up-sampled to have the same spatial resolution H × W as an input image via the up-sample blocks.
Let us consider the warping process of an input image I to compute the interpolation result at pixel (x, y) using K l , U l , and V l . Specifically, let K l xy ∈ R k×k , U l xy ∈ R k×k , and V l xy ∈ R k×k denote the kernel weights, horizontal offsets, and vertical offsets for pixel (x, y), respectively. Then, the interpolation for pixel (x, y) at the lth scale is obtained by the warping process W, which is expressed as ×I (x + i + U l xy (i, j), y + j + V l xy (i, j)).
For each scale l, the multi-scale warping module obtains I l via Eq. (1). Warping results at fine scales tend to preserve detailed local motions, while those at coarse scales represent large motions between consecutive frames reliably. To take advantage of both coarse and fine scale information, the multi-scale warping module produces the warping resultĨ by combining {Ĩ l } 4 l=1 with learnable adaptive weights {α l } 4 l=1 through a fusion layer as where the sum of the adaptive weights in {α l } 4 l=1 are constrained to 1. In other words, 4 l=1 α l = 1.

B. VIDEO FRAME INTERPOLATION
As shown in Figure 1, the proposed algorithm employs two multi-scale warping modules. The first multi-scale warping module performs the warping forwardly from I t to the intermediate frame. This forwardly warped frame is denoted byĨ t . On the other hand, the second multi-scale warping module produces the backwardly warped frameĨ t+1 from I t+1 . Note that the two multi-scale warping modules have different network parameters to estimate the kernel weights, horizontal offsets, and vertical offsets for performing the warping forwardly and backwardly, respectively. Finally, we reconstruct the intermediate frameĨ out by combining the two warping results as where O ∈ R H ×W is a learnable weight map for combining the two warping results effectively. Here, 0 ≤ O(x, y) ≤ 1. A weight O(x, y) > 0.5 indicates that the warping result from I t is more reliable than that from I t+1 at pixel (x, y). When a pixel is occluded in one of the two frames I t and I t+1 , a higher weight can be assigned to the other frame through the weight map O. To obtain O, we add a sub-network to the 4th up-sample block in the decoder. This sub-network contains an up-sample block with the sigmoid activation to satisfy the constraint 0 ≤ O(x, y) ≤ 1, as done in [17].

C. IMPLEMENTATION DETAILS
The proposed algorithm is trained in an end-to-end manner using a loss function Here, L c is the color loss between the predicted intermediate frameĨ out and the ground-truth I gt . For the color loss, we use the Charbonnier function [35], where ρ(x) = (x 2 + 2 ) 1/2 and = 0.001. Also, we define the smoothness loss L s to encourage neighboring pixels to have similar motions, which is given by {ρ(∇ x (P avg (K l U l ))) +ρ(∇ y (P avg (K l U l ))) +ρ(∇ x (P avg (K l V l ))) +ρ(∇ y (P avg (K l V l )))} (6) where is the Hadamard product and P avg is the average pooling function along the channel axis. Also, ∇ x and ∇ y denote partial derivatives in the horizontal and vertical directions, respectively.
In each convolution layer, ReLU is used as the activation function. The batch normalization is not employed, since we use a small batch size of 4. We train the proposed network using the AdaMax optimizer [36] with hyper-parameters β 1 = 0.9 and β 2 = 0.999. The learning rate is initialized to 0.001 and then halved every 20 epochs. The training is iterated for 70 epochs with an RTX 2080Ti GPU. The proposed network takes about 0.2 seconds to obtain an intermediate frame of size 1280 × 720.

1) DATASETS AND METRICS
For training, we use Vimeo90K [37] as the training set, which contains 73,171 triplets of frames of resolution 448×256. The triplets in the training set are randomly cropped with a 256 × 256 size and then randomly flipped horizontally or vertically for data augmentation. For the evaluation, we use the same test sets as the state-of-the-art AdaCoF [17]: the 12 sequences in Middlebury [9] and randomly sampled sequences from the UCF101 and DAVIS datasets. For quantitative assessment of video frame interpolation, we use the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) metrics.

A. ABLATION STUDIES
First, we analyze the efficacy of multi-scale features for obtaining multi-scale kernel weights and offsets. In this test, we measure the video frame interpolation performances by modifying the number of multi-scale features as follows.
• 1) Warping with F 4 (baseline): • 2) Warping with F 3 and F 4 : • 3) Warping with F 2 , F 3 , and F 4 : We set the first case as the baseline, since it uses only the finest-scale feature similarly to AdaCoF. As listed in Table 2, the warping with F 3 and F 4 provides a little performance gain on Middlebury and even worse performance on UCF101, as compared with the baseline. In contrast, the warping with three-scale features (F 2 , F 3 , and F 4 ) achieves performance gains on all three datasets. Finally, the proposed algorithm, which uses all four multi-scale features, provides the best performance on all datasets with large margins -0.38dB on Middlebury and 0.45dB on DAVIS --against the baseline. This indicates that the proposed multi-scale features are effective for the video frame interpolation task. Next, we conduct another ablation study by varying kernel sizes k ∈ {3, 5, 7} in both training and test for predicting kernel weights and offset vectors. Table 3 shows that larger kernel sizes (k = 5 and k = 7) yield better performance than a small kernel size (k = 3). This is because a large kernel size can cover a large reference region in general. However, the best performance is achieved at k = 5, not at k = 7. This indicates that the proposed multi-scale warping module effectively deals with large motions even with k = 5 by exploiting features at coarse scales. We hence use the kernel size of 5 in the proposed algorithm.   and DAVIS datasets. The baseline yields blurry interpolation results in fast-moving objects, such as pizza dough in the 2nd row, runner in the 5th row, and dolphin tails in the 6th row. Also, the baseline fails to faithfully reconstruct fine details of objects, such as building in the 1st row, lion in the 3rd row, and helmet in the 4th row. In contrast, the proposed algorithm provides visually pleasing interpolation results on those fast-moving objects as well. These comparisons indicate that the proposed multi-scale warping module is capable of dealing with large motions effectively and reconstructing fine details between consecutive frames. Table 4 compares the proposed algorithm with existing video frame interpolation algorithms -Phase [38], MIND [39], SepConv-L 1 [13], DVF [30], SuperSlomo [3], and AdaCoF [17] -on the Middlebury, UCF101, and DAVIS datasets. The scores of the existing algorithms in Table 4 are from [17]. The proposed algorithm outperforms the existing algorithms with large PSNR gains on all three datasets. Notice that the proposed algorithm surpasses AdaCoF as well, which indicates that the proposed multi-level warping module is effective for video frame interpolation.  Figure 4 compares the proposed algorithm with the existing algorithms on Middlebury qualitatively. The proposed algorithm reconstructs the shape of the right ball in the 1st row and the shadow of the trunk in the 2nd row more faithfully to the ground-truth than the other algorithms do. Figure 5 compares interpolation results on the UCF101 dataset. In the 1st row, the proposed algorithm and AdaCoF faithfully synthesize the arc shape of the athlete's movement. In the 2nd row, the proposed algorithm and SuperSlomo reconstruct the pitcher's legs more similarly to the ground-truth than the other algorithms. These results demonstrate that the proposed algorithm reconstructs fast-moving regions robustly. Finally, Figure 6 shows qualitative results on the DAVIS dataset. From the 1st to the 3rd rows, videos contain fast-moving objects, such as football player, wheel, and car. It is observed that the proposed algorithm yields visually pleasing interpolation results, especially in the helmet number and the shape of the car, while the other algorithms generate blurry interpolation results. Also, in the 4th row, the detailed texture of the reptile is faithfully reconstructed by the proposed algorithm.

C. VISUALIZATION OF MULTI-SCALE OFFSETS
The proposed multi-scale warping modules produce multi-scale kernel weights {K l } 4 l=1 and multi-scale offsets {U l , V l } 4 l=1 . We visualize these offsets to show the effectiveness of the multi-scale warping modules. For this purpose, we obtain a flow map F l for each scale l by combining offsets with kernel weights. Specifically, a flow vector F l (x, y) at pixel (x, y) is defined as Then, we combine all multi-scale offsets using the weights {α l } 4 l=1 in Eq. (2) to obtain a mixed flow map F m , which is given by    Figure 7(c). On the contrary, F m in Figure 7(d) have high responses on those regions. This indicates that multi-scale offsets represent large VOLUME 9, 2021 motions robustly. Also, by comparing two warping results in Figures 7(e) and (f), we observe that the warping result using all multi-scale features is more faithful than that using F 4 only. Similar results can be observed in a runner, a motorbike, and a skater in the 2nd, 3rd, and 4th rows, respectively.

V. CONCLUSION
In this paper, we developed a video interpolation network based on the multi-scale warping modules, which can deal with both large and small motions robustly. The proposed network extracts multi-scale features and estimates the set of kernel weights and offset vectors at each scale. It then performs the warping according to the scales and combines multi-scale warping results using learnable weights to obtain intermediate frames. Experimental results demonstrated that the proposed algorithm outperforms state-of-the-art video interpolation algorithms on various benchmark datasets.