Keywords

1 Introduction

Reference-based super-resolution (RefSR) methods [2] utilizes an extra high resolution (HR) reference image to help super-resolve the low resolution (LR) image that shares similar viewpoint. Benefit from the high resolution details in reference image, RefSR usually leads to competitive performance compared to single-image SR (SISR). While RefSR has been successfully applied in light-field reconstruction [1,2,3] and giga-pixel video synthesis [4], it remains a challenging and unsolved problem, due to the parallax and the huge resolution gap (8x) exist between HR reference image and LR image. Essentially, how to transfer the high-frequency details from the reference image to the LR image is the key to the success of RefSR. This leads to the two critical issues in RefSR, i.e., image correspondence between the two input images and high resolution synthesis of the LR image.

Fig. 1.
figure 1

Left: the ‘patch maching + synthesis’ pipeline of [2], middle: the proposed end-to-end CrossNet, right: results comparisons.

In the initial work of [1], to develop image correspondences between the two inputs, the gradient features on the down-sampled patches in the HR reference are used for patch-based matching, while patch averaging is designed for the image synthesis. However, the oversimplified and down-sampled correspondence estimation of [1] does not take advantage of the high frequency information for matching, while the synthesizing step does not utilize high resolution image prior for better fusion. To address the above two limitations, a recent work [2] replaces the gradient feature of [1] with the convolutional neural network (CNN) learned features to improve the matching accuracy, and then proposes an additional CNN which utilizes the state-of-the-art single image super-resolution (SISR) algorithm [5] for patch synthesis. However, the ‘patch matching + patch synthesis’ scheme of [1, 2] are fundamentally limited. Firstly, the adopted sliding averaging blurs the output image and causes grid artifacts. Moreover, patch-based synthesis is inherently incapable in handling the non-rigid image deformation caused by viewpoint changes. To impose the non-rigid deformation to patch-based algorithms, [3] enriches the reference images by iteratively applying non-uniform warping before the patch synthesis. However, directly warping between the low and high resolution images is inaccurate. In addition, such iterative combination of patch matching and warping introduces heavy computational burden, e.g. around 30 min for synthesizing an image.

In this paper, we propose CrossNet, an end-to-end convolutional neural network based on the idea of ‘warping + synthesis’ for reference-based image super-resolution. We discard the idea of ‘patch matching’ and replace it with ‘warping’, which enables the design of ‘Encoder-Warping-Decoder’ structure, as shown in Fig. 1. Such structure contains two encoders to extract multi-scale features from LR and reference image respectively. We take advantage of the warping module originated from spatial transformer network (STN) [6], and integrate it to our HR reference image encoder. Compared with the patch matching based methods, warping naturally supports non-rigid deformation to overcome the parallax challenge in RefSR. More over, we extract multi-scale features in the encoder, and then perform multi-scale spatial alignment using warping, as shown in Fig. 1. The introduced multi-scale features capture the complementary scale information from two images, which help to alleviate the huge resolution gap challenge in RefSR. Finally, the decoder aggregates features to synthesize the HR output. Overall, our model is fully end-to-end trainable and does not require pretraining the flow estimator.

Extensive experiments have shown the superior performance of CrossNet (around 2 dB–4 dB gain) compared to state-of-the-art SISR and RefSR methods, under different datasets with large/small viewpoint disparities and different scales. Our trained model that generalized to external dataset including Stanford light field maintains the ability to retain high frequency details. More importantly, CrossNet is efficient in terms that it generates a \(320\times 512\) image within one second, while [1, 2] and [3] take 86.3 s, 105.0 s and around 30 min to perform the same task, respectively.

2 Related Work

2.1 Single-Image Super-Resolution

The single-image super-resolution (SISR) problem aims to super-resolve an LR image without additional references. Despite that, the SISR problems are closely related to the Reference-based Super-resolution (RefSR) problem. In the early days, approaches based on adaptive sampling [7, 8] has been applied to SISR. However, such approaches did not utilize the statistics of nature images. In contrast, model-based approaches try to design image prior which helps to super-resolve the image-specific patterns. Such works usually utilize edge prior [9], total variation model [10], hyper-Laplacian prior [11], sparsity priors [12,13,14,15], or exemplar patches [16, 17].

More recently, the SISR problem was casted into a supervised regression problem, which try to learn a mapping function from LR patches to HR patches. Those works relies on varieties of learning techniques including nearest-neighbor search [18, 19], decision tree [20], random forests [21], simple function [22, 23], Gaussian process regression [24], and deep neural networks.

With the increasing model capacity of the deep neural networks, the SISR performance has been rapidly improved. Since the appearance of the first deep learning-based SR method [25], a large number of works have been proposed to further improve the SISR performance. For example, Dong et al. [26] and Shi et al. [27] accelerate the efficiency of SISR by computing features on low-resolution domains. Kim et al. [28] proposed a 20-layers deep network for predicting the bicubic upsampling residue. Ledig et al. [5] proposed a deep residue network with adversarial training for SISR. Lai et al. [29] reconstructed the sub-band residuals using a multi-stage Laplacian network. Lim et al. [30] improved [5] by introducing a multi-scale feature extraction residue block for better performance. Because of the impressive performance of the MDSR network from [30], we employ MDSR as a sub-module for LR images feature extraction and RefSR synthesis.

2.2 Reference-Based Super-Resolution

Recent works such as [1,2,3, 31,32,33] uses additional reference images from different viewpoints to help super-resolving the LR input, which forms a new kind of SR method called RefSR. Specifically, Boominathan et al. [1] used an DSLR captured high-definition image as reference, and applies a patch-based synthesizing algorithm using non-local mean [19] for super-resolving the low-definition light-field images. Wu et al. [33] improved such algorithm by employing patch registration before the nearest neighbor searching, then applies dictionary learning for reconstruction. Wang et al. [3] iterate the patch synthesizing step of [1] for enriching the exemplar database. Zheng et al. [34] decompose images into subbands by frequencies and apply patch matching for high-frequency subband reconstruction. Recently, Zheng et al. [2] proposed a deep learning-based approach for the cross-resolution patch matching and synthesizing, which significantly boosts the accuracy of RefSR. However, the patch-based synthesizing algorithms are inherently incapable in handling the non-rigid image deformation that is often caused by the irregular foreground shapes. Under such cases, patch-based synthesize causes issues such as blocky artifact and blurring effect. Despite that sliding windows [1, 2] or iterative refinement [3] mitigate such difficulties to some extends, these strategies usually introduce heavy computational cost. On the contrary, our fully convolutional network makes it possible to achieve more than 100 times speedup compared to existing RefSR approaches, allowing the model to be applicable for real-time applications.

2.3 Image/Video Synthesis Using Warping

Our task is also related to image/video synthesis tasks that use additional images from other viewpoints or frames. Such tasks include view synthesis [35, 36], video denoising [37], super-resolution [37,38,39], interpolation or extrapolation [40, 41]. To solve this type of problems, deep neural networks based of the design of “warping and synthesis” has been recently proposed. Specifically, the additional images are backward warped to the target image using the estimated flow. Afterward, the warped image is used for image/frame synthesis using an additional synthesis module. We follow such “warping and synthesis” pipeline. However, our approach is different from existing works in the following ways: (1) in stead of the common practice where warping was performed on image-domain at pixel-scale [35, 36, 40, 41], our approach performs multi-scale warping on feature domain, which accelerates the model convergence by allowing flow to be globally updated at higher scales. (2) after the warping operations, a novel fusion scheme is proposed for image synthesis. Our fusion scheme is different from the existing synthesizing practices that include image-domain early fusion (concatenation) [36, 40] and linearly combining images [35, 41].

Fig. 2.
figure 2

Network structure of our proposed CrossNet.

3 Approach

Our reference-based super resolution scheme, named CrossNet, is based on a fully convolutional cross-scale alignment module that spatially aligns the reference image information to the LR image domain. Along with the cross-scale alignment module, an encoder-decoder structure is proposed to directly synthesize the RefSR output in an end-to-end, and fully convolutional fashion. The entire network is plotted in Fig. 2. In Sect. 3.1, we introduce the designs and properties of the fully convolutional cross scale alignment module. In Sect. 3.2, the end-to-end network structure is described, followed by the image synthesis loss function depicted in Sect. 3.3.

3.1 Fully Convolutional Cross-Scale Alignment Module

Since the reference image is captured at different view points from LR image, it is necessary to perform spatial alignment. In [1,2,3], such correspondence is estimated by matching every LR patches with its surrounding reference patches. However, such sparsely-sampled and non-rigidly upsampled correspondence can easily fail around the region with varying depth or disparity.

Cross-Scale Warping. We propose cross-scale warping to perform non-rigid image transformation. Comparing to patch matching, we do not assume the depth plane to be locally constant. Our proposed cross-scale warping operation considers a pixel-wise shift vector V:

$$\begin{aligned} I_o = warp(y_{Ref}, V), \end{aligned}$$
(1)

which assigns a specific shift vector for each pixel location, so that it avoids the blocky and blurry artifacts.

Cross-Scale Flow Estimator. As shown on the top of Fig. 2, given an upsampled LR image and its corresponding reference image, we adopt the widely used FlowNetS [42] as our flow estimator to generate the cross-scale correspondence at multiple scale. To further improve the FlowNetS, we replace the final \({\times }4\) bilinear upsampling layer of FlownetS with two \({\times }2\) upsampling module, whereas each \({\times }2\) upsampling module contains a skip connection structure following a deconvolution layer. Such additional upsampling procedure allow the modified model to predict the flow-field with much finer definition. The modified flow estimator works to generate the multi-scale flow-fields as follows:

$$\begin{aligned} \{V^{(3)}, V^{(2)}, V^{(1)}, V^{(0)}\} = {Flow}(I_{LR\uparrow }, I_{REF}), \end{aligned}$$
(2)

where the \(I_{REF}\) denotes the reference image, and \(I_{LR\uparrow }\) denotes an representative Single-Image SR (SISR) approach [30] upsampled the LR image (\(I_{LR}\)):

$$\begin{aligned} \begin{array}{rl} I_{LR\uparrow }&{}= SISR(I_{LR}). \\ \end{array} \end{aligned}$$
(3)

More discussions on the choice of flow estimator are presented in discussion in Sect. 4.3.

3.2 End-to-End Network Structure

The patch matching calculates pixel-wise flow using a sliding window scheme. Such matching is computationally expensive, compared with the proposed fully convolutional network for cross-scale flow field estimation.

Resorting the cross-scale warping as a key component for spatial alignment, we propose an end-to-end network for RefSR synthesis. Our network, contains a LR image encoder which extracts multi-scale feature maps from the LR image \(I_L\), a reference image encoder which extracts and aligns the reference image feature maps at multiple scales, and a decoder which perform multi-scale feature fusion and synthesis using the U-Net [43] structure. Figure 2 summarizes the structure of our proposed CrossNet. The major modules, i.e., encoder, estimator and decoder, are elaborated as follows.

LR Image Encoder. Given the LR image \(I_L\), we design a LR image encoder to extract reference feature maps at 4 scales. Specifically, we utilize SISR approach in Eq. 3 to upsample the LR image. After that, we convolve the upsampled images with 64 filters (of size \(5 \times 5\)) with stride 1 to extract feature map at scale 0. We repeatedly convolve the feature map at the scale \(i-1\) (for \(0 < i \le 3\)) with 64 filters (of size \(5 \times 5\)) with stride 2 to extract feature map at scale i. Such operations can be represented as

$$\begin{aligned} \begin{array}{rl} F^{(0)}_{LR} &{}= \sigma (\varvec{W}^{(0)}_{LR} *I_{LR\uparrow } + \varvec{b}^{(0)}_{LR}), \\ F^{(i)}_{LR} &{}= \sigma (\varvec{W}^{(i)}_{LR} *F^{(i-1)}_{LR} + \varvec{b}^{(i)}_{LR}) {\Downarrow _2},~ i = {1,2,3}, \end{array} \end{aligned}$$
(4)

where \(F^{(i)}_{LR}\) is the LR feature map at scale i, \(\sigma \) stands for the activation function of rectified linear unit (ReLU) [44], \(*\) denotes convolution, and \(\Downarrow _2\) denotes 2D sampling with stride 2.

Note that resorting independent SISR approaches (such as [30]) to encode LR image owns two advantages. First, the SISR approaches that are validated on large-scale external datasets help the LR image encoder to generalize better on unseen scenes. Second, new state-of-the-art SISR approaches can be easily integrated into our system to improve the performance without changing our network structures.

Reference Image Encoder. Given the raw reference image \(I_R\), a 4 scale feature extraction network with the exact structure from Eq. 4 are used to sequentially extract reference image features \(\{F^{(0)}_{REF},F^{(1)}_{REF},F^{(2)}_{REF},F^{(3)}_{REF}\}\) from multiple scales. We allow the reference feature extractor and the LR feature extractor to learn different weights, which helps the two sets of features to complement each other.

After that, we perform backward warping operation on the reference image features \(F^{(i)}_R\) using the cross-scale flow \(V^{(i)}\) in Eq. 2, to generate the spatially aligned feature \(\hat{F}^{(i)}_{R}\).

$$\begin{aligned} \hat{F}^{(i)}_{REF} = warp(F^{(i)}_{REF}, V^{(i)}),~ i = {0,1,2,3}. \end{aligned}$$
(5)

More discussions on the multi-scale warping are presented in Sect. 4.3.

Decoder. After extracting the LR image feature and the warped reference image feature at different scales, a U-Net like decoder is proposed to perform fusion and SR synthesis. Specifically, the warped features and the LR image features at scale i (for \(0 \le i \le 3\)), as well as the decoder feature from scale \(i-1\) (if any) are concatenated following a deconvolution layer with 64 filters (of size \(4 \times 4\)) and stride 2 to generate decoder features at scale i,

$$\begin{aligned} \begin{array}{rl} F^{(3)}_{D} &{}= \sigma (\varvec{W}^{(3)}_{D} \star (F^{(3)}_{LR}, \hat{F}^{(3)}_{REF}) + \varvec{b}^{(3)}_{D}), \\ \\ F^{(i)}_{D} &{}= \sigma (\varvec{W}^{(i)}_{D} \star (F^{(i+1)}_{LR}, \hat{F}^{(i+1)}_{REF}, F^{(i+1)}_{D}) + \varvec{b}^{(i)}_D),~ i = {2,1,0}, \end{array} \end{aligned}$$
(6)

where \(\star \) denotes the deconvolution operation.

After generating the decoder feature at scale 0, three additional convolution layers with filter sizes \(5 \times 5\) and filter number \(\{64, 64, 3\}\) are added to perform post-fusion and to generate the SR output,

$$\begin{aligned} \begin{array}{lll} F^{}_{1} &{} = &{}\sigma (\varvec{W}^{}_{1} *F^{(0)}_{D} + \varvec{b}^{}_{1}), \\ F^{}_{2} &{}=&{} \sigma (\varvec{W}^{}_{2} *F^{}_{1} + \varvec{b}^{}_{2}), \\ I_p &{}=&{} \sigma (\varvec{W}^{}_{p} *F^{}_{2} + \varvec{b}^{}_{p}). \end{array} \end{aligned}$$
(7)

3.3 Loss Function

Our network can be directly trained to synthesize the SR output. Given the network prediction \(I_{p}\), and the ground truth high-resolution image \(I_{HR}\), the loss function can be written as

$$\begin{aligned} \mathcal {L} = \frac{1}{N} \sum _{i=1}^{N} \sum _{s} \rho (I_{HR}^{(i)}(s) - I_{p}^{(i)}(s) ), \end{aligned}$$
(8)

where \(\rho (x)=\sqrt{x^2+0.001^2}\) is the Charbonnier penalty function [45], N is the number of samples, i and s iterate over training samples and spatial locations, respectively.

4 Experiment

4.1 Dataset

The representative Flower dataset [46] and Light Field Video (LFVideo) dataset [41] are used here. The Flower dataset [46] contains 3343 flowers and plants light-field images captured by Lytro ILLUM camera, whereas each light field image has \(376 \times 541\) spatial samples, and \(14 \times 14\) angular samples. Following [46], we extract the central \(8 \times 8\) grid of angular sample to avoid invalid images, and randomly divide the dataset into 3243 images for training and 100 images for testing. The LFVideo dataset [41] contains real-scene light-field image captured by Lytro ILLUM camera. Similar to the Flower dataset, each light field image has \(376 \times 541\) spatial samples and \(8 \times 8\) angular samples. There are in total 1080 light-field samples for training and 270 light-field samples for testing.

For model training using these two dataset, the LR and reference images are randomly selected from the \(8 \times 8\) angular grid. While for testing, the LR images at angular position \((i, i), 0 < i \le 7\) and reference images at position (0, 0) are selected for evaluating RefSR algorithms. As our model requires the input size being a factor of 32, the images from the two dataset are cropped to \(320 \times 512\) for training and validation.

To validate the generalization ability of CrossNet, we also test it on the images from Stanford Light Field dataset [47] and Scene Light Field dataset [48], where we apply our trained model using sliding windows approach, with windows size being \(512\times 512\) and stride being 256 to output the SR result of the entire image. More details are presented in the generalization analysis in 4.2.

4.2 Evaluation

We train the CrossNet for 200K iterations on the Flower and LFVideo datasets for \({\times }4\) and \({\times }8\) SR respectively. The learning rates are initially set to 1e−4 and 7e−5 for the two dataset respectively, and decay to 1e−5 and 7e−6 after 150k iterations. As optimizer, the Adam [49] is used with \(\beta _1 = 0.9\), and \(\beta _1 = 0.999\). In comparison to CrossNet, we also test the latest RefSR algorithms SS-Net [2] and PatchMatch [1], and the representative SISR approaches including SRCNN [25], VDSR [28] and MDSR [30].

We evaluate the results using three image quality metrics: PSNR, SSIM [50], and IFC [51]. Table 1 shows quantitative comparisons for \({\times }4\) and \({\times }8\) RefSR under the two parallax settings, where the reference images are sampled at position (0, 0) while LR images are sampled at position (1, 1) and (7, 7). Examining Table 1, the proposed CrossNet outperforms the previous approaches considerably under various settings including small/large parallax, different upsampling scales and different datasets, achieving 2 dB–4 dB gain in general.

Table 1. Quantitative evaluation of the state-of-the-art SISR and RefSR algorithms, in terms of PSNR/SSIM/IFC for scale factors \({\times }4\) and \({\times }8\) respectively.

For better comparison, we also visualize the PSNR performance under different parallax setting in Fig. 3. As expected, the RefSR approaches such as CrossNet, PatchMatch, SS-Net outperform SISR approaches owe to the high-frequency details provided by reference images. However, RefSR results deteriorate as the parallax enlarges, due to the fact that the correspondence searching is more difficult for large parallax. In contrast, the performance of SISR approaches appears as ‘U-shape’ for different views, i.e., at the corners of LF image for disparity being (1, 1) and (7, 7), the SISR performs slightly better. This is probably due to the occurrence of easily super-resolved invalid region becomes larger at corners. Finally, it can be seen that the proposed CrossNet consistently outperforms the resting approaches under different disparities, datasets and scales.

Fig. 3.
figure 3

The PSNR measurement under different parallax settings: the reference images are select at (0, 0) LF grid, while the LR image are selected at (ii) LR grid (\((i,i), 0 < i \le 8\)).

Figure 4 presents the visual comparisons of CrossNet with SISR approaches including SRCNN, VDSR, MDSR and RefSR approaches including PatchMatch and SS-Net under the challenging \({\times }8\) scale setting. Benefiting from the reference image, RefSR approaches show competitive results compared to the SISR methods, where the high frequency details are explicitly retained. Among them, the proposed CrossNet can further provide finer details, resembling the details in ground truth image. More visual comparison are shown in the supplementary material and supplementary videoFootnote 1.

Generalization: To further estimate the cross-dataset generalization capacity of our model, we report the results on Stanford light field dataset (Lego Gantry) [47] and the Scene Light Field dataset [48], where the former one contains light field images captured by a Canon Digital Rebel XTi that set on a movable Mindstorms motors on the Lego gantry, and images from the latter one are also captured on a motorized stage with a standard DSLR camera. Under such equipment settings, the captured light-field images of the two datasets have much large parallax comparing to the ones captured by Lytro ILLUM cameras. The parallax discrepancy between datasets yields difficulty to our trained model, as our model is not particularly trained with large parallax.

Fig. 4.
figure 4

Visual comparison for \({\times }8\) RefSR on LFVideo (1, 1), LFVideo (3, 3), Flower (1, 1). In the experiment, our approach is compared against SRCNN [25], VDSR [28], MDSR [30], PatchMatch [1], and SS-Net [2].

To handle these two datasets, we employ a parallax augmentation procedure during training, which randomly offsets the reference input by \([-15,15]\) pixels both horizontally and vertically. We take the pre-trained model parameters using LFVideo dataset (in Sect. 4.2) as the initialization, and re-train the CrossNet on the Flower dataset for 200K iterations in order to achieve better generalization. We use 7e−5 as the initial learning rate, and decay the learning rate using factors 0.5, 0.2, 0.1 at 50K, 100K, 150K iterations.

Tables 2 and 3 compare in PSNR measurement our re-trained model with PatchMatch [1], SS-Net [2] for \({\times }8\) RefSR on the Stanford light-field dataset and the Scene Light Field dataset respectively. It can be seen that our approach outperforms the resting approaches with different parallax settings on the Stanford dataset. On average, our approach outperforms the competitive SS-Net by 1.79–2.50 dB on the Stanford light-field dataset and 2.84 dB on the Stanford light-field dataset.

Table 2. \({\times }8\) super-resolution experiment on the Stanford light field dataset [47].
Table 3. \({\times }8\) super-resolution experiment on the Scene light field dataset [48].

Efficiency: It is worth mentioning that the proposed CrossNet generates an \(320\times 512\) image for \({\times }8\) RefSR within 1 s, i.e., 0.75 s to perform SISR preprocessing using the MDSR [30] model, and 0.12 s to synthesize the final output. In contrast, the PatchMatch [1] takes 86.3 s to run on Matlab2016 using GPU parallelization while the SS-Net [2] takes on average 105.6 s running on GPU. The above running times are profiled using a machine with 8 Intel Xeon CPU (3.4 GHz) and a GeForce GTX 1080 GPU, while the model inferences of our CrossNet and SS-Net [2] are implemented on Python with Theano deep learning package [52].

4.3 Discussions

One may concern that our loss is designed for image synthesis, and does not explicitly define terms for flow estimation. However, since the correctly aligned features are extremely informative for decoder to reconstruct high-frequency details, our model actually learns to predict optical flow by aligning features maps in an unsupervised fashion. To validate the effectiveness of the learned flow by aligning feature, we visualize the intermediate flow field at all scales in Fig. 5(d), where flow predictions at scales \(0,1,2,3 ({\times }1,{\times }2,{\times }4,{\times }8)\) are reasonably coherent, yet noisy flow predictions are observed at scales \(4,5 ({\times }16,{\times }32)\), because the flow at scale 4, 5 are not used for the feature-domain warping.

Fig. 5.
figure 5

Flow visualization and comparison for sample #1, #99 in the flower \({\times }8\) testing set. (a) the HR image, (b) (c) (d) flow visualization of PatchMatch [1], SS-Net [2], and our approach respectively. In (d), the flow is visualized at scales \({\times }1,{\times }2,{\times }4\) (row 1), and \({\times }8,{\times }32,{\times }64\) (row 2).

In addition to the multi-scale feature warping module proposed in this paper, we investigate a single-scale image warping counterpart which performs reference image warping before the following image encoder for feature extraction. This counterpart is inspired by the common practice in [35, 41] that performs image warping before synthesis. More concretely, our single-scale image warping counterpart performs image warping using the flow from scale 0: \(\hat{I}_{REF} = warp(I_{REF}, V^{(0)}).\) After that, reference image encoder with the same structure is used to extract features from the warped reference image. Without changing the structure of encoder and decoder, such image warping counterpart CrossNet-iw has the same model size as CrossNet.

We train both CrossNet-iw and CrossNet according to the same procedure in Sect. 4.2. We also adopt a pretraining strategy to train CrossNet-iw. We pretrain the flow estimator of WS-SRNet with image warping task for 100K iterations, and then apply the joint training for another 100K iterations, resulting the CrossNet-iw-p model. Figure 6 shows the PSNR convergence curves on training set for \({\times }8\)RefSR on the Flower and LFVideo dataset. It can be noticed that our CrossNet converges faster than the CrossNet-iw counterparts. At the end of the training, CrossNet outperforms CrossNet-iw 0.20 dB and 0.27 dB on training set. Table 4 shows the RefSR precision on the test sets with three representative point views. CrossNet outperforms CrossNet-iw, especially on small parallax setting. It is reasonable because the training uses random sampled pairs from the LF grid, which are mostly took up by small parallax training pairs.

Fig. 6.
figure 6

The convergence analysis on our feature domain warping scheme (CrossNet) versus image warping schemes. Our model (CrossNet, red) converges faster than the image-domain warping counterpart (CrossNet-iW) with or without pre-training. (Color figure online)

Table 4. Ablation study to evaluate the effectiveness of multi-scale feature warping.

As our method relies on the cross-scale flow estimators, it is also important to study the flow predicting capacities of different flow estimator. For such purpose, we train the FlowNetS and our modified model (FlowNetS+) on the Flower and the LFVideo dataset for warping the reference images to the ground truth images given the reference and LR image as input. As shown In Table 5, while the FlowNetS+ contains 2% more parameters in comparison to FlowNetS, the additional upscaling layers of FlowNetS+ reasonably improves the warping precision in both the Flower dataset [46] and the LFVideo dataset [41], as they help to generate finer flow field. In addition, we also observe that by plain warping, the FlowNetS+ achieves notably better compatible performance compared to SS-Net [2], as depicted by the SS-Net (\({\times }8\)) row in Table 1.

Table 5. Quantitative evaluation and the parameter sizes comparison, using different flow estimators to warp the reference image. The LR images are located at angular position (3, 3).

5 Conclusion

Aiming for the challenge large-scale (\(8{\times }\)) super-resolution problem, we propose an end-to-end reference-based super resolution network named as CrossNet, where the input is a low-resolution (LR) image and a high-resolution (HR) reference image that shares similar view-point, the output is the super-resolved (\(4{\times }\) or \(8{\times }\)) result of LR image. The pipeline of CrossNet is full-convolutional, containing encoder, cross-scale warping, and decoder respectively. Extensive experiment on several large-scale datasets demonstrate the superior performance of CrossNet (around 2 dB–4 dB) compared to previous methods. More importantly, CrossNet achieves a speedup of more than 100 times compared to existing RefSR approaches, allowing the model to be applicable for real-time applications.