Design of Network Cascade Structure for Image Super-Resolution

: Image super resolution is an important field of computer research. The current mainstream image super-resolution technology is to use deep learning to mine the deeper features of the image, and then use it for image restoration. However, most of these models mentioned above only trained the images in a specific scale and do not consider the relationships between different scales of images. In order to utilize the information of images at different scales, we design a cascade network structure and cascaded super-resolution convolutional neural networks. This network contains three cascaded FSRCNNs. Due to each sub FSRCNN can process a specific scale image, our network can simultaneously exploit three scale images, and can also use the information of three different scales of images. Experiments on multiple datasets confirmed that the proposed network can achieve better performance for image SR.


Introduction
Single image super-resolution (SR) is an important issue in the field of computer vision research. Given a low resolution (LR) image, the purpose of single image SR is to recover a high resolution (HR) image from its corresponding LR image. Currently, single image SR based methods are generally divided into two categories: sparse coding-based methods [1][2][3][4] and deep learning-based methods [5][6][7][8].
Most sparse coding-based methods assume that each pair of patches from LR and HR images have similar coding coefficients in the patch space. Therefore, the HR image patches can be reconstructed as a linear combination of a few LR image patches. Yang et al. [1][2] learned a pair of dictionaries L D and H D for LR and HR image patches respectively. The final reconstructed HR image can be obtained through the learned dictionary H D and coding coefficient a , which encoded by the learned dictionary L D [9][10][11]. At present, existing sparse coding-based methods mainly focus on how to learn better dictionaries [12][13]. For example, Zhao et al. [12] designed a transfer robust sparse coding based on graph for image representation and Ahmed et al. [13] proposed a new multiple dictionary learning strategy. Deep learning-based methods aim to directly learn an end-to-end mapping from LR images to HR images [5][6][7][8]. Dong et al. [7] first applied convolutional neural network (CNN) for the single image SR, and proposed a super-resolution convolutional neural network (SRCNN) and achieved better restoration performance. However, SRCNN only learns the mapping from bicubic interpolated LR images (not the original LR images) to HR images, the calculation cost of the network will be increased quadratically with the size of HR image increase [8]. In addition, SRCNN adopts a 5 × 5 convolution kernel to learn nonlinear mapping which obviously limited the learning ability of network in such a setting. To solve the above problems, researchers [14][15][16] proposed lots of approaches such as the large super-resolution convolutional neural network (SRCNN-ex), fast super-resolution convolutional neural network (FSRCNN), very deep super-resolution network (VDSR), Laplacian pyramid super-resolution network (LaPSRN) and enhanced deep super-resolution network (EDSR) to improve the structure of SRCNN. In addition, Christian et al. [17] exploited generative adversarial networks (GAN) [18] to handle out image super-resolution problem and proposed super-resolution generative adversarial network (SRGAN). GAN is different from convolutional neural network (CNN), and the main difference is that GAN aims to generate more realistic HR images that are more consistent with human eye, while the purpose of CNN is to faithfully restore the high-frequency information of images. Lots of improved networks based on GAN have been proposed in recent years [19][20]. These networks have solved the above problems to some extent and achieved better restoration performance.
However, most of these models mentioned above only trained the images in a specific scale and do not consider the relationships between different scales of images. For example, if we set the scale factor is 4 during the training process, FSRCNN only utilizes the image information, where the image scale is 4, and does not use the complementary information resided in different scales. If we can simultaneously use the information of different scales of images in the training process, the recovered image qualities will be further improved.
Based on the above concerns, we design a cascaded super-resolution convolutional neural networks (CSRCNN), which consists of three cascaded FSRCNNs. Due to each sub FSRCNN can process a specific scale image, our network can simultaneously exploit three scale images, and can also use the information of three different scales of images in the training process. For each FSRCNN, we set its scaling factor is 2, which can double enlarge the size of input image. Since the images with different scales are trained together, the learned network can exploit more image information resided in different scales. In addition, we use L1 loss function to make the reconstructed image in texture and edge clearer. Experiments on multiple datasets show that the proposed our proposed network can achieve better performance for image SR.

Deep Learning for Image Super-Resolution
The purpose of image super-resolution is to reconstruct high resolution image from a given low resolution one. Dong et al. [7] first applied convolutional neural network in the field of image superresolution and proposed a super-resolution convolutional neural network (SRCNN) which achieved impressive performance for image restoration. At present, deep learning technology is widely used in the field of image super-resolution. Various network structures have been developed, such as deep network with residual learning [14], Laplace pyramid structure [15], residual block [21], and residual dense network [22]. Besides supervised learning and unsupervised learning [23][24][25][26], reinforcement learning [27][28][29] are also introduced to solve the problem of image super-resolution in recent years. Specifically, the literature [30] has a systematic description of image super resolution.

Fast Super-Resolution Convolutional Neural Networks
Dong et al. [8] proposed a fast super-resolution convolutional neural network (FSRCNN), aiming to accelerate the speed of SRCNN. Compared with SRCNN, FSRCNN can directly learn the mapping from the original LR image to HR image by introducing the deconvolution layer at the end of the network. In addition, FSRCNN adds a shrinking and an expanding layer at the beginning and the end of the mapping layer to enhance the presentation capability of non-linear mapping [31]. In the nonlinear layer, FSRCNN adopts a smaller filter (size = 3 × 3) to reduce the computational cost of the network. The whole network is designed as a compact hourglass-type CNN structure, which includes five parts: feature extraction, shrinking, non-linear mapping, expanding, deconvolution. The shrinking layer uses a filter of size 1 × 1 to reduce the LR feature dimension from d to s , where d is the feature dimension of LR image after feature extraction and s is the number of filters ( s d £ ). The expanding layer uses d filters with a size of 1 × 1 to maintain consistency with the shrinking layer, which is the inverse operation of the shrinking layer. To avoid the "dead features" [32] caused by zero gradients in Re LU , the author uses Re P LU [33] as the activation function after each convolution layer. Thus, a complete FSRCNN can be represented as:

Network Structure
We propose a cascaded super-resolution convolutional neural network (CSRCNN) framework in this section. Our proposed network consists of three cascaded FSRCNNs, where each FSRCNN can process a specific scale image. Fig. 1 shows the structure of our CSRCNN.
In each sub FSRCNN, we set its scaling factor to 2 for double enlarging the size of input image.
Suppose that all images have the same width-to-height ratio.

Loss Function
Our network includes three cascaded FSRCNNs. For each subnetwork, the output of previous network will be the input of the next cascaded subnetwork. We use L1 loss functions as the sub-loss functions for all three subnetworks, and the sum of these sub-loss functions will eventually form the total loss function of the whole network. We use For each subnetwork, the loss function is defined as follows: In Eq. (2), we select L1 loss function to replace MSE loss function. The main reason is that L1 loss function can make the reconstructed image clearer in texture and edge, while MSE loss function will lose high-frequency information in the reconstruction process, such as texture and edge [34]. LR image and the output of previous layer will be the input of the next layer. Each sub FSRCNN can double enlarge the size of input image. We select L1 loss function as sub-loss functions of three subnetworks, and the sum of three sub-loss functions will eventually form the loss function of the whole network

Assessment Process
According to the scale ratios of LR images to HR images, we assign LR images to different stages of the cascaded networks during the evaluation process. When a LR image is input into FSRCNN-k, the input shape will be resized based on the corresponding network * k k r W r H . For example, for a given test image with size of / 3* / 3 W H , we will resize the shape of the image as / 2* / 2 W H by bicubic interpolation. The resized image will be arranged into FSRCNN-2. Here, each sub-network can double enlarge the size of input image. Finally, all images are enlarged to the uniform HR images [35]. Fig. 2 shows the entire evaluation process of our network.

The Difference between Our Model with FSRCNN
Our network contains three cascaded FSRCNNs, where each FSRCNN can process a specific scale image. For each sub FSRCNN, we use the L1 loss function, whereas original FSRCNN uses the MSE loss function. In addition, FSRCNN only considers a specific scale image, while our network can simultaneously train different scales of images due to adopting the cascaded structure. In the test phase, our network can generate multiple intermediate results in a prediction process, while FSRCNN only obtains a single scale result.

Dataset
Training and testing dataset: Following SRCNN and FSRCNN, we combine 91 images dataset and the general-100 dataset to train our network. In particular, general-100 dataset [8] contains 100 bmp-format images with ranges from 710 × 704 (large) to 131 × 112 (small) in size. All of these images are favorable quality with clear edges and less smooth regions. Following the setting used in [8], we adopt the data augmentation method [36], includes scaling and rotating images. 1) Scaling: each image is down-sampling with the factor 0.9,0,8,0.7 and 0.6. 2) Rotation: each image is rotated with the degree of 90, 180 and 270. For testing dataset, we use Set5 [37], Set14 [38] and BSD200 [39] datasets. All testing images will be cropped according to the model structure which make the size of output image of the model as an integer.
Training samples: Cascaded structure are adopted in our network. To form LR/HR sub-image pairs, we first down-sample the original training images by different scale factors n , then crop LR images and real HR images obtained in sampling process.

Training Details
Training strategy: In our experiment, we combine 91 images dataset and the general-100 dataset [8] for training. The 91-image dataset is used to train a network from scratch. Then, we add General-100 dataset to the network for fine-tuning when the training is saturated. Using this strategy, the training converges much earlier compare with training with both two datasets.
The choice of learning rate: The learning rate plays a very import role for the performance of our network. For FSRCNN, the learning rates in the convolution layers and deconvolution layer are set to be 3 10and 4 10 -, the learning rates of all layers are reduced by half during fine-tuning. Obviously, the learning rates of convolution layer and deconvolutional layer are static in each iteration, while we adopt a dynamic strategy to update the learning rates for convolution layer and deconvolutional layer in our network. This strategy can make our network produce better learning rates in different iterations. We set The choice of network initialization: Network initialization has a great influence on network training, a good initialization can largely reduce the training time of the network. In our network, we select Re P LU [33] as the activation function and use MSRA initialization [33]. MSRA is a gaussian with a mean of zero and variance of / 2 n , which make each layer neuron input/output variance is consistent.

Experiments for Different Loss Functions
In this section, we conduct experiments to evaluate the effects of different loss functions on our network on three datasets when the upscaling factor is 3. We compare MSE-Loss, Huber-Loss, Charbonnier-loss and L1-loss functions and the results are listed in Table 1. The delta of Huber-Loss is 0.9 and the delta of Charbonnier-Loss is 0.0001. From Table1, we can see that using L1-loss, our cascaded network can achieve better performance. Fig. 3 shows the convergence curves of these four loss functions on Set5 datasets with upscaling factor 3. From the Fig. 3, we first observed that using L1-Loss, our cascaded network gets the best average test PSNR values, followed by Charbonnier-Loss, and Huber-Loss get the worst average test PSNR values. In addition, after 160 iterations, the PSNR value curves of these four loss functions tend to be stable. Among them, the network has a small fluctuation when we use Charbonnier-Loss and MSE-Loss as our network loss functions. We also find that the initial value of PSNR is low when we use L1 loss as the loss function of the network, with the increase of the number of iterations, the PSNR value will gradually increases and eventually tends to be stable. In this section, we evaluate the performance of our network versus the number of cascaded FSRCNNs. Since the scaling factor of each FSRCNN is set to 2, if we cascaded more than three FSRCNNs, the scale of the input image to the network will too small, which makes the effective information contained in these images are very limited. Therefore, in our model, we only cascaded three FSRCNNs. These three networks are respectively denoted as Net-1, Net-2 and Net-3. Net-1 contains one FSRCNN with upscaling factor is 4, and Net-2 contains two cascaded FSRCNNs, where upscaling factor is 2. We use PSNR as the evaluation criteria and test on three datasets for 4× SR. Table 2 lists the results our model versus different number of FSRCNNs and the number of FSRCNNs varies from 1 to 3. From Tab. 2, we can observe that Net-3 obtains the highest PSNR value on all testing datasets. Net-2 obtains higher results than Net-1. This indicates that cascading multiple basic models can achieve better performance than a single basic model. In addition, the increasing trend of PSNR is gradually decreasing with the increase of the number of cascaded FSRCNNs, one possible reason is that the extra information provided by LR images with small scale is limited.

Evaluation for the Number of Cascaded FSRCNNs
In this section, we evaluate the performance of our network versus the number of cascaded FSRCNNs. Since the scaling factor of each FSRCNN is set to 2, if we cascaded more than three FSRCNNs, the scale of the input image to the network will too small, which makes the effective information contained in these images are very limited. Therefore, in our model, we only cascaded three FSRCNNs. These three networks are respectively denoted as Net-1, Net-2 and Net-3. Net-1 contains one FSRCNN with upscaling factor is 4, and Net-2 contains two cascaded FSRCNNs, where upscaling factor is 2. We use PSNR as the evaluation criteria and test on three datasets for 4× SR. Tab. 2 lists the results our model versus different number of FSRCNNs and the number of FSRCNNs varies from 1 to 3. From Tab. 2, we can observe that Net-3 obtains the highest PSNR value on all testing datasets. Net-2 obtains higher results than Net-1. This indicates that cascading multiple basic models can achieve better performance than a single basic model. In addition, the increasing trend of PSNR is gradually decreasing with the increase of the number of cascaded FSRCNNs, one possible reason is that the extra information provided by LR images with small scale is limited.

Comparison with Popular SR Algorithms
In order to verify the reconstruction performance of our proposed CSRCNN. We compare our model with three popular SR algorithms: Bicubic, SRCNN [7], and FSRCNN [8]. We carry out extensive experiments on three datasets: Set5 [37], Set14 [38] and BSD200 dataset [39]. We adopt two image quality indicators to evaluate the SR images, i.e., PSNR and SSIM [40]. Tab. 3 reports quantitative comparisons for ×2, ×3 and ×4. From Tab. 3, we can observe that among all the compared methods, our proposed CSRCNN achieves the best results in three evaluation datasets for ×2, ×3 and ×4. Fig. 4 shows the convergence curves of CSRCNN on three evaluation datasets with upscaling factor 3. Among them, the PSNR value of the proposed CSRCNN achieve 0.51 db, 1.04 db and 0.46 db improvement over FSRNCNN in Set5 datasets for ×2, ×3 and ×4. We observe that the PSNR value of CSRCNN obtains 1.04db improvement over that of FSRCNN on factor 3. One possible reason is that we use bicubic interpolation to resize the size of the image before throwing the LR image into the network. Considering that our network is made up of three cascaded FSRCNNs, where each FSRCNN's upscaling factor is 2. We also made a quantitative comparison for ×8. This comparison result is shown in Tab. 4. Obviously, our network achieves higher PSNR and SSIM values than FSRCNN due to the cascaded design.    Figure 5: Visual comparisons on Set5 and Set14 datasets with upscaling factor 3. The first row is "butterfly" image from Set5 dataset, and the second row is "lenna" image from Set14 dataset Fig. 5 shows the visual comparisons on Set5 and Set14 datasets with upscaling factor 3. As you can see, our method can accurately reconstruct the details of the image, such as texture and edge.

Conclusion
In this paper, we proposed a more efficient network for mining image information in different scales, named Cascaded super-resolution convolutional neural network (CSRCNN). We adopted a cascaded network structure to mine the information of image in different scales. Furthermore, we adopted L1 loss function as the loss function of each subnetwork in the training process which make the reconstructed image clearer in texture and edge. Finally, we also explored the effects of the number of cascaded FSRCNNs for the performance of our network. Extensive experiments show that the proposed method achieves satisfactory SR performance.