Superresolution Reconstruction of Video Based on Efficient Subpixel Convolutional Neural Network for Urban Computing

Video surveillance is an important data source of urban computing and intelligence. The low resolution of many existing video surveillance devices affects the efficiency of urban computing and intelligence. Therefore, improving the resolution of video surveillance is one of the important tasks of urban computing and intelligence. In this paper, the resolution of video is improved by superresolution reconstruction based on a learning method. Different from the superresolution reconstruction of static images, the superresolution reconstruction of video is characterized by the application of motion information. However, there are few studies in this area so far. Aimed at fully exploring motion information to improve the superresolution of video, this paper proposes a superresolution reconstruction method based on an efficient subpixel convolutional neural network, where the optical flow is introduced in the deep learning network. Fusing the optical flow features between successive frames can compensate for information in frames and generate high-quality superresolution results. In addition, in order to improve the superresolution, a superpixel convolution layer is added after the deep convolution network. Finally, experimental evaluations demonstrate the satisfying performance of our method compared with previous methods and other deep learning networks; our method is more efficient.


Introduction
Superresolution reconstruction is generating high-resolution results from the low-resolution images using construction models. In contrast to hardware, algorithm-based image construction can efficiently generate high-resolution images at a low cost. Due to a large number of images and video samples, more and more researchers focus on image reconstruction.
Recently, superresolution reconstruction methods have been proposed. For example, Farsiu et al. [1] proposed a regularized reconstruction method for solving inverse problems of ill-posed problems. A fast superresolution reconstruction algorithm is proposed for the pure translational motion and space invariant ambiguity [2]. The principle of neighborhood embedding [3] is to make the assumption that the local spatial structure between the low-resolution image block and the high-resolution image block is similar. The lowresolution image block represents the low-dimensional data, and the high-resolution image block represents the high-dimensional space. Therefore, the low-resolution image can be linearly represented and mapped to the highdimensional image block to reconstruct the high-resolution image. Yu and Zhang proposed an improved glowworm swarm optimization algorithm for superresolution reconstruction of video images [4]. Combined with the characteristics of superresolution reconstruction, the algorithm's swarm input, firefly's luciferase, and location update equation are redefined; the optimization objective function criterion is set.
Although the above methods can solve some problems of superresolution reconstruction, the disadvantage of the neighborhood embedding method is that the number k of image blocks selected for low-resolution image blocks is artificially specified, which will affect the reconstruction effect by the supervisor and may cause the phenomenon of underfitting and overfitting. Therefore, we proposed a new superresolution reconstruction method based on optical flow and efficient subpixel convolutional neural network (ESPCN), which can solve the traditional reconstruction methods' defects.
The rest of the paper is organized as follows. The related work is discussed in Section 2. The methodology is described in Section 3. The details of our method are proposed in Section 4. Section 5 introduced the experiments and result analysis. Finally, the conclusions related to the paper are outlined in Section 6.

Related Works
There are three kinds of superresolution methods: interpolation method, reconstruction method, and learning-based method.
The interpolation method is to estimate the missing position of pixels by using the prior in original low-resolution images. Herein, the key of interpolation operation is to establish the mapping relationship between low-resolution and high-resolution images. In general, the interpolation methods include three categories: nearest-neighbor interpolation [5], bilinear interpolation [6], and bicubic interpolation [7]. However, the disadvantage of interpolation-based methods lies in the low-quality reconstruction results that commonly suffer from textual noises.
The reconstruction-based method is to model the degeneration process of image samples and generate a high-resolution image by inverse transformation. The degeneration model establishment is the key, which can fuse multiple images to reconstruct high-resolution images. There are three typical methods in the reconstruction methods: iterative back projection [8], projections onto convex sets [9], and maximum posterior probability [10]. In 2011, Zhang et al. improved the method based on convex set projection for sequential frame image [11] and applied the model to the image sequence of panoramic transformation. This method realizes the geometric transformation of the sequence frame accurately and improves the speed of reconstruction. In 2012, Wallach et al. improved the map method [12] and improved the contrast and resolution of a reconstructed image by fusing global information.
Learning-based methods intend to learn the correlation between the low-resolution and high-resolution images. Three phases are included in these methods: feature extraction, feature learning, and reconstruction. Typical methods include neighbor embedding [3] and sparse coding [13][14][15]. The representative method based on sparse representation is the superresolution image reconstruction method [16] proposed by Yang et al. in 2008. High-and low-resolution dictionaries are trained for LR image and HR image resolution, so that all low-resolution images to be reconstructed can extract sparse representation from dictionaries. But the style and edge of the reconstructed image need to be similar to the data set used in learning, and the quality and efficiency of reconstruction are not satisfactory.
In recent years, deep learning-based methods have demonstrated excellent performance on superresolution reconstruction [17,18]. Based on the sparse coding method, Dong et al. first proposed a superresolution method using a deep convolutional network, which is termed as the superresolution convolutional neural network (SRCNN) [17]. A 3-layer convolutional neural network (CNN) was designed to learn the mapping relation from the low-resolution to high-resolution images. Aiming to accelerate the reconstruction process, Dong et al. further proposed a fast superresolution convolutional neural network (FSRCNN) that was based on an hourglass-shaped CNN comprising more layers than previous methods but had fewer parameters. The advantage of FSRCNN lies in that it eliminates the requirement of the image magnification and increases the efficiency of the reconstruction process.
Shi et al. proposed an efficient subpixel convolutional neural network method [18] that extracted features in the low-resolution domain and replaced the classic bicubic upsampling operation with an efficient subpixel convolution. Moreover, ESPCN was characterized by the subpixel convolution that generated the feature map in r 2 channel. Through periodic activation in subpixel convolution, the feature map that was of the size H × W × r 2 was reconstructed to generate the high-resolution images with the size rH × rW × 1. These image features were extracted by the hidden layers in the deep architecture. In practice, ESPCN was more efficient in contrast to other methods, thus had more opportunities to cater to the real-time tasks. Talab et al. used ESPCN and CNN for super-low-resolution face recognition [19].
In order to expand the receptive field for image reconstruction, Kim et al. proposed a very deep superresolution reconstruction method (VDSR) that increased the kernel size from 13 × 13 to 41 × 41 [20]. The sparsity property was used to accelerate the convergence process. Besides, VDSR could generate superresolution results with multiple scales. In contrast to a previous method, VDSR could capture more details of images to improve the performance of the resolution reconstruction.
Ledig et al. introduced the generative adversarial network (GAN) to the superresolution reconstruction and proposed the superresolution generative adversarial network (SRGAN) [21]. SRGAN comprised of a generative block and an adversarial block, where the first was used for reconstruction and the second was used to classify the quality of the reconstructed images. Although SRGAN could not generate competitive results in the PSNR scores, it could obtain results that were in line with the real images, providing a visually satisfying effect.
Sun et al. introduced and updated the residual network (ResNet) into the superresolution reconstruction [22]. Herein, the number of residual layers was increased from 16 to 32, expanding the model scale to generate better results.

Methodology
To improve the superresolution reconstruction method in speed and quality, this paper proposed a new method based on optical flow and ESPCN. The motion information between frames is considered in the reconstruction process to improve the reconstruction quality. The optical flow can calculate the motion of an object in a very small time [23].

Wireless Communications and Mobile Computing
It is suitable for motion estimation between video frames. Therefore, the optical flow performs well in accuracy and effectiveness compared with other motion estimation methods. Because of the subpixel convolution layer, ESPCN has very little additional computational cost compared to other deep learning methods. ESPCN can be used in realtime to improve the efficiency of urban computing and intelligence. In specific, the optical flow feature is extracted, which is jointly combined with the image to input into the network. Finally, the subpixel convolutional layer is applied to produce a high-resolution image from the feature map. The framework of our method is shown in Figure 1; there are three phases: motion information estimation, information fusion, and reconstruction. Optical flow is defined as the instant velocity of the pixels including those in moving objects [24][25][26]. For the video sequence, the optical flow extraction is realized by estimating the changes between two frames. The correlation between successive frames is estimated to identify the changes in the time axis. The optical flow feature is represented as the two-dimensional vector that demonstrates the ratio of the intensity change.
Assume there are L layers in the ESPCN network, the front L-1 layers can be expressed as follows: In which W l b l are the weight and bias of each layer, respectively; the value of l is between 1 and L-1; W l is a two-dimensional convolution tensor with size of n l−1 × n l × k l × k l ; n l is the feature dimension of the lth layer; the dimension of n 0 is the channel number C, k l is the convolution kernel size of the lth layer; ϕ is the activate function; and the last layer of the network f L maps the low-resolution image to the high-resolution image I SR .
The subpixel convolutional method is first proposed in ESPCN where the convolutional kernel is used to activate different parts of images. When shifting the kernel, the subpixels are periodically activated according to their locations which can be mathematically defined as follows: where f L ðÞ and f L−1 ðÞ are the mapping functions in the L and L − 1 layers, SR and LR denote the terms of the superresolution and low-resolution, W L is the kernel in the L layer, b L is the moderation factor, and PSðÞ is devoted to the periodical transformation which can map the feature map to the highresolution image: where x and y locate the pixels in the high-resolution images; the size of the kernel is n L−1 × r 2 C × k L × k L . Since the nonlinear mapping layer is eliminated in the bottom of the deep convolutional network, the subpixel convolution is operated on the low-resolution image when k L = ðk s /rÞ and mod ðk s , rÞ = 0. We use the mean squared error (MSE) as the training strategy. The computing method is as follows: Meanwhile, we used the tanh function as the activation function, by introducing nonlinearity into the network with the tanh function; the output of the upper network is mapped to the input of the lower network; and the expression ability of the neural network is enhanced. The tanh function is defined as tanh function is a kind of hyperbolic function with fast convergence.
Moreover, the subpixel convolution can eliminate the initial interpolation process, which can generate a satisfying reconstruction result with low cost. This architecture nicely adapts to the superresolution reconstruction of the video with a large number of frames.

Algorithm
The phases of the superresolution reconstruction are shown in Figure 2. Successive 5 frames are selected as the input for our deep convolutional network. Here, we define the time scale of the third frame as n, while the time scales of the previous two frames are n − 2 and n − 1; the two time scales of the last two frames are n + 2 and n + 1. The motion feature estimation is based on the frame n; the optical flow calculation is operated on the frames n − 2, n − 1, n + 1, and n + 2 to generate the feature maps for these frames. Then, these feature maps are combined with the original frames to obtain     Wireless Communications and Mobile Computing the input block of the deep convolutional network where the subpixel convolutional layer is applied to reconstruct the superresolution images [27]. The architecture of the deep convolutional network is constructed with a 4-layer structure that comprises 3 convolutional layers and 1 subpixel convolutional layer. We transfer the color space of the original frame from the RGB space to the Ycb space. The first layer comprises 64 kernels that are 5 × 5 × 15, which generate the 64-channel feature maps. The second layer comprises 32 kernels that are 3 × 3, which generate the 32-channel feature maps. 3 × r 2 kernels that are of 3 × 3 are included in the third layer that generates the 3 × r 2 feature map. The superresolution reconstruction is realized by the last subpixel convolutional layer.
In specific, we assume that the size of frames is H × W, where H is the height of the frame and W is the width. Accordingly, the dimension of the input RGB frames is H × W × 3. In our model, successive 5 frames are jointly

Database.
Training data is collected from the Xiph database which includes 10 video sequences with the length of 2695 frames [28]. The resolution of the frame is 144 × 176.
The training data shown in Figure 3 contains different contents in urban video surveillance, such as news broadcast, cars, a patrol boat, plants, and a rugby game. For training video sequences, the video segmentations are extracted and degenerated by adding the Gaussian noise and the downsampling processes, which can generate the low-quality and lowresolution samples. These samples are then inputted into the model. It is noticed that the input training data of the convolutional neural network is an image block composed of five consecutive frames after motion estimation. The output of the convolutional neural network is a high-resolution image. Meanwhile, the network generates the errors with the reference frame to update the network. Finally, the superresolution reconstruction model is going to be achieved after the network convergence. The testing database shown in Figure 4 is given by the VideoSet4 database with 4 videos: calendar, buildings, foliage, and walk [29]. The building video includes 34 frames with the size of 704 × 576; the calendar video includes 41 calendar images with the size of 720 × 576. In this training sample, the framework contains a large number of characters and regular textures, which can be evaluated by the details of edge reconstruction. Foliage video shows traffic scenes including vehicles and streets. The resolution of the original frame is 720 × 480. Moreover, the contents of the foliage are quite complicated which can evaluate the robustness of our method against the dynamic noises. The walk video includes the walking pedestrians; the dataset of this video comprises 47 frames with the size of 720 × 480. This dataset is characterized by slowly moving objects, thus can evaluate the performance of our method for motion estimation.

Parameter Setting.
The input block is the combination of the motion estimation and original images, which is 15-dimensional. With respect to the network architecture, the first layer is constructed by 64 kernels with the size of 5 × 5 × 15 and generates 64-dimensional feature maps. The second layer is constructed by 32 kernels which are applied to handle the feature map extracted by the first layer and generate 32-dimensional feature maps. The third layer comprises 3 × r 2 kernels with the size of 3 × 3, which generates 3 × r 2 -dimensional feature maps. At last, the subpixel convolutional layer is used to reconstruct the superresolution images. The learning ratio is set as 0.001, the maximum epoch number is 100, and the batch size of the network is 32.

Experimental Comparisons.
The comparison is divided into two parts. The first is comparison of traditional methods and the proposed method. The second is comparison between superresolution with motion estimation and without motion estimation. The data for experimental evaluation is selected from VideoSet4. The PSNR and time cost are selected as the metrics. The PSNR is calculated as follows: in which MAX is the image threshold and MSE is the mean square error.

The First Comparison.
Traditional methods include the bicubic interpolation, POCS, and the sparse coding-based superresolution reconstruction methods, while the compared deep learning-based method is the SRCNN. Figures 5 and 6 qualitatively present the results for frame #13 in the walk video and frame #8 in the building video. From the results, we can find that the reconstruction results obtained by traditional methods, i.e., sparse coding and POCS-based methods, are relatively more blurred in contrast to the results obtained by our methods. For example, our reconstructed images can clearly distinguish the details of the contour of the pedestrians and the texture of the baby carriage. Moreover, a significant advantage of our method can be demonstrated in Figure 5. In Figure 5, the windows of the building generate many complicated textures which cause periodic noise in the results obtained by compared methods. However, this effect is largely removed by our method due to the use of the motion features. Table 1 quantitatively shows the PSNR scores obtained by different methods. From Table 1, we can find that our method generates better reconstruction results with respect to the PSNR scores. In contrast to the traditional reconstruction methods, the advantage of our method is that motion feature learning can improve the correctness of reconstruction. Although SRCNN and our methods are commonly based on convolutional operations, our method is the best. Different from the SRCNN, motion information is estimated and introduced in our method, which is the reason for our better performance.

The Second
Comparison. We use the same video training set to train them. Table 2 shows the PSNR of four videos of VideoSet4 reconstructed by two different models. In the case of two times of magnification, it can be seen that the  algorithm without motion estimation of adjacent frames is generally about 5 dB higher than bicubic interpolation. The average increase of our method is about 0.12 compared with SRCNN (without motion estimation). Figure 7 shows the reconstruction quality of the 19th frame of the foliage sequence in VideoSet4 under different algorithms. From the results, it can be found that the biggest difference between the ESPCN model with motion estimation and the ESPCN model without motion estimation is that for moving objects, adding motion estimation can produce better reconstruction effect, which is more obvious in high-speed moving objects. In Figure 7, the black car with the midview motion has obvious distortion and halo (Figure 7(d)). In Figure 7(e), in the reconstruction image using motion estimation, the edge of the black car is relatively straight and clear. This is because the new model takes into account the motion information before and after the frame, and it can better consider the temporal coherence of the video than the single frame image with single information.

Wireless Communications and Mobile Computing
The same phenomenon is shown in Figure 8. On the 31st frame of the calendar video sequence, the characters in the bicubic interpolation image calendar in Figure 7(c) are blurred. In Figure 8(d), although the superresolution method of the single frame image improves the character edge to a certain extent, there is a certain degree of dislocation between smaller characters. In Figure 7(e), in the reconstructed image of the model combined with motion estimation, the edges of the smaller characters become more separated and the deformation degree is lower. The difference between foliage and calendar is that foliage mainly reflects the main motion of the object, while calendar reflects the relative motion of the object caused by the lens motion, which shows that the motion estimation has a certain effect on the two kinds of motion.

Conclusions
Aimed at improving the performance of the reconstruction of videos, this paper proposes deep convolutional networkbased reconstruction methods where the motion information is extracted and introduced. Moreover, the proposed method introduces the subpixel convolution, which can significantly speed up the reconstruction process. Experimental results demonstrate that our method generates better reconstruction results in contrast to previous methods, the SRCNN, and the ESPCN. The proposed method can be applied to hardware architecture composed of integrated circuit chips such as digital signal processor (DSP) and programmable system on chip (SOPC) in the form of embedded program. So, our method can be configured in the front-end device of video surveillance. In the future, it will be used as a form of edge intelligent and provide a feasible method for reducing the computing load of the centre system of urban computing, and we will evaluate the contribution of other types of motion information for superresolution reconstruction.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.