Progressive U-Net residual network for computed tomography images super-resolution in the screening of COVID-19

Thin-slice computed tomography (CT) examination plays an important role in the screening of suspected and confirmed coronavirus disease 2019 (COVID-19) outbreak patients. Therefore, improving the image resolution of COVID-19 CT has important clinical value for the diagnosis and condition assessment of COVID-19. However, the existing single-image super-resolution (SISR) methods mainly increase the receptive field of convolution kernels by deepening and widening the network structure, and adopt the equal processing methods in the airspace and channel domains with different importance, and a large number of computing resources will be wasted on the unimportant features. We propose a progressive U-Net residual network (PURN) for COVID-19 CT images super-resolution (SR) to solve the practicality of existing models, to better extract features, and reduce the number of parameters. First, we design a dual U-Net module (DUM), which can efficiently extract low-resolution (LR) COVID-19 CT images feature. Second, the DUM module first performs up-block three times, and then down-blocks three times in order to learn the interdependence between high-resolution (HR) and LR images more efficiently. Finally, the local skip connection structure is introduced in the DUM module, and the global long skip connection structure is introduced in the reconstruction layer to further enrich the flow of reconstructed HR image information. Experimental results show that our algorithm effectively improves the SR reconstruction effect of COVID-19 CT images, restores its detailed features more sharply, and greatly improves the practicability of the algorithm.


Introduction
From January to March 2020, a novel coronavirus pneumonia epidemic occurred in Wuhan, Hubei province, and other regions, seriously endangering human health and forming a large-scale transmission. As a result, there were over 80,000 confirmed cases, over 4,000 deaths, and over 1,000 critically ill patients in China. For imaging examination of coronavirus disease 2019 , multi-slice spiral computed tomography (CT) shows unique advantages and brings good news to COVID-19 patients. However, for atypical patients, patients with small lesions, and patients with a large weight base, it is often necessary to increase the tube voltage and tube current to ensure the image quality, which greatly increases the patient's radiation dose (Ai et al., 2020;Algaissi et al., 2020;Alshammari et al., 2020;Ayinde et al., 2020). With the continuous advancement and optimization of technology, imaging technology has achieved leaps and bounds (Barry et al., 2020), and increasing researches have applied deep convolution neural network to the SR reconstruction of medical CT images.
In recent years, scholars have made many achievements in SR research on natural images, and developed a series of reconstruction algorithms. According to the basic idea of the algorithm, the current SR reconstruction technology can be divided into interpolation based (Keys, 1981) and reconstruction based (Ge et al., 2016) and the method based on learning (Qiu et al., 2020(Qiu et al., , 2021Tai et al., 2017;Wang et al., 2015). The method based on interpolation generally estimates the connected pixels between adjacent pixels through basis functions to fill the unknown points in the HR image. Although this method is efficient and intuitive, it cannot guarantee the accuracy of estimated pixels, so the reconstructed image often has problems such as artifacts and blur. Furthermore, the reconstruction method is based on signal processing theory, taking LR images as constraints, and using the prior information on the images to estimate the HR images. However, due to the lack of prior information, the HR images reconstructed by this type of method often have problems such as missing details. In order to break the limitations of the first two methods, related scholars introduced the idea of learning into the SR field. The SR reconstruction algorithms based on instance learning (Lu et al., 2018), based on sparse coding (Yang et al., 2010), and fixed neighborhood regression (Chang et al., 2004) have been successively used, and it is proposed that the SR reconstruction quality of the image is improved by another level. Dong, Loy, He et al. (2016) took the lead in the application research of convolutional neural network in the field of image SR, and proposed an image SR method based on convolutional neural network (SRCNN). The SRCNN uses three convolutional layers to perform features extraction, non-linear mapping and image reconstruction operations on the image, and obtains a reconstruction effect that is significantly better than traditional algorithms. In order to speed up the reconstruction speed,  combined deconvolution with the original algorithm and proposed fast image super-resolution reconstruction based on deconvolution (FSRCNN). With the deepening of research, scholars have found that deep networks can achieve better SR effects, but the deepening of the number of layers often leads to problems such as network gradient dispersion or gradient explosion, making it impossible to converge. Kim et al. (2016a) borrowed the idea of residual network (ResNet) , used deep residual network to accelerate network learning, and proposed a superresolution reconstruction algorithm (VDSR) based on deep convolutional neural network.
After that, Kim et al. (2016b) considered the problem of parameter scale and proposed the deep recursive convolutional network (DRCN), which uses a recursive network structure to share parameters between network structures, which effectively reduces the difficulty of training; in addition, the author also uses skip connection and integration strategies to further improve performance. Subsequently, Shi et al. (2016) proposed an efficient sub-pixel convolutional neural network (ESPCN), which uses LR images as input, and uses sub-pixel convolutional layers at the back end of the network structure to implicitly map LR images to HR images, effectively reducing computational complexity and improving reconstruction efficiency. Lai et al. (2017) proposed the Laplacian pyramid network (LapSRN), where the idea of Laplace pyramid is introduced into deep learning, and the experimental results prove the superiority of the stepby-step sampling operation. In addition, the residual results predicted at each level are monitored during the training process, which further improves the performance. Lim et al. (2017) proposed the enhanced deep residual network for single image superresolution (EDSR) by removing the redundant modules from the literature  and using the L1 norm as the loss function. Y.  proposed the residual channel attention network (RCAN), by using the channel attention mechanism, a feature channel with rich information can be selected. The above network structures are mostly feed-forward structures, ignoring the interdependence of HR images and LR images and the error when upsampling LR images.
Although the above learning-based algorithms have achieved good SR effects on natural images, there are still some problems when applied to CT images. The single-scale network structure of VDSR cannot fully extract the main features of CT images, and there are only three layers of networks. The definition of CT images recovered by the SRCNN is insufficient, and the algorithm complexity is relatively high, which cannot meet the requirements of medical image processing speed. On the other hand, although the FSRCNN has been improved, the depth of its network is still not enough to extract the detailed information on medical images. At the same time, a network can only be applied to one magnification, which limits the application of the above two algorithms in the field of medical imaging.
In response to the above problems, we started from the characteristics of CT images and combined the idea of residual learning, and proposed the progressive U-Net residual network (PURN). By building a dual U-Net network module, we can more fully extract various features of LR CT images, introduce residual learning, and adopt the Adam optimization algorithm to speed up the convergence of the model. At the same time, we process the training set under multiple enlargements in the early stage of training so that the model can support multiple resolution enhancement at the same time.
The main contributions of this paper are as follows: (1) We propose a progressive U-Net residual network to realize image SR reconstruction, and through comparative experiments on the benchmark dataset, it is verified that the proposed algorithm has a significant improvement over the most advanced SISR methods. (2) We design a dual U-Net model, which can strengthen feature supervision from a shallow level, promote network convergence, and effectively improve the quality of image reconstruction. (3) We propose a local and global residual network structure, which effectively alleviates the problem of gradient disappearance, and the local residual block can transfer the original rich details of the image to the subsequent feature layer to extract richer details feature.

Residual learning
When training a very deep network structure, since the initialization parameters are very close to zero, it is easy to cause gradient dispersion when the network reversely broadcasts the update parameters. This makes deepening the network structure not only unable to improve network performance, but even worse. In response to this problem, He et al. (2016) proposed the residual net (ResNet), using the idea of residual learning to alleviate the problem of gradient dispersion. The main idea is to add a direct connection channel to the network, allowing a certain percentage of the previous net-work output to be retained.
However, there are certain difficulties in learning identity mapping, to avoid learning the parameters of identity mapping, the ResNet uses the network structure describe in Figure 1, namely HðxÞ ¼ FðxÞ þ x. It can be converted to FðxÞ ¼ HðxÞ À x, where FðxÞ is the residual term. When the residual term is FðxÞ ¼ 0, the identity mapping HðxÞ ¼ x can be easily constructed. Compared to learning the identity mapping HðxÞ ¼ x, learning FðxÞ ¼ 0 is easier.

Up-block and down-block module
The predefined upsampling network mainly uses the original LR image to be upsampled to the target's HR space using the bicubic interpolation algorithm, and then it is input into the convolutional neural network for deep feature extraction. Representative networks include the SRCNN and VDSR, as shown in Figure 2(a). Although this type of method is relatively simple in operation, the image feature extraction part takes place in the HR space, which will greatly increase the training time of the overall network. At the same time, due to the bicubic algorithm being used to sample the image, the noise in the LR image can lead to inaccurate feature information, which has a great impact on the reconstruction results of the whole network. In order to solve the problems of the predefined upsampling network, the single upsampling network removes the upsampling module in the predefined structure of the network. All its mapping transformations are performed in the LR space. The pixel convolution layer or deconvolution layer reconstructs the image into an HR space, its representative network is an ESPCN, which greatly reduces the time complexity and space complexity of the overall network, but there are also some such networks. The problem is that the mapping relationship between their HR and LR images is only determined by the last layer so that a simple convolutional layer reflects the mapping relationship, which cannot fully explain the feature mapping information between the HR and LR images, as shown in Figure 2(b). In order to solve the problems of the predefined upsampling and single upsampling networks, we designed a feature extraction network based on U-Net ideas and residual ideas. This is manifest from Figure 2(c), where the network structure is not fixed in LR space or HR space. Instead, it is continuously transformed into HR and LR space, and it is sampled in low-resolution space by convolution layer, and then projected to highresolution space by deconvolution layer to learn the nonlinear relationship between LR and HR.

U-Net
The U-Net network structure is the network structure proposed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in the 2015 IEEE International Symposium on Biomedical Imaging (ISBI) competition (Ronneberger et al., 2015). The network structure is composed of a shrinking sub-network and an expanding sub-network, forming a U-shaped structure, so it is named U-Net. The Unet network is shown in Figure 1, the U-Net first extracts feature information by downsampling through convolution and pooling, and then up-samples through transposed convolution and crops the previous low-level feature maps for fusion for precise positioning. We then repeat this process until the output feature map is obtained, and finally the segmentation map is obtained through the activation function.
Although U-Net has the advantages of fewer parameters and simple network structure, it is insufficient in depth compared to networks such as ResNet and VGG. Therefore, in this experiment, the traditional U-Net network model is improved, the residual network is combined with the U-Net network model, and we propose the progressive U-Net residual network.

Methodology
The reconstruction algorithm in this paper applies a deep convolution network, so that in the process of feature extraction, it can be directly operated on the LR image frequently. In addition, the increase in depth increases the accuracy of reconstruction. The PURN method is shown in Figure 3, which consists of six groups of up sampling modules and six groups of down sampling modules to form a dual U-Net network structure, and skip connection.

Shallow feature extraction block
The shallow feature extraction block consists of a convolutional layer and an activation layer, which is used to extract features from the input LR image, the Y and F 0 are used to represent the input and output of the network, the feature extraction can be represented by Equation (1).
where � denotes the convolution operation, W denotes the convolution kernel, b denotes the bias, F denotes the feature extracted from the LR Y. Before performing the convolution operation, in order to prevent the feature map after the convolution from getting smaller and smaller, the zero-padding operation is performed before each convolution, to ensure that the image after each convolution is still the original size.

Dual U-Net module
Compared with the U-Net network, we design a dual U-Net module (DUM) that can deepen the number of network layers and improve the quality of image reconstruction; and more skip connections can be added in the middle of the network, which can better combine the feature information of LR image and extract more detailed features, as shown in Figure 4. In addition, the ResNet has the advantage of fast convergence and can reduce the amount of model data, and the ResNet makes the model easier to train, which can prevent model degradation, but also prevent the gradient from disappearing, and the loss does not converge.

Up-and down-block module
In the up-block and down-block module network structure, we use an iterative up-and-down sampling network. In the feature extraction process, instead of using ordinary convolutional layers to extract image features like the above mainstream algorithms, we are constantly extracting features during upsampling and downsampling, as shown in Figure 5(a,b). In this process, we first use upsampling to map the image to the HR space, and add the projection error from LR to HR to the HR feature map; then use downsampling to map the image from the HR space to the LR space. In the resolution space, the projection error from HR to LR is added to the LR feature map. Through this alternate  use of upsampling and downsampling, the mapping relationship between image LR and HR is continuously adjusted, and their non-linear relationship is optimized, to achieve the purpose of improving the final LR to HR reconstruction effect. In the entire feature extraction module, our PURN algorithm uses multiple convolutional layers and deconvolutional layers alternately to transform image information into HR and LR spaces. Furthermore, every time an image is projected to an HR space through up projection or to an LR space through down projection, it will bring some errors with the original image, and by continuously optimizing these error relations, the mapping relationship between HR and LR images can be more accurate. Based on this idea, our PURN algorithm will focus on learning the direct mapping relationship between HR and LR space. For more convenient description, the up-block method is called upsampling, and the down-block method is called downsampling.
The main function of upsampling is to learn the reconstruction error relationship when the image is projected from the LR space to the HR space (Mei et al., 2020). Then, a convolution operation is performed on the HR feature map to obtain the LR feature map. Subtracting the two LR feature maps can obtain the difference between the LR feature maps. In fact, these differences are important feature information lost in the process of feature extraction, and deconvolution of these differences feature images is added to the HR feature images obtained for the first time. In this way, an upsampling operation is completed, and the projection error from low-resolution space to highresolution space is learned.
The convolution operation is performed on the upsampling I h 0 , and the downsampling feature map is obtained.
We calculated the high-frequency residuals of the LR feature maps I l 0 and I l 1 twice, and perform deconvolution operations on the errors.
We add E r to the HR feature map obtained after the first convolution to obtain the entire upsampling result.
where I h 0 represents the upsampling image obtained after deconvolution, I l 0 represents the original LR image, � represents the convolution operator, U represents the upsampling convolution kernel, " s represents the upsampling operator, and I l 1 represents the downsampling image obtained after the convolution operation, D represents the downsampling convolution kernel, # s represents the downsampling operator, and E r represents the feature map obtained after deconvolution of the high-frequency residual parts of the two LR images. The downsampling process is opposite to the upsampling operation, and it is mainly used to learn the projection error relationship when the HR image is projected into the LR space. In this process, the image will be transformed into HR and LR space with the convolutional layer and deconvolution layer first, and the difference in the LR space will be obtained, and then these difference feature maps will be merged into the LR space. From the rate feature map, the corresponding downsampling feature mapping information is obtained.
We perform convolution operations on the image to obtain the downsampling feature map.
The deconvolution is performed on the upsampling image.
We calculate the high-frequency residuals of the LR feature maps I l 1 and I l 2 twice, and perform the convolution operation.
We add I l 1 to the LR feature map obtained after the first convolution to get the result of downsampling.
where I l 1 represents the LR image obtained after downsampling, D represents the convolution kernel used for downsampling, I l 2 represents the image obtained after upsampling, U represents the convolution kernel used for upsampling, and Er represents the feature map obtained after convolution of the high-frequency residual parts of the two LR images.

Sub-pixel convolutional upsampling layer
We reduce the dimensionality of the output obtained by the previous deep feature extraction block to the same dimension as the initial LR input, adaptively fuse and retain the previously learned features, perform global feature fusion and output, and reconstruct an HR image.
where Y denotes the initial input of the image, R denotes is the high-frequency feature information learned by the nonlinear mapping, F d denotes the calculation operation of the local feature fusion layer, G up denotes the upsampling operation, and H high denotes the reconstructed HR image.

Dataset and implementation details
In the experiment, we use the public dataset DIV2K (Gao & Grauman, 2017) as the training dataset, which mainly includes 800 training data, 100 verification images, 100 test images, and the dataset images include rich scenes and rich edge and texture details. While training the model, we use 800 training images from the dataset. In order to avoid underfitting during the training process, the 800 images are augmented by random rotation of 90°, 180°, 270°, and horizontal rotation to expand the training data to 8 times the original sizes so that the model has enough training data, and can adapt to the problem of image reconstruction with different tilt angles . For testing, we use three standard benchmark datasets: Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2010), and Urban100 . At the same time, we use the values of peak signal-to-noise ratio (PSNR) (Wang et al., 2004) and structural similarity (SSIM) (K.  as the objective criteria for evaluating prediction results.
In addition, combined with CUDA 11.1 and PyTorch 1.8.1, we use Python code to implement PURN algorithm, and train and evaluate the algorithm through many experiments on NVIDIA GeForce RTX 3080 GPU and Ubuntu 18.04 operating systems, and the feature size of the PURN network for COVID-19 CT images SR is shown in Table 1.
where i represents each pixel, H high represents the reconstructed HR image, and F high represents the original HR image. CT imaging examination is one of the main methods of COVID-19 diagnosis, its value lies in the detection of lesions, the judgment of the nature of the lesions, and the assessment of the severity of the disease in order to facilitate clinical classification. At the same time, CT scan has the advantages of HR imaging and multiplane reconstruction and observation, and the early CT manifestations of COVID-19 have certain characteristics. Therefore, chest CT examination plays a critical role in its early diagnosis and treatment, and is suitable for imaging screening of patients with high clinical suspicion. In order to build a COVID-19 diagnosis model based on deep learning, we tested it on the public COVID-CT dataset (COVID-19 CT images: https://github.com/UCSD-AI4H/COVID-CT).
The COVID-CT (Qiu et al., 2019) dataset contains 349 CT slices of 216 COVID-19 patients and non-COVID-19 463 CT slices of the patient. It mainly collects medical images from websites and publications, that is, collected from 760 COVID-19-related pre-printed documents, which mainly come from medRxiv and bioRxiv.

Comparison with the state-of-the-arts
Our PURN method not only integrates residual ideas but also combines U-Net to improve the quality of predicted images to verify the effectiveness of the algorithm in this paper, we compare and analyze the prediction results of the algorithm in this paper and some mainstream algorithms on different datasets (Jiang et al., 2020). The experimental results are shown in Table 2, in the four public datasets Set5, Set14, BSD100, and Urban100, some mainstream algorithms for comparison include Bicubic (Lu et al., 2018), A+ (Timofte et al., 2015), SCN (Timofte et al., 2015), SRCNN (Chang et al., 2004), FSRCNN (Kim et al., 2016a), VDSR (Dong, Loy, He et al., 2016), DRCN (Shi et al., 2016), LapSRN (Lai et al., 2017), DRRN (Kim et al., 2016b). Furthermore, we compare the peak signal-tonoise ratio (PSNR), structural similarity index (SSIM), and subjective visual effects of the-state-of-the-art algorithms when the scaling factors are 2, 3, and 4 on different datasets.
It is manifest from the Table 2 that our PURN algorithm has achieved excellent performance on the public test dataset, especially when the scaling factor is 2, our PURN algorithm achieves sub-optimal PSNR in the Urban100 dataset, but in the Set5 dataset, Set14 dataset achieves the best PSNR. In particular, compared with other state-of-the-art algorithms on the Set14 dataset, the improvement is the most obvious, with an increase of 0.46 dB relative to DRRN. In addition, our PURN method has reached the optimal value of the SSIM index.
In addition, in order to clearly observe the comparison results of detailed information, we zoomed in on some areas of the reconstructed results, and the experimental results are shown in Figure 6, it can be found that the reconstructed image of the Bicubic and SRCNN can hardly observe contours, and so on. For detailed information, although the reconstruction results of the FSRCNN, VDSR, and LapSRN algorithms have been slightly improved, some of the detailed information is still missing. The reconstruction results of the DRCN and DRRN algorithms obtain a better visual experience, but compared to the algorithm in this paper, there is still a lack of sharp edge information. Therefore, by comprehensively comparing the above eight algorithms, our model can extract the original features of LR COVID-19 CT images more accurately and comprehensively, reconstruct its detailed  texture more clearly and sharply, and at the same time, the edge information of the reconstructed image has good continuity, and no artifacts, the reconstructed image is the closest to the real image. The performance and running time of our PURN method and other state-of-the-art methods to reconstruct a CT image are compared. Figure 7 shows that the average running time of our PURN on the COVID-19 CT dataset for 4× enlargement can achieve 0.39 s, as well as VDSR, but the PSNR is better at 0.19 dB than VDSR. Therefore, our PURN method can obtain the highest values of PSNR under the condition of better running time, which can meet the real-time application in actual scenarios.

Conclusion
In this paper, we propose the Progressive U-Net Residual Network (PURN) for COVID-19 CT image superresolution. In PURN, in view of the loss of details and blur artifacts in the SR reconstruction of COVID-19 CT images, we design a dual U-Net module (DUM) to extract LR image features, which can alleviate the problem of underutilization of feature information and loss of high-frequency information in the learning process. At the same time, we make full use of the difference between feature images, so that we can find more useful high-frequency information in the reconstruction of the predicted image. The experimental results prove the superiority of our PURN method in both the PSNR and SSIM indicators, and the detailed information on the COVID-19 CT image reconstruction is more abundant. Furthermore, our method is also of great significance for experts to find lesions more accurately and improve the accuracy of clinical diagnosis, and provides new ideas for the theoretical research of medical image SR reconstruction methods. For future implementation, we propose an algorithm that performs SR at any scale, which is closer to actual application scenarios, and can integrate medical image registration and multi-frame image SR algorithms to further improve the accuracy of SR

Disclosure statement
No potential conflict of interest was reported by the author(s).