FA-GAN: Fused attentive generative adversarial networks for MRI image super-resolution

Highlights • A fused attentive generative adversarial networks framework is proposed for MR image super-resolution.• A combination of channel attention and self-attention is used to calculate the weight parameters of the input features.• Spectral normalization process is introduced to make the discriminator network stabler.• The proposed FA-GAN method is superior to the state-of-the-art reconstruction methods.


Introduction
Image super-resolution refers to the reconstruction of highresolution images from low-resolution images (Dong et al., 2016). High resolution means that the pixels in the image are denser and can display more flexible details Gholipour et al., 2010). These details are very useful in practical applications, such as satellite imaging, medical imaging, etc, which can better identify targets and find important features in high-resolution images (Zeng et al., 2019;Wang et al., 2020Wang et al., , 2019. High-resolution (HR) MRI images can provide fine anatomical information, which is helpful for clinical diagnosis and accurate decisionmaking (Wang et al., 2018;Manjon et al., 2010). However, it not only requires expensive equipment but also requires a long scanning time, which brings challenges to image data acquisition. Therefore, further applications are limited by slow data acquiring and imaging speed (Jafari-Khouzani, 2014; Rueda et al., 2013).
The super-resolution (SR) is a technique to generate a highresolution (HR) image from a single or a group of low-resolution (LR) images, which can improve the visibility of image details or restore image details (Tourbier et al., 2015;Dong et al., 2016;Shi et al., 2018a). Without changing hardware or scanning components, SR methods can significantly improve the spatial resolution of MRI (Mahmoudzadeh and Kashou, 2014;Luo et al., 2017). Generally, there are three methods to implement image SR in MRI: interpolation-based, construction-based, and machine learning-based (Shi et al., 2018b;Jog et al., 2014).
The interpolation-based SR techniques assume that the area in the LR image can be extended to the corresponding area by using a polynomial or an interpolation function with a priori smoothness (Shi et al., 2016;Seeliger et al., 2018). The advantages of the interpolation-based super-resolution reconstruction algorithm are simplicity and high real-time performance; the disadvantage is that it is too simple to make full use of the prior information of MR images. In particular, the super-resolution reconstruction algorithm based on a single MR image has obvious shortcomings, which in a blurred version of the corresponding HR reference image Huang et al. (2017); Armanious et al. (2019).
The reconstruction-based SR methods are introduced to solve an optimization problem incorporating two terms: the fidelity term, which penalizes the difference between a degraded SR image and an observed LR image, and the regularization term, which promotes sparsity and inherent characteristics of recovering the SR signal (Goodfellow et al., 2014;Luo et al., 2017). The performance of these techniques becomes suboptimal especially in the high-frequency region when the input data becomes too sparse or the model becomes even slightly inaccurate (Quan et al., 2018;Ledig et al., 2017a;Huang et al., 2017).These shortcomings reduce the effect of reconstruction-based SR methods to large magnifications, which may work well for small magnifications less than 4.
Machine learning techniques, particularly deep learning (DL)-based SR approaches, have recently attracted considerable attention because of their state-of-the-art performance in SR for natural images. Most recent algorithms rely on data-driven deep learning models to reconstruct the required details for accurate super-resolution (Liu et al., 2018;Latif et al., 2018). Deep learning-based methods aim to automatically learn the relationship between input and output directly from the training samples (Dong et al., 2014a;Zhang et al., 2018). At the same time, deep learning has also played a vital role in CT/PET image reconstruction, such as PET Image Reconstruction from Sinogram Domain Hu et al., 2020;Häggström et al., 2019).
With the development of deep learning, the Generative Adversarial Network (GAN) proposed by Goodfellow et al, has recently been demonstrated that it has good performance in image transformation and super-resolution imaging. Sanchez et al. proposed the standard superresolution GAN (SRGAN) framework for generating brain superresolution images (Ramachandran et al., 2017). Most GAN-based image generation models are constructed using convolutional layers. Convolutions process information in local neighborhoods, however, using only convolutional layers is inefficient in establishing remote dependencies in images (Miyato et al., 2018;Iandola et al., 2016).
It is difficult to learn the dependencies between images using a small convolution kernel. However, the size of the convolution kernel is too large, which will reduce the model's performance. Besides, increasing the size of the convolution kernel can also expand the receptive field, but it inevitably increases the complexity of the model (Quan et al., 2018;Ledig et al., 2017b). Zhang et al. propose the Self-Attention Generative Adversarial Network (SAGAN) with attention-driven, long-range dependency modeling for image generation tasks (Liu et al., 2018).
In the previous work on reconstruction problems, deep learning based methods have two major issues (Latif et al., 2018). Firstly, they treat each channel-wise feature equally, but contributions to the reconstruction task vary from different feature maps. Secondly, the receptive field in a convolutional layer may cause to lose contextual information from original images, especially those high-frequency components that contain valuable detailed information such as edges and texture. Therefore, the Channel-Attention module is designed to filter the useless features and to enhance the informative ones. Therefore, model parameters in shallower layers are to be updated mostly that are relevant to a given task. To the best of our knowledge, this is the first work to employ channel-wise attention to the MRI reconstruction problem (Dong et al., 2014b;Ting and Xiao, 2019). Combining the idea of MR reconstruction and image super-resolution, some researchers work on recovering HR images from low-resolution under-sampled K-space data directly (Shi et al., 2018c;Xu et al., 2018;Huang et al., 2021;Hu et al., 2021).
In this paper, a fused attentive generative adversarial network (FA-GAN) is proposed for generating super-resolution MR images from lowresolution MR ones. The novelty of this work can be concluded as following: 1)The local fusion feature block, consisting of different threepass networks by using different convolution kernels, was proposed to extract image features at different scales, so as to improve the reconstruction performances of SR images; 2) The global feature fusion module, including the channel attention module, the self-attention module, and the fusion operation, was designed to enhance the important features of the MRI image, so that the super-resolution image is more realistic and closer to the original image; 3)The spectral normalization (SN) is introduced to the discriminator network, which can not only smooth and accelerate the training process of the deep neural network but also improve the model generalization performance.

Methodology
The proposed neural network model is designed to learn the image firstly, and then inversely map the LR image to the reference HR image Zhu et al. (2019); Bello et al. (2019). This model only takes LR images as input to generate SR images. The operation can be defined as (1) where I LR , I HR ∈ℝm×n are respectively LR and HR MRI images of size m× n and f: I HR ∈ℝm× n → I LR ∈ℝm× n denotes the down-sampling process that creates a LR counterpart from an HR image.

SR network with GAN
The network output is passed through a series of upsampling stages, where each stage doubles the input image size. The output is passed through a convolution stage to get the resolved image. Depending upon the desired scaling, the number of upsampling stages can be changed. The adversarial min-max problem is defined by The framework of the proposed FA-GAN network is shown in Fig. 1. The whole model takes the down-sampled low-resolution magnetic resonance image as input, extracts the features through the LFFB module, and generates the enlarged image through convolution and up- sampling. Finally, the GFFB module fuses the detailed features to generate a super-resolution magnetic resonance image. During the training process, HR references will be used to guide the optimization of model parameters. Moreover, the spectral normalization (SN) is introduced to the discriminator network to stabilize the training of GAN.

Local fusion feature block (LFFB)
Different from those previous experiments (Fu et al., 2019), the local fusion feature block consists of different three-pass networks by using different convolution kernels, as shown in Fig. 2. In this way, the information flows between those bypasses can be shared with each other, which allow our network to extract image features at different scales. The operation can be defined as where Cs×s means the S scale feature extractor. Ours proposed S scale feature extractor consist of three convolution layers with s×s kernel size and one ReLU intermediate activation layer. The operation of F [•] means the concatenation and 1 × 1 convolution, which is mainly designed to quickly fuse features and reduce the computational burden.

Global feature fusion block (GFFB)
The global feature fusion module includes three parts, namely the channel attention module, the self-attention module, and the fusion operation. Through these modules, the important features of the MRI image can be enhanced, so that the super-resolution image is more realistic and closer to the original image (Fig. 3).
(1) Channel-Attention Module In this paper, a lightweight channel attention mechanism is introduced, which allows to selectively emphasize informative features and restrain less useful ones via a one-dimensional vector from global information. As illustrated in Fig. 4, a global average pooling is used to extract the global information across spatial dimensions H*W firstly. Then, it is followed by a dimension reduction layer with a reduction ratio of r, a ReLu activation, a dimension increase layer, and a sigmoid activation to generate SR image. The two dimension computable layers are implemented by fully connected layers. The final output of the recalibration is acquired by rescaling the input features.
(2) Self-Attention Module The role of the self-attention module is to replace the traditional convolutional feature map with a self-attention feature map.
After convolution operation, the convolutional feature maps pass three branches f(x), g(x), h(x) of the 1 × 1 convolution structure, and the size of the feature map is unchanged. g (x) changes the number of channels, and the output of h (x) keeps the number of channels unchanged. H and W represent the length and width of the feature map,  and C represents the number of channels. After transposing the output of f (x), and multiplying the output matrix of g (x), through normalizeing by softmax to get an [H*W, H*W] attention map. By multiplying the attention map with the output of h (x) to get a [H*W, C] feature map, and using a 1 × 1 convolutions to reshape the output to [H, W, C] to get the feature map at this time. The structure of the self-attention module is shown in Fig. 5.
where β j,i indicates the extent to which the model attends to the i th location when synthesizing the j th region. Here, C is the number of channels and N is the number of feature locations of features from the previous hidden layer. The output of the attention layer is o and can be expressed as: In the above formulation, w g , w f , w h ,and w v are the learned weight matrices, which are implemented as 1 × 1 convolutions.
Besides, we further multiply the output of the attention layer by a scale parameter and add back the input feature map. Therefore, the final output is given by, where γ is a learnable scalar and initialized to 0. Introducing learnable γ can make the network first rely on the information of the local neighborhood, and then gradually learn to assign more weight to non-local information. (

3) Fusion Operation
A. Direct Connection. The direct connection function can be implemented by adding the two terms directly as following: where i is the index of a feature. R represents the output of Channel Attention, and Y represents the output of Self-attention. Both α and β are set to 0.5 as the preset value. B. Weighted Connection. Compared to the direct connection, the weighted connection introduces the competition between R and Y. Besides, it can be easily extended to a softmax form, which is more robust and less sensitive to trivial features.Both α and β are set to 0.5 as the preset value. To avoid introducing extra parameters, we calculate weights using R and Y. The weighted connection function is represented as (4) Loss Function The loss function is used to estimate the difference between the value generated or fitted by the model and the real value, that is, the difference between the reconstructed MRI and the original MRI. The smaller the loss function, the stronger the model is. In order to improve the quality of model reconstruction, we propose to use perceptual loss, pixel loss, and adversarial loss as the combined loss function of the generator. Perceptual loss mimics human visual differences, and pixel loss is the difference between pixels in the image domain.
In the following we describe possible choices for the content loss l SR x and the adversarial loss l SR gen . This paper uses the Euclidean distance between VGG features, which is more relevant to human perception, as the content loss, as shown below: ϕ i,j indicates that the extracted feature is the j-th convolutional layer before the i-th largest pooling layer. W i,j , H i,j represents the dimension of the feature layer. The adversarial loss function is the average discriminator probability value of the samples generated by the generator. The formula is as follows:

l SR Gen
D θD (G θG (I LR ) represents the probability that the discriminator judges the image generated by the generator as the original magnetic resonance image.

Datasets and metrics
All the experiment use TeslaV100-SXM2 GPU and four different MRI data sets to train and test the model. Randomly select 50 samples with 3D formats for training, of which 40 samples are used as the training set (3200 2D MRI) and 10 samples are used as validation set (960 2D MRI). All low-resolution images in the experiment are obtained by bicubic interpolation. At the same time, to ensure the fairness of the test, we conducted two independent tests to verify the performance of the proposed FA-GAN model. The first test experiment was to randomly select 10 samples for the test set (960 two-dimensional MRI images) to test and calculate the average quantitative index, and the second one was to select a two-dimensional MRI with obvious features from the testing sets. The optimizing procedure is implemented by using Adam optimization algorithm with 0.9. The FA-GAN networks were trained with a learning rate of 0.0001.The model training takes 10 h at a time.
The experiment uses three evaluation criteria to evaluate the reconstructed image: peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and freshet Inception Distance score(FID). The definition of PSNR is where x represents the original image, y represents the super-resolution reconstructed image, and i, j respectively represent the coordinate position of the pixel, and M, N represent the size of the image. The SSIM can be defined by where μ x and μ y represent the mean of the image x andyrespectively, σ x and σ y represent the variance of the image x andy respectively, the σ xy covariance of the image x andy, C 1 and C 2 the constant value used to maintain stability. The expression of FID is In the formula, Tr represents the sum of the elements on the diagonal   of the matrix, μ is the mean, Σ is the covariance, x r represents a real image, and g is a generated image.

Experimental results
In the experiment, we set the parameters of the comparison experiment with the optimal parameters in order to compare the best reconstruction performance. Figs. 6-8 show the reconstructed twodimensional super resolution MR image by using different GAN-based algorithms. Due to the small difference in visual observation, we choose a two-dimensional MRI with more prominent features and zoom in on a specific area. From Figs. 6-9, when compared with other three GAN-based methods, it can be found that the FA-GAN based reconstruction algorithm has a clearer texture structure in detail than the other methods. Due to the combination of the self-attention module and channel attention module, with a richer high-frequency texture detail under a large-scale factor, the proposed FA-GAN method preserves more detailed structural information and the fine outline of the MR image. The reconstructed image has clear texture details and most of the aliasing artifacts are effectively suppressed even with 4× super resolution.
Tables 1-4 show the average values of two quantified indicators of PSNR and SSIM of 3200 two-dimensional MRI reconstructed by using different algorithms. Table 1 presents cardiac super-resolution MRI reconstruction performances by using the differences methods, Table 2 shows the brain super-resolution MRI reconstruction performances of different methods, Table 3 provides the knee super-resolution MRI reconstruction results of different methods, and Table 4 shows the MM-WHS by using different GAN-based methods. The reconstruction performances of these GAN-based methods are listed in terms of the PSNR and SSIM values at three magnifications of 2, 4, and 8 times. As shown in Tables 1-4, it can be found that the proposed FA-GAN method achieve the highest PSNR and SSIM among these four GAN-based reconstruction methods, and the following are SA-SR-GAN, CA-SR-GAN and SRGAN. From the tables, the FA-GAN method can improve the average PSNR of the reconstructed images by about 0.44-4.85 and SSIM by about 0.0003− 0.0044, especially in the cardiac MR image with 2× super-resolution reconstruction. The proposed FA-GAN can improve the reconstruction performances obviously in terms of the PSNR and SSIM values. Table 5 illustrates the super-resolution MRI reconstruction performances by using different GAN-based methods in terms of FID. As shown in Table 5, it can be found that the FA-GAN can effectively reduce the FID. A lower FID means that the reconstructed SR MR images are closer to the real high resolution MR images, which means that the quality of the reconstructed SR MR images is higher.

Discussion
To demonstrate the effect of each component, we carry out seven ablation experiments of local feature fusion block (LFFB), channel attention (CA), and self-attention (SA). By removing the local features fusion block, our model falls back to a network similar to SRGAN but with the attention block. The results confirm that making full use of local features fusion block will significantly improve performance. One possible reason is that fusing hierarchical features improves the information flow and eases the difficulty of training. We can conclude from Table 6 that the proposed FA-GAN model with all components achieves the best performance. The integration of local feature fusion block and global feature fusion block not only improves 1− 2 dB on PSNR, but also gets much better visual effects in image details than the other methods with part components, as shown in Figs. 6-9.
According to the results of the ablation experiment, as shown in Table 6, it can be seen that the CA and LFFB modules together plays the most important role in the super resolution MR image reconstruction, which affect the reconstruction performances obviously. However, the affection of the SA module is relative small, and the reconstruction quality drops slightly. Table 7 illustrates the reconstruction effect under different connection modes. It can be clearly seen that the weighted connection has achieved better results.Thus, we use a weighted connection in our method.
For the selection of parameters α and β, we have done the following three sets of comparative experiments. As shown in Table 8, the experimental results show that the parameters are the optimize values when α = 0.5 and β = 0.5.
In this paper, the spectral normalization (SN) is introduced to the discriminator network, so as to stabilize the training of GAN and limit the Lipschitz constant of the discriminator. Compared with other normalization techniques, spectral normalization does not require additional hyper parameter adjustments (set the spectral norm of all weight layers to 1). Fig. 10 shows the effect of SN on FA-GAN, which makes the loss value steadily drop and makes the whole training process more stable. Fig. 11 illustrates the loss value of the training process of SR image reconstruction by using four different GAN-based methods with ×4 times. It can be found that the loss value by using FA-GAN method decreases monotonously with iteration increasing, while the other methods decrease in waves, which indicates that the proposed FA-GAN method combined with spectral normalization makes the training more stable.

Conclusion
This paper proposed a new method for super-resolution magnetic resonance images reconstruction by using fusion attention based generative adversarial networks (FA-GAN). Two different attention mechanisms are integrated into the SRGAN framework to obtain   important features. Compared with the SRGAN framework, the proposed FA-GAN method can reconstruct super-resolution images with higher PSNR, SSIM and lower FID, and the reconstructed SR images preserve much closer image details to the real high-resolution image. In the future work, the proposed FA-GAN method can be used to reconstruct the super resolution MR images like 7 T resolution from 3 T MR equipment, which can improve the resolution of the MR image without change the hardware.

Author contributions
MJ, MZ, LW, XY, JZ, YL, PW, JH and GY conceived and designed the study, contributed to data analysis, contributed to data interpretation, and contributed to the writing of the report. MJ, JH and GY contributed to the literature search. MJ, MZ, LW, XY, JZ, YL, and PW contributed to data collection. MJ, MZ, LW, XY, JZ, YL, and PW performed data curation and contributed to the tables and figures. All authors contributed to the article and approved the submitted version.

Declaration of Competing Interest
The authors report no declarations of interest.