Image super-resolution reconstruction based on multi-scale dual-attention

Image super-resolution reconstruction is one of the methods to improve resolution by learning the inherent features and attributes of images. However, the existing super-resolution models have some problems, such as missing details, distorted natural texture, blurred details and too smooth after image reconstruction. To solve the above problems, this paper proposes a Multi-scale Dual-Attention based Residual Dense Generative Adversarial Network (MARDGAN), which uses multi-branch paths to extract image features and obtain multi-scale feature information. This paper also designs the channel and spatial attention block (CSAB), which is combined with the enhanced residual dense block (ERDB) to extract multi-level depth feature information and enhance feature reuse. In addition, the multi-scale feature information extracted under the three-branch path is fused with global features, and sub-pixel convolution is used to restore the high-resolution image. The experimental results show that the objective evaluation index of MARDGAN on multiple benchmark datasets is higher than other methods, and the subjective visual effect is better. This model can effectively use the original image information to restore the super-resolution image with clearer details and stronger authenticity.


Introduction
In the field of image processing, super-resolution (SR) reconstruction is a technology that uses image processing algorithm (Haris et al., 2020) to convert low resolution images into high resolution images.In recent years, single image super resolution (SISR) has been widely used in practical applications such as improving the clarity of pictures in multimedia (Alshammri et al., 2022), improving the accuracy of medical images in diagnosis (Li et al., 2023(Li et al., , 2021)), and improving the resolution of satellite remote sensing images (Huang & Jing, 2020).Video super-resolution (VSR) requires the alignment and fusion of inter-frame complementary information into keyframes to improve keyframe reconstruction, and many network models (Chan & Yu, 2021;Choi et al., 2021;Jo & Oh, 2018) have been available to solve problems such as video blurring in VSR.However, the VSR technique is more complex than SISR and there are still difficulties in implementing and extending existing methods.Therefore, the focus of this paper is SISR.
The majority of the currently available image super-resolution reconstruction algorithms are based on interpolation (Zhang & Wu, 2006), reconstruction (Zhang et al., 2012) and deep learning (Meng et al., 2020;Wang et al., 2022;Yang et al., 2021;Zhou et al., 2020).Although the interpolation-based method is simple, the reconstruction effect is not very good.Reconstruction-based methods are slow and require a lot of prior knowledge.With the rapid development of artificial intelligence, image super-resolution reconstruction and depth learning have become the main research fields in recent years.The mapping relationship between low-resolution (LR) and high-resolution (HR) images is used to train the super-resolution reconstruction model, and then the texture details and other features of the image are recovered.Deep learning combined with big data (Gai et al., 2016;Qiu et al., 2021) and cloud computing (Li et al., 2016;Qiu et al., 2020) can further optimise related work of image processing.
Super-resolution convolutional neural network (SRCNN) (Khan et al., 2020) is the first time to apply deep learning to super resolution reconstruction and obtains better reconstruction results than the effect maps obtained by interpolation and reconstruction methods.However, SRCNN is not only slow in training but also requires image pre-processing operations.Shi et al. (2016) proposed a pixel rearrangement-based ESPCN network model that uses efficient sub-pixel convolution layers to achieve different magnification effects and obtains higher reconstruction results while reducing the network parameters.Kim et al. (Wang et al., 2021) proposed a VDSR network based on residual learning, and adopted residual connection between layers to make information flow better, thus avoiding the problem of disappearing gradient.However, in the process of super resolution reconstruction, the hierarchical features of low resolution image are often not fully utilised, which leads to the phenomenon of fuzzy image after reconstruction.
To solve the above problems, Zhang et al. (2018) proposed a residual dense network (RDN).This network establishes a continuous memory mechanism by learning the lowfrequency information in LR images.By retaining the feature information of the previous network layer, the feature information of LR image is fully learned.However, the rich features also contain a significant amount of irrelevant data, which can impact image reconstruction quality.Based on RDN, Zhang et al. (2018) introduced the channel attention mechanism into super resolution and built the RCAN network model.The adaptive learning of the features between individual channels enables the network to disregard irrelevant data and concentrate more on the data's characteristics.Although these models have achieved good results in the objective evaluation criteria of image quality, they may appear artefacts, smoothness, texture details and other phenomena that cannot be recovered.Moreover, they tend to construct deeper and more complex network structures, which makes training more difficult.Meanwhile, deep network will also bring gradient instability and network degradation and other problems.We can build more efficient SR networks by reducing network depth and computation, and designing more efficient modules.
In this paper, we propose a multi-scale dual-attention-based residual dense generative adversarial network (MARDGAN) in combination with deep learning.In addition, CSAB and ERDB are combined to form the deep residual dense attention module (DRDAM) which serves as the generated network's fundamental building block.First, DRDAM of different scales are constructed in different paths for multi-scale feature extraction and feature dissemination.Second, the bottleneck layer fuses the feature information extracted by DRDAM.Global residual learning (GRL) is a method for combining the shallow features and the fused features to produce a more effective feature representation.Finally, the sub-pixel convolution is used to reconstruct the high-resolution image.The discriminant network judges the truth and falsity of the generated images, and then makes the generative network and the discriminant network play games with each other to promote the generative network to generate more real and clearer images.At the same time, spectral normalisation (SN) is added to prevent gradient explosion of the network.
To summarise, the following are our primary contributions: (1) Channel and spatial attention block (CSAB) calculate their respective weights in a local way and establish the interdependence of channels and spatial features to recover the detailed features of the image.
(2) Enhanced residual dense block (ERDB) fully extracts all layered features of the original LR image, and then enhances feature propagation to restore high-quality images.The spectral normalisation (SN) is added to enhance the stability of the network.
(3) We propose a MARDGAN for accurate SISR reconstruction to extract and share multiple feature information in a multi-scale parallel manner.Ensure maximum information flow between modules to restore image texture features, and restore high resolution images better than some existing network models.
The structure of this paper is as follows.Section 1 introduces the research background and significance of image super resolution reconstruction.Section 2 introduces the related work of super resolution reconstruction.Section 3 introduces the network structure and loss function in detail.Section 4 introduces the experimental process and comparison in detail.Section 5 is the conclusion of this paper.

Related work
Convolutional neural network can extract richer and abstract image semantic features by increasing the depth and width of the network layer, but it will also bring problems such as high computational cost and difficult model training.Therefore, how to efficiently extract target features (Hu et al., 2019) and use effective features for super resolution reconstruction (Wang et al., 2020) has become a hot research topic, and an important concept in feature extraction is receptive field.Low layer network has small receptive field and strong ability to represent local information, but it will lead to the loss of global information.With the deepening of the network, the receptive field of the high-level network will become larger and larger.In this case, the semantic information representation ability of the image is strong, but the resolution of the feature map will decrease.In addition, there are objects of different sizes in the image (Chen et al., 2021), and different objects have different semantic features (Zhang et al., 2022).Therefore, it is very important to extract and fuse semantic information at all levels efficiently.Qin et al. (2020) proposed a novel multi-scale feature fusion residual network (MSFFRN), which fully utilises the image features of different levels of SISR.MSFFRN has multiple interleaved paths, and convolution kernels of different sizes are designed to adaptively extract and fuse image features of different scales.This helps to fully mine the local features of the image, but increases the computational complexity.Qiu et al. (2021) proposed a multiple improved residual network (MIRN), which combines the output features of two adjacent residual blocks with the overall input features, and makes full use of the adjacent residual information of the convolution layer features in the residual block.The problem of lack of correlation between features is solved by connecting the residual blocks of multi-level jumps.These features are connected by leaps and bounds, and feature information can be shared and reused with each other.However, the above mentioned method will result in smoothness and blurring of texture details in the reconstructed SR images.Compared with other networks, clearer (.Daihong et al., 2022) and more realistic samples can be obtained through the use of generative adversarial networks (GAN).Ledig et al. (2017) proposed SRGAN, which is the first model to combine the generation of adversarial networks with super-resolution reconstruction.The GAN network-based super-resolution reconstruction model can learn how images degrade and then make the generated images look more real.However, GAN is unable to make full use of images' shallow features and occasionally experiences training instability, which can significantly reduce the expressiveness of the generated network and thus reduce its performance.

Method
Super-resolution image reconstruction using deep neural networks can achieve good results, but there are still some problems such as excessive image smoothing, gradient disappearance and gradient explosion.Therefore, MARDGAN is proposed to solve the super-resolution problem in the real world, with the main purpose of improving the overall perceptual quality of SR.Using the idea of inception net (Szegedy et al., 2016), convolution of different scales is used to carry out convolution of input images.Compared with the previously used image super-resolution with a single convolution kernel, the proposed algorithm aims to extract more details from low-resolution images, which is helpful to restore the high-quality image.

Network structure
At present, most super-resolution reconstruction networks still use single-scale convolution kernel to extract the underlying feature information of images, which ignores more details on low-resolution images, while multi-scale feature extraction can effectively extract feature information at different levels (Huang et al., 2022;Meng et al., 2022).Figure 1 shows the proposed generative network.To create distinct feature maps for shallow feature extraction, low-resolution images were convolution processed using convolution cores of 3 × 3, 5 × 5, and 7 × 7 sizes on three separate pathways.The feature reuse is then enhanced by using 2 CSABs and 16 ERDBs in DRDAM to extract the image's deeper texture structure information.In particular, CSAB improves high-frequency feature information.Through ERDB, feature information exchange of different depths in the network is realised to enhance feature reuse.
Common feature fusion operations include feature mean fusion, feature weighted fusion and feature concatenation.In this paper, in the global feature fusion step, the splicing operation is first performed to fuse the features from different paths into one output feature.The bottleneck layer is then utilised to reduce the dimension of the combined features, hence lowering the network parameters.Before performing image reconstruction, GRL integrates the input picture's features from shallow feature extraction with the global features and uses the long jump connection method, which is particularly helpful for network training.To increase training efficiency and add more details, the feature map produced by 3 × 3 convolution in this study is only added to the feature map after global feature fusion.
Image reconstruction requires first enlarging the input image to the desired size, that is, up-sampling the image, and then reconstructing the image.In the past, DenseNet used deconvolution as the most popular up-sampling method (Zhang et al., 2019), but deconvolution was often mixed with too many artificial factors.In this work, we use sub-pixel convolution (Zhihong et al., 2019) for up-sampling, first convolving the original feature map to expand the number of channels, and then organising the convolved feature map into a specific format to obtain a large map, thus realising the image magnification process (He et al., 2019).
In discriminator networks, SN is used to provide stable training.Since the concept of classification is introduced, the discriminant network is to judge whether the received image is a real image or an image generated by a generator.Figure 2 shows the discriminant network's structure.Eight convolution kernels with the size of 3 × 3 are used to extract the features in the discriminant network, and the number of channels doubles as the depth of the convolution layer increases.To avoid neuronal death, the LeakyReLU activation function was used and SN was added in layers 2-8 to stabilise fluctuations in neural network parameters and avoid gradient explosions.Finally, through the Sigmoid function, a one-dimensional tensor is obtained to reflect the real degree of the image.

Channel and spatial attention block (CSAB)
The CSAB designed in this paper is different from the traditional CBAM (Woo et al., 2018).As shown in Figure 3, it uses dual branch paths to enter the channel and spatial attention module at the same time to obtain attention weight coefficients and extract feature information at different levels.When various tasks are typed, resources and features are allotted to each convolutional channel simultaneously.Overall, the implementation is straightforward but efficient.The feature extraction map is then created by merging the two modules' feature    SPP is commonly used in image restoration and image segmentation, and is essentially multiple averaging pooling layers.In super-resolution reconstruction, the perceptual field required for deep networks is very large, which makes training slow.In contrast, by adjusting the size and step of the sliding window with SPP, only one feature calculation is required, making it possible to convert a feature map of any size into a fixed-size feature vector.As shown in Figure 5, convolution operation is performed on images of arbitrary size to obtain corresponding feature maps.This feature map is broken up by SPP into three distinct scales: 4 × 4, 2 × 2 and 1 × 1.The number 16 × 64-d in the figure indicates that the obtained feature map is divided into 16 blocks, and each block consists of 64 channels.Then the partitioned feature maps are fused to determine the average value of each block, which is converted into a 21 × 64 matrix, which can be expanded into a one-dimensional matrix when fed into the full connection.

Enhanced residual dense block (ERDB)
In this work, we designed the ERDB and introduced the concept of DCCN (Huang et al., 2017).In ERDB, a continuous memory mechanism is created by allowing information to be exchanged between various network layers in a deep structure through the use of residual and dense connections.This paper's dense block of residuals has four layers with a total of ten connections.By connecting all feature maps with the same size, each layer receives input from all previous layers and then outputs its feature mapping to all subsequent layers, reducing the gradient disappearance issue as well as boosting feature propagation and reuse.The feature states of each layer are continuously transmitted throughout the network thanks to this continuous storage function.In Figure 6, the ERDB consists of the convolutional layer, the spectral normalisation layer and the activation layer, and is iterated several times using the PReLU activation function and 3 × 3 convolution kernel.In the training process, data are often normalised to speed up the convergence of the model, and even the normalisation is added in the hidden layer.The model's convergence is accelerated through the use of PReLU activation functions and spectral normalisation in this paper.The stability of GAN is enhanced by multiplying the main path by the constant between 0 and 1, and then adding residual connection.
Numerous experiments (Huang et al., 2017) have demonstrated that removing the BN layer can improve training performance and reduce the computational complexity in a variety of super-resolution reconstruction tasks.The GAN will appear in the training that the discrimination network enters the ideal state early, and can always distinguish the true and false images.As a result, it is unable to provide effective gradient information to the generating network, which leads to gradient explosion and non-convergence problems.To solve these problems, SN is used in the parameter matrix of the discriminant network to satisfy Lipschitz constraints.The gradient of the function is limited by the Lipschitz condition to a certain degree of radical change.Consequently, during the neural network optimisation process, the function will be smoother, parameter changes will be more stable, and gradient explosion will be less likely.

Loss function
The loss function is an optimisation target for the reconstruction accuracy of each pixel as well as for the composition.Therefore, the generative network's total loss function is defined as follows in this paper: ( 1 ) The first term in Equation (1) represents the perceptual loss, which constrains the original and generated images.The VGG-19 network is used to calculate the degree of difference between the HR and SR feature maps, preventing the generated image from being significantly different from the actual image.The definition of the perceptual loss is where W i,j , H i,j and C i,j are the VGG network's width, height and number of channels.ϕ i,j is the VGG network's feature map prior to the i pooling and subsequent to the j convolution.G(I LR ) is the HR image reconstructed by the generator from the low-resolution image.
The second term in Equation ( 1) is the generation loss, the discriminator considers the generated image to be accurate, so the generated image's distribution tends to be more similar to the real image.The definition of the generation loss is where E z∼p z (z) is the expectation of data in relation to random noise, and G(z) is the generated sample.The third term in Equation ( 1) represents the pixel loss, which is the sum of the absolute values of the pixel differences between the predicted image and the real image, and also prevents the distortion of the reconstructed image due to excessive smoothing.The definition of the pixel loss is where E is the mathematical expectation, and G(x) is the high-resolution image that the generator produces for the low-resolution image.

Experimental setup
The public data set COCO2014 (Song et al., 2021), which is a sizable and rich object recognition, segmentation and caption data set, primarily collected from complicated daily scenarios, serves as the source for the experimental training set.The test set uses Set5 (Wang et al., 2018), Set14 (Xu et al., 2020) and BSD100 (Liu & Chen, 2021).These three datasets are currently popular single image super-resolution datasets, from natural images to specific objects, including 5, 14 and 100 images with different scenes, respectively.Peak signal-to-noise ratio (PSNR) (Wang et al., 2018), structural similarity (SSIM) (Liu et al., 2021) and learned perceptual image patch similarity (LPIPS) (Zhang et al., 2018) are used as the evaluation indexes on the Y channel.

Experimental process
Before starting the training, first cut the pictures in the data set randomly to 96 × 96, and then use bicubic interpolation to perform the down-sampling operation.The scaling factor is r = 4 , which is used to generate the LR image required for training.Finally, the LR image and the original HR image are sent into the model, and the training data are enhanced by random horizontal flip.In the generative network, 2 CSABs and 16 ERDBs

Loss analysis
To ensure that the direction of the subsequent training stage is correct, the loss function is designed to calculate the difference between the actual value and the backpropagation result of each iteration of the neural network.Figure 7 shows the change of the loss function of MARDGAN under the COCO2014 dataset.The figure shows that the training process of MARDGAN is very unstable, and the oscillation amplitude of the generation loss and discrimination loss curve is particularly large, which shows that the struggle between the discriminator and the generator is very fierce, thus making the image generated by the generator more realistic.In contrast, the perceptual loss gradually converges, indicating that the texture details of the image are well recovered.Generally speaking, compared with traditional networks, generative adversarial network training is more difficult, but it can get better results.

Quantitative evaluations
We compare the proposed model with state-of-the-art SISR method, such as Bicubic, RCAN (Zhang et al., 2018), RFANet (Liu et al., 2020), DRSR (Yang et al., 2021), ESRGAN (Wang et al., 2018), WDSRGAN (Sun et al., 2020) and FBSRGAN (Wang et al., 2022) to confirm the proposed network's reconstruction capability.From Table 1, the average PSNR and SSIM of the MARDGAN proposed in this paper are 2.139 dB and 0.049 higher than ESRGAN, 0.396 dB and 0.013 higher than WDSRGAN, and 0.319 dB and 0.012 higher than FBSRGAN in the three test sets.The average PSNR and SSIM of the proposed MARDN model are 1.375 dB and 0.048 higher than RCAN, 0.826 dB and 0.032 higher than RFANet, and 0.69 dB and 0.026 higher than DRSR in the three test sets.This is because the network model in this paper obtains the deep feature information of the image through three different paths and different receptive fields.The CSAB is used to focus on the more important features in the image, so as to ignore some useless information, and spectral normalisation is used to prevent the occurrence of artefacts in the generated image.However, the traditional channel and spatial attention mechanism is difficult to better learn the high-frequency information of HR images, and thus cannot efficiently restore better quality SR images.The traditional residual network and dense network may contain a lot of redundant information when splicing all the generated feature maps, resulting in too large model complexity.This paper combines residual network and dense network more closely, and proposes a new ERDB structure.It not only fully extracts the local features of the image but also provides continuous storage function for the network.The features of each layer are propagated backward, and the more effective features of previous and current features are learned adaptively through local feature fusion.The network model can be deepened while ensuring that the previous gradient is preserved.Therefore, the objective evaluation index obtained by the model is higher than that of the state-of-the-art SISR method.This paper's MARDGAN model's PSNR and SSIM evaluation indexes are lower than those of the pre-trained MARDN model because the generated adversarial network algorithm failed to improve PSNR and SSIM.However, until the Nash equilibrium state is reached, the generated network and the discriminant network will continue to fight and play until the generated image and the original image are more similar.The generated image's visual smoothness and unsatisfactory image quality will both be impacted by the excessive increase in PSNR and SSIM values.GAN can produce images of high quality that are more in line with human perception.Table 2 shows that the LPIPS of MARDGAN proposed in this paper is the lowest in the three test sets, indicating that the generated image is closer to the real image in terms of vision.
Figure 8 shows the performance and complexity of each model.As shown in Figure 8(a), FBSRGAN has more parameters than other models despite being a considerable advance over earlier techniques.DRSR can achieve better performance with a few parameters, but it  is still difficult to train because of its complex network architecture.In contrast, our MARDN and MARDGAN can better extract image features by combining CSAB and ERDB.In the case of less parameters, the performance is better than some existing methods, and the reconstructed image quality is better.For a more intuitive comparison with other methods, we provide a trade-off between runtimes and performance in Figure 8(b).It is clear that for a small amount of runtimes, our approach can produce better PSNR values.

Qualitative evaluations
To more easily compare the effect, we take one image from Set5, Set10 and BSD100 datasets, and use different SR algorithm models to reconstruct them with super-resolution.
Figure 9 shows the reconstruction results of each SR algorithm after we take the image named "Head" from the Set5 dataset and magnify it by four times.The boy's eyes and nose are partially magnified in the test image.The image that was reconstructed using bicubic interpolation is very hazy, as depicted in the figure.The image reconstructed by RCAN and RFANet is very smooth, with blurred texture, and DRSR can be improved slightly.The reconstructed images of ESRGAN and WDSRGAN still have defects in the nose part, and FBSRGAN does not recover the eyelash satisfactorily.The reconstruction of MARDN images in this paper has achieved good results in terms of objective evaluation indicators, but still imperfect in texture detail recovery.However, the image reconstructed in this paper by MARDGAN is more accurate for image texture reconstruction, which can well reconstruct the texture details of the spots on the bridge of the boy's nose, and has a good reconstruction effect in both overall and detail.In Figure 10, the MARDGAN in this paper can reconstruct the texture details of the baboon whiskers well without any artefacts.In Figure 11, we can see the clearer and finer details of the glass part of the window, and the reconstruction quality is better than the other models.

Ablation experiment
We modify the base models one at a time and compare the results to see how each part of the proposed MARDGAN affect the renderings.Figure 12 depicts the overall visual comparison effect.The original image is in the first column, and each other column after it represents a new model.See below for a detailed discussion.
The column c image in Figure 12 is not as smooth as the column b image, and there is no artefact phenomenon generation.This is because ERDB can perform feature propagation more effectively than residual block (RB) in SRResNet (Blau et al., 2018) and utilise all of the original image's hierarchical features.The artefact phenomenon can be eliminated, the restored image's quality can be improved, and the model's stability can be improved by replacing the BN layer with the SN layer.In Figure 12, column d has a higher quality of image reconstruction and better recovery of details than column c.This is because CSAB has a strong representation capability to capture semantic information and is able to make better use of high-frequency features of low-resolution images to eliminate features that are not necessary.The images in column e in Figure 12 are clearer and richer in texture than those in column d.This is because GAN has been conducting generative adversarial training.It   can give the image a more lifelike appearance, assisting in the recovery of more intricate texture features and producing images of higher quality.
The rows in Table 3 represent the columns in Figure 12.The usage of each module is analysed by objective evaluation metrics.Table 3 shows that the average PSNR and SSIM of column c improved by 1.267 dB and 0.034, respectively, and LPIPS decreased by 0.017 compared to column b.The average PSNR and SSIM of column d improved by 1.060 dB and 0.044, respectively, and LPIPS decreased by 0.015 compared to column c.The difference is that the average PSNR and SSIM in column e are lower by 0.441 dB and 0.024, respectively, and LPIPS is lower by 0.006 than that in column d.In summary, although SN, ERDB, and CSAB can produce improved PSNR and SSIM values, the quality of the generated image still has defects.However, the combined use of GAN can not only obtain the generated image with higher quality, better texture effect and more in line with human visual senses but also obtain better LPIPS index.

Conclusion
In this paper, a Multi-scale Dual-Attention based Residual Dense Generative Adversarial Network (MARDGAN) is proposed to address the issue of unrecoverable image details and insufficient use of feature information in low-resolution images.In particular, the enhanced residual dense block (ERDB) is made to transmit the state of each layer continuously so that the main network can concentrate on learning information at high frequencies.In addition, the model training's stability will be significantly enhanced by the proposed spectral normalisation.The channel and spatial attention block (CSAB), which adaptively assigns weights to feature maps by taking into account the interdependence between channel features and spatial features, was also developed to enhance the utilisation of inherent image features.
According to the experiments' findings, this approach can extract more in-depth feature information from low-resolution images to produce high-resolution images with improved texture clarity and quality.In terms of PSNR, SSIM, and LPIPS, this method has outperformed the previous methods and has superior visual quality.Through specific training, our method has great application potential in areas requiring high reliability of SR images, such as enhancing mural clarity, satellite remote sensing imaging and medical imaging.However, the number of parameters in this method is still large, which will greatly increase the training time, and may even appear over-fitting.In future research, we will continue to optimise the current network architecture and prefer to use a lighter network to reduce training time and improve the quality of super-resolution reconstructed images.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
The

Figure 3 .
Figure 3. Structure of the channel and spatial attention block.

Figure 4 .
Figure 4. Channel attention and spatial attention: (a) channel attention and (b) spatial attention.
maps. Figure 4(a) depicts the attention paid to the channel.The channel feature map was generated by spatial pyramid pooling (SPP), and the channel attention weight coefficient was generated by Sigmoid function in the fully connected layer and activation function layer.Create a new scaling feature by multiplying this weight by the input feature.Spatial attention is shown in Figure 4(b).Through average pooling and maximum pooling, two different features on the channel axis are obtained, and then they are concatenated.Other operations are the same as those of channel attention.

Figure 8 .
Figure 8. Performance and model complexity comparison: (a) PSNR vs parameters and (b) PSNR vs runtimes.

Table 1 .
PSNR(dB)/SSIM evaluation results of different methods on different datasets.

Table 2 .
LPIPS evaluation results of different methods on different datasets.Note: Bold font is the best value for each column.

Table 3 .
Results of evaluation index used by different modules on different datasets.
project was supported in part by the Natural Science Basis Research Plan in Shaanxi Province of China under Grant [2023-JC-YB-517] and the High-level Talent Introduction Project of Shaanxi Technical College of Finance and Economics [2022KY01].