MAMIQA: No-Reference Image Quality Assessment Based on Multiscale Attention Mechanism With Natural Scene Statistics

No-Reference Image Quality Assessment aims to evaluate the perceptual quality of an image, according to human perception. Many recent studies use Transformers to assign different self-attention mechanisms to distinguish regions of an image, simulating the perception of the human visual system (HVS). However, the quadratic computational complexity caused by the self-attention mechanism is time-consuming and expensive. Meanwhile, the image resizing in the feature extraction stage loses the full-size image quality. To address these issues, we propose a lightweight attention mechanism using decomposed large-kernel convolutions to extract multiscale features, and a novel feature enhancement module to simulate HVS. We also propose to compensate the information loss caused by image resizing, with supplementary features from natural scene statistics. Experimental results on five standard datasets show that the proposed method surpasses the SOTA, while significantly reducing the computational costs.

With the development of deep learning, several NR-IQA methods [3], [4], [5], [6], [7], [8] based on Convolutional Neural Networks (CNNs) have been proposed. The main problem with these methods is that only fixed-sized image inputs are allowed, which leads to image resizing as preprocessing for some inputs. For the IQA task, this will affect the quality of the image and leads to deviations in the final prediction.
Neuroscience research [9] has demonstrated that when the human visual system (HVS) estimates the quality of an image, some regions are given higher attention. Therefore, HVS can be simulated by introducing the attention mechanism to assign different weights to different regions of the image. Transformers have been applied to many machine vision tasks [10], [11], [12], [13], [14], [15]. Several NR-IQA methods also use transformers to implement the attention mechanism of HVS. TRIQ [16] uses convolutional neural networks to extract features and implements a self-attention mechanism with the Transformer's encoder, capable of handling image inputs of different resolutions. TranSLA [17] introduces saliency information to guide the self-attention mechanism of the Transformer while incorporating a gradient map to supplement the Transformer with local information. TReS [18] employs a hybrid approach of CNN and a self-attention mechanism in Transformer to extract local and non-local features from the input image. However, recent research [19] argues that the computation of a one-dimensional structure for a two-dimensional structured image is unreasonable and poses many challenges. First, the self-attention mechanism imposes a quadratic computational complexity for images, which is time-consuming and expensive for high-resolution images. Such solutions with high computational complexity and storage requirements cannot be easily deployed in mobile or edge devices [20]. Second, Transformer-based models [16], [17], [18] usually use CNN as the feature extractor, which still performs a resize operation on the image; hence, deviating from the original image quality.
To solve the above-mentioned issues, we propose a new NR-IQA method based on a multiscale attention mechanism called MAMIQA, which is divided into two branches: one branch mimics the HVS and the other branch captures original image features to compensate for the information loss due to image resizing. For the first branch, we use decomposed large kernel convolution to assign different attentions to different regions of the image. By decoupling the large kernel convolution into depth-wise convolution, depth-wise dilation convolution and channel convolution, we implement the attention mechanism with a lower computational overhead than Transformers. Moreover, the HVS views images at multiple scales [21]. To simulate the behavior of the HVS, we extract features from multiple scales and further propose a feature enhancement module (FEM) to enrich the local fine-grained details and global semantic information of multi-scale features. For the second branch, the traditional natural scene statistics (NSS) [22] method is used. The advantage of NSS methods [23], [24], [25] is that there is no limitation of input image size, i.e., the original image features can be extracted without resizing or cropping the image. Therefore, NSS is adopted to extract the original image features, to compensate for the information loss due to image resizing. The main contributions of this letter are as follows: r We propose a novel NR-IQA method based on light-weight attention mechanism, which mimics the HVS and improves the performance of image quality assessment.
r We adopt natural scene statistics to extract the features of original-sized images, which compensates for the information loss caused by image resizing or cropping.
r We extract multiscale features and propose an efficient feature enhancement module (FEM) to improve the multiscale feature representation of the model. r We enable low complexity IQA, by enhancing the performance beyond transformers-based methods, with much lower computational complexity. Fig. 1 depicts the overall framework of our proposed MAMIQA, which is a two-branch network structure with a multiscale attention branch, and an NSS branch.

II. PROPOSED METHOD
For the multiscale attention branch, we extract attention features of four scales through the multiscale attention module. The extracted four-scale features are then fed into our proposed feature enhancement module (FEM) for feature enhancement. Finally, global average pooling (GAP) is performed to obtain the enhanced multiscale features.
For the NSS branch, we use BRISQUE-based NSS features [24]. First, we calculate the mean subtracted contrast normalized (MSCN) coefficients of the image, i.e. the local normalized luminance coefficients. Since the MSCN coefficients of the original quality image conform to the Gaussian distribution, while the distorted image does not conform to this statistical regulation [22], we can extract the features of the image by quantifying the difference between the two. We capture this regular deviation by using generalized Gaussian distribution (GGD). Also, the product of the MSCN adjacent coefficients of the image is shown to conform to the statistical regularity of the natural image. Hence, the deviation of the product coefficients is captured by using an asymmetric generalized Gaussian distribution (AGGD). The NSS features are extracted from both 1 and 1/2 scales, and fed into two fully connected layers, which use PReLu as the activation function (as PReLu can strengthen the nonlinear relationship between the layers and thus, accelerate the training process).
We use a three-layer fully connected MLP Head to fuse the above features and predict the perceptual quality. For each batch of images in training, the regression loss is minimized with the mean absolute error (MAE) as the loss function, for stable training and preventing gradient explosion [26].
Here, q i is the predicted quality score for i th image and s i is its corresponding subjective quality score (ground truth).

A. Multiscale Attention Module (MAM)
The key to the attentional mechanism is to produce the attention map, which shows the importance of the different regions, so we should learn the dependency between different regions. There are two approaches to capture such dependencies. One approach is implemented through the self-attention mechanism, but with obvious drawbacks, which have been listed above. The other approach is to establish correlations using large kernel convolutions. Implementing attention directly using large kernel convolutions leads to high computational overheads as well as a large number of parameters. To solve this problem, we apply an attention network [19] as the backbone of the multiscale attention module. A large kernel convolution is decomposed to  capture the dependencies among different regions. As shown in Fig. 2, a large kernel convolution can be divided into three components: a depth-wise convolution, a depth-wise dilation convolution, and a channel convolution. This module captures the feature dependencies with a lower computational cost and lower number of parameters. The input features are first downsampled to obtain F 1 , then L groups of operations are stacked in sequence to extract the features. The operations of each group can be described as follows: where f GELU denotes GELU activation, f BN (·) denotes batch normalization, f 1×1 (·) denotes 1 × 1 convolution, f DW − Conv (·) and f DW − D − Conv (·) denote depth-wise convolution and depthwise dilation convolution, respectively. f F F N (·) denotes the convolutional feed-forward network, ⊗ denotes element-wise product, L for each stage are {3, 3, 12, 3}. Finally, the layer normalization is applied at the end of each stage.

B. Feature Enhancement Module (FEM)
The human visual system views images at multiple scales [21]. To help the network better understand the content information in distorted images, four scales of features are extracted with the multiscale attention module, then enhanced, and finally processed by global average pooling to stitch together the multiscale features. In order to enhance the local and global feature representation, inspired by the inception module in [40], we propose a feature enhancement module (FEM). As shown in Fig. 3, the FEM consists of 1 × 1, 3 × 3 and 5 × 5 convolutional layers with different reception fields and an average pooling layer in parallel, where convolutional kernels of different sizes can extract information of different scales. Using average pooling can reduce the dimension of features, remove redundant information, and fuse multi-dimensional features to extract more dense features. In addition, we use two 3 × 3 convolutional kernels instead of one 5 × 5 convolutional kernel. We use two different stride 3 × 3 convolution layers to achieve the same reception field as the 5 × 5 convolution layer, which has the advantage of having fewer parameters and reducing the complexity of the network.

III. EXPERIMENTS
For the multiscale attention branch, the distorted images are randomly cropped into 224 × 224 patches. To augment the training samples, a random horizontal flip with a vertical flip is performed. For the NSS branch, we use the original size image as input. The proposed MAMIQA is implemented with Pytorch and trained under Ubuntu 16.04 operation system with TITAN RTX GPU. We use the Adam optimizer with weight decay 5 × 10 −4 to train our model. The batch size is set to 64, and the learning rate is set to 2 × 10 −5 . To validate the performance of the proposed method, we conducted experiments on five publicly available IQA Datasets, including three synthetically distorted datasets LIVE II [41], TID2013 [42], CSIQ [43], and two authentically distorted datasets LIVE-C [44], KonIQ-10K [45]. Two widely used metrics, Spearman rank-order correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC), are adopted to measure the performance of IQA models against the ground truth subjective quality. Table I shows the performance of our proposed method compared to other methods on five IQA datasets, with first three lines indicate the FR-IQA methods and the remaining NR-IQA methods. The experimental results of the competing methods are based on implementations obtained from the original papers. Our proposed model achieves superior performance in PLCC and SRCC. Specifically, our model achieves 4.2%, 3.8% (PLCC, SRCC) higher than MS-GMSD (best FR-IQA) in TID2013 dataset and 1.8%, 1.7% (PLCC, SRCC) higher than MS-GMSD in LIVE dataset. Compared with other NR-IQA models, our model gains 1.3%, 1.6% (PLCC, SRCC) in CSIQ over DBCNN (second-best). Table II reports the experiments conducted over the datasets to further compare our method with SOTA methods. All methods were trained on one dataset and tested on three other datasets without any fine-tuning or parameter adjustment. It is observed that the proposed method achieved the best results in 8 out of the 12 tested cases, and competitive results in the remaining 4 cases, showcasing the strong generalization ability of the proposed method.

A. Performance Evaluation
We further compare the complexity of our model with two FR-IQA methods PieAPP and DISTS, and three NR-IQA methods (including CNN-based method (ResNet-50), Transformer-based methods (TReS) and StairIQA). As can be seen from Table IV I  COMPARISON OF OUR PROPOSED METHOD WITH SOTA ALGORITHMS ON THREE SYNTHETICALLY DISTORTION DATASETS AND TWO AUTHENTICALLY  DISTORTION DATASETS. THE TOP TWO RESULTS ARE SHOWN IN BOLD   TABLE II  SRCC RESULTS FOR CROSS-DATASET EXPERIMENTS   TABLE III  ABLATION EXPERIMENTS ON DIFFERENT COMPONENTS   TABLE IV  COMPARISON OF SRCC RESULTS AND COMPLEXITY WITH COMPETITIVE  METHODS improved performance. When compared to the Transformerbased model (TReS), the proposed method has roughly the same number of parameters, but with 35% lower computational complexity and much higher performance. This validates the efficiency of the proposed lightweight solution.

B. Ablation Study
In order to analyze the effectiveness of using the feature enhancement module and NSS methods, an ablation experiment was conducted to verify the influence of each component in the proposed model. We constructed the following model settings: 1) a model containing only the multiscale attention module backbone network. 2) A model containing a multiscale attention module network with a feature enhancement module (FEM).
3) The proposed model. Table III reports the results for  the LIVE, CSIQ, TID2013, LIVEC, and KONIQ datasets. The results show how adding each of the FEM and NSS improve the performance, for both synthetic and authentic distortions, indicating the effectiveness of the proposed modules.

IV. CONCLUSION
In this letter we proposed a lightweight IQA method named MAMIQA. A multiscale attention branch captures the attention via decomposed large kernel operations and enhances the feature representation via a novel feature enhancement module. An NSS branch is used to extract supplementary features to compensate the information loss caused by image cropping. The experimental results showed that the proposed model gains a significant performance improvement over the SOTA. Specifically, it was demonstrated that the proposed method better estimates the quality compared to the trending Transformers-based methods, while requiring significantly lower computational power.