Multi-Scale Blind Image Quality Predictor Based on Pyramidal Convolution

: Traditional image quality assessment methods use the hand-crafted features to predict the image quality score, which cannot perform well in many scenes. Since deep learning promotes the development of many computer vision tasks, many IQA methods start to utilize the deep convolutional neural networks (CNN) for IQA task. In this paper, a CNN-based multi-scale blind image quality predictor is proposed to extract more effectivity multi-scale distortion features through the pyramidal convolution, which consists of two tasks: A distortion recognition task and a quality regression task. For the first task, image distortion type is obtained by the fully connected layer. For the second task, the image quality score is predicted during the distortion recognition progress. Experimental results on three famous IQA datasets show that the proposed method has better performance than the previous traditional algorithms for quality prediction and distortion recognition.


Introduction
Since the rapid development of digital technology, digital information has become ubiquitous in people's lives, such as electronic photo album, video stream and video websites. As the performance of the display device improves, people are more concerned about the quality of the images. However, digital images may occur a quality degrade during the process of image acquisition, storage and transmission, which lead to the deterioration of image quality. In order to predict the quality score of the received images, image quality assessment (IQA) has become more and more critical in the field of low-level computer vision task.
According to the final receiver of the image, IQA method can be categorized into subjective IQA method and objective IQA method. For subjective IQA method, the image score is obtained by the human observer. Though it can get reliable and accurate scores, however, collecting the mean opinion score (MOS) or differential mean opinion score (DMOS) for each image is laborious and consuming. Hence, it is crucial to design a computer algorithm to automatic predict the image score.
Generally, Objective IQA method can be divided into full-reference (FR) IQA, reduce-reference (RR) and no-reference (NR) IQA based on the availability of the reference image information. For FR-IQA methods, it uses the reference information and distorted image to obtain the final image score, such as PSNR, SSIM [1], MS-SSIM [2], FSIM [3] and VIF [4]. For RR-IQA methods [5][6], only partial reference information can be used in image quality prediction. Although the quality prediction of full-reference image quality assessment has been greatly improved in recent years, however, reference information is not available in many realistic scenes, such as in the wild scene. Hence, in order to make the IQA algorithm available in many real-world scenes, it is crucial to predict the image quality without the reference information. The goal of the NR-IQA method is to predict the image quality without the information of the pristine image. Compare to the FR-and RR-IQA methods, NR-IQA (which is also called BIQA) is more applicable in many scenes since it is unnecessary to provide the reference information, which also makes it a challenging work to predict the quality score precisely.
Commonly, traditional objective NR-IQA methods use hand-crafted features [7][8][9] extracted from the distorted images to conduct the quality prediction. Many methods first extract the Natural Scene Statistics (NSS) based features, and then map these extracted features to the quality score through the support vector machines (SVR). Designing these hand-craft features require many people's efforts, but the effect of enhancement is not ideal. Since deep learning promotes the development of many computer vision task, e.g., object detection [10], image segmentation [11] and image enhancement [12]. Deep neural network applied to many IQA methods [13][14] boosts the performance of the quality prediction. Generally, convolutional neural networks (CNN) are used to extract the distortion features hide in the distorted image, then these features are feed into the fully connected (FC) layers to regression for the final quality score. Compare with the traditional IQA methods, deep learning-based IQA methods can extract more effective distortion features, and it can update the network parameters automatically through backpropagation without manual design features.
Although the convolutional neural network facilitates the extraction of effective features, however, the standard square convolutional layer has the weakness in handling the multi-scale features. To solve this problem, a multi-scale blind image quality predictor based on pyramidal convolution is proposed to focus on extracting the multi-scale distortion features for quality regression. Different from the standard square convolutional layer, pyramidal convolution [15] is capable of handling the image through multiple convolutional kernels. Hence, one standard square convolutional layer and three pyramidal convolutions are adopted to our network to learn complicated relationships between multi-scale distortion features and predicted quality score. For the human visual system (HVS), humans can easily judge the distortion types when they receive a distorted image. To mimic the HVS, a distortion recognition task is added to enhance the learning ability of the quality prediction. The distortion type recognition task is realized by the fully connected layer by mapping the feature maps to the n-node (n denotes the number of the distortion types) distortion types. The contributions of the proposed blind image quality predictor are summarized as follows: (1) A multi-scale blind image quality predictor is proposed to mapping the relationships between the distortion features and the quality score, and it is realized by end-to-end training without the need to designed the hand-crafted features. (2) To mimic the behavior of the HVS, the distortion recognition task is proposed to assist the quality prediction task. To enhance the ability of the feature extraction, pyramidal convolution is adopted to our network to achieve the multi-scale feature extraction ability. (3) Experiments conducted on three famous IQA datasets have proved the effectiveness of our proposed method.

Related Works
For the FR-IQA method, it needs to obtain the full reference images, and the quality score is obtained by comprehensively comparing the distorted image and the corresponding distortion-free image. Compare with the RR-and NR-IQA methods, FR-IQA methods is relatively mature. The simplest way of the FR-IQA method is MSE (mean squared errors), it is realized by calculating the average variance of the pixel points of the distorted image and the reference image. PSNR (peak signal-to-noise ratio) is another corresponding way to calculate the difference between the distorted image and the corresponding distortionfree image. Although these two methods are simple to implement and widely used in the early stage, the prediction results are not consistent with the subjective IQA method. With the exploring of the human visual system (HVS), many novel methods are proposed. Wang et al. [1] proposed SSIM (structural similarity image metric) to mimic the HVS, which has been the most representative FR-IQA method. SSIM considers the brightness, contrast and structural information of the distorted image to extract the representative features, and achieves a great result on quality prediction. Then, many scholars made a series improvement on the original SSIM. Wang et al. [16] proposed the MS-SSIM (multi-scale structural similarity image metric) to supply more multi-scale features than the original SSIM with introducing more view conditions. Chen et al. [17] proposed the GSSIM (gradient-based structural similarity), which considers the gradient information when extracting the features.
Before the machine learning and deep learning applied to the IQA domain, the dominant approach was to rely on the NSS features extract from the image, which is used to distinguish the distorted image and the pristine image. Generally, no-reference image quality assessment can be classified into hand-crafted-based methods and the learning-based methods. For hand-crafted-based methods, Wang et al. [18] proposed a quality method for handling the JPEG compression images. Saad et al. [19] proposed the BLIINDS-II to handling the distortion in DCT domain by extracting the contrast and structure features. Mittal et al. [20] proposed the NIQE by using the multivariate model to conduct the prediction task.
For learning-based methods, distortion features are extracted by the deep neural network instead of the elaborately designed features. Kang et al. [13] design the network with only one convolutional layer and two pooling layers to do the quality regression. To augment the training samples, images are cropped to 32 × 32 pixel patches to feed the network. Then they update the network by adding another task for distortion recognition [14]. Bosse et al. [21] use the deeper network with ten convolutional layers and maxpooling layers to extract the features, and the weighed strategy is proposed to calculate the influence of each patch for the final score. Kim et al. [22] enhance the training data by generating the error map in the first stage of training, then use the pre-trained model to do the quality regression in the second stage. Though these methods achieve a great result in handling the quality prediction, there still challenge remains. The distortion information in the image is multi-scale instead of single-scale. It is impossible to extract the multi-scale distortion features effectively through the single-scale convolutional layers.

Method Description
The overall architecture of the proposed blind image quality predictor is shown in Fig. 1. To effectively extract the distortion features of the distorted images, a multi-task blind image quality predictor is proposed to solve the NR-IQA problem. The proposed method contains two tasks: (1) Distortion recognition task and (2) Quality prediction task. Given a distorted image , we crop patches from to group � , = 1,2, ⋯ , �. Before entering the training progress, local normalization is used to preprocess the image patch , then the local normalized patch ′ is feed into the network to train the distortion type and final quality score. The details of the proposed method are described as follows.

Model Architecture
Motivated by [13], the proposed network uses convolutional neural network to extract distortion features. The network consists of four convolutional layers, which include one standard convolutional layer and three pyramidal convolution layers, then three fully connected layers are used to map the feature maps to quality score and distortion types. The detailed architecture of the proposed network is shown in Tab. 1. The first convolutional layer with kernel size 1 is used to expend the channels of the feature maps to match the next pyramidal convolution layers. Then three pyramidal convolution layers with three kernel size 3, 5, 7 are used to extracted the multi-scale distortion features, and the output features are 128 × 8 × 8. After that, max pooling and mini pooling layers are used to reduce the feature maps to 128 × 1 × 1. Finally, three fully connected layers followed by PReLu [23] map the relationships between the extracted features and predicted distortion types and quality score. Motivated by [13], dropout is adopted after the first FC layers to avoid the overfitting problem, and the dropout probability is set to 0.5 in the network.

Image Preprocess
For the human visual system (HVS), the HVS is insensitive to the changes in the low-frequency band. And for image distortion progress, the distortion only affects the high-frequency information of the image but has little impact on the low-frequency information. Hence, to mimic the human visual system and make the training progress more stable, input image patches need to be preprocessed before entering the training progress. In this step, local normalization is used to preprocess the input image as following [13]. Given an image patch , the intensity value of a ( , ) pixel is denoting as ( , ), where i and j denotes the width and height location of the image patch. The local normalization progress is summarized as follows: where C denotes a small positive constant to prevent dividing by zero, M and N indicates the size of the normalized window. As suggest in [13], we set = = 3 to achieve the best performance.

Pyramidal Convolution
As shown in Fig. 2, distortion information hides in the image vary a wide range, from shallow to deep. Hence, the standard convolutional layer used in [13][14] cannot handle the multi-scale distortion features well. To solve this problem, pyramidal convolution [15] is adopted in the proposed network to extract more effective distortion features. As described in [15], pyramidal convolution contains a pyramid of kernels, different size of the kernel size varying different depth has the ability to extract the different levels of distortion information hide in the image.

Loss Functions
During the backpropagation progress, a well-designed loss function can not only accelerate the network converging but also improve the accuracy of the quality prediction. To achieve the best training effect, we design the mixed loss function with two different loss functions: and . Compare with the large dataset designed for object detection, the IQA datasets are too small for training the deep learningbased IQA method. To solve this problem, the input image is divided into 64 × 64 pixel patches for augmenting the training dataset. For each cropped image patch , the corresponding score is obtained by the original image . During the test progress, the quality score of the distorted images is calculated by averaging all patches cropped from the original image: where denotes the i-th patch of distorted image, N denotes the number of the image patches cropped from the image, (•) indicates the network of the proposed method, and denotes the network parameters of the network.
During the network training, the goal of the network is to narrow the gap between the predict scores and the ground truth score. The loss function is used to evaluate the predict image quality score � ; � and ground truth score . For the image quality prediction task, we adopt the commonly used objective function as: where M denotes the number of images, (•) indicates the network of our proposed method and denotes the i-th ground truth score. For the distortion recognition task, cross entropy loss is adopted as the loss function and it can be described as: where i, j indicates the i-th and j-th distortion types, respectively. x denotes the input vector. D denotes the number of the distortion types, e.g., for LIVE dataset, = 5, for CSIQ dataset, = 6.
In the end, to achieve the best performance on quality prediction and distortion recognition, the mixed loss function is defined as: = * + * (7) where and denote the weight factor of and , respectively. In order to balance the training progress and keep the loss function on the same order of magnitude, we set = = 1.

Training of the Network
All the training patches are cropped from the distorted image with the size of 64 × 64 pixel, and the step is set to 64 pixels. Our method is implemented using the Pytorch [24] on NVIDIA GTX 1080. We use Adam [25] with a learning rate of 10 −4 to train our network. For every 100 epochs, the learning rate is decreased by 0.1. In addition, the momentum factor, weight decay factor and batch size are set to 0.9, 10 −4 and 128 respectively.

Datasets
In order to test the performance of quality prediction and distortion recognition, three famous synthetically IQA databases: LIVE [26], TID2013 [27] and CSIQ [28] are chosen to conduct the experiments. The details of the three databases (e.g., the number of reference images and distorted images, the number of distortion types) are tabulated in Tab. 2.
LIVE [26] database contains 779 distorted images which are generated from 29 different pristine images under the laboratory environment. The distorted images are under five different distortion types (such as, JP2K, JPEG, WN, GBLUR and FF) at 7 to 8 degradation levels. In addition, it provides Differential Mean Opinion Scores (DMOS) for each distorted image, and the range of the DMOS is from 0 to 100. The higher DMOS denotes the image has the worse quality.
TID2013 [27] database contains 3000 distorted images which are generated from 25 pristine images. For the distortion types and levels, it contains 24 different distortion types, and the degradation level is five, which makes it the most abundant synthetically IQA database according to the distortion types. Different from LIVE database, the Mean opinion Scores (MOS) is provided for each distorted image, and the value is from 0 to 9. The lower MOS denotes the lower image quality. CSIQ [28] database contains 866 distorted images generated from 30 pristine images. Each reference image contains six distortion types at 4 to 5 degradation levels: JPEG, JPEG2000, Gaussian blurring, Gaussian pink noise, Gaussian white noise and contrast change. Same as the LIVE database, the DMOS is provided for each distorted image, and the value is from 0 to 1. The higher value means the image has the bad visual quality.

Performance Criteria
For conduct the experiment, we choose two widely used metrics to evaluate each IQA algorithm: Spearman Rank Order Correlation Coefficient (SROCC) and Pearson Linear Correlation Coefficient (PLCC). PLCC measures the linear correlation between the labeled quality scores and the network predicted quality, and it is formulated as: where denotes the labeled quality score of i-th image, and � denotes the predicted quality score of i-th image. � denotes the mean of the ground truth image quality scores, and � � indicates the mean of the predicted quality scores.
SROCC measures the prediction monotonicity and is defined as: where denotes the number of the images, indicates the rank of the ground truth score in ground truth scores, and denotes the rank of the predicted score � in predicted scores.

Experimental Results on Single Dataset
To verify the consistency between the model prediction results and human subjective evaluation, we conduct the single dataset evaluation on three synthetically IQA databases: LIVE [26], TID2013 [27] and CSIQ [28]. In the experiment, each database is randomly divided into two groups, 80 per cent of the reference images and the corresponding distorted images are selected to group for training the IQA algorithms, and the rest of them are used to group the testing set. The selection process is completely random. This procedure is repeated ten times to erase the bias caused by the database, the median results of SROCC and PLCC are chosen as the final results. For better evaluate the quality prediction and distortion recognition performance of the proposed method, several standard IQA methods are chosen to conduct the experiments, including BLIINDS-II [19], DIIVINE [29], IL-NIQE [30], CORNIA [31], and two deep learning-based method CNN [13] and CNN++ [14]. SROCC and PLCC results on three datasets are shown in Tab. 3, the best results are marked with bold face. From Tab. 3, our proposed method achieves the best results for all three IQA databases, and it reaches (0.962, 0.963), (0.732, 0.761), (0.843, 0.852) respectively. Compare with the deep learning-based method CNN and CNN++, our method even achieves better results on the LIVE dataset. Some crucial conclusions can be drawn from the experimental results that: (1) The pyramidal convolution layer introduced to our network can effectively extract the multi-scale distortion features than the standard convolutional layer, which leads to the improvement in prediction results. (2) The multi-task training progress can better simulate the human visual system by predicting the image quality score while predicting the distortion types, which promotes the prediction accuracy.
Due to the distortion recognition ability is important for the accuracy of the quality prediction, we conduct the experiments to test the accuracy of the proposed method in distortion recognition. For the experiment, we compare our method with BLIINDS-II [19], BRISQUE [32], CORNIA [31], CNN++ [14]. The classification accuracy is tabulated in Tab. 4. It can be observed that the proposed method achieved the highest accuracy of 96.2%, which indicated that our method could have the ability to recognize the distortion type.

Experimental Results on Different Distortion Types
A good IQA algorithm should be able to predict not only general distortion types but also for the individual distortion types. In this section, to verify the prediction ability for IQA methods on different distortion types, experiments are conduct on different types of LIVE database.  In this individual distortion experiment, all the model is train and test the model on each distortion types. We choose four NR-IQA methods: BLIINDS-II [19], DIIVINE [29], CORNIA [31], CNN [13] to compare with our method. The SROCC and PLCC values are shown in Tab. 5 and Tab. 6. From Tab. 5 and Tab. 6, the proposed method achieves the highest prediction accuracy for JP2K, BLUR and FF distortion types, and the results are (0.953, 0.962), (0.972, 0.979) and (0.911, 0.919), respectively. In summary, compare with CNN and CNN++, the method can handle different distortion type well.

Conclusion
In this paper, a multi-scale blind image quality predictor based on pyramidal convolution is proposed to solve the problem for NR-IQA, which includes two tasks: A quality prediction task and a distortion recognition task. With the introducing of the distortion recognition task, the accuracy of the quality prediction can be further improved. In addition, to enhance the network learning ability, pyramidal convolution is adopted to the backbone feature extractor of the proposed method to extract the multi-scale features. Extensive experiments on three famous IQA databases: LIVE, TID2013 and CSIQ demonstrate the effectiveness of the proposed method for quality prediction and distortion recognition.

Funding Statement:
The author(s) received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest.