Magnetic Resonance Image Quality Assessment by Using Non-Maximum Suppression and Entropy Analysis

An investigation of diseases using magnetic resonance (MR) imaging requires automatic image quality assessment methods able to exclude low-quality scans. Such methods can be also employed for an optimization of parameters of imaging systems or evaluation of image processing algorithms. Therefore, in this paper, a novel blind image quality assessment (BIQA) method for the evaluation of MR images is introduced. It is observed that the result of filtering using non-maximum suppression (NMS) strongly depends on the perceptual quality of an input image. Hence, in the method, the image is first processed by the NMS with various levels of acceptable local intensity difference. Then, the quality is efficiently expressed by the entropy of a sequence of extrema numbers obtained with the thresholded NMS. The proposed BIQA approach is compared with ten state-of-the-art techniques on a dataset containing MR images and subjective scores provided by 31 experienced radiologists. The Pearson, Spearman, Kendall correlation coefficients and root mean square error for the method assessing images in the dataset were 0.6741, 0.3540, 0.2428, and 0.5375, respectively. The extensive experimental evaluation of the BIQA methods reveals that the introduced measure outperforms related techniques by a large margin as it correlates better with human scores.


Introduction
The ubiquity of advancements in imaging has brought significant attention of medical specialists due to the role of the quality of displayed content in diagnosis [1][2][3]. The quality of Magnetic Resonance (MR) images depends on used hardware parts, software techniques, as well as human errors involving The extrema represent a set of filtered versions of an input image. Then, entropy is used for quality prediction.
The major contributions of this work are a novel method for the quality assessment of MR images and a comprehensive evaluation of the measure against the state-of-the-art IQA techniques on a dataset of MR images assessed by a large group of experienced radiologists.
The remainder of this paper is organized as follows. In Section 2, the approach is introduced. Then, in Section 3, it is evaluated against the related BIQA methods. Finally, in Section 4, the paper is concluded.

Proposed Image Quality Measure
In the introduced method, ENMIQA, an input image I is filtered to determine pixels that represent local intensity extrema. To determine which pixels should be selected, the NMS operation [21,22] is performed. However, to provide a more thorough examination instead of selecting pixels that are of greater or lesser intensity value than its surrounding neighbors, in this work, a sequence of intensity thresholds T = [1, 2, . . . , S], S ∈ Z + , is introduced. The NMS uses the threshold t ∈ T to indicate the local extrema. Consequently, image I for each threshold t is represented by the number of found local extrema I(t). This can be written as: where a pair (a, b) denotes the pixel location within an image of the size M × N and T(a, b, t) is a test in which the NMS is calculated using the proposed threshold t. The test is obtained as follows: where (i, j) ∈ {(0, 1), (0, −1), (1, 0), (−1, 0)}. The pair of indices (i, j) forms the neighborhood of 3 × 3 pixels around the location (a, b). Finally, a sequence of sums I(T) = [I(t = 1), I(t = 2), . . . , I(t = S)] is obtained. Then, it is divided by the image size to normalize the values. To determine the quality of the input image I, entropy of I(T) is calculated. Entropy is the fundamental concept of Shannon information theory [23,24]. It is usually considered in the framework of measure theory. Assuming that space X with a probabilistic measure µ and a countable partition P of X are given [25], the entropy h is: where s: [0, 1] → [0, ∞) can be expressed as s(x) = −x log x for 0 < x ≤ 1 and s(0) = 0. Note that entropy equals zero if and only if there exists such P ∈ P that µ(P) = 1. If X contains R elements, then P = {P 1 , ..., P R }. Furthermore, if µ is based on counting measure, then Equation (3) has the following form: where k i = m i m , m i and m are the numbers of elements in P i and X, respectively. Entropy defined by Equation (4) reaches its maximum for the uniform distribution of the measure µ on the family P. Such defined entropy refers to the amount of information on (X, µ) introduced by P. Consequently, the inversely proportional relationship between entropy and information is often applied in practice.
In this paper, entropy analysis is used for the IQA of two-dimensional MR images. In such a context, it can be employed for measuring disorders. In MR scans of internal organs, single isolated impulses with higher or lower intensity concerning a local neighborhood are common in distorted images. Thus, the greater the value of the threshold t in the NMS, the greater the probability that the detected intensity irregularities are disorders that decrease the quality of an image. The observed discriminative capabilities of entropy regarding images of different qualities justify its use for the IQA of MR images. In this work, Equation (4) is directly used as a quality measure, assuming that a set X is expressed as {(I, t), t ∈ T} and T determines the partition of X. The main computational steps of the method are shown in Figure 1.  As shown, the proposed method determines more extrema in images with more distortions.

Results and Discussion
In this section, a dataset that contains MR images with associated subjective scores is introduced. Then, the performance of ENMIQA against ten state-of-the-art related methods is evaluated using a typical methodology and discussed. Finally, the influence of parameters of ENMIQA on its performance is provided.

Experimental Data
The introduced ENMIQA and related techniques are evaluated on a dataset that contains MR images and subjective scores collected in tests with human subjects. The dataset consists of 70 T2-weighted MR images (T2w) extracted from the lumbar and cervical spine, brain, hip, knee, and wrist sequences in axial, sagittal, and coronal planes. The sequences were obtained for a group of 51 patients of 27-41 years old (26 men and 25 women). The study protocol was designed according to the guidelines of the Declaration of Helsinki and the Good Clinical Practice Declaration Statement. The data safety was ensured by removing the personal details from images. Written acceptance for conducting the study was obtained from the Ethics Committee of Jagiellonian University (no. 1072.6120.15.2017). To produce images with different quality for the IQA purposes, shortened sequences were acquired using Process Analytical Technology (PAT) I software (Siemens) and employing the GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) 3 in which 25% of the echoes were acquired with 60% signal reduction regarding the original acquisition mode [26,27]. Then, images with distortion types that were not present in all examined body parts were rejected. The obtained dataset is characterized in Table 1. There are 15, 9, and 11 image pairs captured in sagittal, axial, and coronal planes, respectively. The size of the images ranges from 192 × 320 to 512 × 512. The subjective scores for images were obtained in a group of 31 experienced radiologists with more than six years of diagnostic reading residency. Each radiologist assessed two images of the same part of the body at once, spending a minute on the assessment of the pair. The images were scored from 1 to 5, with a higher score associated with better quality. The examination was repeated until all images in the dataset were assessed. Then, scores for images were averaged and the mean opinion score (MOS) was obtained. The number of radiologists that took part in the subjective tests was large enough to ensure that personal quality preferences do not impair the MOS. However, the number of images in the database depended on the number of medical professionals and the time spent on the examination. Exemplary images from the dataset can be seen in Figure 3.

Evaluation Methodology
According to the popular protocol for the performance evaluation of IQA measures, objective scores Q for images in a database are compared with subjective scores (i.e., MOS) S collected for them in tests with human subjects. Typically, the four criteria are used to characterize IQA measure [28]: Pearson correlation coefficient (PLCC), Spearman Rank order Correlation Coefficient (SRCC), Kendall Rank order Correlation Coefficient (KRCC), and Root Mean Square Error (RMSE). The PLCC and RMSE are calculated for the vector Q p obtained via a nonlinear mapping between objective scores Q and subjective scores S using fitted parameters of the regression model β = [β 1 , β 2 , . . . , β 5 ], i.e., Q p = β 1 The PLCC is obtained as: whereQ p andS are mean-removed vectors. The SRCC is calculated as: where d i is the difference between i-th image in vectors of scores and m denotes the number of images in the dataset. The KRCC is obtained as: where m c , m d are the number of concordant and discordant pairs, respectively. The RMSE, in turn, is obtained as:

Comparative Evaluation
The ENMIQA is compared against the following ten related BIQA measures: SNRTOI [18], BPRI [29], ILNIQE [30], QENI [31], SISBLIM [32], metricQ [33], SSEQ [34], SINDEX [35], MEON [36], and DEEPIQ [37]. The SNRTOI [18] was implemented by authors of this paper, while other methods were run using their publicly available Matlab implementations. All compared methods, similarly to ENMIQA, do not require training. However, MEON and DEEPIQ represent recently introduced deep learning approaches and are already trained by their authors. The ENMIQA run with S = 30 in experiments and other measures used their default parameters. In cases in which a method was designed to process color images, three identical channels were used as an input. The performance of the methods and their approaches to image quality modeling and prediction are shown in Table 2. As reported, the measure introduced in this paper, ENMIQA, outperforms related techniques by a large margin in terms of all four performance indices. Depending on the considered index, it is followed by SISBLIM (PLCC and RMSE) and DEEPIQ (SRCC and KRCC). To show the performance of the measures for images of body parts largely represented in the database, the PLCC calculated for their subsets is reported in Figure 4. Here, ENMIQA obtains greater PLCC than it can be seen for the remaining methods for images of the lumbar and cervical spine, knee, shoulder, and wrist. It is slightly worse than BPRI for brain images. Interestingly, it seems that the recently introduced BPRI is suitable for such images, despite being the second worse technique regarding the entire database and the fourth-best technique in ranking based on the individual body parts. The worse results of methods designed for the assessment of natural images, as well as by complex deep learning approaches, can be justified by the specifics of MR images in which a large portion of the area is covered by organs or tissue while the background is usually dark and may contain noise. In natural images, such empty or nearly empty spaces are seldom found. Furthermore, popular BIQA methods are often trained to recognize typical distortion types (e.g., BPRI, ILNIQE, MEON, SSEQ, or DEEPIQ). Interestingly, methods trained on images contaminated with Gaussian noise can, to some extent, correctly predict the quality of MR images since Gaussian noise manifests itself in magnitude images as a Rician distribution of pixel intensities [38]. This is confirmed by weaker performance of the SNRTOI, which, being an SNR derivative, is often used by radiologists as supporting information on the captured images. The reported results for other methods seem to justify the need for the development of measures designed for the IQA of MR images. To evaluate the statistical significance of the obtained errors in the prediction of IQA methods, hypothesis tests based on the prediction residuals of each IQA measure after non-linear mapping were conducted using F-statistic [28]. The F-test is based on an assumption of the Gaussianity of residuals and determines whether the two compared sample sets come from the same distribution, based on the ratio of their variances. The test is often used for the comparison of IQA measures [28]. Therefore, at first, the Jarque-Bera (JB) statistic to determine whether residuals come from a normal distribution was used [39]. In the JB test, the null hypothesis is that the vector of residuals of NR measure follows a normal distribution while the alternative hypothesis is that it does not follow it. Since for all compared measures the null hypothesis was not rejected at the 5% significance level, the F-statistic could be reliably employed.
In the F-test, the null hypothesis is that the vectors of residuals of two IQA measures come from the same distribution with the same variance and are statistically indistinguishable (95% confidence). The alternative hypothesis is that the vectors are statistically distinguishable and have different variances. Before the calculation of the F-statistic, a vector of residuals of a measure was used to fit a normal distribution and 1000 samples were drawn from it. The tests revealed that the residual variance of ENMIQA is statistically smaller than those of all compared IQA methods with confidence greater than 95%. This is also indicated by the ratio in all cases. The obtained JB statistics for measures and ratios of the residual variances of algorithms to the ENMIQA are presented in Table 3.

Computational Complexity
The computational complexity of ENMIQA depends on the size of processed image (N × M), the length of the sequence of thresholds S, and the size of the neighborhood used for the NMS (k = 3 × 3). Therefore, its computational complexity is of the order of O(N MSk 2 ).
The introduced dataset was used to analyze the computational complexity of methods in terms of the average time taken to assess an image. The methods were run on a 2.2 GHz Intel Core CPU with 8 GB RAM using Matlab 2019b environment. Table 4 reports obtained timings. As shown, ENMIQA is slower than MEON, SINDEX, and SNRTOI, but it is faster than the remaining seven measures. The fastest methods (i.e., SINDEX and SNRTOI) are characterized by inferior IQA performance, and taking into account the results for more promising techniques, the introduced ENMIQA is relatively fast and provides the superior quality prediction of MR images.

Influence of Parameters
The ENMIQA is governed by the sequence of thresholds T = [1, 2, . . . , S], S ∈ Z + used by the non-maximum suppression. Therefore, it is worth to determine how stable is its performance for various S. The S is the greatest threshold in the sequence and indicates its length. The PLCC performance of the method on the entire database, ranging S from 5 to 100 with the step of 5 is shown in Figure 5a. The previously introduced evaluation methodology was applied on the entire dataset to allow a coherent comparison with already reported results of other IQA methods (see Section 3.3). Considering the value of the threshold S, it can be set in between 20 and 60 without a visible drop in the prediction performance. Since ENMIQA exhibits a stable performance across the values of S, S = 30 used in experiments is justified. The non-maximum suppression selects a pixel with the extreme value, taking into account its eight neighbors and the threshold t. Since a pixel has 8 neighbors, it is reasonable to use its full neighborhood (the size of 8). However, the suppression can be modified to accept a lesser number of neighboring pixels that are used to indicate the local extrema (see Equation (2)). Therefore, in Figure 5b, the impact of the number of neighbors on the PLCC results of ENMIQA is shown. Here, if the number of used neighbors while determining the local extrema is lower than 8, the performance of the method visibly deteriorates. Hence, the entire pixel neighborhood should be considered by ENMIQA with the NMS. Interestingly, even with a smaller neighborhood the approach still offers a promising performance.

Conclusions
In this work, a new BIQA measure for the evaluation of MR images is proposed. The method uses the non-maximum suppression with a sequence of thresholds to detect local intensity extrema in MR images. A relationship between the number of extrema and entropy is investigated. Consequently, a new measure is introduced and experimentally validated against ten representative BIQA techniques on a database that contains MR images assessed by a large group of experienced medical professionals. The experimental comparison reveals that ENMIQA outperforms the-state-of-the-art measures by a large margin in terms of four performance criteria, confirming its suitability for the quality prediction of MR images.
To facilitate the replicability of the reported findings, as well as the applicability of the measure, the Matlab code of ENMIQA and the dataset are available at http://marosz.kia.prz.edu.pl/ENMIQA.html.

Conflicts of Interest:
The authors declare no conflict of interest.