Image Quality Assessment for Magnetic Resonance Imaging

Image quality assessment (IQA) algorithms aim to reproduce the human's perception of the image quality. The growing popularity of image enhancement, generation, and recovery models instigated the development of many methods to assess their performance. However, most IQA solutions are designed to predict image quality in the general domain, with the applicability to specific areas, such as medical imaging, remaining questionable. Moreover, the selection of these IQA metrics for a specific task typically involves intentionally induced distortions, such as manually added noise or artificial blurring; yet, the chosen metrics are then used to judge the output of real-life computer vision models. In this work, we aspire to fill these gaps by carrying out the most extensive IQA evaluation study for Magnetic Resonance Imaging (MRI) to date (14,700 subjective scores). We use outputs of neural network models trained to solve problems relevant to MRI, including image reconstruction in the scan acceleration, motion correction, and denoising. Our emphasis is on reflecting the radiologist's perception of the reconstructed images, gauging the most diagnostically influential criteria for the quality of MRI scans: signal-to-noise ratio, contrast-to-noise ratio, and the presence of artifacts. Seven trained radiologists assess these distorted images, with their verdicts then correlated with 35 different image quality metrics (full-reference, no-reference, and distribution-based metrics considered). The top performers -- DISTS, HaarPSI, VSI, and FID-VGG16 -- are found to be efficient across three proposed quality criteria, for all considered anatomies and the target tasks.


Introduction
Image quality assessment (IQA) is a research area occupied with constructing accurate computational models to predict the perception of image quality by human subjects, the ultimate consumers of most image processing applications Wang (2011).
The growing popularity of image enhancement and image generation algorithms increases the need for a quality assess-tributions of thousands of images instead of gauging them individually. The popular new DB IQA methods include such metrics as Inception Score Salimans et al. (2016), FID Heusel et al. (2017), KID Bińkowski et al. (2018), MSID Tsitsulin et al. (2020), and many others. Despite being widely used, the DB metrics were neither included in the recent large scale general domain reviews Athar and Wang (2019); Manap and Shao (2015), nor in the medical onesMason et al. (2020).
IQA measures are applied to estimate the quality of image processing algorithms and systems. For example, when several image denoising and restoration algorithms are available to recover images distorted by blur and noise contamination, a perceptual objective IQA could help pick the one that generates the best perceptual image quality after the restoration. To do that reliably, Image Quality Metrics (IQMs) need to show a high correlation with the perceptual estimates of the quality reported by human subjects for a given image processing algorithm. However, IQA algorithms are often evaluated on non-realistic distortions, such as added noise or artificial blur-ringAthar and Wang (2019); Manap and Shao (2015); Mason et al. (2020). Such a discrepancy between the synthetic evaluation and the practical use may cause misleading results.
While most metrics are designed to predict image quality in the general domain, Magnetic Resonance Imaging (MRI) provides gray-scale data with the content and style noticeably different from the natural images. Hence, the applicability of the IQMs in the MRI domain must be validated.
Moreover, IQMs trained on natural images attempt to describe the overall perception of the quality of an entire scene. On the contrary, an MRI scan can be perceived as high-quality when specific characteristics, responsible for the scan's value, are deemed adequate. Those are the characteristics that assist expert radiologists in making their diagnostic decisions, including perceived level of noise (signal-to-noise ratio, SNR), perceived soft tissue contrast (contrast-to-noise ratio, CNR), and the presence of artifacts. These specific quality criteria are, unfortunately, coupled. For example, some denoising algorithms tend to introduce additional blurring (lowering of CNR) in exchange for increased SNR, and some motion correction approaches tend to introduce noticeable artifacts. Therefore, a more detailed evaluation of IQM's ability to express separate MRI quality criteria is required.

Related Work
The evaluation of metrics for IQA in the domain of natural images started from the early task-specific works that considered FR methods to characterize color displays and half-toning optimization methods Ahumada (1993).
More recent task-specific studies explored IQA for the images of scanned documents Kumar et al. (2013) and screen content Yang et al. (2015). Likewise, fused images Wang et al. (2021), smartphone photographs Fang et al. (2020), remote sensing data Ieremeiev et al. (2020), and climate patterns Baker et al. (2019) demanded the development of targeted IQA approaches. Historically, many of these works have been focusing on the quality degradation caused by the compression algorithms Avadhanam and Algazi (1999); Allen et al. (2007); Triantaphillidou et al. (2007); Baker et al. (2019), with relatively small datasets appearing publicly for the IQ evaluation. However, the small dataset size and the excessive re-use of the same test sets have led to the promotion of the IQMs poorly generalizable to the unseen distortions.
This was recognized as a major problem, stimulating the emergence of large-scale studies Athar and Wang (2019); Ding et al. (2021). Among the large-scale evaluations, the majority compared multiple FR metrics, ranging from just a handful Sheikh et al. (2006); Larson and Chandler (2010); Zhang et al. (2012); Mohammadi et al. (2014) to several dozens Athar and Wang (2019); Pedersen (2015) of IQMs analyzed on popular datasets.
The medical domain stands out from the others by a special sense of what is deemed informative and acceptable in the images Cavaro-Menard et al. (2010). Resulting from years of training and practice, the perception of medical scan quality by adept radiologists relies on a meticulous list of anatomyspecific requirements, on their familiarity with particular imaging hardware, and even on their intuition.
Given the majority of IQMs were not designed for the healthcare domain, some recent works were dedicated to the niche. One small-scale study considered a connection of IQA of natural and medical images via signal-to-noise (SNR) estimation Li et al. (2018). Others assessed common FR IQMs using nonexpert raters Rajagopal (2015); Chow et al. (2016). Sufficient for the general audience, these methods proved incapable of reflecting the fine-tuned perception of the radiologists Keshavan et al. (2019).
On the other hand, Crow et al. argue that NR IQA are preferable for assessing medical images because there may be no perfect reference image in the real-world medical imaging Chow and Paramesran (2016). To address this issue, several recent studies also propose new NR IQMs for MRI image quality assessment Jang et al.

Image Quality Metrics Considered
In this work, we evaluate the most widely used and publicly available general-purpose FR, NR, and DB IQMs to find the best algorithms for the quality assessment on arguably the most important MRI-related image-to-image tasks: scan acceleration, motion correction, and denoising. Instead of modeling the disrupted images, we use outputs of trained neural networks and compare them with the clean reference images from the fastMRI dataset.
For the comparison, we collect 14,700 ratings from 7 trained radiologists to evaluate the quality of reconstructed images based on three main criteria of quality: perceived level of noise (SNR), perceived soft tissue contrast (CNR), and the presence of artifacts, making this work the most comprehensive study of MRI image quality assessment to date.

Medical Evaluation
The key goal of this study is to evaluate popular selected IQMs on MRI data. Previous works Mason et al. (2020); Renieblas et al. (2017) evaluated the ability of certain IQMs to assess overall quality of data after to various types of artificial distortions 1 . However, in practice, the overall image quality rating may be insufficient due to its ambiguity: e.g., one could not truly interpret the reasons for poor or good scoring. At the same time, asking the medical experts these general questions may be challenging because of many factors, ranging from the specifics of certain clinical workflows to personal preferences.
In this work, we aspire to solve these problems by proposing the following study. First, we evaluate IQMs with regard to their ability to reflect radiologists' perception of the quality of distorted images, comparing them to the fully-sampled artifact-free ones. We range the metrics based on three IQ criteria that are crucial for making clinical decisions: perceived level of noise (SNR), perceived level of contrast (CNR), and the presence of artifacts. Second, instead of corrupting images with artificial perturbations, for the first time in the community, we validate these metrics using the actual outputs of deep learning networks trained to solve common MR-related tasks. As such, the artifacts originate from the imperfect solutions to the common real-world problems of motion correction, scan acceleration, and denoising.
A group of trained radiologists rated the quality of distorted images compared to the clean reference images on a scale from 1 to 4 for the three IQ criteria. Unlike the five-point Likert scale, the simplified scale balances the descriptiveness of the score with the noise in the votes of the radiologists. Our mock experiments showed that the respondents considered the selection between too many options difficult, with the five-point scale having a diluted difference between the options; whereas, the three-point scale was deemed insufficient.
After the evaluation, the aggregated results were compared with the values of selected IQA algorithms to identify the top performers -the metrics that correlate the highest with the radiologists' votes.

Image Library Generation
As a data source, we use the largest publicly available repository of raw multi-coil MRI k-space data -the FastMRI dataset, containing the knee and the brain scans Knoll et al. (2020); Zbontar et al. (2019). The knee subset of FastMRI contains 1,500 fully sampled MRIs acquired with a 2D protocol in the coronal direction with 15 channel knee coil array on 3 and 1.5 Tesla Siemens MRI machines. The data consists of approximately equal number of scans acquired using the proton density weighting with (PDFS) and without (PD) fat suppression pulse sequences with the pixel size of 0.5 mm × 0.5 mm and the slice sickness of 3 mm.
The knee subset is divided into 4 categories: train (973 volumes), validation (199 volumes), test (118 volumes), and challenge (104 volumes). Only the multi-coil scans were selected for this study, omitting the single-coil data. The brain subset includes 6,970 1.5 and 3 Tesla scans collected on Siemens machines using T1, T1 post-contrast, T2, and FLAIR acquisitions. Unlike the knee subset, this data are of a wide variety of reconstruction matrix sizes. For the purpose of de-identification, authors of the dataset limited the data to only 2D axial images, and replaced k-space slices 5 mm below the orbital rim with zero matrices. The brain subset is divided into 6 categories: train (4,469 volumes), validation (1,378 volumes), test 4× (281 volumes), test 8× (277 volumes), challenge 4× (303 volumes), and challenge 8× (262 volumes).
Starting with the clean knee and brain data, we fist generate images corrupted with three types distortions: scan acceleration, motion, and noise. The examples of the distorted images are presented in Fig. 1. After that, we train two reconstruction models for each type of distortions using PyTorch Paszke et al. (2019) for 2 and 4 epochs respectively. The reduced training time was a conscious choice, enabling the model to produce some reconstruction errors. More specifically, we interrupted the training when the 90% of the loss plateau is reached, which allows for good performing models with imperfections

Reference
Scan Acceleration Motion Noise Fig. 1. Distortions introduced to initial artefact-free scans during training and inference. Using the raw k-space data of the reference images, we undersample them with the acceleration factor of 4, impose rigid motion of a moderate amplitude, and introduce mild Gaussian noise. Note how the distortions differ from those in the Natural Images, on which the common IQMs were developed. We adjusted brightness for viewer's convenience.
we wanted to test for 2 . Finally, we use the trained models to reconstruct the corrupted images in the validation subset of the fastMRI, from which we generate the labeling dataset.

Scan Acceleration
Scan-acceleration data are generated from the ground truth images by undersampling the k-space data. To train the model, we selected only T1 weighted scans (T1, T1-PRE and T1-POST) from the train category of the FastMRI brain data. The same subset of data was used for training of motion correction and denoising models. The k-space data were subsampled using a Cartesian mask, where k-space lines are set to zero in the phase encoding direction. The sampled lines are selected randomly, with the total sampling density depending on the chosen acceleration rate. Following the data generation process from the FastMRI challenge Knoll et al. (2020), all masks are fully sampled in the central area of k-space (the low frequencies). For the 4× accelerated scans, this corresponds to 8%, and for the 8× acceleration, it equals to 4%. Besides making the reconstruction problem easier to solve, such lines allow computing the low-pass filtered versions of the images for assessing the coil sensitivity maps.
To compensate for the undersampling, we used the 2019 FastMRI challenge winner Adaptive-CS-Net model Pezzotti et al. (2020). Based on the Iterative Shrinkage-Thresholding Algorithm (ISTA) Beck and Teboulle (2009), this model consists of several trainable convolutional multi-scale transform blocks between which several prior knowledge-based computations are implemented. For scalability reasons and without substantially impacting the reconstruction results, in this study, we trained a simplified light-weight version of the Adaptive-CS-Net model. The resulting model consists of only 10 trainable blocks and 267k parameters. Unlike the full Adaptive-CS-Net model with three MRI-specific physics-inspired priors, the simplified version has only one prior module between the reconstruction blocks -the soft data consistency step. Specifically, the update for the block B i+1 in the simplified Adaptive-CS-Net model is defined as follows: 2 It is a standard way to broaden the image distribution from which the samples are drawn for evaluation and voting (e.g., see Effland et al. (2019)). where x i denotes the i-th estimate of reconstruction, U andÛ are the multi-scale transform and its inverse that consist of 2D convolutions and a nonlinearity in the form of Leaky-ReLU. The feature maps produced at the different scales are thresholded using the soft-max function soft(·) 3 , parameterized by a learned parameter λ s, f s for each feature channel f s and scale s. In Eq. 1, the soft data consistency step e i is defined as follows: where F and F −1 denote Fourier transform and its inverse, My is the data measured with the sampling mask M.
We trained the simplified Adaptive-CS-Net model using RM-Sprop optimizer Tieleman and Hinton (2012) to minimize L1 loss function between the reconstruction estimate and the ground truth image obtained from the fully sampled data. We used a step-wise learning rate decay of 10 −4 and the batch size of 8 to reconstruct the data for various acceleration factors (from 2× to 8×).

Motion Correction
The in-plane motion artifacts, including rigid translation and rotation, were introduced into the Fourier-transformed data following the procedure described in Sommer et al. (2020). For each input image, the assumed echo-train length of the turbo spin-echo readout was chosen randomly in the 8-32 range. Similarly, the assumed extent of zero-padding in k-space was chosen randomly in the range of 0-100. The motion trajectories (translation/rotation vectors as a function of scan time) were generated randomly to simulate the realistic artifacts. In this study we utilized the protocol for "sudden motion" simulation. Here, the subject is assumed to lie still for a large part of the examination, until a swift translation or rotation of the head occurs. The time point of the sudden motion was taken randomly as a fraction of the total scan time in the range of one-third to seven-eighths. The maximum magnitude of the motion was chosen randomly from the range of [1,4] pixels for the translation and [0.5,4.0] degrees for the rotation artifacts. The center of rotation was also varied randomly in the range of [0,100] pixels in each direction. These parameter ranges were selected empirically to generate a large variety of realistic artifacts and were used consistently in the training and in the validation runs.  Fig. 2. Formation of Labeling and Re-labeling datasets for annotation (left) and the content of each group of images for the medical evaluation and labeling by radiologists in the Labeling dataset (right). Starting from the clean validation data, we first generate corrupted data with the acceleration artefacts, the motion artefacts, and the Gaussian noise. Then, we reconstruct the corrupted data using trained neural network models and randomly select scans to form labeling and re-labeling pairs for the experts to grade.
To compensate for the motion artefacts of various extent, we trained U-Net models Ronneberger et al. (2015) with 209k parameters. While more advanced architectures exist, we found the basic U-Net to be more than sufficient for the scope of the proposed IQA study, as it is enough to capture imperfections which are often generated by deep learning models.
The model received the motion corrupted data as the input and learned to predict the motion artefacts in a residual manner, i.e., the output of the model was a predicted image of motion in the input data. The model was trained to minimise L1 loss between the ground-truth and the predicted residual with Adam Kingma and Ba (2015) optimizer using the step-wise learning rate decay of 10 −4 and the batch size of 8. Preserving the same nature of artifacts, we trained our models for a range of amplification factors (from 1 to 3). For that, throughout the training, the motion amplitude was scaled by the amplification factor, yielding a consistently diverse appearance of the motion artefacts that could be met in practice.

Denoising
In our study, noisy magnitude images are generated from the complex k-space data with the Gaussian distribution taken as the representative noise model. Below, the standard deviation of the Gaussian noise is reported for a region of interest in the background of the magnitude image, as proposed in Sijbers et al. (2007). The parameters of the noise distribution for each volume are drawn from the last slice of this volume. Then, the Gaussian noise with the estimated distribution parameters is generated, scaled by an amplification factor, and added to all images of the volume. We used the amplification factor of 2 for the training and the amplification factors of 1, 2, and 3 for the test data generation to enrich the variety of the tested image qualities in the resulting dataset.
To compute the denoised images, we trained DnCNN models Dong et al. (2015) with 556k parameters on the brain multicoil train data using the RMSprop optimizer Tieleman and Hinton (2012) and a step-wise learning rate decay of 10 −4 with the batch size of 8. Similarly to the other tasks considered herein, we are not looking for the most powerful denoising algorithms but consider a very commonplace model DnCNN instead, merely to rank the modern IQA metrics for the specific task of denoising.

Final Dataset for Labeling
We started the formation of the labeling dataset from the clean volumes from the validation subsets of brain and knee FastMRI datasets; hence, these scans were not used to train the artefacts correction models. In total, both validation subsets contain 1,577 volumes, resulting in 28,977 images: 199 knee volumes with 7,135 slices and 1,378 brain volumes with 21,842 slices. In each brain volume, the lower 2 and top 3 slices were discarded to restrict the analysis to clinically relevant parts of the scan. In each knee volume, the first 3 slices were discarded for the same reason. To limit the number of data points and decrease the overall variability of data types, we selected only T1-weighted (T1, T1-PRE and T1-POST) brain volumes and proton-density weighted without fat suppression (PD) knee volumes.
The data generation pipeline is summarized in Fig. 2.
Using the selected subset of clean validation data, we simulated images for the reconstruction: • For the scan acceleration task, we simulated acceleration artefacts for undersampling rates of 2×, 4×, and 6×, following the data generation process from the FastMRI challenge; • For the motion correction task, we simulated motion artefacts of three different strengths using the rigid motion simulation framework described above; • For the denoising task, we simulated Gaussian noise with amplification factors of 1, 2 and 3 using the noise generation procedure described above.
After that, all generated corrupted data were reconstructed using the reconstruction models trained for the corresponding tasks. Note that we deliberately generated a fraction of data with parameters different from the ones used to train the reconstruction models. We found this approach yields various levels of artefacts typically appearing after the reconstruction process.
From the large pool of reconstructed images, we select 100 pairs of images (clean -reconstructed) for each task (scan acceleration, motion correction, denoising) and each anatomy (knee, brain), evenly distributing the data to represent each reconstruction parameter (e.g., the acceleration rate for the scan acceler-ation task). This strategy results in the labeling dataset of 600 pairs of images in total (3 tasks × 2 anatomies).
To reach the goal labeling dataset size, we utilized the following data selection procedure: 1. Compute values of IQMs for all reconstructed images (for NR IQMs) or image pairs (for FR IQMs); 2. Normalize each IQM value to [0, 1]; 3. Compute variance between IQM values for all items; 4. Sort all items by the value of variance; 5. Select 25% of data for each task-anatomy combination from the data items with the highest variance, assuming that items with the biggest disagreement between IQMs are the most informative; 6. Select the rest 75% of data pseudo-randomly (preserve distribution of reconstruction parameters) to avoid introducing any bias from the variance computation.
Lastly, we deliberately duplicated 100 of the 600 prepared items for the purpose of verification of radiologists' selfconsistency, resulting in 700 image pairs to be labelled by each radiologist.

Experiment Setup
Within the paradigm of the model observer framework Barrett et al. (1993), the quality of a medical image can be defined as how well a clinical task (e.g., diagnostics) can be performed on it He and Park (2013). This means that the perfect MRI IQM would be some task-based score, such as the diagnostic accuracy. However, such a metric is difficult to implement due to a great diversity of diagnostic outcomes that radiologists deal with in practice. Because of that, the convention is to use a subjective estimation of the overall diagnostic value instead Mason et al. (2020).
However, we argue that a single score is not sufficient to reflect the abundance of anatomies, pathologies, and artefactual cases that the radiologists work with. Instead, we propose to subdivide the score of the overall diagnostic quality into three main criteria that can be important for a clinical practitioner to make their decision: i) perceived level of noise, ii) perceived level of soft-tissue contrast, and iii) presence of artefacts.

Subjective Evaluation
Seven trained radiologists with 7 to 20 years of experience took part in this study. The participants were asked to score pairs of reconstructed-reference images using three main IQ criteria. For each image pair and each criterion, radiologists scored the perceived diagnostic quality of the reconstructed image compared to the ground-truth using a four-point scale: not acceptable (1), weakly acceptable (2), rather acceptable (3), and fully acceptable (4). The four-point scale was selected over the five-point Likert scale, previously used in Mason et al. (2020).
Each participant performed the labelling individually using a dedicated instance of the Label Studio Tkachenko et al. (2020) software accessible via a web interface. The experts were asked to make all judgments about the image quality with regard to a particular diagnostic task that they would normally perform in their practice (e.g., the ability to discriminate relevant tissues, the confidence in using the image to detect a pathology, etc.). The interface provided additional functionality of scaling (zooming) the images to closer mimic the real-life workflow. The pairs of images were displayed in a random order until all pairs were labelled. Participants had an opportunity to re-label the pairs they have already scored at any point until the experiment is finished.
During the main part of the experiment, each participant labelled 600 pairs of images based on the 3 quality criteria, resulting in 4,200 annotated pairs and 12,600 labels in total. The results of the main labelling session were used for further evaluation of the IQMs. After finishing the main part of the experiment, the participants were asked to additionally label 100 randomly selected pairs from the same dataset, yielding additional 2,100 labels. The results of this additional re-labelling were used to evaluate the self-consistency of each annotator.

Metrics Computation
Unlike FR and NR IQMs, designed to compute an imagewise distance, the DB metrics compare distributions of sets of images. This makes them less practical for traditional IQA, the goal of which is to compute a score for a given image pair. Moreover, the need to have sets of images hinders the votebased evaluation via the mean subjective opinion scores.
To address these problems, we adopt a different way of computing the DB IQMs. Instead of extracting features from the whole images, we crop them into overlapping tiles of size 96 × 96 with stride = 32. This pre-processing allows us to treat each pair of images as a pair of distributions of tiles, enabling further comparison. The other stages of computing the DB IQMs are kept intact.

Data Analysis
Here, we adapt the analysis of the scoring data proposed in Sheikh et al. (2006) to the multiple IQ criteria. The voting scores for each scoring criteria are not analyzed in their raw format. Instead, they are converted to z-scores (averaged and re-scaled from 0 to 100 for each radiologist to account for their different scoring): where µ mk and σ mk are the mean and the standard deviation of the difference scores of the m th radiologist on the k th scoring criteria, and D nmk are the difference scores for n th degraded image defined as follows: In Eq. (4), s mk,ref is the raw score of the m th radiologist on the k th scoring criteria for the reference image corresponding to the n th degraded image, and s nmk is the raw score of the m th radiologist on the n th degraded image on the k th scoring criteria. Note that in this study, the radiologists were asked to perform pair-wise comparison between degraded and reference images. Hence, it is possible to treat the raw labelling scores as the difference scores D nmk .   Table 1 for numerical values.

MR-artefacts
After standardizing the expert votes by Eq. (3), their correlation statistics with each IQM were computed in the form of SRCC and KRCC coefficients, defined as follows: where d i is the difference between the i-th image's ranks in the objective and the subjective ratings and n is the number of observations.
where (x 1 , y 1 ), ..., (x n , y n ) are the observations: the objective and the subjective score pairs. We use SRCC as the main measure of an IQM performance, due to the non-linear relationship between the subjective and the objective scores 4 .
The sizes of each batch of data are described in Fig. 2 (right). 4 The non-linear relationship is evident in Fig. 4 below.
A non-linear regression was performed on the IQM scores according to the quality Q to fit the subjective votes: where x are the original IQM scores and β 1 , . . . , β 5 are the fitting coefficients. Table 1 summarize the correlation study between the radiologists' scores and the IQM values for the three proposed evaluation criteria. The figures also show the results for the natural image domain. Top 4 performers in each category are marked in bold. The best and the worst examples of the reconstructions, as judged by different metrics, are presented in Fig. 5, and the aggregate scores for the top-performing metrics in each application in Fig. 6.

Discussion
The visual inspection of the outputs of the models in Fig. 5 makes it evident how the top metrics are superior in reflecting Anatomy knee brain Fig. 4. Relationship between processed subjective scores and IQM values for 3 evaluation criteria, 3 target tasks, and 2 anatomies (600 annotated image pairs in total). The solid lines are fits, plotted using the non-linear regression (7) on the subsets of images split by the tasks. The top 4 metrics (along with PSNR and SSIM, as the most commonplace) are shown in the decreasing order left to right, using SRCC to gauge the performance.

Worst Reconstruction
Reference Fig. 5. The best and the worst reconstruction-reference pairs according to different metrics (their values are shown in yellow). Note how the top 4 metrics (first four columns) reflect the actual reconstruction quality better than PSNR and SSIM (which are prone to misjudging a simple shift of brightness or a blur). The brightness is adjusted for viewer's convenience.
the actual reconstruction quality over the conventional PSNR and SSIM. The latter are known to misjudge shifts of brightness or a blur, indicating high quality for the bad images, whereas the more advanced FR and DB IQMs correlate with the visual perception and the subjective scores ably. Henceforth, out of the 35 metrics considered, we only discuss the best ones, according to their rank in the correlation study (VSI, HaarPSI, DISTS, FID VGG16 ) and the widely used PSNR and SSIM.
As the key observation in the first systematic study of the DB metrics, we affirm that the choice of the feature extractor plays a crucial role. In particular, the correlation scores show that the Inception-based features are almost always worse than those from VGG16 (except for the MSID metric). Moreover, we see that, despite having been designed for the evaluation of realism of generative models data, FID shows competitive SRCC scores, thus, becoming a new recommended metric for the MRI image assessment tasks. The non-linear relationship between the subjective and the objective scores, seen in Fig. 4, portrays intricate behavior with evident dependence on the anatomy and the target task, as well as a clear clustering of the points, instrumental for selecting a proper metric in a particular application. Notable are the generally lower IQM correlation scores when the difficulty of the reconstruction routine increases (compare trends in the scan acceleration data to those in the more complex denoising and the motion correction models). Also, the evaluation values for the knee reconstruction are generically lower, which could be caused by the greater variety of anatomical structures present in the knee data, as well as the more strict pertinent medical evaluation criteria Keshavan et al. (2019). Fig. 6 aggregates the outcomes per each task, anatomy, and evaluation criteria studied in our work, with the relation be- . Aggregate relationship between the objective and the subjective scores for 3 evaluation criteria (rows), 2 anatomies (columns), and 3 tasks: scan acceleration ( ), denoising ( ), and motion correction ( ). The IQMs are ordered by decreasing average SRCC for the artefacts criterion on the brain data. This order is kept throughout all results for consistency. Note the tendency of the metrics to perform poorly in some task-anatomy combinations, e.g., in denoising the brain data.
tween the subjective and the objective scores highlighting the differences in the average performance of the top metrics. Notably, these selected IQMs have the highest correlation with expert judgment in the scan acceleration task. However, all metrics equally struggle reflecting the opinion of the radiologists in denoising and, sometimes, in motion correction tasks, especially on brain data. We also observe that some metrics perform consistently in terms of all three evaluation criteria and all tasks for given anatomy. For instance, GMSD and DISTS, despite not being of the highest SRCC rank overall, still show consistently high correlation scores on knee data, which proffers both of them as universal choices for the IQA in orthopedic applications. On the other hand, HaarPSI consistently rates the highest for both anatomies in the scan acceleration task, an instrumental fact to know when a single machine is used to scan various body parts or when the pertinent cross-anatomy inference Belov et al. (2021) is performed.

Natural vs. MRI Images
A frequent IQA-related question is how generalizable are the performance benchmarks across different datasets and image domains. To study that, we analyzed the applicability of all 35 IQMs considered herein both in the MRI and the natural image (NI) domains (Table 1). For the latter, the popular TID2013 Ponomarenko et al. (2015) and KADID-10k Lin et al. (2019) datasets of NIs were used. Fig. 3 illustrates the effect of the shift between the NI and the MRI domains, featuring an expected drop of the correlation values for most metrics 5 . However, the domain shift affects the ranks of the IQMs differently. Some top NI metrics, such as MDSI and MS-GMSD, naturally take lower standings in the MRI domain; however, others, such as HaarPSI and VSI, remain well-correlated with the radiologists' perception of quality. Further examples of IQMs robust to the domain shift are DISTS and FID VGG16 .

Labeling Discrepancies and Self-Consistency Study
Another IQA-related question encountered in survey-based studies is the trustworthiness of the votes themselves. Given that only reputable radiologists were engaged in our labeling routine, we have no grounds for doubting their annotations as far as the domain knowledge is concerned. Therefore, feasible discrepancies among their votes can be assumed to originate either from such factors as the study design, its duration, and fatigue, or from a previous experience which sometimes forms a posteriori intuition and, allegedly, influences the experts to make decisions different from the others.
While the latter is too subjective and difficult to regulate, the former could be controlled. We put effort to simplify the user experience and allowed the radiologists to approach the labeling assignment in batches at their own pace. The average lead time spent labeling a pair of images 6 , an arguable indicator of the scrupulousness of an annotator, is plotted in Fig. 7, where we also summarize the results of the self-consistency study. The study reports Weighted Cohen's Kappa scores, computed between the votes provided in the main and in the additional re-labeling experiments on the same data. Interestingly, there is no significant correlation between self-consistency and the labeling time, placing other factors mentioned above, such as individual experience, at the forefront.
In Fig. 7, the Weighted Cohen's Kappa values correspond to moderate to substantial consistency of scoring (according to Viera and Garrett (2005)). Given the sufficiently trustworthy labeling, the spread of the correlation scores for the modern IQA metrics in Fig. 6, and the non-trivial correlation patterns in Fig. 4, one can conclude that the optimal MRI metric is yet to be devised.
Besides a blunt umbrella metric aggregating the topperforming predictions (e.g., those of VSI, HaarPSI, DISTS, and FID VGG16 ), the future effort should be dedicated to additional forays into modeling MRI-specific perception of the radiologists and to interpreting their assessment using formalized rules taken from the medical textbooks. Such interpretatable metrics will be especially in demand, given the recent appearance of the MRI sampling approaches aimed towards optimizing downstream tasks Razumov et al. (2021), including the recently annotated FastMRI dataset Zhao et al. (2021). Another line of future work could be 'borrowed' from the NI domain, where the abundance of data has led to the emergence of several NR IQMs. Although, in our study, all such metrics (classic BRISQUE Mittal et al. (2012) and the more recent PaQ-2-PiQ Ying et al. (2020) and MetaIQA Zhu et al. (2020)) showed equally mediocre performance compared to the other IQMs, we 6 We discarded 5% of the shortest and the longest lead times to account for erroneous clicks and breaks between the labeling sessions. Mean Lead Time, sec Fig. 7. Correlation between the subjective scores in labelling and relabelling sessions on the same data, with each column/color corresponding to an individual radiologist. This plot shows scoring self-consistency of the experts and the average time spent labelling one pair of images. Apparently, the time spent on labelling is not the major factor affecting the self-consistency of experienced radiologists.
believe their value in the MRI domain is bound to improve with the growth of available data.

Conclusion
This manuscript reports the most extensive study of the image quality metrics for Magnetic Resonance Imaging to date, evaluating 35 modern metrics and using 14,700 subjective votes from experienced radiologists. The top performers -DISTS, HaarPSI, VSI, and FID VGG16 -are found to be efficient across three proposed quality criteria, for all considered anatomies and the target tasks.

Appendix. Labelling User Interface
During the labeling experiment, the participants were asked to score pairs of reconstructed-reference images presented to them side-by-side in a web interface of the Label Studio Tkachenko et al. (2020). The web interface is shown in Fig.  8.
The labeling was done using three main IQ criteria: the presence of artifacts, the perceived level of noise, and the perceived level of soft-tissue contrast. The participants were able to select their answers using the mouse pointer or some keys on the keyboard. During the quality assessment process, the participants were able to zoom images, re-label previously labeled examples, pause and divide their evaluation session into as many labeling rounds as they wished. All labeling results were continuously saved on a remote server to eliminate the possibility of data loss. After the labeling process is finished, the participants were offered the last chance to fix the scoring of the borderline examples. Fig. 8. Web interface of the Label Studio software released to the expert radiologists to perform the labeling. The participants selected their answers using the proposed scale from 1 to 4, rating the images based on each proposed IQA criteria.