Applying machine learning classifiers to automate quality assessment of paediatric dynamic susceptibility contrast (DSC-) MRI data

Objective: Investigate the performance of qualitative review (QR) for assessing dynamic susceptibility contrast (DSC-) MRI data quality in paediatric normal brain and develop an automated alternative to QR. Methods: 1027 signal–time courses were assessed by Reviewer 1 using QR. 243 were additionally assessed by Reviewer 2 and % disagreements and Cohen’s κ (κ) were calculated. The signal drop-to-noise ratio (SDNR), root mean square error (RMSE), full width half maximum (FWHM) and percentage signal recovery (PSR) were calculated for the 1027 signal–time courses. Data quality thresholds for each measure were determined using QR results. The measures and QR results trained machine learning classifiers. Sensitivity, specificity, precision, classification error and area under the curve from a receiver operating characteristic curve were calculated for each threshold and classifier. Results: Comparing reviewers gave 7% disagreements and κ = 0.83. Data quality thresholds of: 7.6 for SDNR; 0.019 for RMSE; 3 s and 19 s for FWHM; and 42.9 and 130.4% for PSR were produced. SDNR gave the best sensitivity, specificity, precision, classification error and area under the curve values of 0.86, 0.86, 0.93, 14.2% and 0.83. Random forest was the best machine learning classifier, giving sensitivity, specificity, precision, classification error and area under the curve of 0.94, 0.83, 0.93, 9.3% and 0.89. Conclusion: The reviewers showed good agreement. Machine learning classifiers trained on signal–time course measures and QR can assess quality. Combining multiple measures reduces misclassification. Advances in knowledge: A new automated quality control method was developed, which trained machine learning classifiers using QR results.


INTRODUCTION
Dynamic susceptibility contrast (DSC-) MRI provides estimates of perfusion in the brain, 1 by imaging the passage of a gadolinium-based contrast agent using a dynamic T 2 or T 2 * weighted imaging sequence. 2 The contrast agent causes local changes in T 2 and T 2 * , which dynamically alter the MR

BJR
Powell et al signal intensity. 3 Analysis of the resulting signal-time courses, associated with each pixel, can produce estimates of cerebral blood volume (CBV), cerebral blood flow (CBF) and vascular mean transit time (MTT). 4 As well as DSC-MRI, perfusion can also be measured using MRI with dynamic contrast-enhanced (DCE-) MRI, arterial spin labelling (ASL) and intravoxel incoherent motion (IVIM). However, DSC-MRI offers better signalto-noise ratio (SNR) and contrast-to-noise ratio, and a faster acquisition time. 5 Measurement of perfusion can be used to indicate health in a range of diseases. In paediatrics, it is used to assess brain tumours, which are the leading cause of cancer-related mortality in children, 6 as well as diseases which affect neurovasculature. Most paediatric brain tumour patients have a gadolinium injection to allow for post-contrast T 1 weighted imaging so the DSC-MRI acquisition can be carried out during this injection providing information that would otherwise not be available. 7 CBV and CBF values from DSC-MRI acquisitions in paediatric patients have been used to predict long-term survival 8 and have been shown to correlate with tumour grade. 9,10 These applications require accurate CBV and CBF values, therefore it is important to ensure that the DSC-MRI signal-time courses they are estimated from are of good quality. 11 DSC-MRI is prone to motion and susceptibility artefacts, which degrade the quality of acquired data. 11 The scanner and acquisition protocol for DSC-MRI also commonly varies from centreto-centre, which affects the signal-time courses produced 12,13 and the SNR of the CBV maps, 14 limiting the clinical applicability of the technique. 15 For example, the field strength of the scanner and acquisition parameters such as repetition time (TR), echo time (TE), voxel volume and flip angle may vary between centres. These factors affect the SNR of the acquired data, whilst TR dictates the temporal resolution. 16 In brain tumour patients, breakdown of the blood-brain barrier (BBB) can lead to contrast agent extravasation, where the contrast agent leaks into the extravascular extracellular space (EES). 17 Contrast extravasation can lead to T 1 weighted contamination of the signal-time courses and underestimation of the CBV values, or T 2 * weighted effects leading to overestimation of CBV values. 17 These contamination effects can be reduced either by administering a pre-bolus of contrast agent, 18 or using a low flip angle during the acquisition, to reduce the T 1 weighting of the DSC-MRI sequence. 11 Recent research has shown that the application of leakage correction is essential when using a low flip angle, single-bolus protocol. 7,19,20 Currently, the ASFNR recommendation for quality control (QC) of DSC-MRI data is to assess signal-time courses by eye, using qualitative review (QR). This involves assessing signaltime courses for the presence of artefacts, including magnetic susceptibility (the response of a material to a magnetic field, which can result in signal loss 21 and motion); for appropriate signal drop indicating the quality of bolus administration; and for noise spikes in the signal-time curve, suggesting that any such time points should be removed. 11 There can be discordance between reviewers and one DSC data set contains thousands of signal-time courses, so it is not practical to assess the quality of all signal-time courses manually. In practice this means that a subset of the signal-time courses is used to assess the quality of a whole data set. An automated process based on QR, which could be applied to assess signal-time course quality on a voxelwise basis, which could be used to provide an assessment of the overall quality of a data set, is desirable.
Previous work has shown that it is possible to define statistical thresholds and apply these to quantitative measures calculated from DSC-MRI signal-time courses to assess data quality. 22 Machine learning (ML) classification can be used to train models to make predictions based on features extracted from a data set. 23 This has plenty of applications in medical imaging. For example, it has been used in the pneumonia detection, 24,25 detection and classification of Covid-19, 26 diagnosing colorectal cancer, 27 and assisting in planning rehabilitation care for stroke patients. 28 ML has also been applied to DSC-MRI data for several applications, but so far it has not been used for assessing data quality. For example, it has been used in place of standard analysis techniques for DSC-MRI to provide a quicker and more robust method to estimate CBF values from raw signal-time course data, 29 predict survival in glioma patients, 30 and classify tumour type. [31][32][33][34][35][36] Therefore, ML could be applied to features extracted from DSC-MRI signal-time courses to determine data quality. Any new method for assessing data quality should be established in normal brain before it is applied to diseased tissues. Undertaking such a study is a challenge in children due to ethical constraints but an appropriate alternative is to uses paediatric patients undergoing DSC-MRI scans for brain tumours, and selecting signal-time courses from slices of brain which do not contain tumour.
The objectives of this paediatric study are: to assess the discordance in QR between two reviewers, to use this QR to determine thresholds of quantitative measures of data quality and to investigate whether QR and quantitative thresholds could be used to develop an automated QC process for assessing overall data quality of a paediatric data set.

Patient data
For this study, a data set containing 25 paediatric patients, acquired at 4 UK centres was used. The data were gathered from the Children's Cancer and Leukaemia Group (CCLG) functional imaging of tumours database. 37 23 of the data sets were acquired pre-diagnosis, and 2 of the data sets were acquired post-diagnosis. One of the post-diagnosis patients underwent a biopsy and chemotherapy, and the other underwent a surgical resection. The acquisition protocols used are summarised in Table 1   QR of patient data QR of 1027 signal-time courses, extracted from 25 patients, was performed. A large number of patients were used to ensure a range of acquisition protocols and artefacts were included. Artefacts were those observed in normal clinical practice when scanning patients. Table 1 summarises how many signal-time courses came from each acquisition protocol. Signal-time courses were randomly selected from pre-defined regions within each patient, which included: grey matter (GM), white matter (WM), the edge of the brain, the edge of the ventricles and the cerebellum. All signal-time courses were selected from slices which did not contain any tumour, by selecting supratentorial signal-time courses from patients with infratentorial tumours and infratentorial signal-time courses from patients with supratentorial tumours. Tumour diagnosis information was obtained from the CCLG database. All signal-time courses were assessed using QR by Author 1 (PhD student with 3 years' experience), and a randomly selected subset of 243 signal-time courses were additionally assessed by Author 2 (Clinical scientist with 8 years' experience). QR to assess data quality was carried out using the guidance from the ASFNR recommendations. 11 This involved assessing whether a clear signal drop was present and the level of noise within the baseline and the rest of the signal. Signal-time courses were then given a score of 1 (accepted) or 0 (rejected) based on this assessment. For the subset of signaltime courses reviewed by two reviewers, the scores from Author 2 were considered to be the ground truth and were used for the determination of thresholds and in the training of ML classifiers.
The percentage disagreement between the two reviewers and the Cohen's κ for interrater reliability were calculated for the subset of 243 signal-time courses assessed by both reviewers. All statistical analysis was carried out in R (R Foundation for Statistical Computing, Vienna, Austria, v. 3.5.0).

Calculating the quantitative measures of quality
Signal drop-to-noise ratio (SDNR), root mean square error (RMSE), full width half maximum (FWHM) and percentage signal recovery (PSR) were used as quantitative measures of signal-time course quality. These were calculated for each of the 1027 signal-time courses which had previously undergone QR. SDNR was calculated using equation 1, with the signal drop defined as the difference between the mean baseline and mean of the first pass minima and the two adjacent dynamics. This is a similar measure to the contrast-to-noise ratio applied in work by Digernes et al, except it is calculated from the signal-time course instead of the relaxation rate curve. 39

SignalDrop StandardDeviation in Baseline
(1) RMSE was calculated by fitting a version of the simplified γ variate function, 40 shown in equation 2, to the first pass of the signal-time course.
Where y(t) is the fit, t is the time, c is the average baseline signal, and α, β and K are shape coefficients. The RMSE value from this fit was normalised to the area of the first pass.
The FWHM was calculated as the width of the first pass (in seconds) at half the signal drop. In the Sequence column, GE-EPI = Gradient Echo -Echo Planar Imaging, and sPRESTO = Sensitivity Encoded (SENSE) Principles of Echo-Shifting with a Train of Observations. 38 The PSR was calculated from equation 3, with T 2 * recovery defined as the difference between the mean post-bolus signal and the mean of the first pass minima and the two adjacent dynamics. 11 To calculate these measures, it was necessary to define the dynamics where the baseline ended, and the post-bolus started. The baseline end was determined by calculating the moving mean (with sliding window of three) and cumulative mean of the signal-time course, starting from the first dynamic, and finding the dynamic where the means diverged. The start of the post-bolus was determined using the same process but starting from the last timepoint. The first pass was defined as the region between the end of the baseline and the start of the post-bolus. Figure 2 shows an example signal-time course and the features used to calculate the quantitative measures.
Thresholds from QR Quality thresholds for SDNR, RMSE, FWHM, and PSR, were determined using the QR results from the 1027 signal-time courses that underwent QR ( Figure 1). Thresholds were determined using k-fold cross-validation (CV), with k = 10. Data are separated into k equally sized folds, with (k-1) folds used as training data, and the remaining fold used as testing data, from which the performance metrics are calculated. This process is repeated until all folds have been used as testing data. 41 The separation of signal-time courses into folds was stratified to ensure an even distribution of accepted and rejected signal-time courses For each fold, threshold values were determined from the training data. Sensitivity, specificity, precision, classification error and area under curve (AUC) from a receiver operator curve (ROC) were calculated as performance metrics, by applying the thresholds to the testing data. Mean thresholds and performance metrics were calculated by averaging across the folds.
SDNR and RMSE quality thresholds were determined using sensitivity vs specificity plots. For each fold, the SDNR threshold was varied over each SDNR value within the training data, and the sensitivity and specificity were calculated from applying the threshold to the training data and comparing to the QR results. The optimal threshold was the value where sensitivity equalled specificity. This process was repeated for the RMSE values.
Upper and lower thresholds of quality were determined for FWHM and PSR, respectively. For each fold, the parameter values from the training data were ordered in ascending value. The signal-time courses with the smallest and largest FWHM or PSR values, respectively, that passed QR were identified. These values were used as thresholds, with any signal-time course with an FWHM or PSR between the two thresholds classed as good quality.
Combining quantitative measures using ML ML classification was carried out using the ML toolbox in Matlab (The MathWorks, MA, 2019a). 42 Classification was carried out using the data set of 1027 signal-time courses used for determining thresholds from QR. SDNR, RMSE, FWHM and PSR values were used as predictors for classifier training and the QR scores (1 = passed QR, 0 = failed QR) used as the target outputs. Hyperparameter optimisation was applied for each classifier and k-fold CV with k = 10 was used. As previously the k-fold validation was stratified to ensure that there an even distribution of accepted and rejected signal-time courses in each fold, but the centre or data set the signal-time course came from was not taken into consideration, as the aim of this work is to apply the final classifier to a wide range of patient data. The classifiers used were binary tree, support vector machine (SVM), ensemble, random forest, and logistic regression. These classifiers were selected to ensure that a wide range of classification methods were applied to the data. This will help to ensure that the optimal ML classifier is chosen. The average sensitivity, specificity, precision, classification error and AUC were calculated for each classifier. Further details on the ML, including the hyperparameter optimisation and the results of the hyperparameter optimisation, can be found in the appendix.

Application to patient data
Each of the thresholds of the quantitative measures of quality and the best performing ML classifier were applied to signal-time courses obtained from one slice of patient data acquired using the acquisition protocol described in row 3 of Table 1. A quality map was created for each method, showing which voxels had passed QC and which had failed.

RESULTS
QR of patient data shows good agreement between reviewers, and that there is a region of uncertainty where it is difficult to classify signaltime courses Table 2 Table 3. Figure 5 shows example signal-time courses where there were disagreements between the QR results and the SDNR threshold for acceptance of quality. Out of the three signal-time courses that passed the SDNR threshold but  failed QR, all three passed the FWHM and PSR thresholds, and one passed the RMSE threshold. All three signal-time courses failed QR because of issues with the post-bolus signal, which was not picked up by SDNR. Out of the three signal-time courses that passed QR but failed the SDNR threshold, one passed the RMSE threshold and two passed the FWHM and PSR thresholds.
Combining quantitative measures to assess data quality reduces classification error and ML classifiers offer an automated method to do this Table 4 summarises the average performance measures for each of the ML classifiers. The classifier with the lowest classification error was the random forest, producing sensitivity, specificity, precision, classification error and AUC of 0.94, 0.83, 0.93, 9.3% and 0.89, respectively. Figure 6 shows an example of the confusion matrix and the ROC curve from the best performing fold of the best performing classifier. Figure 7 shows examples of the disagreements between the QR and the ML results. Details of the hyperparameter optimisation and its results for the best performing classifier can be found in the appendix.
When applied to one slice of patient data ML passed more signal-time courses than the SDNR threshold Figure 8 shows the quality maps produced by applying the thresholds as obtained from the QR of each of the described metrics and the random forest ML classifier, respectively, to one slice of patient data. Blue pixels represent signal-time courses that passed the respective QC method, whilst orange pixels represent those that failed. Table 5 summarises the percentage of signal-time courses that passed each method.

DISCUSSION
Our study shows that although QR can be used to assess data quality, there are a range of signal-time courses which are difficult to classify. Automated quality control methods using simple metrics can be developed using the results of QR. Combining multiple metrics using ML results in fewer signal-time course misclassifications than using individual metrics. However, selecting a set of metrics to fully describe a signal-time course and all of its potential artefacts is challenging. We applied the automated QC methods to paediatric data in this work, however they are also applicable to adult data.
The signal-time courses assessed by two reviewers show a low discordance between reviewers, due to a low percentage of disagreements and a Cohen's κ value of 0.83, which shows excellent agreement. 43 The two reviewers found it harder to agree on whether to pass signal-time curves from data sets acquired at 1.5 T, due to their reduced SDNR. For the entire subset, the ranges of SDNR and RMSE values for signaltime courses where there were disagreements between the reviewers and for signa--time courses that were within the region of uncertainty are both large. This shows that a signaltime course with a large SDNR is not guaranteed to be good quality, as other factors may also affect quality, including, e.g. a distorted first pass.  The automated QC methods presented in this work assess data quality on a voxel-wise basis, which allows for more data to be assessed than in QR. A typical DSC-MRI data set will have tens of thousands of signal-time courses. Assessing all of them by eye is not possible, so QR generally involves assessing a small subset of the whole data set. Our automated QC methods are ML classifiers, trained using signal-time courses from "normal brain", which could lead to the exclusion of pathology related low perfusion signal-time courses, which may be clinically useful. Large amounts of WM, which is suggested by consensus guidelines as the tissue to use for normalising rCBV values, 19 may also be excluded as it is less perfused than GM. A better approach may be to use voxel-wise QC to give an overall assessment of data quality, which could then be used to decide whether a data set is of sufficient overall quality to be included in a study. An alternative to this would be to assess the quality of an average signal-time course from a data set. This would be quicker than a voxel-wise analysis but could produce misleading results-an average signal-time course will "average out" noise and regions of artefacts from the data.
The SDNR threshold defined the minimum SDNR for data to be accepted and was the best-performing individual measure, giving the most similar results to QR. This is expected as SDNR defines Figure 5. A demonstration of the disagreements between the QR results and the SDNR threshold. The left column (a, c, e) shows signal-time courses that passed QR but failed the SDNR threshold. The right column (b, d, f) shows signal-time courses that passed the SDNR threshold but failed QR. QR, qualitative review; SDNR, signal drop-to-noise ratio. The Bag method was selected the by the hyperparameter optimisation.
BJR Applying Machine Learning to Quality Assessment of Paediatric DSC-MRI how visible the signal drop is, which is a key part of assessing data quality by QR. 11 However, SDNR is reduced in low perfused tissues, such as white matter or some low-grade tumours, and this may lead to the exclusion of some of these signal-time courses, discarding clinically useful information. The RMSE threshold gave poorer performance across all the performance measures compared to SDNR. The FWHM and PSR thresholds give AUC values comparable to the other quantitative measures and a similar classification error to the RMSE threshold. Both resulted in very good sensitivity but poor specificity.
Multiple factors affect DSC-MRI data quality, and a single measure cannot cover them all. Figure 5 illustrates the difficulties of trying to classify signal-time course quality purely on SDNR.

BJR
Powell et al Difficulty in defining a single threshold for each quantitative measure shows the need for combined measures to assess data quality. This agrees with work by Akella et al 22 where multiple quantitative measures including, the failure rate of fitting a gamma-variate to the first pass, mean FWHM, and mean PSR were calculated from the signal-time courses in each data set and used to determine data quality. Cut-off values were calculated using a 99% one-sided confidence interval. Data sets that did not fall within the cut-off values for at least one metric were classed as poor quality. 22 Our work presented here differs in that thresholds for quality are determined using QR results instead of confidence intervals. Combining measures using ML classifiers, leads to improved classification error compared to individual thresholds as shown in Table 4. The random forest classifier gave the lowest classification error, but offers only a minor improvement in performance measures compared to the other classifiers. Therefore, any of the classifiers tested would be suitable.
The ML classifier offers improved sensitivity, classification error and AUC, compared to the SDNR threshold. There is little change in specificity and precision, suggesting that the main improvement in performance comes from a reduction in the number of false negatives, with little change in the number of false positives. ML also has a similar classification rate to the percentage disagreements in QR between reviewers, which implies that it is as accurate as QR at least for these data sets. Therefore, when the quality control methods are applied to a patient data set, the ML classifier passes a higher percentage of signal-time courses than the SDNR threshold, as shown in Figure 8 and Table 5. The lack of reduction in the number of false positives is likely due to the current quantitative measures not being able to identify all the artefacts that DSC-MRI is susceptible to.
The ML classifiers were trained using k-fold validation. Stratified k-fold validation was used to ensure that there was an even distribution of accepted and rejected signal-time courses in each  fold. The centre and data set the signal-time courses originated from was not considered so this was applied on a signal basis rather than a subject basis. This is because the aim of this work is to train a classifier which can be applied to a wide range of patient data, so it needs to be capable of handling data from different centres acquired with different acquisition parameters. Therefore, splitting the data in a subject basis could bias the classifier and reduce its performance.
Currently, the results of qualitative review are applied to a series of quantitative measures which are calculated from the signaltime courses. If a convolutional neural network (CNN) was used in place of the ML classifier, then it could be trained using the signal-time courses directly, rather than calculating measures from the signal-time courses. However, currently there are not enough data to train a CNN type model. This is something that could be investigated in the future once more data have been acquired.
The use of a pre-bolus or single-bolus injection protocol in paediatric data affects the SDNR of the signal-time courses by reducing the signal drop. In adults, a pre-bolus of contrast agent may be given in addition to a full dose of contrast agent, increasing the overall SNR. However, in paediatrics, the European Society for Paediatric Oncology (SIOPE) recommends that paediatric patients should only receive a single-dose of contrast agent, due to concerns over gadolinium deposition. 44 Splitting a single-dose in order to give a pre-bolus will therefore cause a reduction in SDNR in the DSC-MRI acquisition.
Most DSC-MRI studies use the ASFNR recommendation of QR to assess data quality. 11 Automated QC using statistical thresholds has previously been presented by Akella et al. 22 Our method differs as the thresholds and ML classifier are trained on the results of QR. An alternative way to assess data quality is for a radiologist to assess the quality of the perfusion maps produced. This could either mean assessing the diagnostic quality of the perfusion maps, 45,46 assessing the presence of susceptibility artefacts, 47 or assessing the visibility of a certain region of the brain. 48 However, these methods are not automated and risk artefacts being misinterpreted as pathology.
In order to calculate the metrics presented in this study, it is necessary to establish the end of the baseline signal in each signaltime course. There are established methods for determining the end of the baseline, e.g. Carroll et al 49 present a method which uses adaptive thresholds calculated from the standard deviation of the pre-contrast signal, defining a set number of time points from which to calculate the adaptive thresholds. 49 The method we have presented is better suited to a multicentre data set with variable injection protocols, where the number of dynamics in the baseline may vary between centres.
There are some limitations to this study. Firstly, the patient "training data set does not include every type of artefact, such as susceptibility artefacts or insufficient dynamics to capture the full passage of the contrast agent. There may be cases where the classifier misclassifies a signal-time course with an artefact appearing for the first time. The training data set is also made up of signal-time courses acquired with specific and consistent acquisition protocols. Any changes in acquisition protocol will affect the signal values and therefor the quantitative measures, e.g. PSR can vary with acquisition protocol. 13 So, thresholds would need to be recalculated at different centres.
Secondly, whilst these methods were tested on signal-time courses from slices of brain that did not contain tumour or other definite pathology, they may still not be "normal tissue". The quantitative measures may differ in diseased tissue or tissue that has been exposed to treatments such as radiotherapy and so the thresholds and methods presented may not be suitable to all circumstances. For example, PSR has been used to exclude signal-time course with unusual post-bolus signals. However, PSR will be affected by contrast agent leakage due to the breakdown of the blood-brain barrier in tumours. In this case, a leakage correction method, such as the Boxerman-Schmainda-Weisskoff method 50 could be applied prior to quality assessment. In this study, leakage correction was not applied as no tumour signal-time courses were included.
Finally, although the classifier offers an automated QC method, it is still based on QR so still has an element of subjectivity to it.

CONCLUSIONS
QR of individual signal-time courses by two reviewers showed good agreement on the signal-time courses they assessed. ML classifiers trained on QR results offer an automated method to assess the quality of an entire data set. Although SDNR was a good indicator of quality, using only a single measure to determine data quality risks misclassification of signal-time courses. Combining SDNR with RMSE, FWHM and PSR improves classification, and achieves a misclassification rate similar to the discordance rate of QR. We have shown that ML classifiers trained on QR can be used to assess quality of DSC-MRI signal-time courses obtained from normal brain in this paediatric data set.

ACKNOWLEDGMENT
This work was funded by EPSRC through a studentship from the Sci-Phy-4-Health CDT (EP/L016346/1) and the National Institute for Health Research (NIHR) via a research professorship (RP-R2-12-019). Also, the work has been partially funded by the

CONFLICTS OF INTEREST
The authors declare that there are no conflicts of interest.