Deep learning models for triaging hospital head MRI examinations

The growing demand for head magnetic resonance imaging (MRI) examinations, along with a global shortage of radiologists, has led to an increase in the time taken to report head MRI scans in recent years. For many neurological conditions, this delay can result in poorer patient outcomes and inﬂated healthcare costs. Potentially, computer vision models could help reduce reporting times for abnormal examinations by ﬂagging abnormalities at the time of imaging, allowing radiology departments to priori-tise limited resources into reporting these scans ﬁrst. To date, however


Introduction
Magnetic resonance imaging (MRI) is fundamental to the diagnosis and management of a range of neurological conditions ( Atlas, 2009 ).Maximising the clinical and economic benefit of head MRI examinations, however, relies on the timely reporting of scans by radiologists.Delays in reporting lead to delays in treatment; for many abnormalities (e.g., acute stroke, tumour, haemorrhage etc.), this results in poorer patient outcomes (increased morbidity and mortality) and inflated healthcare costs ( Adams et al., 2005 ).
It is, therefore, concerning that a marked increase in the time taken to report head MRI scans has been seen in recent years.In the UK, for example, reporting times have increased year-on-year since 2012 ( NHS, 2019 ), with 98% of radiology departments currently unable to fulfil their imaging reporting requirements within contracted hours ( RCR, 2017 ); a similar picture is also seen in China, Australia, and across Europe.Ultimately, this increase can be traced to the growing demand for head MRI examinations, along with a global shortage of radiologists, although backlogs created as a result of the global COVID-19 pandemic are also expected to exacerbate the problem in the coming years ( NHS, 2021 ), putting a growing number of patients at risk.
Potentially, computer vision models could help reduce reporting times for abnormal examinations by flagging abnormalities at the time of imaging.This would allow radiology departments to prioritise limited resources into reporting these scans first, thereby expediting intervention by the referring clinical team.Deep learning methods generally, and Convolutional Neural Networks (CNN) in particular, show considerable promise for this purpose, having achieved remarkable success on a range of medical imaging tasks in recent years ( De Fauw et al., 2018 ) ( Titano et al., 2018 ) ( Ardila et al., 2019 ) ( McKinney et al., 2020 ).However, developing a model which is 'fit for purpose' in real-world hospital settings presents a number of key challenges.We assert that a clinically-useful model should: i) Be sensitive to the full gamut of abnormalities likely to be seen in a hospital setting.These range from subtle but important vascular abnormalities, such as venous sinus thrombosis and aneurysmal subarachnoid haemorrhage, to more conspicuous findings like tumours and infarcts.The model should also be sensitive to important extra-cranial abnormalities such as orbital and sinonasal masses.ii) Distinguish between changes which in a hospital setting are considered 'normal for age' and those considered 'excessive for age'.For example, involutional atrophic changes (i.e., volume loss) are often observed as part of normal ageing ( Golomb et al., 1993 ), but can indicate early-onset neurodegenerative conditions in younger patients.Likewise, small foci of increased signal intensity on T2-weighted images scattered throughout the cerebral white matter occur naturally in the ageing brain ( LeMay 1984 ) ( De Leeuw et al., 2001 ) but are considered abnormal in younger patients.Classification should therefore be 'ageconditional' in order to correctly classify these common findings.iii) Provide slice-wise and voxel-wise visualisations of image regions which most influenced its predictions; this helps to engender trust in the model as well as enable human (e.g., radiologist) review of triage decisions.iv) Be robust to variations in scanner vendors, imaging protocols, patient populations and minor imaging artefacts (e.g., motion and 'ghosting' artefacts etc.) in order to ensure generalizability beyond the training dataset, including to data obtained at different hospitals.v) Be optimised for use with those scans which are the basis of routine clinical examinations, rather than rely on more advanced sequences which are performed during targeted imaging protocols.
To date, there has been no demonstration of a model which satisfies each of these conditions.Ultimately, this can be ascribed to the difficulty of obtaining large, clinically-representative labelled datasets for model training ( Hosny et al., 2018 ).In recent years, however, breakthroughs in natural language processing (NLP) have made it feasible to derive accurate radiological labels from freetext radiology reports ( Vaswani et al., 2017 ;Devlin et al., 2018 ;Wood et al., 2021b ) enabling the automatic conversion of archived hospital examinations into labelled datasets suitable for supervised learning ( Wood et al., 2022 ).
In this work, we build on these breakthroughs and present a deep learning framework for triaging hospital head MRI examinations.
The main contributions of our work are as follows: 1) Using a state-of-the-art Transformer-based neuroradiology report classifier, 70,206 head MRI examinations from two UK hospital networks were classified as 'normal' or 'abnormal', generating a large labelled dataset for model training.Separately, two neuroradiologists labelled 800 examinations by interrogating the actual images, generating a 'reference-standard' test set for model evaluation.We then trained and tested CNN-based computer vision models on different subsets of these data, and demonstrate accurate classification of axial T2-weighted and axial diffusion-weighted scans, with good generalisability between hospitals and robustness to imaging artefacts.2) We demonstrate an automated interpretability framework, based on 'smooth guided backpropagation', which provides automatic slice-wise and voxel-wise visualisations of image regions which most influenced model predictions.3) We introduce 'estimated noise correction', a form of noise correction suitable for training computer vision models on NLPlabelled medical images where an a priori estimate of classconditional label noise is available.4) To quantify the impact that our model would have on reporting times we performed a simulation study using historical timestamped data from two UK hospital networks.We show that our best model would considerably reduce the reporting times of abnormal examinations over a one-year period, demonstrating feasibility for use in a clinical triage environment.
This work is a significant extension of our recent conference paper ( Wood et al., 2021a ).Improvements include: i) Training additional models using diffusion-weighted scans to investigate the added value of using these scans, along with T2-weighted images, for abnormality detection; ii) generating a larger labelled dataset for model training and simulation (70,206 versus 54,115); iii) development of an automated interpretability framework to enable real-time review of model decisions; iv) additional experiments, including dataset size ablation studies and image artefact sensitivity analyses; v) additional dataset information.

Related work
A number of studies have sought to develop computer vision models for detecting abnormalities in head MRI scans.These fall into one of two categories depending on whether a labelled dataset was used for model training (supervised learning), or not (unsupervised learning).

Supervised learning
The closest studies to ours are ( Gauriau et al., 2021 ) and ( Nael et al., 2021 ).Gauriau et al. presented a deep learning framework, based on CNNs, for triaging head MRI examinations on the basis of axial T2-FLAIR scans.A dataset of examinations from 3 hospitals was manually labelled for model training and testing.However, classification performance was modest (AUC = 0.83, sensitivity = 78%, specificity = 69% for the best performing singlehospital model); this was likely due to the small training dataset size (n training = 1987 scans).A large reduction in performance was also seen when testing on scans from a separate hospital (AUC = 0.7, AUC = 0.13); this was likely due to marked differences between the various datasets (e.g., one dataset contained female patients only, and included paediatric subjects [age range = 1 -99 years], whereas another dataset contained female and male patients, but no paediatric subjects [age range = 20 -111]).Furthermore, no form of model interpretability was provided, and patient age was not considered when labelling or classifying scans.In other words, this study did not satisfy conditions ii), iii) and iv) outlined above.
Nael et al. recently presented a CNN-based framework for triaging head MRI examinations on the basis of multiple MRI sequences.A dataset of hospital scans was manually labelled (including pixel-level annotation) for model training ( n = 9779 examinations) and acceptable performance and generalisability were achieved (AUC = 0.91, AUC = 0.03, sensitivity = 83%, specificity = 86%).Like Gauriau et al. (2021) , however, age-appropriate changes were considered 'abnormal', with the result that only 16% of examinations were labelled 'normal', severely limiting the downstream impact of report prioritization.Furthermore, extracranial abnormalities were ignored.This study therefore also did not satisfy all of the conditions outlined previously, in particular i) and ii).
Importantly, both studies also did not report the most important metric for assessing the feasibility of automated triagenamely, the impact on reporting times ( Kelly et al., 2019 ).

Unsupervised learning
Owing to the limited availability of labelled training data, unsupervised learning methods have also attracted considerable interest.The basic idea common to these studies is to train a generative model (e.g., a generative adversarial network (GAN) ( Baur et al., 2020a ;Han et al., 2021 ), a variational autoencoder (VAE) ( Chen and Konukoglu, 2018 ;You et al., 2019 ;Baur et al., 2020b ;Zimmerer et al., 2018 ;Zimmerer et al., 2019 ;Kobayashi et al., 2020 ), or a combination of the two ( Baur et al., 2018 )) to learn the manifold of normal anatomical variability, and at test time detect abnormalities by looking for outliers in either the latent feature space or the reconstruction loss.Superficially, unsupervised methods are attractive because healthy images are all that are required for model training and these can be obtained from open-access research databases such as the UK Biobank or the US Human Connectome Project (HCP).However, research cohorts are unrepresentative of clinical populations; the HCP, for example, contains scans of participants aged between 22 -37 years only ( Van Essen et al., 2013 ), while the Biobank data contains scans of participants aged between 40 -69 years only ( Bycroft et al., 2018 ).When applied in clinical settings, it is likely that models trained on these scans will flag older patients as 'abnormal', not because an abnormality is present but because the patient is an outlier relative to the training distribution (healthy 90-year-old brains look 'different' from healthy 60-year-old brains e.g., due to age-appropriate volume loss etc.).Research scans also undergo quality control; scans which do not adhere to strict imaging protocol parameters or are degraded in some way are often excluded ( Jack et al., 2008 ).Hospital scans, by contrast, are more heterogeneous, and it is likely that unsupervised models will erroneously flag hospital-grade scans as abnormal simply because they are 'out of distribution' (e.g., due to differences in magnetic field strength and homogeneity, image resolution, presence of patient motion and 'ghosting' artefacts etc.).
Of course, these limitations could be addressed by training directly on a large, clinically-representative cohort of healthy images obtained from a hospital database; however, this would necessitate distinguishing scans as 'normal' or 'not normal' (i.e.'abnormal') prior to model training, at which point supervised learning could be applied instead.

Data
All 81,936 adult ( ≥ 18 years old) head MRI head examinations performed at King's College Hospital NHS Foundation Trust (KCH) and Guy's and St Thomas' NHS Foundation Trust (GSTT) between 2008 -2019 were obtained for this study.The MRI scans were performed on Signa 1.5 T HDX (General Electric Healthcare, Chicago, US), Aera 1.5 T (Siemens, Erlangen, Germany), Avanto 1.5 T (Siemens, Erlangen, Germany), Ingenia 1.5 T (Philips Healthcare, Eindhoven, Netherlands), Intera 1.5 T (Philips Healthcare, Eindhoven, Netherlands) or Skyra 3 T (Siemens, Erlangen, Germany) scanners ( Table A1 in Appendix A ).The text of the corresponding 81,936 radiology reports produced by expert neuroradiologists (UK consultant grade; US attending equivalent) were extracted from the Computerised Radiology Information System (CRIS) (Healthcare Software Systems, Mansfield, UK).These reports were largely unstructured and typically comprised 5 -10 sentences of image interpretation, along with comments regarding the patient's clinical history, and recommended actions for the referring doctor.All data were de-identified.The UK National Health Research Authority and Research Ethics Committee approved this retrospective study.
To maximise clinical utility, we sought to develop a model optimised for use with the MRI sequence(s) most commonly acquired during hospital head MRI examinations.At KCH and GSTT (two large and representative NHS hospital networks), axial T2weighted scans were performed in > 90% of examinations.This is in line with what is seen in the United States ( ACR, 2019 ).The next most common sequence was axial diffusion-weighted scans, performed in > 70% of examinations, while more advanced sequences (e.g., contrast-weighted and susceptibility-weighted scans) were obtained in under 10% of examinations.We therefore elected to focus on abnormality detection using axial T2-weighted scans, and as a secondary goal we sought to investigate the added value of using diffusion-weighted scans (when available) with an ensemble model.Therefore, only those examinations which included an axial T2-weighted scan were included for model training and testing.To ensure that the training and test sets reflected the heterogeneity of examinations seen in routine clinical practice, no reported examinations were excluded on the basis of image quality.
T2-weighted scans used the manufacturers' standard fast/turbo spin echo based sequence with echo train length (speed up factors) in the range 13 to 24.Diffusion weighted imaging used the manufacturers' product echo planar imaging sequence.
Further dataset information is provided in Table 1 and Appendix A where sex and age distributions are further described.

Dataset labelling
Dataset labelling was performed using a state-of-the-art Transformer-based neuroradiology report classifier ( Wood et al., 2020a ;Wood et al., 2021b ).This model was trained using a dataset of 50 0 0 neuroradiology reports from KCH which had been manually labelled by a team of 5 expert neuroradiologists (UK consultant grade; US attending equivalent) as either radiologically 'normal' or 'abnormal', following comprehensive pre-determined criteria ( Appendix B ).Briefly, findings which could generate a downstream clinical intervention were labelled as 'abnormal' (referral for case discussion at a multi-disciplinary team meeting was included as an intervention).The 'abnormal' category included findings which were 'excessive for age'.(e.g., excessive volume loss, excessive hyperintensities on T2-weighted images etc.); this was possible because the radiology reports made this distinction.All other examinations, including those with age-appropriate changes, were labelled 'normal'.The model achieved near-perfect classification performance on a hold-out set of 500 manually-annotated KCH radiology reports (AUC = 0.992) and generalised to an external hold-out test set of 500 reports from GSTT (AUC = 0.990, AUC = 0.002) ( Fig. C1 , Table C1 in Appendix C ).For further information about the development of this model, see Wood et al. (2020a) , Wood et al. (2020b) , Wood et al. (2021b) .

Table 1
Training, testing, and simulation datasets.The patient age distribution is given in terms of (mean ± standard deviation), with age range (min -max) given in parentheses.'Abnormal' refers to the fraction of abnormal examinations in each dataset.Training and simulation dataset labels were derived from the corresponding radiology reports using a dedicated neuroradiology report classifier ( Wood et al., 2020a ;Wood et al., 2021b ); test set labels were derived by two neuroradiologists who interrogated the actual images with a consensus classification decision made with a third neuroradiologist when there was a discrepancy.Manual assignment was the reference standard.Once validated, the model assigned binary labels (i.e., 'normal' or 'abnormal') to all 62,834 head MRI examinations with axial T2weighted scans performed at the two sites between 2008 -2018 for computer vision model development, and to all 7372 outpatient examinations with axial T2-weighted scans performed between 2018 -2019 for use in a simulation study ( Fig. 1 ).For computer vision model evaluation, a test set of 800 examinations with 'reference-standard' labels was generated by randomly sampling 40 examinations from each site for each year between 2008 -2018.Two neuroradiologists labelled these scans as 'normal' or 'abnormal' applying the same framework used for report labelling -but interrogating the actual images.All available sequences within a head MRI examination were interrogated when generating these reference-standard labels.Initial agreement between the two neuroradiologist labellers was 94.9% (Fleiss' kappa = 0.87), so that a consensus classification decision with a third neuroradiologist was made in 5.1% of cases.Importantly, this test set contained more than 90 classes of morphologically distinct abnormalities.

Image pre-processing
Minimal pre-processing of head MRI scans was performed.Axial T2-weighted or axial diffusion-weighted scans of arbitrary resolution and dimensions, stored as Digital Imaging and Communications in Medicine (DICOM) files, were converted into NIfTI format, resampled to common voxel sizes and dimensions (1 mm 3 ) (to overcome variations in slice thickness and slice spacing seen in routine clinical practice), cropped or padded to 180 mm x 180 mm x 180 mm and then down-sampled to a final 3D array of shape 120 × 120 × 120 for deep learning.The cropping/padding step ensured that the aspect ratios of the raw scans were preserved after down-sampling.These pre-processing steps are visualised in Fig. D1 in Appendix D .The intensity of each image was normalised by subtracting the image mean and dividing by the image standard deviation.Computationally expensive pre-processing steps (e.g., bias-field correction, spatial registration of any form including realignment of angled scans) which limit real-time clinical utility were avoided.Skull-stripping was also avoided in order to enable detection of important extra-cranial abnormalities.All pre-processing was performed using open-access python-based libraries: pydicom ( Mason et al., 2020 ) was used to load DICOM files; dcm2niix ( Li et al., 2016 ) was used to convert DICOM files to NifTI format; NiBabel ( Brett et al., 2020 ) and numpy ( Harris et al., 2020 ) were used to load and manipulate NifTI files; Project MONAI ( MONAI, 2020 ) was used to resample, resize and normalise each image.Scripts to enable readers to reproduce these pre-processing steps are available at https://github.com/MIDIconsortium/Neuroimage _ preprocessing .

Computer vision models
Our models were based on the DenseNet121 convolutional network architecture ( Huang et al., 2017 ) which consists of an initial block of 64 convolutional filters (kernel size = [7 × 7 × 7], stride = 2) and a 'max pooling' layer (kernel size = [3 × 3 × 3], stride = 3), followed by four 'densely connected' convolutional blocks.Each dense block consists of alternating point-wise (kernel size = [1 × 1 × 1]) and volumetric (kernel size = [3 × 3 × 3]) convolutions which are repeated 6, 12, 24 and 16 times in the four blocks, respectively.Between each dense block are 'transition layers' which consist of a point convolution (kernel size = [1 × 1 × 1]) and an average pooling layer (kernel size = [2 × 2 × 2], stride = 2).Global average pooling is applied to the output of the 4th dense block, resulting in a 1024dimensional feature vector; this is then concatenated with the patient's age -which had been normalised by subtracting the training set mean and dividing by the training set standard deviation -and passed through a fully-connected layer to generate prediction probabilities for the two classes ( Fig. 2 , Fig. E1 in Appendix E ).Providing the age as input allowed the model to make age-conditional predictions e.g., to discriminate between radiological findings which are 'excessive for age' and 'commensurate for age'.
Because our neuroradiology report classifier is not a perfect model (i.e. it achieves AUC < 1), however, some small fraction ( ∼ 4%, see Table C1 in Appendix C ) of the training images will be erroneously labelled 'normal' when in fact they should be labelled 'abnormal', and vice versa.Recent studies have shown that this 'label noise' can significantly impact the performance of deep learning models ( Karimi et al., 2020 ;Zhang et al., 2021 ).Following Sukhbaatar et al. (2014) , Patrini et al. (2017) , we added a 'noisecorrection' layer to our network.To motivate this, we note that the probability that a given image x with true (but unknown) label y * will be assigned a noisy label ˜ y can be written as: where T is a 2 × 2 'transition matrix' with diagonal elements that specify the probability of correct labelling and off-diagonal elements which specify the probability of label 'flipping': In Eqn.(1) , p( y * = i | x ) is the distribution which we desire our computer vision model to learn (i.e., the probability distribution of the true label, conditioned on input image x ), whereas p( ˜ is the distribution that is actually learned (e.g., by the baseline To ensure that the training and test sets reflected the heterogeneity of examinations seen in routine clinical practice, no reported examinations were excluded on the basis of image quality.For the triage simulation study, however, in-patient examinations were excluded because at KCH and GSTT in-patient head MRI examinations -which often contain abnormal images -are mandated to be reported within hours, so that a triage system for these examinations would likely have negligible impact.
model) as a result of maximizing the cross entropy between the noisy labels ˜ y and the model predictions.We can force the model to learn the true distribution p( y * = i | x ) by weighting the predicted probabilities during training by the corresponding elements of T implied by Eqn. ( 1) -an operation which can conveniently be recast as a matrix multiplication between the 2 × 2 matrix T , and the 2 × 1 Softmax output.At test time, when reference standard image labels are available, T is set to the identity matrix ( I 2 ) to enable predictions on the basis of p( y In general, T is unknown and must be learned as part of model training.As described in Sukhbaatar et al. (2014) , however, this often results in T converging to I 2 , in which case the baseline and 'noise-corrected' models are identical.In the case of label errors resulting from imperfect text classification, however, an accurate estimate of T is provided by the confusion matrix which is typically generated as part of NLP model validation.We refer to this as 'estimated noise correction', and pursue this strategy in our experiments.

Model interpretability
To scrutinise model predictions, guided backpropagation was performed ( Springenberg et al., 2014 ).Guided backpropagation works by computing the derivative of the model predictions and 'back-propagating' this signal to the input image, although unlike standard backpropagation, only positive error signals are backpropagated.In this way, guided backpropagation highlights image regions which, if changed slightly, would alter the model's predictions.Guided backpropagation was preferred to other popular interpretability methods such as i) occlusion sensitivity analysis, since repeated forward passes through the model are computationally expensive (i.e., it can take hours to generate high resolution 3D saliency maps), precluding 'real-time' interpretability; ii) layer-wise-relevance propagation ( Bach et al., 2015 ), as the underlying assumptions (e.g., 'relevance conservation') are not met for networks with skip connections; and iii) class-activation mapping (CAM) methods (e.g., GradCAM ( Selvaraju et al., 2017 )) since the size of the final DenseNet convolutional map for an input image of size (120 × 120 × 120) is (3 × 3 × 3), so that the resolution of the resulting heatmap when up-sampled to (120 × 120 × 120) is too coarse to be informative.
When applied to 3D images, guided backpropagation returns gradient arrays of the same dimensionality as the input images.Scrolling through these saliency maps looking for (often subtle) 'hot-spots' would therefore be too time-consuming to permit realtime review of triage decisions.Instead, we sought to present a 1D saliency 'line-out' which shows salience as a function of slice number by computing the maximum gradient per slice.By taking the argmax of this 1D function, the 2D heatmap for the most important slice can also be automatically displayed.Following Smilkov et al. (2017) , we supress spuriously high gradients which can occur due to image noise by repeating the guided backpropagation procedure 50 times, each time adding Gaussian noise (mean = 0, standard deviation = 0.1) to the input image, and then using the mean of this gradient array.

Experiments
A number of models were trained using different subsets of the available NLP-labelled data.In each case, the data were split into training (85%) and validation (15%) sets, ensuring that no patient appearing in the training set appeared in the validation or test sets.Final model evaluation was always performed on a test set of images of unseen patients with reference-standard labels assigned by neuroradiologists on the basis of manual image inspection.Our DenseNet implementation was a modification of that provided by Project MONAI (available at https://docs.monai.io/en/latest/ _ modules/monai/networks/nets/densenet.html ) and all modelling was performed with PyTorch 1.7.1 ( Paszke et al., 2019 ).The ADAM optimizer ( Kingma and Ba, 2014 ) was used with an initial learning rate 1 × 10 −4 which was reduced by a factor of 10 after every 5 epochs without validation loss improvement.Confidence intervals were generated by repeating this procedure 5 times for each model using different training/validation splits (test sets remained fixed).DeLong's test ( DeLong et al., 1988 ) was used to test the statistical significance of differences in AUC scores between different models.To investigate the sensitivity of model predictions to image artefacts common in clinical settings, synthetic motion and 'ghosting' artefacts were created following Shaw et al. (2018) , using code available at https://torchio.readthedocs.io/transforms/augmentation.html( Pérez-García et al., 2021 ).

Simulation study
To quantify the impact that our model would have in a real clinical setting, we performed a retrospective simulation study using all out-patient examinations performed at KCH and GSTT between 1 January 2018 -31 December 2018 (in-patient examinations were excluded because at KCH and GSTT in-patient head MRI examinations -which often contain abnormal images -are mandated to be reported within hours for every day of the year, so that a triage system for these examinations would likely have negligible impact).Briefly, the simulation proceeded by stepping through each day and, using the original acquisition timestamp, showing the scans performed on that day to our trained abnormality classifier model.The model's output (i.e., the predicted image category) was then used to decide where in the dynamic reporting queue to insert each examination ( Fig. 3 ).Note that we use the class

Table 2
Classification performance (mean AUC ± 95% CI) for the baseline and 'noise-corrected' axial T2-weighted models when trained/tested on different combinations of the available data.Accurate classification (AUC > 0.9) was achieved for all training/testing combinations, but 'noise correction' led to an improvement for all train/test splits ( p < 0.05).

Table 3
Performance metrics for the best axial T2-weighted model (trained and tested on all available scans from both sites) which was used for the simulation study.

Model
AUC Sensitivity Specificity F1 score Axial T2-weighted 0.941 ± 0.003 0.910 ± 0.02 0.831 ± 0.02 0.908 ± 0.02 (i.e.'normal' or 'abnormal') rather than the predicted probability to decide where in the queue to insert each scan to avoid easy-toidentify but less urgent abnormalities from systematically jumping ahead of clinically urgent but difficult-to-classify abnormalities in the queue.In this way, the predicted class and time already spent in the queue are used to determine reporting order.Once the day's scans were added to this queue, the first N scans at the front of the queue (where N is fixed by the number of scans historically reported that day) were removed from the front of the queue, and the modelled 'prioritized report delay' (i.e. the difference between the historical 'acquisition time' and our modelled 'report timestamp') for these scans was recorded.At the end of the one-year period, the modelled 'prioritized report delay' for each examination was compared to the historical reporting times.This simulation was inspired by the work of Annarumma et al. ( 2019) which was performed in the context of triaging chest radiographs, and we made use of code which these authors have made available at https://github.com/WMGDataScience/chest _ xrays _ triaging/blob/master/reporting _ delays _ simulation/ simulate _ reporting.py .Our modified code, which is optimised for use with head MRI scans, is available at https: //github.com/MIDIconsortium/Prioritization_ simulation .

Results
Accurate classification from axial T2-weighted scans (AUC > 0.9) was observed for all training/testing combinations ( Table 2 , Fig. 4 ).However, noise-correction led to a small but statistically significant improvement in all cases ( p < 0.05).When trained on scans from only a single hospital, the models generalized to scans from the other hospital ( AUC ≤ 0.02) ( Fig. 4 ).
Table 3 shows additional performance metrics for the best axial T2-weighted model (trained on scans from both sites) when operating at the point indicated in Fig. 4 .Table 4 shows the impact that At both hospitals, the reduction in reporting times for abnormal examinations, as well as the increase in reporting times for normal examinations, was statistically significant ( p < 0.001) ( Fig. 5 ) Saliency maps for the axial T2-weighted model generated by smooth guided backpropagation demonstrate accurate localisation -both across and within slices -of a range of morphologically distinct abnormalities ( Fig 6 a -6f).These maps are also sensitive to multiple, distinct findings present in the same scan ( Fig. 6 g), and to more diffuse pathologies (e.g., atrophy) ( Fig. F1 in Appendix F ). Model predictions were also robust to the presence of moderate image artefacts ( AUC = 0.01), as were the corresponding saliency maps ( Fig. 7 ).Our interpretability framework also passed important statistical randomization tests ( Adebayo et al., 2018 ) ( Appendix G ).
By training additional axial T2-weighted models using randomly sampled subsets of the available training data, we observed that our abnormality detection framework is operating in an asymptotically optimal data regime ( Fig. 8 ); in other words, only minimal improvement can be expected by further increasing the training dataset size.Fig. 4. Mean receiver-operating characteristic curve for axial T2-weighted 'noise-correction' models (1) trained/tested using images from both sites (purple), (2) trained on KCH, tested on GSTT (teal), and (3) trained on GSTT, tested on KCH (blue).The operating point of model ( 1) which was used for the simulation study is also indicated (dotted grey).To investigate the impact of providing patient age as input, a noise-corrected model without age concatenation was trained and tested on scans pooled from both sites.Compared with the noisecorrected, age-conditional model trained and tested on the same subset of scans, a statistically significant reduction in performance ( p < 0.05) was observed ( Table 5 ).
To investigate the impact of using axial diffusion-weighted scans for classification, a model was trained and tested using all available scans pooled from both sites (n training = 4 8,84 9, n testing = 614).Accurate classification was observed (AUC = 0.901 ± 0.003) ( Table 5 ) using diffusion-weighted scans alone, although this was considerably worse than an axial T2-weighted model trained and tested on the same subset of examinations (0.941 ± 0.003) ( p < 0.001).However, an ensemble model which averaged the predictions of the diffusion-weighted model and the best performing T2-weighted model (from Table 3 ) when both scans were available, and used only the predictions of the best T2-weighted model when diffusion-weighted scans were unavailable, outperformed ( p < 0.05) the best T2-weighted model alone (AUC = 0.948 ± 0.003) ( Table 7 ).Examples from the test set which were misclassified using T2-weighted scans alone, but correctly classified by the ensemble model on the basis of the diffusion-weighted predictions, are shown in Fig. 9 .Table 8 shows the impact that this ensemble model would have had if it was used to suggest the order that outpatient examinations were reported at KCH and GSTT in 2018.A small further decrease in reporting times for abnormal scans compared to that achieved by the best axial T2-weighted model is seen (5.0 days versus 5.3 days at KCH, 13.7 days versus 14.1 days at GSTT) ( Table 6 ).

Discussion
The background to our study is the year on-year-increase in time taken to report head MRI scans around the world; this delay causes increased morbidity and mortality, and can inflate the cost of treatment.An automated triage tool could reduce reporting times for abnormal examinations by identifying abnormalities at the time of imaging, enabling radiology departments to prioritize limited resources into the reporting of abnormal scans.This in turn could facilitate earlier intervention by the referring clinical team, likely improving clinical outcomes and reducing healthcare costs.To this end, we have developed a deep learning-based computer vision framework for identifying abnormalities using axial T2-weighted and axial diffusion-weighted MRI scans.Our models were trained at scale using retrospective data from two large UK hospital networks, and demonstrate accurate classification performance and good generalisability between hospitals.A simulation study demonstrated that our best performing model would con-Fig. 5. Report prioritization simulation results for KCH (top) and GSTT (bottom).Historical outpatient reporting delays (dashed lines) are compared with what would have been observed (solid lines) if our model had been used to prioritize the reporting of abnormal scans at the two sites.To test for statistical significance, the null hypothesis distribution was generated by repeating the simulation 10 0 0 times, assigning a random priority to each examination (blue).At both sites, a statistically significant ( p < 0.001) reduction in reporting times for abnormal examinations (solid red) compared with what was observed historically (dashed red) was seen.The corresponding increase in normal reporting times was also significant ( p < 0.001).Note that the y axis values are for the null hypothesis distribution only.siderably reduce the mean reporting time for abnormal outpatient examinations at GSTT and KCH, demonstrating feasibility as an automated triage tool.
A number of key strengths to our study can be identified.(1) Our use of a dedicated neuroradiology report classifier enabled us to generate a large, clinically-representative labelled dataset for model training.This in turn enabled the full gamut of abnormalities likely to be encountered in a real-world clinical setting to be seen by our model during training.Importantly, the dataset also contained patients of diverse ethnicity covering the full adult lifespan (18 -99 years), as well as a range of scanner vendors and acquisition parameters.(2) By using scans collected from two separate hospitals, we were able to investigate the generalisability of our models to out-of-sample data by restricting training to scans from one hospital only and testing on scans from the other.Importantly, minimal reduction in classification performance was observed in this case, suggesting that our models may be suitable for use in neuroradiology departments beyond those considered in this study.(3) Unlike previous studies, our models are optimised for use with raw, clinical-grade axial T2-weighted and axial diffusion-weighted scans.An important consequence of avoiding image pre-processing (e.g., bias-field correction, skull-stripping and spatial registration) is that faster classification can be achieved; our framework is able to load axial T2-weighted or axial diffusionweighted scans of arbitrary resolution and dimensions stored as DICOM files and ultimately return a classification prediction in un-

Table 6
Performance of a diffusion-weighted model.Accurate classification was observed using diffusion-weighted scans alone, although this was considerably worse than an axial T2-weighted model trained and tested on the same subset of examinations.der 5 s.A further benefit of using raw scans is that our models are able to detect important extra-cranial abnormalities (e.g., orbital and sinonasal masses) which would otherwise be occluded as part of skull-stripping.( 4) Our framework provides slice-wise and voxel-wise visualisations of image regions which most influenced the model's predictions, engendering a natural form of interpretability and enabling real-time review of triage decisions.Importantly, this visualisation can be performed automatically i.e., the most salient slice(s) can be computed and displayed without the need to scroll through each image looking for 'hot spots' in the saliency maps.( 5) We have put our model into clinical context through a retrospective simulation, quantifying the impact that our model would have on reporting times at two real-world hospital sites.In contrast, previous abnormality detection studies in the context of head MRI have sought to demonstrate efficacy using classification metrics common to the machine learning literature (e.g., AUC, sensitivity, specificity, F1-score).However, none of these measures ultimately reflect what is most important to patients, namely whether the use of the model would lead to a reduction in reporting times for abnormal scans (the so-called 'AI chasm' ( Kelly et al., 2019 )).Accurate classification was observed using axial T2-weighted scans alone.This is important as these scans are the most commonly acquired sequence for detecting pathology in clinical settings around the world.At KCH and GSTT for example, two large and representative UK NHS hospital networks, the only examinations that don't include axial T2-weighted scans are pre-surgery plans and recalls for contrast imaging ( ∼7% of examinations), and these patients will already have undergone routine imaging anyway.In other words, all examinations amenable to triage in our healthcare system will include an axial T2-weighted scan.Nonetheless, axial diffusion-weighted scans are also commonly performed during routine examinations and we have shown that a small improvement in classification can be achieved by incorporating these scans through an ensemble model which averages the predictions of the T2-weighted and diffusion-weighted models.Diffusion-weighted scans appeared to be particularly useful for distinguishing between mild small vessel disease ('normal') and acute infarction ('abnormal'), both of which can occasionally appear the same on T2-weighted scans when the infarct is very small.

Model
A consequence of AI-assisted triage is that the time to report normal head MRI examinations will be increased.This may present issues for the few false negative errors which our model makes, so it was important to understand the nature of these errors.Our team of neuroradiologists determined that mistakes primarily occurred with findings which are most naturally described in terms of a 'spectrum', but which we had elected to binarize to enable supervised learning.For example, 'minor', 'mild' or 'modest' small vessel disease (SVD) was considered 'normal', whereas 'moderate' or 'severe' SVD was considered 'abnormal' ( Fazekas et al., 1987 ).In most cases, our model was able to correctly classify SVD; however, equivocal cases (e.g., 'mild-to-moderate', which had been labelled 'abnormal' to encourage model sensitivity) were sometimes misclassified.Likewise, equivocal cases involving atrophy and en-  .These examples show that diffusion-weighted images can be useful for distinguishing between mild small vessel disease ('normal') and acute infarction ('abnormal'), both of which can occasionally appear the same on T2-weighted scans when an infarct is very small.larged perivascular spaces were sometimes misclassified.Given the degree of subjectivity involved in these particular scenarios, however, these errors are highly likely to have a negligible clinical impact.Nonetheless, as future work we plan to investigate the use of regression, rather than binary classification, to model these particular abnormalities.
A limitation of our framework is that abnormalities which are not visualisable on either T2-weighted or diffusion-weighted scans will not be detected.For example, microhaemorrhages and blood breakdown products are sometimes only visible on gradient-echo or susceptibility-weighted images.However, in an outpatient setting these abnormalities almost never require expedited manage-ment when seen in isolation; furthermore, these sequences are not typically part of routine head examinations, so this is not of clinical relevance.A further limitation is that some abnormalities in our 'abnormal' category might require more urgent intervention than others.As part of future work, we plan to develop a third category of 'emergency diagnoses' to finesse the triage process further.However, UK NHS hospitals require that all emergency MRI scans be reported within hours.Furthermore, CT is the 'workhorse' of emergency diagnoses.Therefore, the benefit of a third category in our healthcare system is likely to be modest.
In conclusion, we have presented an interpretable, CNN-based head MRI abnormality classifier trained on a large dataset of axial T2-weighted and axial diffusion-weighted head MRI scans from two UK hospital networks which had been automatically labelled using a Transformer-based neuroradiology report classifier.We demonstrated accurate, interpretable, robust and generalizable classification on a hold-out set of scans labelled by a team of neuroradiologists, and have shown that the model would reduce the time to report abnormal examinations at both hospital networks, demonstrating feasibility as an automated triage tool.

Appendix B. Binary abnormality definitions
Abnormal is defined as one or more abnormality described below.
Normal is defined as no abnormality described below.Granular (specialised) abnormality definitions Small vessel disease ( Fazekas et al., 1987 ) gives a classification system for white matter lesions (WMLs) summarised as: 1 Mild -punctate WMLS: Fazekas I 2 Moderate -confluent WMLs: Fazekas II 3 Severe -extensive confluent WMLs: Fazekas III To create a binary categorical variable from this system, if the report is described as "unsure", "normal" or "mild" this is categorized as normal as this never requires treatment for cardiovascular risk factors.However, if there is a description of moderate or severe WMLs, the report is categorized as abnormal as these cases sometimes require treatment for cardiovascular risk factors.
Included as normal are descriptions of scattered non-specific "white matter dots" or "foci of signal abnormality" (unless a more defuse or specific pathology is implied) and small vessel disease described as "minor", "minimal" or "modest".
Conversely, those cases which are described as "mild to moderate", "confluent", or "beginning to confluence" small vessel disease are treated as abnormal.
Genetic small vessel disease, in particular Cerebral Autosomal Dominant Arteriopathy with Subcortical Infarcts and Leukoencephalopathy (CADASIL), is considered abnormal.

Mass
All the following intracranial masses are categorized as abnormal: -Neoplasms (tumours) • Intra-axial including all primary and secondary neoplasms • Extra-axial including all primary and secondary neoplasms • Pituitary adenomas included • Lipomas included -Tumour debulking or partial resection as this implies residual tumour (note: these are labelled as both "encephalomalacia" and "mass" abnormalities) -Ependymal, subependymal or local meningeal enhancement (non-surgical) in the context of a history of an aggressive infiltrative tumour -Abscess -Cysts • Retrocebellar cyst (mega cisterna magna not included) • Arachnoid cysts • Pineal cysts and choroid fissure cysts • Rathke cleft cysts -Focal cortical dysplasia, nodular grey matter heterotopia, subependymal nodules and subcortical tubers -Chronic subdural haematoma or hygroma (i.e.cerebrospinal fluid (CSF) equivalent) -Perivascular spaces normal unless giant -MRI examinations for stereotactic surgical planning alone may have very brief reports.In these scenarios it is typically evident from the clinical information provided that there is a mass e.g., surgical planning for glioblastoma.
Note that findings that typically may have minimal clinical relevance when confirmed by a neuroradiology expert, are included in this category e.g., arachnoid cyst.The rationale is that such a finding might generate a referral to a multidisciplinary team meeting for clarification clinical relevance.We consider that a referral to a multidisciplinary team meeting is a clinical intervention and we aim to ensure that any findings that generate a downstream clinical intervention are included.

Vascular
All the following are categorized as abnormal for vascular: -Aneurysm • including coiled aneurysms regardless of whether there is a residual neck or not -Arteriovenous malformation -Arteriovenous dural fistula -Cavernoma -Capillary telangiectasia -Chronic / non-specific microhaemorrhages -Petechial haemorrhage -Developmental venous anomaly -Venous sinus thrombosis -Vasculitis if associated with vessel changes such as luminal stenosis or vessel wall enhancement -Arterial occlusion / flow void abnormality or absence -Venous sinus tumour invasion (this is labelled as both "vascular" and "mass" abnormalities) -Arterial stenosis.If constitutional / normal variant not included.-Vascular-like findings which are considered normal include descriptions of sluggish flow, flow-related signal abnormalities (unless they raise the suspicion of thrombus) and vascular fenestrations.
Note that findings that typically may have minimal clinical relevance when confirmed by a neuroradiology expert, are included in this category e.g., developmental venous anomaly.The rationale is that such a finding might generate a referral to a multidisciplinary team meeting for clarification of clinical relevance.We consider that a referral to a multidisciplinary team meeting is a clinical intervention and we aim to ensure that any findings that generate a downstream clinical intervention are included.

Encephalomalacia
All the following are categorized as abnormal for encephalomalacia: -Gliosis -Encephalomalacia -Cavity -Post-operative tissue changes / appearances are included as encephalomalacia -Tumour debulking or partial resection as this implies residual tumour (note: these are labelled as both "encephalomalacia" and "mass'" abnormalities) -Chronic infarct / sequelae of infarct -Chronic haemorrhage / sequelae of haemorrhage (with / without haemosiderin staining) -Cortical laminar necrosis Encephalomalacia-like findings which are considered normal unless there is a clear description of related parenchymal injury include craniotomy, burr-holes, posterior fossa decompression, and 3rd ventriculostomy

Acute stroke
All the following are categorized as abnormal for acute stroke: -Acute / subacute infarct (if demonstrating restricted diffusion) • Include if there are other descriptors indicating a subacute nature such as swelling even though restricted diffusion has normalised -If a single ischaemic event with both diffusion restricting and non-restricting elements then this is labelled as an "acute stroke" abnormality (rather than an "encephalomalacia" abnormality) -Parenchymal post-operative restricted diffusion secondary to retraction injury -Mitochondrial Encephalopathy with Lactic Acidosis and Strokelike episodes (MELAS) if associated with restricted diffusion -Hypoxic ischaemic injury if associated with restricted diffusion -Vasculitis if associated with acute / subacute infarct -"Mature", "established", "chronic" or "old" infarcts without other descriptors are labelled as "encephalomalacia" abnormalities

White matter inflammation
All the following are categorized as abnormal for white matter inflammation: -Multiple sclerosis (MS) including when some plaques show cavitation (low T1 signal) -Cerebellar ectopia -Brain herniation (e.g., through a craniectomy defect) -Clear evidence of intracranial hypertension (e.g.prominent optic nerve sheaths AND intrasellar subarchnoid herniation) • Isolated intrasellar subarachnoid herniation / empty sella is labelled normal • Isolated tapering of dural venous sinuses is labelled normal -Clear evidence of intracranial hypotension (e.g., pituitary enlargement AND pachymeningeal thickening) • If subdural collections present, these are also labelled as "mass" -Cerebral oedema or reduced CSF spaces from parenchymal swelling -Absent or hypoplastic structures such as agenesis of the corpus callosum -Meningeal thickening or enhancement for example in the context of neurosarcoid or vasculitis -Enhancing or thickened cranial nerves -Infective processes primarily involving the meninges or ependyma (i.e.ventriculitis or meningitis) -Encephalitis if primarily involving the cortex (herpes simplex virus (HSV)/ autoimmune encephalitis) -Excessive or unexpected basal ganglia or parenchymal calcification -Optic neuritis involving the intracranial segments of the optic nerves or chiasmitis -Adhesions / webs -Pneumocephalus -Colpocephaly -Superficial siderosis -Ulegyria -Focal areas of signal intensity (FASIs) / Unidentified bright objects (UBO) -Basal ganglia / thalamic changes in the context of metabolic abnormalities -Neurovascular conflict fulfilling conditions of nerve distortion AND nerve root entry zone involvement -Band heterotopia and polymicrogyria -Hypophysitis -Seizure related changes -Amyotrophic lateral sclerosis (ALS).( Wood et al., 2020a ;Wood et al., 2021b ), evaluated on 500 radiology reports from KCH (indigo) and 500 radiology reports from GSTT (teal), each of which had been manually labelled by a team of 5 neuroradiologists.Included is the area under the receiver operating characteristic curve (AUC).The model generalised to reports from GSTT ( AUC = 0.002) despite being trained only on reports from KCH, demonstrating that it can be reliably used to examinations at both sites.   .Saliency maps generated by smooth guided backpropagation are sensitive to diffuse pathologies such as atrophy, and importantly are sensitive to both symmetric and asymmetric changes within the same brain, the pattern of which is highly pertinent in determining an imaging-based disease differential diagnosis.a) A subject from the test set described as having generalised atrophy in excess for age, but with more marked atrophy of the left hippocampus and around the intraparietal sulcus, as well as enlargement of the lateral ventricles.Frontotemporal dementia or cortico-basal degeneration may be considered in the differential diagnosis.b) A subject from the test set described as having generalised prominence of the ventricles and sulci, as well as asymmetrical (right greater than left) volume loss within the amygdala and hippocampi.Alzheimer's disease may be considered in the differential diagnosis.In both cases, the slice-wise saliency lineouts appear more 'spread-out' than was the case for the localised pathologies in Fig. 6 in the main text, but are modulated by peaks which correspond to regions of more marked atrophy.(left/right are 'radiological left/right').

Appendix E. DenseNet121 architecture
Appendix G ( Adebayo et al., 2018 ) suggested that interpretability methods should pass two important statistical randomization tests (or 'sanity checks' as the authors call it): i) The model parameter randomization test.Saliency maps generated by a trained model should differ substantially from those generated by an untrained (i.e., randomly initialized) network.ii) The data randomization test.Saliency maps generated by a trained model should differ substantially from those generated by the same model trained on a copy of the dataset with randomly permuted labels.
In other words, saliency maps should be sensitive to model and data randomization.In the context of natural (i.e., non-medical) 2D computer vision, ( Adebayo et al., 2018 ) found that several popular interpretability methods (including guided backpropagation and guided GradCAM) were less sensitive to model and data randomization than other methods (including SmoothGrad).However, is unclear whether these results carry over into computer vision tasks involving medical images, or how combinations of these individual methods (e.g., smooth guided backpropagation) perform.
To investigate this, we performed randomization tests using our axial T2-weighted model.Fig. G1 compares the saliency lineouts and maps generated by our trained axial T2-weighted model ( Table 3 in the main manuscript) with those generated by a copy of the same model with the weights of the final densely connected layer replaced by random values sampled from a uniform distribution (similar results are seen when the weights of the entire network are randomized).Fig. G2 compares the saliency lineouts and maps generated by our axial T2-weighted model with those generated by an architecturally-identical model trained on the same dataset but with randomly permuted labels.Clearly, substantial differences can be seen in both cases; in particular, the randomized saliency maps resemble simple edge detectors, whereas our optimized saliency maps are sensitive only to the underlying pathology.Our interpretability framework therefore clearly passes these 'sanity checks', and show promise for enabling real-time review of model decisions.Left -saliency lineouts and maps generated using our axial T2-weighted model trained on the correct image labels (i.e., 'normal' or 'abnormal').Right -saliency lineouts and maps for the same images generated using an architecturally identical model trained on the same dataset but with randomly permuted labels.Also Included is the randomized saliency map for the slice deemed most important by the non-randomized model.A marked sensitivity to data randomization is seen.

Fig. 1 .
Fig. 1.Flow chart showing data sets used for training, validating, and testing our models, as well as the data sets used for the simulation study.KCH (top), GSTT (bottom).To ensure that the training and test sets reflected the heterogeneity of examinations seen in routine clinical practice, no reported examinations were excluded on the basis of image quality.For the triage simulation study, however, in-patient examinations were excluded because at KCH and GSTT in-patient head MRI examinations -which often contain abnormal images -are mandated to be reported within hours, so that a triage system for these examinations would likely have negligible impact.

Fig. 2 .
Fig. 2. Baseline classification model and 'noise correction' classification model.Both networks perform visual feature extraction using a 3D Densenet121, and concatenate this with the patient's age in order to generate class probabilities.The 'noise-correction' model includes an additional layer which modifies the predictions during training to enable learning the true, rather than the noisy, label distribution.

Fig. 3 .
Fig. 3. Automated triaging of head MRI examinations.A computer vision-based triage tool would suggest the order in which head MRI examinations are reported by a radiologist by inserting images in real-time into a dynamic reporting queue based on the predicted likelihood of being abnormal (shown) or on the predicted category and time spent in the queue (what we do in this study).Earlier reporting of abnormal scans would expedite intervention by the referring clinical team, with the expectation of improving patient outcomes and lowering healthcare costs.

Fig. 6 .
Fig. 6.Smooth guided backpropagation enables 'slice-wise' and 'voxel-wise' visualisations of image regions which most influence the model's predictions.When applied to scans from the hold-out test set, accurate localisation of a range of morphologically distinct abnormalities, both across and within slices, is observed (a -f); sensitivity to multiple findings in a single scan is also seen (g).a) Mass in the right cerebellar hemisphere, consistent with a solitary metastasis.b) Chronic infarct (lacune) in the left corona radiata.c) Right temporal extra-dural haematoma.d) Extensive infarction involving both cerebellar hemispheres.e) T2 hyperintensity within the left operculum; associated cortical volume loss (excessive for age).f) Left cerebellar extra-axial T2 hyperintensity in keeping with a collection.g) Glioma in the left parietal lobe, with evidence of chronic right cerebellar damage, and moderate small vessel disease.(left/right are 'radiological left/right').

Fig. 7 .
Fig. 7. Saliency maps are robust to moderate levels of image degradation due to patient motion and 'ghosting' artefacts.Left -original images and corresponding saliency maps, right -the same images corrupted by synthetic patient motion (top and middle) and 'ghosting' (bottom) artefacts, along with corresponding saliency maps.

Fig. 8 .
Fig. 8. Dataset size ablation study.Test set performance (AUC) as a function of training dataset size for our axial T2-weighted model using scans pooled from both sites.Error bars show 95% confidence intervals.Optimal performance is achieved for training dataset size ≥ 30,0 0 0 scans.

Fig.
Fig. Examples from the test set which were misclassified using T2 alone (left), but correctly classified by an ensemble model which also incorporates diffusion-weighted scans (right).These examples show that diffusion-weighted images can be useful for distinguishing between mild small vessel disease ('normal') and acute infarction ('abnormal'), both of which can occasionally appear the same on T2-weighted scans when an infarct is very small.

Fig. A1 .
Fig. A1.Distribution of ages across the datasets used in this study, including when grouped by sex.

Fig. C1 .
Fig. C1.Receiver operating characteristic (ROC) curve for the neuroradiology report classifier described in( Wood et al., 2020a ;Wood et al., 2021b ), evaluated on 500 radiology reports from KCH (indigo) and 500 radiology reports from GSTT (teal), each of which had been manually labelled by a team of 5 neuroradiologists.Included is the area under the receiver operating characteristic curve (AUC).The model generalised to reports from GSTT ( AUC = 0.002) despite being trained only on reports from KCH, demonstrating that it can be reliably used to examinations at both sites.

Fig E1 .
Fig E1.DenseNet121 architecture used in this study.Also shown are the output sizes at each internal layer of the network for an input image of size (120 × 120 × 120).

Fig. F1
Fig. F1.Saliency maps generated by smooth guided backpropagation are sensitive to diffuse pathologies such as atrophy, and importantly are sensitive to both symmetric and asymmetric changes within the same brain, the pattern of which is highly pertinent in determining an imaging-based disease differential diagnosis.a) A subject from the test set described as having generalised atrophy in excess for age, but with more marked atrophy of the left hippocampus and around the intraparietal sulcus, as well as enlargement of the lateral ventricles.Frontotemporal dementia or cortico-basal degeneration may be considered in the differential diagnosis.b) A subject from the test set described as having generalised prominence of the ventricles and sulci, as well as asymmetrical (right greater than left) volume loss within the amygdala and hippocampi.Alzheimer's disease may be considered in the differential diagnosis.In both cases, the slice-wise saliency lineouts appear more 'spread-out' than was the case for the localised pathologies in Fig.6in the main text, but are modulated by peaks which correspond to regions of more marked atrophy.(left/right are 'radiological left/right').

Fig. G1 .
Fig. G1.Model parameter randomization test.Saliency maps and lineouts are highly sensitive to model randomization.Left -saliency lineouts and maps generated using our trained axial T2-weighted model.Right -saliency lineouts and maps for the same images generated using the same model except that the weights in the final densely connected layer had been randomized.Also Included is the randomized saliency map for the slice deemed most important by the non-randomized model.A marked sensitivity to model parameter randomization is seen.

Fig. G2 .
Fig. G2.Data label randomization test.Saliency maps and lineouts are highly sensitive to randomization of training labels.Left -saliency lineouts and maps generated using our axial T2-weighted model trained on the correct image labels (i.e., 'normal' or 'abnormal').Right -saliency lineouts and maps for the same images generated using an architecturally identical model trained on the same dataset but with randomly permuted labels.Also Included is the randomized saliency map for the slice deemed most important by the non-randomized model.A marked sensitivity to data randomization is seen.

Table 4
Results of the retrospective simulation study, demonstrating the impact that our axial T2-weighted model would have on reporting times for outpatient examinations at KCH and GSTT in 2018.Data are mean delay ± standard deviation.
this model would have had if it was used to suggest the order that outpatient examinations were reported atKCH and GSTT in 2018.

Table 5
Comparison of model performance with and without age as input.Performance metrics for the age-conditional model are reproduced from Table 4 .

Table 8
Results of the retrospective simulation study, demonstrating the impact that our ensemble model would have on reporting times for outpatient examinations at KCH and GSTT in 2018.Data are mean delay ± standard deviation.
(when diffusion scans were available) from Table6.The ensemble model outperformed the axial T2-weighted model alone ( p < 0.05).