VASARI-auto: Equitable, efficient, and economical featurisation of glioma MRI

Graphical abstract


Introduction
Contemporary brain tumour care relies upon joint multi-disciplinary teams spanning clinical, oncological, surgical assessment, histopathology, and radiology (Louis et al., 2021).
Neuroradiology plays a vital role for these patients, not merely in the initial diagnosis and triaging to services but also in post-treatment follow-up, where many patients are monitored for several years.Across all subspecialties, our understanding of neuro-oncology is increasingly recognised to be challenged by the marked heterogeneity of brain tumours (Ruffle et al., 2023b).Though there is no established solution to this heterogeneity, it is a problem that could arguably only be attended to with richer patient-personalised information, catalytic for data-driven decision-making.But to understand this heterogeneity, we require robust systems that illuminate disease variation from one patient to another.
The VASARI (Visually AcceSAble Rembrandt Images) MRI feature set is a quantitative scoring system designed to facilitate accurate and reliable imaging descriptions of adult gliomas (TCIA, 2020), initially developed in 2010 as part of The Cancer Genome Atlas (TCGA) initiative from the Repository for Molecular BRAin Neoplasia DaTA (REMBRANDT) study (Gusev et al., 2018).It uses controlled and predefined terminology to define hallmark characteristics of glioma -including location, proportions of constituent components (such as oedema, enhancing and nonenhancing tumour), and other associated features such as cortical, ependymal, or deep white matter involvement.VASARI's inception was intended to yield more consistent imaging interpretations, irrespective of its rater, centre or imaging approach (TCIA, 2020).Indeed, it has shown promise towards better standardisation of care in adult glioblastoma, with multiple studies consistently demonstrating reasonable inter-observer agreement across constituent VASARI features beyond what could be expected from a conventional means of reporting (Gemini et al., 2023;Park et al., 2021;Setyawan et al., 2024).It has also been used with clinical and genomic data to effectively predict tumour histological grade, progression, mutation status, risk of recurrence and overall patient survival, implying a broader clinical utility (Jain et al., 2014;Nicolasjilwan et al., 2015;Peeken et al., 2019;Peeken et al., 2018;Setyawan et al., 2024;Wang et al., 2021;Zhou et al., 2017).Though initially developed for adult glioblastoma, the VASARI feature set has been trialled in several novel clinical contexts, including in paediatric brain tumours (Biswas et al., 2022) and rarer neuroepithelial malignancies (Li et al., 2023), where it has shown potential as a clinical aid.
However, despite good evidence to support implementing the VASARI feature set as a clinical tool, it can be prohibitively time-consuming.Some studies report manual segmentation times of 20-40 minutes per case (Deeley et al., 2011;Wan et al., 2020).In an inevitably resourcelimited and overstretched healthcare system (NHS, 2024), such a time constraint inevitably obstructs translation into real-world care.
Though the task is complex, it is theoretically deliverable by machine vision.Over the last few decades, lesion segmentation has formed a cornerstone of innovation across neurooncology (Lu et al., 2021;Peng et al., 2021;Ruffle et al., 2023a;Xue et al., 2020), medical imaging (Lenchik et al., 2019;Suetens et al., 1993), biomedical engineering (Ashburner and Friston, 2005), machine and deep learning (Menze et al., 2015).The ability to segment an anatomical or pathological lesion in three dimensions confers the ability to evaluate it quantitatively -moving beyond visual qualitative assessment -with greater richness and fidelity than conventional two-dimensional measurements repeatedly shown to be often spurious and inconsistent between radiologists (Dempsey et al., 2005;McNitt-Gray et al., 2015;Zhao et al., 2009), and with greater sensitivity to the heterogeneity of the underlying pathological patterns (Mandal et al., 2020).Enabling radiological image segmentation opens many possibilities for downstream innovation in neuro-oncological healthcare and research, ranging from standardisation of care, clinical stratification, outcome prediction, response assessment, treatment allocation and risk quantification, many of which have already shown great promise.The underlying goal is to enhance the individual fidelity of data-driven decisionmaking, facilitating better patient-centred care (Rajpurkar et al., 2022;Topol, 2019), a remit especially warranted in neuro-oncology (Louis et al., 2021).
Given this, we developed VASARI-auto, an automated VASARI feature set labelling tool (Figure 1).With a required input of patient lesion segmentations only -engineered by design to maximise patient confidentiality -we herein illustrate its high performance, efficiency, equity, and downstream survival predictive utility in a multi-site patient cohort large-in-kind, and with real-world healthcare provider simulations illustrating tangible added value that can enhance clinical neuro-oncology workflows.were acquired for all participants.By random assignment of 100 glioblastoma, IDH-wt cases, two experienced consultant neuroradiologists reviewed imaging and recorded VASARI features and were timed doing so.In parallel, we developed VASARI-auto, an automated software to determine VASARI features.We derived these features using VASARI-auto from both semi-supervised hand-annotated lesion masks from a separate group of neuroradiologists and using a previously published and openly available tumour segmentation model, herein referred to as 'TumourSeg'.Lesional tissue is colour-coded as orange for enhancing tumour, purple for nonenhancing tumour, and pink for perilesional signal change.
We subsequently undertook multiple downstream evaluations of both neuroradiologist VASARI labelling and that from VASARI-auto, evaluating: 1) agreement both between neuroradiologists, between software, and between neuroradiologist and software; 2) equity calibration to determine if neuroradiologist and software labelling were equitably performant for all ages and sexes; 3) a simulated economic analysis determining the cost to undertake labelling with neuroradiologists or VASARI-auto based on real-world clinical workloads; and 4) in using these data to predict patient overall survival.Neuroradiologists were blinded to all software development and evaluations from VASARI-auto, and likewise, software developers were blinded to all neuroradiologist labelling until the final downstream evaluation stage.
We firstly contacted the corresponding authors of both datasets to clarify which participant imaging was part of The Brain Tumour Segmentation Challenge (BraTS) (Baid et al., 2021) since BraTs data were used to initially train the adopted tumour segmentation model (herein referred to as 'TumourSeg') (Ruffle et al., 2023a;Ruffle, 2023), and as such needed to prevent any possibility of an information leak.We excluded any such cases from the data pool.Further, we subsampled to study only patients with a confirmed molecular diagnosis of glioblastoma, IDH-wt (Louis et al., 2021), for which VASARI featurisation was initially developed.Each patient dataset included volumetric and brain-extracted T1, T2, FLAIR, and post-contrast T1weighted MRI sequences (Table 1).
In all patients, age, sex, overall survival (in days), and a lesion segmentation were available.
Separately undertaken by the original UCSF-PDGM and UPenn-GBM repository authors, each patient neuroimaging set first underwent automated segmentation using an ensemble model consisting of the prior top-scoring BraTS challenge algorithms, which was then manually corrected by a group of annotators with varying experience and approved by one of two neuroradiologists with more than 15 years of attending experience each (Bakas et al., 2022b;Calabrese et al., 2022).
Ethical approval UCSF-PDGM data collection followed relevant guidelines and regulations and was approved by the UCSF institutional review board with a waiver for consent (Bakas et al., 2022b).For UPenn-GBM, collection, analysis, and release of the UPenn-GBM data was approved by the Institutional Review Board at the University of Pennsylvania Health System (UPHS), and informed consent was obtained from all participants (Bakas et al., 2022b).

Neuroimaging
Neuroradiologist VASARI-featurisation From this glioblastoma, IDH-wt cohort, we drew a random sample of 100 unique patients.The delineation of 100 from 1172 patients for manual neuroradiologist labelling was programmatically randomised to avoid selection bias.Our choice of n=100 was guided by a balance of reasonable statistical power and how time-consuming manual annotation of these scans can be for neuroradiologists.
Each patient was randomly assigned to one of two consultant neuroradiologists with 15 and 8 years of experience in neuro-oncology, who quantified VASARI features from neuroimaging.
Both radiologists had prior experience with VASARI criteria, though an initial calibration meeting was also undertaken to ensure consistency.Neuroradiologists quantified all VASARI features except those requiring diffusion or non-brain-extracted sequences.The time taken to derive VASARI features in each patient case was recorded.From a random number generator, we drew a random integer between 10-15 (which drew 13), for which we randomly allocated 13 duplicate cases to both neuroradiologists to ascertain inter-rater agreement.
Neuroradiologists were blinded to all software and model development.

Tumour segmentation
We used a previously published tumour segmentation model (TumourSeg) for all cases, described in significant detail elsewhere (Ruffle et al., 2023a).In brief, this model is a highresolution convolutional 3D U-Net implemented with nnU-Net (Isensee et al., 2021), a pipeline with proven high performance in semantic segmentation across a range of micro and macroscopic tasks (Antonelli et al., 2022;Isensee et al., 2020;Isensee et al., 2021).The model was trained on the BraTS training dataset of 1251 participants with 5-fold cross-validation, with additional external evaluation on cases acquired at the National Hospital for Neurology and Neurosurgery (Ruffle et al., 2023a).We ensured that none of the BraTS data used in model training were evaluated in this downstream task to prevent the possibility of an information leak.We compared the segmentation performance of TumourSeg to hand-annotated labels provided by the original dataset curators: quantitatively by the Dice-Sørensen coefficient and qualitatively by a neuroradiologist's visual review.

Nonlinear registration with enantiomorphic normalisation
Having segmented lesions in native space, MRI sequences and lesion segmentation masks were nonlinearly registered to 1mm MNI space with Statistical Parametric Mapping (SPM) using enantiomorphic correction (Ashburner and Friston, 1999;Nachev et al., 2008).The advantage of enantiomorphic correction is that the risks of registration errors secondary to a lesion are minimised by leveraging a given patient's normal structural neuroanatomy on the unaffected contralesional hemisphere (Nachev et al., 2008).A neuroradiologist manually reviewed all imaging data at multiple stages of the data pre-processing.

VASARI-auto software development
We developed a fully automated pipeline -'VASARI-auto' -to derive VASARI features from lesion masks.Lesion masks could be of any source, whether manually traced, from an openly available tumour segmentation model, or other lesion segmentation tools.VASARI-auto required data to be held in MNI registered space (prototyped in a 1mm 3 volumetric resolution, but deployable in any).We pooled neuroanatomical atlases for all brain lobes, as well as the brainstem, insula, thalamus, corpus callosum, internal capsule, ventricles, and cortex, for the derivation of locational-based features.For each case, VASARI-auto loaded the multi-channel tumour segmentation (with separate labels for enhancing tumour, nonenhancing tumour, and perilesional signal change) and, following pre-existing VASARI reporting standards(TCIA, 2020), derived the following: F1 -tumour location; F2 -side of tumour epicentre; F4enhancement quality; F5 -proportion enhancing; F6 -the proportion of nonenhancing tumour; F7 -the proportion of necrosis; F9 -multifocal/multicentric lesional status; F11the thickness of the enhancing margin; F14 -the proportion of oedema; F19 -ependymal invasion; F20 -cortical involvement; F21 -deep white matter invasion; F22 whether nonenhancing tumour crossed the midline; F23 -whether enhancing tumour crossed the midline; and F24 -the presence of satellite lesions.Notably, whilst initial tumour segmentation harnesses a trained, validated, and open-sourced deep learning segmentation model, VASARI-auto requires only mathematical derivation of features from the 3dimensional lesion mask.No non-deterministic or nonlinear inferential statistics are involved, and the results are mathematically deterministic and reproducible.
Our code did not quantify a few VASARI features that require either non-brain extracted data (F25 -calvarial modelling) or the original MRI sequences (F10 T1/FLAIR ratio; F12-13definition of the enhancing and nonenhancing margin, F18 -pial invasion, and F16haemorrhage), the reason being was that we wished to develop an automated tool immediately usable with irrevocably anonymised lesion segmentation data without the requirement for raw volumetric neuroimaging.We similarly did not quantify F17 -diffusion changes since DWI was not available for many cases in the external data, beyond our control.
We also did not quantify the presence of F8 -cysts since most brain tumour segmentation models rely on BraTS lesion labels of enhancing tumour, nonenhancing tumour, and perilesional signal change, but with no distinction for cysts.Therefore, we felt any attempts to model cyst presence would be liable to confabulation.We similarly did not quantify F3eloquence, for lack of appropriate brain masks to model it robustly; moreover, we did not wish to detract from a gold standard of a neurosurgeon's electrical stimulation assessment for eloquent-sparing resections (Ritaccio et al., 2018).
The requirements to run VASARI-auto are given below in the software subsection.We also recorded time to quantify VASARI features with VASARI-auto, both already pre-generated lesion masks, and when paired with TumourSeg (Ruffle et al., 2023a).

Reporting agreement
Quantitatively, we compared agreement in all VASARI featurisation between 1) consultant neuroradiologists, 2) consultant neuroradiologists and VASARI-auto, and 3) between VASARIauto when using either the source semi-supervised and neuroradiology-reviewed segmentations to VASARI-auto using TumourSeg (Ruffle et al., 2023a;Ruffle, 2023).The neuroradiologist's label was always taken as the ground truth against which VASARI-auto would be assessed.The agreement was quantified by Cohen's Kappa (Pedregosa et al., 2011), which furthermore was appropriately linearly weighted for non-Boolean VASARI features.We also quantified the balanced accuracy in VASARI featurisation between consultant neuroradiologists (the ground truth) and VASARI-auto (the prediction), as well as the balanced accuracy between VASARI-auto using the source semi-supervised and neuroradiologyreviewed segmentations (the ground truth) and VASARI-auto using TumourSeg (the prediction) (Ruffle et al., 2023a).
Qualitatively, in post hoc analyses, we also undertook a case-based review of 1) the results from TumourSeg, with direct comparison to the neuroradiologist hand annotation, and 2) the results of VASARI-auto with direct comparison to the VASARI featurisation of separate neuroradiologists.

Equity calibration
We quantified software and reporting patient equity (Abramoff et al., 2023;Carruthers et al., 2022) for all analysis steps.For tumour segmentation, we compared model performance by the Dice coefficient across all lesional compartments (enhancing tumour, nonenhancing tumour, perilesional signal change, and whole tumour [a single mask for all areas of abnormality]) for male and female sex and for all decades of age included in the cohort (20-90 years).We similarly compared Cohen's Kappa agreement metrics across male and female sex and all decades of age.

Efficiency, economic and workforce analysis
We statistically compared the time required to record VASARI features between 1) consultant neuroradiologists, 2) VASARI-auto with tumour segmentations already supplied, and 3) VASARI-auto with TumourSeg.
Next, we undertook an economic and workforce analysis, simulating neuro-oncology workload across the UK.Every week, a neuro-oncology multidisciplinary team (MDT) meeting is held to discuss all referrals, ongoing cases, and management plans, which includes a neuroradiological review of all cases.We reviewed the last three years of neuro-oncology MDT lists (2020-2023) and quantified the minimum and maximum number of cases to be discussed each week, which was of range 30-75.We determined the minimum and maximum pay scales for consultants in the National Health Service workforce as of March 2024, which vary depending on years of service(NHS-Employers, 2023).We similarly quantified power consumption costs to run a reasonably powerful computer (1200 kilowatt Hour (kWh), based on UK energy tariffs as of March 2024March (sust-it.net, 2024)).We curated a list of all UK neurooncology centres (n=40), kindly provided by the British Society of Neuro-Oncology, to simulate UK-wide neuro-oncology workloads.
Having derived this data, we simulated the next three years (2024-2027) of MDT clinical workload at each centre.A random number of MDT cases was simulated weekly using the previous minimum-maximum caseload through 2020-2023.We then simulated a random choice of neuroradiology consultants who would be allocated to present a given week's neurooncology MDT, with their salary drawn randomly from the NHS consultant pay scales.We then randomly simulated the time taken to quantify VASARI features across all cases, where time per case was drawn from a random uniform distribution informed by the time taken for neuroradiologists to quantify all 100 cases in our earlier analysis.From this, we quantified the workload and financial cost if each patient had undergone VASARI featurisation by a neuroradiologist.We similarly quantified the time and expense of power if VASARI-auto and VASARI-auto with TumourSeg had undertaken featurisation.We undertook this process with five iterations to ensure model stability/robustness to outliers.

Survival prediction
Lastly, we fitted linear regression models seeking to predict patient overall survival (OS) (in days) from VASARI features.These were in the formulation !" ~1 + ' !+ ' " + ' # , where ' # denotes each VASARI feature.We fitted separate models using VASARI features quantified from 1) consultant neuroradiologists, 2) VASARI-auto using the source semi-supervised and neuroradiology-reviewed segmentations, and 3) VASARI-auto using TumourSeg (Ruffle et al., 2023a;Ruffle, 2023), from which we compared the quality of fit.We derived each feature's variance inflation factor to adjust for potential multicollinearity and excluded those whose value exceeded 10.Although large in kind (n=100), with the relatively small sample used here, we deliberately chose not to model with nonlinear or machine learning models nor partition data into train or test datasets, which would otherwise be highly liable to overfit in such an instance.Since our task here is to benchmark the utility of features derived by neuroradiologists compared to our developed machinery, using nonlinear complex models that are liable to overfit would arguably be inappropriate in this setting.

Analytic compliance
All analyses were performed and reported following international TRIPOD and PROBAST-AI guidelines (Collins et al., 2021).

Code, model, and data availability
The software for VASARI-auto shall be openly available upon publication at https://github.com/jamesruffle/vasari-auto.All patient data utilised in this article is freely and openly available (Bakas et al., 2022b;Calabrese et al., 2022).

Compute
All experiments were performed on a 64-core Linux workstation with 256Gb of RAM and an NVIDIA 3090Ti GPU.

Cohort
The brain tumour patient cohort included 56 male and 44 female participants, with a mean age ± standard deviation of 61 years ± 13.49.The mean overall survival (in days) was 436 ± 462.21 days.Seventy-four participants were included from UPenn-GBM, and 26 were included in UCSF-PDGM.There were no significant differences in age, sex, or survival between participants at either site, indicating a well-standardised and representative multi-site sample.

Segmentation
A comparison of tumour segmentation TumourSeg (Ruffle et al., 2023a;Ruffle, 2023) to the externally curated semi-supervised labels showed a mean Dice segmentation performance of 0.95 ± 0.05 for the whole tumour, 0.89 ± 0.07 for the enhancing tumour, 0.86 ± 0.11 for the nonenhancing tumour, and 0.91 ± 0.06 for the perilesional signal change.A visual overlay of lesion segmentations to the brain showed no spatial discrepancy (Figure 2).There was no significant difference in segmentation performance between the male and female sexes and across all decades of age, indicating an equitable tumour segmentation model (Figure 3).
In contrast, when treating VASARI-auto when derived from the external semi-supervised lesion labels as a ground truth, VASARI-auto accuracy using the tumour segmentation model was much more stable, with a mean accuracy of 97.40 ± 0.03%.Case-based examples are shown in Figure 5. neuroradiologist reporters (green), between neuroradiologists and VASARI-auto (orange), and between VASARI-auto with and without using TumourSeg (purple) shows inter-rater variability between neuroradiologists but quantitatively higher agreement and stability between both VASARI-auto methods.B) Accuracy between neuroradiologist VASARI reporting (the ground truth, GT) and VASARI-auto (orange), and between VASARI-auto with and without using TumourSeg (purple).VASARI-auto is generally performant compared to neuroradiologists, although some discrepancies are evident due to diverging definitions of what is referred to as a nonenhancing tumour and what is oedema.VASARI-auto across-model comparison is highly accurate.Abbreviations: nCET, non-contrast-enhancing tumour; WM, white matter.between Neuroradiologist #1, Neuroradiologist #2, and VASARI-auto.For each case, the time taken to record is listed (in seconds), followed by a selection of VASARI features that are colour-coded depending on whether there is full concordance between both neuroradiologists and VASARI-auto (green), partial concordance between VASARI-auto and one neuroradiologist (orange), or discordant (red).

Efficiency
The use of VASARI-auto in VASARI featurisation -regardless of whether used in isolation or when paired with the tumour segmentation model -was significantly faster per case compared to consultant neuroradiologists (p<0.0001)(Figure 6).The mean time to quantify was 3.03 ± 0.59 seconds with VASARI-auto, which was significantly higher but notably still efficient at 15.47 ± 1.56 (95%CI) using VASARI-auto with TumourSeg (p<0.0001).In comparison, the mean time to quantify was 317.46 (i.e., 5.28 minutes) ± 96.89 seconds with consultant neuroradiologists.

Simulated workforce analysis
The simulated workforce analysis forecast that, over 2024-2027, a total cumulative 8150 ± 168 cases would require discussion at each weekly neuro-oncology MDT (Figure 6).For VASARI featurisation to be undertaken in all cases, this would demand 744.43 ± 15.54 consultant neuroradiologist workforce hours, equating to £39,373.37 ± £864.22 in salary remuneration for hours worked.In contrast, quantifying VASARI features with VASARI-auto for all cases over three years would require 8.30 ± 0.15 hours of computing time (time comparison p<0.0001), equating to approximately £3.65 ± 0.12 for power costs (cost comparison p<0.0001).If combined with tumour segmentation, this time and expense would rise slightly to 34.89 ± 0.70 hours of computing time and £15.17 ± 0.55 for power costs (both of which remained significantly less than with neuroradiologist labelling).Time taken and costs remained significantly greater for featurisation by neuroradiologists compared to VASARI-auto (p<0.0001).
We scaled this up to all 40 neuro-oncology centres across the UK.For VASARI featurisation to be undertaken in all UK cases, this would demand 29,777.39consultant neuroradiologist workforce hours, equating to £1,574,935 in salary remuneration for hours worked.In contrast, quantifying VASARI features with VASARI-auto for all cases over three years would require 331.95 hours of computing time, equating to approximately £145.85 for power costs.If combined with tumour segmentation, this time and expense would rise slightly to 1394.42 hours of computing time and £606.75 for power costs.Both time taken and cost were significantly greater for featurisation by neuroradiologists compared to VASARI-auto with or without the addition of TumourSeg (all p<0.0001).

Performance equity
A critical performance measure of any automated tool in healthcare is invariance across patient background characteristics (Carruthers et al., 2022).We compared reporting agreement between 1) neuroradiologists, 2) neuroradiologists and VASARI-auto, and 3) VASARI-auto when applied to the externally curated tumour segmentations or with TumourSeg (Ruffle et al., 2023a), with respect to patient age and sex, using Cohen's Kappa (Figure 7).There was no evidence of reporting inequity between neuroradiologists and between neuroradiologists and VASARI-auto (allowing for the more limited distribution of demographics for those patients double-reported by neuroradiologists).Similarly, agreement between VASARI-auto using manually traced or model-derived lesion segmentations was equally performant across patient age and sex, all of which indicate equitably of VASARI-auto and tumour segmentation models.

Survival prediction
The clinical utility of any feature is ultimately determined by its downstream predictive, prescriptive, or inferential power.Fidelity in overall survival prediction using VASARI features was qualitatively similar whether using feature sets derived by consultant neuroradiologists, from VASARI-auto applied to the semi-supervised and neuroradiology reviewed segmentations, or VASARI-auto paired with TumourSeg (Figure 8).Quantitatively, the best linear regression fit was achieved with VASARI-auto using the semi-supervised and neuroradiology-reviewed segmentations (R 2 0.245), followed closely by VASARI-auto using TumourSeg (R 2 0.227), with slightly weaker performance when using the consultant neuroradiologist-labelled VASARI features (R 2 0.205).
Feature-wise, F21 deep white matter invasion was significantly associated with poorer overall survival (p=0.028).Trends for a greater proportion of enhancing tumour, a parietal location, and multifocality were all associated with poorer overall survival, albeit non-significant (p=0.173,p=0.109, and p=0.131, respectively).Full model coefficients are provided in the supplementary material.(orange), or C) VASARI-auto combined with TumourSeg (purple).X-axes illustrate the actual survival, whereas y-axes illustrate predicted survival.There is highly similar qualitative performance in survival prediction regardless of whether a neuroradiologist or VASARI-auto labels it, although quantitatively, the R 2 is higher with both VASARI-auto assessments.

Discussion
We present VASARI-auto, an automated system for deriving VASARI features from glioma imaging using tumour segmentations alone.Our evaluation shows high accuracy, greater consistency than inter-agreement between neuroradiologists, and equitable performance across age and sex.We show VASARI-auto could save time and resources within each radiology department, equating over three years to 771 hours of consultant neuroradiologist time or ~£40,000 (>$50,000) in NHS finance terms, given the workload of a neuro-oncology centre such as ours.Scaled across the UK, the saving is anticipated to be more than £1.5 million ($1.9 million).Framed differently, such software would enable these workforce hours to be reallocated to other areas of unmet clinical need.We furthermore show that patient survival forecasting is non-inferior when using these automated models, demonstrating the preservation of feature fidelity.

Adding value with AI-assisted practice
Despite being well-validated in research to provide well-structured information on the imaging appearances of glioma, presenting an opportunity for quantitative tumour surveillance, the VASARI feature set is seldom used in clinical practice.The causes for this are multifactorial but are likely a combination of high clinical workload-VASARI is timeconsuming to record-and lack of sufficient level of neuroradiology training and experience.
Our software substantially lowers the barriers to adopting VASARI scoring while maintaining fidelity and assuring patient equity.Its introduction provides a means of extracting more detailed patient-personalized information, aiming to improve clinical care at a very modest cost in either time or financial terms.Particularly pertinent in the UK, where the number of radiologists per 100,000 population is one of the lowest in Western Europe (Piorkowska et al., 2017)-only 7-such decision support tools add high value or even free up an already overstretched workforce to allow work in other clinical areas.
A critical measure of the value of any feature, automated or manual, is downstream utility, such as survival prediction.Our analysis shows non-inferior-rather, quantitatively higherpredictive fidelity in using VASARI-auto features over those curated by consultant neuroradiologists.Demonstrating non-inferiority in software that is resource-cheap, contrasted with a time-consuming process for experienced neuroradiologists, is essential for software that provides inferior care than the current clinical standard, regardless of any efficiency or cost saving, adds little value.

Maximising reporting consistency
Clinicians' opinions-whether radiologists or others-often differ.This is to be expected: diseases are typically heterogeneous.Patients, too, are heterogeneous: a successful treatment approach for one might not be suitable for another (Rajpurkar et al., 2022).However, a model capable of absorbing heterogeneity can yield a quantitative description that exhibits consistency across the population concerning a critical decision.From follow-up monitoring of tumours, we know that conventional two-dimensional measurements can be highly inconsistent between radiologists (Dempsey et al., 2005;McNitt-Gray et al., 2015;Zhao et al., 2009), motivating the pursuit of alternative approaches.Though we find it unlikely that a radiologist's work will be replaced entirely by software, harmonising human domain expertise with software-driven quantitative analytics seems inevitable to advance the clinical status quo.Notably, many comparisons could be drawn between this viewpoint and the commercial sector, where substantially greater AI development is currently undertaken.While car manufacturers increasingly navigate towards AI-assisted driving, steering wheels are unlikely to be removed anytime soon.
What are the characteristics of an optimal approach?The ideal would be to absorb all variation irrelevant to the task.For example, where measurements are undertaken by manual annotation-which one should note is the currently adopted clinical practice globally, despite their empirically observed limitations (Dempsey et al., 2005;McNitt-Gray et al., 2015;Zhao et al., 2009)-this is trivial to ameliorate using automated software and relatively simple mathematics.We exemplify this here, showing only modest inter-rater agreement between highly experienced neuroradiologists that can be stabilised and standardised with automated methods.Particularly pertinent examples are in deriving VASARI features (or, for that matter, any other radiological feature outside the scope of this article) that are ultimately quantitative.
Where the quality of a lesion segmentation is validated, such as we show in our comparison between source segmentations and those in the segmentation model, then the mathematical derivation of precise proportions of lesional compartments, such as enhancing tumour, nonenhancing tumour, and perilesional signal change, is a simple mathematical operation of compartmental ratios.This is especially true for quantitative features that are harder to quantify intuitively, such as the thickness of enhancing tumour.Gliomas-particularly glioblastoma-are highly variable in their appearance.One part of an enhancing margin (if any) might be considerably thicker than another: how do we measure this?We would argue that the wrong answer (although commonplace in clinical reporting) would be to hedge an approximation between the lower and upper limits.Instead, a more robust solution is a simple mathematical derivation operating on a lesion segmentation (Ruffle et al., 2023a).However, the difficulty one faces, as is evidenced in these works, is where ambiguity in the ground truth-namely, what is a nonenhancing tumour and what is oedema-compounds an assessment regardless of whether derived by an experienced neuroradiologist or by software.
An answer to this problem is unlikely to be solved by clinical experience, status quo imaging techniques or software, but rather by innovation across all three.

Maximising performance equity
Healthcare should be equitable, which extends to any such tool at our clinical disposal (Abramoff et al., 2023;Carruthers et al., 2022).Artificial intelligence is one of the domains seeing the quickest growth in all research, industry, and society, with many purported applications across medicine.Yet equitable calibration to ensure that software brings benefits to all is relatively rarely quantified.For these reasons, we assess performance equity, not only of VASARI-auto but also of the adopted tumour segmentation model.Though confined to age and sex, the approach can be scaled through representation learning to encompass any characteristic (Carruthers et al., 2022).

Limitations
Our study has limitations.First, although drawn from a larger cohort of 1172, we utilise a sample of 100 patients with glioma who have undergone comprehensive clinical VASARI featurisation by experienced neuroradiologists.Although large for the domain, further validation should be undertaken at a greater scale to evaluate broader generalisability.This sample, however, is carefully curated and includes imaging from two major US medical centres, for which we could evaluate the performance of the tumour segmentation model and VASARI-auto, both separately and taken together.Second, we could not incorporate all features of VASARI in the software, specifically those that required structural neuroimaging, additional sequences beyond that provided by external repositories, or where variability/confabulation could occur.This decision was deliberate, for we wished to develop software that did not use patient-identifiable data, heterogeneous sequence data (in time and place), or computationally intensive processing pipelines.Whilst this choice precludes assessment of some of the VASARI features, the current pipeline requires merely a lesion segmentation to our strength.Moreover, VASARI-auto can be undertaken in a privacy-preserving setting, with trivial computing requirements, and from data acquired in any MRI.The appeal here is that rather than be limited to specific MRI scanners or specialist centres, the framework is immediately scalable to any centre, even with limited hardware resources.Future work should, however, expand upon this to include these remaining features.
Thirdly, the extent of the VASARI-auto featurisation pipeline was gated by the availability of widely used lesion compartment labels, namely enhancing tumour, nonenhancing tumour, and oedema (Baid et al., 2021).Therefore, our software could not provide VASARI data on haemorrhagic change because no label exists in the source data (Bakas et al., 2022b;Calabrese et al., 2022): it cannot learn what it has not been taught (Ruffle et al., 2023a).There are evolving opinions across neuro-oncology as to what may be oedema and what is nonenhancing tumour: it is for this reason we use the terminology of 'perilesional signal change' in discussing the segmentation pipeline.Moreover, it is the reason for lower agreement between VASARI-auto and neuroradiologist reporting for the lesion proportion features, since the software is guided by the status quo where such perilesional signal is referred to as oedema, though some radiologists might instead label as nonenhancing tumour.These changing viewpoints are because what classically was referred to as oedema has been shown to contain tumour cells under biopsy (Barajas et al., 2012).This ambiguity will impact model performance, for we are gated by ground truth labels that discretise nonenhancing tumour and perilesional signal change based on structural MRI sequences, despite there being no gold standard test to confirm if tumoural cells are definitively present within the signal abnormality or not.Though no 'silver bullet' imaging technique currently exists to remedy this, advanced imaging techniques (including diffusion and perfusion) may aid in remedying this in the future (Alsulami et al., 2023;Soni et al., 2018;Wurtemberger et al., 2022).In any case, it should be stressed that the values themselves do not actually matter here, but of far greater importance is that from lesion segmentation, a more robust and standardised assessment across a cohort of patients is yielded, likely the reason for stronger performance in downstream survival prediction.
Fourthly, since our accuracy metric is quantified from the neuroradiologist label as its ground truth, it would not be appropriate to claim superior accuracy of the VASARI-auto over a neuroradiologist here.However, we can quantify consistency between neuroradiologists, between the software and the radiologist, and between different input types of the software, which is akin to quantifying uncertainty.To that end, we illustrate that deriving VASARI features such as the thickness of the enhancing tumour margin and the presence of satellite lesions were radically inconsistent between neuroradiologists; these were far more reproducible between VASARI-auto runs.
Finally, our economic cost analysis assumes a VASARI feature set is undertaken in all cases assigned to the neuro-oncology MDT.This is an upper bound: VASARI is seldom used for the time and level of professional training it demands.Furthermore, given the sharp rise in the volume of medical imaging undertaken for patients globally (Piorkowska et al., 2017;Smith-Bindman et al., 2008), it is likely that the hours and cost incurred for radiologists to featurise these cases are a significant under-representation.However, precisely to that point, one should consider the economic analysis to highlight a gain in healthcare value at negligible time or financial cost.

Conclusions
VASARI-auto can characterise glioma efficiently, effectively, and equitably.The use of VASARIauto-derived features in predicting patient survival is non-inferior to the use of those manually curated by experienced consultant neuroradiologists.Translation to the clinical frontline with an automated derivation of these features may enhance existing radiology practice with negligible cost to an imaging department, serving as a decision support tool to provide healthcare providers with more information to facilitate standardised, equitable, and more personalised patient care.

Figure 2 .
Figure 2. Tumour segmentation equitable calibration.A-B) Heatmap of tumour location derived from semi-supervised external neuroradiologist hand segmentations (A) and from segmentation model TumourSeg (B) shows the two to be highly similar indicative of spatially equitable intracranial performance.C-D) Box and whisker (C) and radar (D) plots depict tumour segmentation model performance by Dice coefficient across whole tumour (WT), enhancing tumour (ET), nonenhancing tumour (NET), and perilesional signal change (PS),illustrating that tumour segmentation is equally performant across both male and female patients (C), and across all decades of life (D).

Figure 3 .
Figure 3. Randomised case-based review of tumour segmentation.Randomly selected sample of 16 patients of different ages and sex, with their contrast-enhanced T1-weighted imaging and the TumourSeg model result overlaid.Correctly segmented lesional voxels are colour-coded as orange for enhancing tumour (ET), purple for nonenhancing tumour (NET), and pink for perilesional signal change (PS).Any misclassified voxels (whether false positive, false negative, or correctly lesional but the wrong tissue class) are colour-coded in black.

Figure 5 .
Figure 5. Randomised case-based review of VASARI featurisation.A) A randomly selected sample of 5 patients of different ages and sexes, with their contrast-enhanced T1-weighted imaging and the TumourSeg model result overlaid.Correctly segmented lesional voxels are colour-coded as orange for enhancing tumour (ET), purple for nonenhancing tumour (NET), and pink for perilesional signal change (PS).B) Sample VASARI featurisation comparison

Figure 6 .
Figure 6.Efficiency, economic and workforce planning analysis.A) The time taken for a neuroradiologist to derive a VASARI feature set for a single patient (green) is substantially higher than with either VASARI-auto (orange) or using VASARI-auto paired with TumourSeg (purple).B) Simulated economic and workforce analysis, where weekly neuro-oncology multidisciplinary team (MDT) workload is drawn from a random uniform distribution based upon the last three years of workload at our centre.Thin individual lines represent different simulation runs to emulate the forty different UK neuro-oncology centres, with thicker lines representing the epoch mean.Cumulative financial cost (dashed line) and time taken (solid

Figure 7 .
Figure 7. VASARI featurisation equitable calibration.A-C) radar plots showing Cohen's Kappa agreement aligned to male (green) or female (orange) patient sex across all decades of life.A) inter-rater agreement between neuroradiologists, B) between neuroradiologists and VASARI-auto, and C) between VASARI-auto using manually traced and software-derived lesion segmentations.

Figure 8 .
Figure 8. Downstream inference with patient outcome prediction.Results of linear regression predicting overall patient survival in days using VASARI-features derived by A) neuroradiologists (green), B) VASARI-auto from the semi-supervised external segmentations