Single-timepoint low-dimensional characterization and classification of acute versus chronic multiple sclerosis lesions using machine learning

Multiple sclerosis (MS) is a chronic inflammatory and neurodegenerative disease characterized by the appearance of focal lesions across the central nervous system. The discrimination of acute from chronic MS lesions may yield novel biomarkers of inflammatory disease activity which may support patient management in the clinical setting and provide endpoints in clinical trials. On a single timepoint and in the absence of a prior reference scan, existing methods for acute lesion detection rely on the segmentation of hyperintense foci on post-gadolinium T1-weighted magnetic resonance imaging (MRI), which may underestimate recent acute lesion activity. In this paper, we aim to improve the sensitivity of acute MS lesion detection in the single-timepoint setting, by developing a novel machine learning approach for the automatic detection of acute MS lesions, using single-timepoint conventional non-contrast T1- and T2-weighted brain MRI. The MRI input data are supplemented via the use of a convolutional neural network generating "lesion-free" reconstructions from original "lesion-present" scans using image inpainting. A multi-objective statistical ranking module evaluates the relevance of textural radiomic features from the core and periphery of lesion sites, compared within "lesion-free" versus "lesion-present" image pairs. Then, an ensemble classifier is optimized through a recursive loop seeking consensus both in the feature space (via a greedy feature-pruning approach) and in the classifier space (via model selection repeated after each pruning operation). This leads to the identification of a compact textural signature characterizing lesion phenotype. On the patch-level task of acute versus chronic MS lesion classification, our method achieves a balanced accuracy in the range of 74.3-74.6% on fully external validation cohorts.


Introduction
Multiple sclerosis (MS) is a chronic immune-mediated neurodegenerative disease affecting the central nervous system. Its pathological hallmark is the accumulation of demyelinated lesions or plaques , detectable on conventional T2-weighted magnetic resonance imaging (MRI) as areas of white matter hyperintensity (WMH) relative to the normal-appearing white matter (NAWM) ( Traboulsee and Li, 2006 ). The delineation of WMHs therefore reveals the spatial distribution of MS Abbreviations: BBB, blood brain barrier; CNN, convolutional neural network; CSF, cerebro-spinal fluid; DMT, disease-modifying therapy; EDSS, expanded disability status scale; GAN, generative adversarial network; Gd + , gadolinium enhancement; ML, machine learning; MRI, magnetic resonance imaging; MS, multiple sclerosis; NAWM, normal-appearing white matter; NET2, new or enlarging T2 MS lesion; RFE, recursive feature elimination; ROI, region of interest; TA, texture analysis; WMH, white matter hyperintensity.
damage but does not discriminate lesions that have recently formed and may be undergoing active demyelination (known as acute ) from lesions that have been present for some time (known as chronic 1 ). The detection and quantification of acute lesions is relevant to the clinical management of patients with MS, in whom evidence of recent disease activity may guide treatment decisions such as switching anti-inflammatory disease-modifying therapies (DMTs) and may support short-and longterm disease prognostication ( Wattjes et al., 2021 ). In clinical trials evaluating MS DMTs, evidence of recent acute lesion activity is also used to define inclusion criteria and enrichment strategies, while measurements of ongoing acute lesion activity are commonly used as endpoints ( Calabresi et al., 2014 ;Kapoor et al., 2018 ;Wattjes et al., 2021 ). The formation of new lesions is generally accompanied by a transient period of blood brain barrier (BBB) breakdown and acute inflammatory activity ( Kappos et al., 1999 ). Following the intravenous administration of a gadolinium chelate contrast agent, areas of BBB breakdown can be detected as hyperintensities on post-contrast T1-weighted MRI. The delineation of areas of gadolinium enhancement (Gd + ) thus allows for the detection of acute lesions at a single timepoint; however, due to the relatively short half-life of BBB disruption and Gd + persistence ( Cotton et al., 2003 ;Guttmann et al., 2016 ), acute lesion detection based on this method severely underestimates acute pathology. This method is also invasive and potentially nephrotoxic ( Perazella, 2008 ). Repeated gadolinium-based contrast agent administrations may also result in gadolinium accumulation in the human brain, the clinical implications of which are currently poorly understood ( Guo et al., 2018 ).
Beyond Gd + status, acute lesions can also be detected as focal areas of new or substantially enlarging T2-weighted (NET2) lesions via the comparison of a present T2-weighted MRI scan with a previously acquired reference scan ( Altay et al., 2013 ). Notably, when a previous MRI scan is available from up to 24 weeks prior, NET2 counts are generally 3-fold to 5-fold higher than the single-timepoint detection of Gd + lesion counts ( Hauser et al., 2017 ;Moraal et al., 2010 ). The requirement of a reference scan to identify NET2 can, however, delay diagnosis or treatment decisions and incurs an economic burden ( Kobelt et al., 2017 ). Furthermore, by construction, the identification of NET2 only allows detection of the subset of acute lesions that emerge from the NAWM and fails to detect acute activity within existing WMHs. Since acute lesions may also form in pre-existing lesion locations (as demonstrated by the detection of Gd + foci within prior WMHs), some degree of acute lesion activity may be missed using NET2 detection techniques. In particular, acute lesions occurring in areas of pre-existing WMH which are captured outside of their Gd + phase will thus be missed under the conventional definition of "MRI-acute " lesions.
The hypothesis that "MRI-acute " lesions may underestimate the extent of pathologically "active " demyelination is further supported by the high prevalence of "active " plaques observed in pathology studies reporting 70-80% of patients with at least one "active " plaque between age 35-50 years-old ( Frischer et al., 2015 ), which exceeds the prevalence of "MRI-acute " lesions generally reported in age-matched clinical trial populations -albeit MS trials typically involve an acute lesion enrichment selection bias inherent to disease activity eligibility criteria.

Objectives
These limitations highlight the need for a single-timepoint acute lesion detection method that could identify the totality of recent acute MS lesion activity (including gadolinium-negative lesions in areas of preexisting lesions). This work seeks to augment the sensitivity of crosssectional methods of acute lesion detection in MS by capturing MRI biomarker signatures identified within ground truth examples of both Gd + and NET2 lesions, from non-contrast conventional T1-and T2weighted MRI only. In this work, ground truth NET2 lesions were defined based on the comparison of two T2-weighted MRI scans acquired at most 24 weeks apart.
Using intensity-based radiomic features from the core and periphery of lesion sites, compared with their counterparts on a "lesion-free " synthetic image, we identify patterns of MRI signals that are discriminative of recent acute MS lesion activity and reflective of ongoing demyelination. For notational compactness, we will denote this set of radiomic features enriched with lesion-free information as " -radiomics ". The relevant patterns of -radiomics are interpreted by an ensemble of machine learning (ML) algorithms trained to discriminate acute from chronic MS lesions. The resulting ensemble predicts the label (i.e., acute/chronic) of an individual MRI voxel using MRI signals aggregated across a cubic patch centered on that voxel.
The methodological contributions of this work are as follows: • Using a convolutional neural network (CNN) for image inpainting, we generate synthetic "lesion-free " T1-and T2-weighted brain MRI scans from original "lesion-present" scans, and compare textural information contained in "lesion-present " versus "lesion-free " scans. • We design a patch sampling strategy combined with a simple adaptive region of interest (ROI) definition rule, which allows us to segment a "core " and "periphery " regions from lesion sites that may correspond to either focal or confluent MS lesions. • We present an end-to-end greedy iterative feature-pruning algorithm for feature selection (i.e., imaging biomarker discovery) exploiting textural characteristics within lesions and their periphery, coupled with an ensemble classifier optimization module applied at each feature pruning step.

Texture analysis in MS
Several studies have investigated the utility of texture analysis (TA) on MRI in MS. It is hypothesized that pathological processes in MS induce structural changes expressed on the nanometer to micrometer scale, which manifest as voxel pattern changes on conventional MRI images ( Zhang et al., 2011 ). It follows that the detection of textural biomarkers associated with specific types of biological activity may support the identification of MS lesions from single-timepoint conventional MRI. The feasibility of this approach is further strengthened by previous work showing that TA on MRI is relatively unaffected by the image acquisition variables ( Mayerhoefer et al., 2009 ;Savio et al., 2010 ) and as such offers robust biomarkers suitable for use in routine clinical care of patients with MS, where image acquisition parameters may vary.
Specifically, TA has been investigated for its ability to discriminate MS plaques from the NAWM using conventional cross-sectional T1and T2-weighted MRI Zhang et al., 2008 ), to discriminate Gd + lesions from chronic lesions using T2-weighted MRI ( Michoux et al., 2015 ;Yu et al., 1999 ), and for the precise delineation of Gd + lesions using post-contrast T1-weighted MRI ( Karimaghaloo et al., 2013 ). Furthermore, TA has been leveraged to characterize acute lesions, by spatially distinguishing a demyelinating core from a border of inflammation ( Drabycz and Mitchell, 2008 ), measuring tissue injury in acute foci ( Zhang et al., 2009 ) and by predicting the persistence of T1 black holes beyond acute onset ( Zhang et al., 2011 ). Building upon these prior studies, our work leverages radiomics analysis, which consists of mining a large number of quantitative imaging variables ( Zwanenburg et al., 2020 ). Radiomics analysis has been suggested to hold great promise in the transition towards large-scale medical imaging studies, whereby the abundance of visual evidence can be leveraged by ML techniques to address clinical challenges ( Gillies et al., 2016 ;Lambin et al., 2012 ) -in this context, our work also builds upon prior applications of ML to augment lesion segmentation in MS including for WMHs ( Fartaria et al., 2016 ;Zeng et al., 2020 ;Zhong et al., 2014 ), Gd + ( Coronado et al., 2021 ;Gaj et al., 2021 ;Narayana et al., 2020 ), and NET2 lesions ( Elliott et al., 2013 ;Salem et al., 2018 ).

Prior work
In this study, NET2 lesions are defined as those WMHs that are less than 24 weeks old. As such, the task of acute versus chronic MS lesion classification can be seen as a binarized version of the task of MS lesion age estimation. In this context, Sweeney et al. ( Sweeney et al., 2021 ) recently developed a ML-based algorithm for MS lesion age prediction

MRI data
Brain MRI scans from three large-scale double-blind phase 3 pivotal trials were retrospectively analyzed, including T1-and T2-weighted brain MRI scans from ADVANCE (1512 subjects with relapsing-remitting MS; NCT00906399), ASCEND (886 subjects with secondary progressive MS; NCT01416181), and DECIDE (1841 subjects with relapsingremitting MS; NCT01064401). Study details and MRI acquisition parameters for these trials have been reported previously ( Calabresi et al., 2014 ;Kapoor et al., 2018 ;Kappos et al., 2015 ) and are summarized in Table 1 . For each participant, one baseline study and a series of followup studies were conducted; each subject could thus contribute multiple observations to our analysis. Each study was associated with at least one T2-weighted MRI scan as well as T1-weighted MRI scans pre-and post-gadolinium injection. For ADVANCE, the follow-up MRI scans were acquired at 24-, 48-, and 96-weeks post-baseline; for ASCEND at 24, 48, 72, 96, 108, and 156 weeks, and for DECIDE at 24 and 96 weeks. A subset ( n = 142) of participants from ADVANCE had follow-up scans every 4 weeks up to 24 weeks post-baseline. In total, 11,267 pairs of T1-and T2-weighted brain MRI scans were analyzed in this study.

Ground truth
In the context of the original clinical trial, MRI scans were analyzed by a central reading center (NeuroRx Research, Montreal, QC, Canada), as follows: Gd + lesions were manually segmented at each timepoint of MRI acquisition by consensus of two trained experts while WMHs were segmented using a semi-automated method including a manual verification and correction step ( Francis, 2010 ). This process yielded two binary maps per study, denoting WMHs and Gd + lesions respectively (see Fig. 1 ).
From the WMH masks, the NET2 lesion mask detected in a present scan relative to a prior reference scan was determined automatically by considering the difference between the sets of WMH voxels detected across these time-points, where artifactual differences due to segmentation variability or imperfect registration were automatically excluded. These NET2 lesion masks were manually reviewed and corrected where necessary (see Fig. 2 ). Importantly, a new WMH was labeled as NET2 if it was detected via the comparison of scans acquired at most 24 weeks apart. Ultimately, acute lesions were defined as the union between the Gd + and NET2 masks, within the bounds of the WMH mask. Conversely, chronic lesions were defined as the non-acute section of the WMH mask (see Fig. 3 ).

MRI pre-processing
Across all scans, spatial coherence was enforced via 6-parameter rigid registration (translation and rotation in 3D space) to the 2009a Non-linear Symmetric atlas ( Fonov et al., 2009 ) in the ICBM-152 reference space, followed by a resampling operation to isotropic voxel spacing (1 × 1 × 1 mm) via trilinear interpolation, yielding final scan dimensions of 256 × 256 × 180 voxels. Additionally, each scan underwent N3 bias-field correction ( Sled et al., 1998 ), followed by an intensity standardization procedure enforcing that NAWM voxels should have zero mean and unit-variance. The NAWM was defined as the set of white matter voxels without WMH; to mitigate the risk of contamination of the normalization parameters by peri -lesion abnormalities, NAWM voxels located within 2 mm of a WMH were excluded from the computation of the normalization parameters.

Data supplement via MS lesion inpainting
Our dataset of T1-and T2-weighted brain MRI scans was supplemented with synthetic lesion-free equivalents of each scan, generated via a CNN-based inpainting model. This model performs an image-toimage translation task whereby the MRI intensity profile within the  Definition of ground truth NET2 lesion mask. A new WMH was labeled as acute if it arose within the previous 24weeks. The WMH region is first segmented in each timepoint of T2-weighted MRI scan acquisition. For each post-baseline timepoint (e.g., 24 weeks post-baseline), the NET2 mask is the set of voxels that are labeled as WMH at that timepoint and were not labeled as WMH in a previous timepoint − 1 acquired at most 24 weeks prior to . After their automatic detection, NET2 masks were manually corrected by trained MRI readers. Abbreviations: Gd + , gadolinium enhancement; MRI, magnetic resonance imaging; NET2, new or substantially enlarging T2-weighted; w, week; WMH, white matter hyperintensity.
WMH region of an input MRI scan is replaced with a synthetic inpainting mimicking normal-appearing tissue. During both training and inference, WMH masks were isotropically dilated by 1 mm, which ensures that both the lesion foci and any immediate peri -plaque abnormality are inpainted.

Architecture
The 2D inpainting method proposed by Yu et al. ( Yu et al., 2019 ) was adapted to the 3D setting via multi-view ensembling. Specifically, we trained one model per anatomical view (axial, coronal, sagittal) and aggregated predictions across views via voxel-wise averaging at test time. Each 2D inpainting model was based on a generative adversarial network (GAN) ( Goodfellow et al., 2020 ) composed of two encoder-decoder generator blocks referred to as the coarse and refinement blocks (see Fig. 4 ), followed by a discriminator block. Each generator block implemented gated convolutions ( Liu et al., 2018 ) to restrict the encodingdecoding process to information contained outside of the region to be inpainted; thus, each generator block produces a synthetic full-brain MRI image, whereby the intensity profile in the inpainted region is predicted using information outside of that region. Specifically, the refinement generator block included an attention module guiding the encoding process. The contextual attention module originally used by Yu et al.  ) was replaced with a recursive self-attention module. This module allows for MRI signals distant from the lesion site to be integrated into the prediction of a hypothetical normal-appearing tissue profile replacing lesion tissue ( Zhang et al., 2019 ). The combined use Fig. 3. Axial (A), coronal (B), and sagittal (C) views of a T2-weighted brain MRI scan randomly selected from the ADVANCE cohort, showing the acute (red) and chronic (blue) ground truth segmentation maps. We observe isolated chronic and acute foci, as well as contiguous acute and chronic lesions. Abbreviations: Gd + , gadolinium enhancement; MRI, magnetic resonance imaging; NET2, new or substantially enlarging T2weighted; WMH, white matter hyperintensity.

Fig. 4.
Architecture of our CNN model for MS lesion inpainting, adapted from ( Yu et al., 2019 ). During training, free-form synthetic lesion masks are created outside of the WMH masks while during inference, WMH masks are fed into the model. The generator block consists of a sequence of coarse and refinement autoencoders. The output of the generator block ( "Refined result ") is the inpainted "lesion-free " image extracted during inference. During training, the output of the generator block is subsequently fed into a discriminator model, which encodes it into a set of latent features. In parallel, the discriminator model also encodes a randomly selected, non-inpainted MRI image containing healthy tissue within the white matter region delimited by the free-form synthetic lesion mask. For each latent feature (i.e., neuron of the last feature map), the probability for that feature to be derived from a real (non-inpainted) image is estimated. This estimated probability is leveraged to train the model, via the discriminator loss term . Overall, a high-performing generator can "fool " the discriminator into believing it is creating real images. Abbreviations: GAN, generative adversarial network; MRI, magnetic resonance imaging; MS, multiple sclerosis; WMH, white matter hyperintensity. of a discriminator loss (which encourages realistic inpainting profiles) and recursive attention mechanism (which facilitates the integration of MRI signals collected across brain locations) differentiates our approach from other state-of-the-art methods for MS lesion inpainting from brain MRI ( Manjón et al., 2020 ).
Furthermore, anatomical consistency was imposed by ensembling models built on four reference anatomical spaces. These spaces were defined by randomly selecting four T1-weighted MRI scans from the baseline studies of the subset of ADVANCE participants exhibiting a low MS lesion volume ( ≤ 100 m{m 3 ). Each scan thus selected defines one reference anatomical space, and the process of affine registration to each reference space is illustrated in the appendix in Fig. 16 . In total, 12 models were trained, corresponding to one model per anatomical view (axial, coronal, and sagittal) for each anatomical template. During inference, each input MRI scan was first projected to each one of the four reference spaces (see Fig. 16 ), which was achieved using a linear registration algorithm implemented within the CE-marked and FDA-approved ART-Plan software (TheraPanacea, Paris, France) ( Ferrante et al., 2019 ). The syn-thetic full-brain MRI images produced in each one of the four reference spaces were subsequently projected back to the source space and predictions were aggregated via voxel-wise averaging in the native space. A sample inpainting result is shown in Fig. 9 as part of the Results section.

Optimization
A subset of the participants, 80%, pooled from the ADVANCE and AS-CEND trials, were randomly selected for training the inpainting model; the remaining 20% were used for validation. A set of 80,000 2D slices were extracted from the white matter of all training scans across the axial, sagittal, and coronal planes. Prior to slice extraction, each MRI scan underwent intensity normalization based on histogram matching ( Nyú and Udupa, 1999 ) performed separately on T1-and T2-weighted scans, using intensity statistics from the brain region and excluding lesion areas (and their associated 2 mm peri -plaque margin). Histogram-matched intensities were then mapped linearly to the range (-1 to 1) via min-max normalization.
Let denote a sample in a set of 2D images extracted from a collection of 3D T1-or T2-weighted brain MRI scans and let denote a binary mask defining the region to be inpainted in sample . The mask was randomly generated via the procedure originally proposed in ( Yu et al., 2019 ) and designed to simulate brush strokes, whereby stroke length and brush width parameters were optimized to produce free-form shapes reproducing the visual aspect of MS lesions geometry. Since the task of the inpainting model is to produce a lesion-free tissue profile replacing an MS lesion, the mask was constrained to white matter regions located at least 5 mm away from any WMH. Then, was preprocessed such that voxels from contained within were set to a constant value of 0 to avoid contributing to the feedforward pass.
We seek to produce a predicted inpainting ̂ that minimizes the L1 distance between and ̂ in both the inpainted region as well as (trivially) in the non-inpainted region. As is fed into our network, we denote the output of the coarse generator block as ̂ and the output of the refinement generator block as ̂ . The reconstruction loss in the masked inpainting region and the loss in the unmasked region are expressed below, where ⊙ denotes voxel-wise multiplication. The network was further supervised by minimizing the GAN loss defined below, where denotes the discriminator loss function. The total model loss is defined as total ∶= + + ; empirical results motivated the choice of = = = 1 , as proposed in ( Yu et al., 2019 ). Model parameters were trained via the Adam optimizer with a learning rate of 1 × 10 -4 , using a batch size of eight. Training was stopped early if the global L1 loss on the validation set did not decrease for two consecutive epochs. The model was implemented in PyTorch and took 14 h to train on a GTX-1080 GPU.

Motivation for a patch-based approach
In early stages of MS, brain WMH regions can generally be separated into foci via connected-component analysis, whereby one prediction (acute or chronic) can be produced for each discrete lesion. In contrast, in later stages of MS, the confluence of multiple lesions accumulated over time makes it difficult to isolate individual lesions ( Dworkin et al., 2018 ). This problem precludes a lesion-level approach and instead prompts a voxel-level analysis. Thus, in this work, the class of each WMH voxel is predicted using information contained within a cubic patch of dimensions 15 × 15 × 15 mm centered on that voxel, whereby patch dimensions were selected by computing the smallest possible patch size satisfying the condition that at least 90% of all acute lesions in our dataset could be fully contained within the patch. Patchlevel information is extracted across the original and inpainted T1-and T2-weighted MRIs; each patch is associated with the WMH mask detected in that patch (see Fig. 6 ).

Patch sampling strategy
Patches were extracted from randomly sampled voxel locations across the WMH regions of all available brain scans, following a spatially uniform distribution (see Fig. 5 ). This was constrained by a set of patch exclusion criteria eliminating small ( < 9 m m 3 ) lesion foci as well as patches on the boundary between acute and chronic MS lesions. In the specific case of patches extracted from discrete lesion foci that could be fully contained within the patch, the center of the patch was corrected to co-localize with the center of mass of those foci. These exclusion and correction rules are further detailed in the appendix and favor patches that are non-equivocal examples of either acute or chronic MS activity, which reduces label noise and as such facilitates learning. During model validation and testing, these exclusion criteria were maintained. For patches satisfying these criteria, the label of the patch was that of its central voxel.

Class balancing across patches
The resulting dataset of patches was class-balanced by undersampling the majority chronic class. Specifically, the set of chronic patches extracted across the pooled population of participants was such that the count of patches across participants would reflect the extent of chronic lesion burden across those participants. As an example, if the chronic lesion volume detected in participant is equal to twice the volume of chronic lesion detected in participant , then the final set of chronic patches will contain twice as many chronic patches from participant than patches from participant . In addition, the distribution of lesion volumes contained within chronic patches was matched to the distribution of lesion volumes in acute patches. This ensures that the acute versus chronic discrimination process is driven by differences in textural patterns of MRI intensity, as opposed to variations in lesion volume that may otherwise be captured through texture features via volume-confounding effects ( Jensen et al., 2021 ). Key statistics of the resulting class-balanced datasets of patches are presented in Table 2 as part of the Results section.

Regions of interest: core and periphery
In each patch, two regions of interest (ROIs) were segmented: the core and periphery . Specifically, we define the core of the patch as the section of the WMH contained within the focus region, as illustrated in Fig. 6 , whereby the focus region is defined as a binary ball of radius 4 mm centered on the patch. 2 Consequently, the core ROI is equivalent to the lesion mask for patches centered on focal WMH components of maximum 3D diameter ≤ 8 mm (in the ADVANCE cohort, approximately 70% of acute MS lesions satisfy this criterion). For larger WMH components, the focus region can be used to focus on the relevant subsection of the larger lesion component. Thus, the core ROI can be used to dynamically adapt to different spatial extents of WMH and supports the use of our method on both discrete and confluent MS lesions.
In addition to the core ROI, we defined the periphery ROI of the patch as the set of voxels around the core that are at a distance ≤ 3 mm away from the edge of the core . Importantly, the periphery ROI does not systematically co-localize with the peri -plaque region. Indeed, the nature of the tissue contained within the periphery ROI may vary depending on the location of the patch within the brain, relative to other MS lesions.

Table 2
Number of patches extracted from the ADVANCE, ASCEND, and DECIDE trials, stratified by lesion label. Acute patches are further stratified by lesion age and whether the lesion co-localized with Gd + . The acute and chronic datasets of patches were class-balanced. Notes: An interstudy period of 12 weeks occurred only in the ASCEND trial between 96-and 108-weeks post-baseline, while an interstudy period of 4 weeks occurred only among a subset of participants in the ADVANCE trial, from baseline to 24 weeks post-baseline. It is thus not possible to identify which NET2 lesions from ASCEND and DECIDE emerged within 4 weeks prior to scan acquisition, which is indicated as "NA ". A lesion detected as new via the comparison of scans taken more than 24 weeks apart, but showing gadolinium enhancement at the latest timepoint, was considered acute. Abbreviations: NA, not applicable; NET2, new or substantially enlarging T2-weighted. The periphery ROI may thus contain both peri -plaque tissue (which may include white matter tissue and/or CSF from CSF-rich locations such as the ventricles) as well as lesion tissue from neighboring (and possibly confluent) lesions (see Fig. 6 ). By allowing the periphery ROI to include neighboring lesion tissue (as opposed to restricting it to the nonlesion peri -plaque tissue), we circumvent the issue of an empty periphery, which would otherwise occur when a patch is sampled within the core of a large confluent lesion mass. Overall, the core and periphery ROIs represent two non-overlapping binary masks adapted to the WMH contained in each patch (see Fig. 6 ).

Radiomics computation
A set of 88 textural and first order radiomic features ( Van Griethuysen et al., 2017 ) were extracted separately from the core and periphery ROIs for each MRI sequence (T1-and T2-weighted) as well as for their inpainted counterparts, yielding a set of 704 radiomic features per patch. The intensity discretization for computing texture features was performed using a fixed bin size (as opposed to a fixed bin count), which has been shown to improve reproducibility in TA from MRI ( Duron et al., 2019 ). Bin width was set to 0.4, which yielded approximately 30 intensity bins per ROI and was motivated by prior studies that have shown good reproducibility for a bin count of 32 on MRI ( Carré et al., 2020 ). No patch-level intensity normalization was applied; we instead relied on the scan-level intensity normalization scheme described in Section 5.1 "MRI Pre-Processing ".

Biomarker identification and classification pipeline
The dataset of pooled participants from the ADVANCE trial was split 80:20 into training and validation sets, while ensuring that the distribution of classes in each subset was balanced. Patches extracted from participants from the ASCEND and DECIDE trials constituted two independent testing sets. Radiomic features were normalized via z-scoring, whereby normalization parameters were computed from the training set and subsequently applied to the training, validation, and testing sets. The training set was used as input to a classification pipeline identifying the optimal combination of a feature space and a classification model for the successful discrimination of acute versus chronic patches. This classification pipeline was inspired by ( Chassagnon et al., 2020 ) and consists of an initial feature ranking pipeline, followed by a continuous loop of recursive feature elimination and ensemble classification model optimization (see Fig. 7 ). The feature ranking, ensemble classification, and modified recursive feature elimination (mod-RFE) modules are defined in the following sections.

Feature ranking
Let  denote the set of = 704 radiomic features extracted from each patch sample. A feature selection algorithm was designed to reduce the dimensionality of this input space by removing irrelevant and/or redundant variables. This has the potential to reduce computation time, improve classification performance and generalizability via the elimination of noisy features, and facilitate the interpretation of the final model ( Cai et al., 2018 ;Kuhn and Johnson, 2013 ). Specifically, we perform feature selection via two distinct steps. First, a feature ranking module orders the radiomic features from most to least predictive. Then, starting from a feature subspace  comprising the top-K most-relevant features, a modified recursive feature elimination (mod-RFE) process is conducted, whereby the size of the feature subspace is iteratively decremented by eliminating the least useful features at each step.
The feature ranking step is supported by a supervised scoring metric ( ) denoting the prevalence of a feature (indexed by ∈ [ 1 , 2 , … , ] ) from  and defined as the sum of two metrics inspired by the filter and embedded methods of feature selection. The filter method selects features based on their score in univariate statistical tests for their correlation with the prediction target. In this context, we may measure the linear dependence between a feature and the patch label (e.g., using point biserial correlation) or their monotonic relationship (e.g., using Kendall's Tau) as well as more complex relationships (via informationtheoretic measures such as mutual information). However, this univariate filter method is limited in that it may select important but correlated (and therefore redundant) predictors. In contrast, the embedded method is a multivariate approach traditionally implemented through the application of classification models endowed with built-in feature selection properties, such as 1 -regularized linear models. Whilst the multivariate embedded method is more robust to the presence of correlation across features, it is also limited to the detection of linear relationships between features and the patch label. It may also select features in a model-dependent fashion and yield feature scores (i.e., model weights) that are dependent on the choice of regularization strategy and magnitude. Importantly, to tackle the task of feature ranking , as opposed to feature selection , we adapt the traditional embedded paradigm by considering the coefficients learned by 2 -regularized linear models and inspect the relative weight given to each feature, rather than select the subset of features with non-zero weights. Similarly, for the filter metric, features were ranked based on their absolute correlation score with the target, rather than selected based on a p-value significance threshold. By combining both a filter and an embedded method to feature selection, we aim to mitigate the respective limitations of each approach and combine their strengths.
To evaluate the filter and embedded metrics, we randomly generated 100 subsets from the training dataset of patches, whereby each subset contained 80% of all training patches, without repeating elements. To compute the embedded metric, we trained a variety of linear and tree-based ML classifiers on each one of the 100 generated subsets to discriminate acute from chronic patches. These classifiers included a linear Support Vector Machine with L2 regularization = 0 . 25 (chosen via cross-validation over the training set using subject-level splits, using balanced classification accuracy as objective measure), Logistic Regression with L2 regularization = 1 . 0 ( idem ), as well as tree-based models, including a Decision Tree of depth 3 optimizing the Gini impurity, and boosted ensembles of decision trees, including AdaBoost with 30 boosting rounds and a learning rate of 1.0, and XGBoost with 30 boosting rounds and a learning rate = 1 . 0 . For each one of the 100 subsets, model coefficients were extracted from each trained linear classifier and feature importance values were extracted from each tree-based classifier (e.g., Gini impurity for the Decision Tree, or gain for XGBoost). These were used to rank all features from highest to lowest relative importance. From this ranking, the 5% features with the highest score were selected. The embedded metric score of each selected feature was then incremented by 1. This procedure was repeated for each one of the 100 subsets and is illustrated in Fig. 8 . Ultimately, the embedded metric score for feature was defined as the number of splits in which was selected, summed across linear and tree-based models. Consequently, the embedded metric score is a natural number comprised between 0 (i.e., was never selected) and 500 (i.e., was selected by each of the 5 classification models in each one of the 100 splits).
Similarly, for each one of the 100 subsets, linear and non-linear correlation metrics were evaluated to quantify the degree of association between each radiomic feature and the patch label. Metrics included Point Biserial correlation, Kendall's , ANOVA F-value, 2 , and mutual information. For each metric, features were ranked in ascending correlation magnitude. The filter metric score of a feature was then defined as the number of splits in which was selected across correlation metrics. Again, the filter metric score is a natural number comprised between 0 (i.e., was never selected) and 500 (i.e., was selected by each of the five correlation metrics in each one of the 100 splits). Importantly, by selecting a fixed number of features for each split and for each scoring method, we allow for conceptually different feature importance metrics to be transformed to a common space of natural numbers in the range [0:500]. The resulting embedded and filter scores are then each mapped to [ 0 ∶ 1 ] via min-max normalization across features, yielding relative  metrics which are ultimately summed together to give the prevalence score ( ) . By summing together min-max normalized metrics (instead of summing up ranks across metrics, for example), the definition of ( ) conserves the magnitude of relative discrepancies between features as measured by the embedded and filter metrics.
Finally, features were ranked using their prevalence score. The top-K most informative features could then be retrieved to construct the set of features  . We specifically enforced that, for each radiomic variable retrieved in this way, the measures extracted from both the original and inpainted MRI data should be selected. This was enforced by iterating from the best-ranked variable to the worst-ranked variable and populating the K -dimensional subspace of selected features with both the current-rank variable as well as its original or inpainted counterpart, via a "tag-along " strategy. This ensures that, although original and inpainted MRI features were scored separately using , each set  consists of pairs of original and inpainted texture radiomics such that  defines an -radiomics signature.

Classification
The ensemble classification module then yields, from a dataset containing a subset  of top-radiomic features, an ensemble classifier   discriminating acute from chronic patches. Specifically,   was constructed by ensembling five base classifiers. The set of candidates from which these five models were selected included K-Nearest Neighbor, Linear Support Vector Machine (SVM), Polynomial SVM, Radial Basis Function SVM, Decision Tree, Random Forest, Adaboost, XGBoost, HistGradBoost, Gaussian Naive Bayes, Quadratic Discriminant Analysis and Multi-Layer Perceptron.
First, the hyperparameters of each model taken from the candidates' pool were tuned via an extensive deterministic grid search, crossvalidated three times using subject-level splits. Second, tuned models from the pool were evaluated via 5-fold cross-validations. Models were ranked in ascending order of balanced classification accuracy and the top-5 were selected, excluding models exhibiting a train-validation performance gap exceeding 35% in balanced accuracy. Third, the predic-tions of the top-5 classifiers were combined under different ensembling strategies including hard voting, accuracy-weighted hard voting (whereby the vote of each base model was weighted by its crossvalidated validation accuracy), soft voting, accuracy-weighted soft voting, and stacking. In stacking, a logistic regressor was trained to map the probabilistic output of the top-5 base estimators to the patch class (acute/chronic). The best-performing ensemble model   was selected via evaluation on the validation set from the ADVANCE trial, using balanced classification accuracy as the objective metric.

Recursive feature elimination
The ensemble classification module was applied for different values of from 4 up to , which yielded a series of trained classifiers {   4 , … ,   } . Among this set of classifiers and their associated feature spaces, the smallest feature space size * associated with classifier   * yielding a validation accuracy comprised within a reasonable margin of the best accuracy achieved across all trained classifiers was selected as the initial feature space for the mod-RFE pipeline (i.e., the second stage of feature selection). Importantly, the computational complexity of mod-RFE exceeds an arithmetic sum up to the size of the initial feature space, thus initializing mod-RFE with * features, rather than features, improves the tractability of this last feature selection stage.
Specifically, our mod-RFE module integrates the ensemble classification module within a continuous optimization loop of greedy RFE. Starting from a feature space of size * , the classification module first yields an optimal classifier   * . Then, we iteratively loop over all pairs of radiomic features ( , ) in  * , where and denote features corresponding to a radiomic variable measured from original and inpainted MRI data, respectively. In each iteration from this loop, we trained   * on a feature space of size − 2 in which ( , ) were removed. The performance of each newly derived classification model trained in the absence of ( , ) was evaluated via 5-fold crossvalidation. This was repeated for each of the * ∕2 pairs of features in  * . After * ∕2 experiments, the pair of features whose removal caused the least decrease in balanced accuracy was removed, yielding a feature space of size * − 2 , and a new classifier   * −2 was defined. This process was repeated recursively, pruning the feature space by two features in each iteration, terminating at a feature space of size 2.
To limit the computational cost of this approach, a random subset of 20% of the samples from the training set of patches was selected at each recursive loop, from which the impact of the removal of each features pair ( , ) was evaluated. Finally, the trace of validation accuracy against feature space sizes was plotted in Stage 2 of Fig. 10 , from which an -radiomics signature was selected via visual inspection, to identify the smallest feature space size maintaining near-optimal classification accuracy on the validation set.

Code and data availability statement
Requests for data should be submitted via the Biogen Clinical Data Request Portal ( www.biogenclinicaldatare quest.com). To gain access, data requestors will need to sign a data-sharing agreement. Data are made available for 1 year on a secure platform. The code for this paper is proprietarily owned by Biogen and cannot be shared.

Ethics statement
All patients provided written informed consent to participate in the clinical studies, which included consenting to future use of their study data for medical and pharmaceutical research, such as this post-hoc analysis.

MS lesion inpainting
A sample inpainting result is shown in Fig. 9 . The performance of our inpainting model is evaluated both quantitatively and qualitatively in the Appendix.

Patch sampling
Using the sampling strategy described in Section 5.3.2 "Patch Sampling Strategy ", we extracted a set of 40,160 patches (see Table 2 ) taken from 11,267 timepoints of MRI scan acquisition collected across 3,924 participants with MS.

Classification performance
The set of 704 radiomic features were ranked and a series of ensemble classifiers were constructed for values of ranging from 4 up to 704 (see Stage 1 of Fig. 10 ). The optimal initial for initializing mod-RFE was determined as = 200 (for > , we observed an undesirable increase in approximation error and little to no increase in validation accuracy). The results of the mod-RFE pipeline (see Stage 2 of Fig. 10 ) led to the selection of a compact -radiomics signature of 32 features comprising 16 unique variables each extracted from both original and inpainted MRI data. This signature contained 11 T2-based and 5 T1-based variables (nine of which were periphery-based and seven were corebased). The ensemble classifier optimized in this feature space combined a polynomial SVM of degree 3 and regularization parameter = 20 . 0 , a radial basis function SVM with = 10 . 0 , a multi-layer perceptron with one hidden layer comprising 100 units with Rectified Linear Unit (ReLu) activation, trained via Adam optimization with a learning rate of 0 . 001 and constrained with L2 regularization strength = 0 . 1 , XGBoost optimizing the binary logistic loss via gradient-boosted decision trees of maximum depth 7 with 0.1 learning rate and 600 boosting rounds, and HistGradBoost optimizing the binary logistic loss via gradient-boosted decision trees (unconstrained in depth) with a minimum of 20 samples per leaf, with shrinkage coefficient 0.2 and L2 regularization strength 5.0. Detailed benchmarking results for all classification models evaluated in this paper are reported in Table 8 in the appendix.

Control experiments 6.4.1. Prediction from lesion location or volume
In addition, a set of control experiments were designed to generate reference results contextualizing the performance of our classifier. To assess the effects of any potential spatial bias between acute and chronic patch samples ( experiment 1 ), we applied our classification pipeline to predict the ground truth class of each patch given the 3D location of its central voxel. We achieved a balanced accuracy of 62 . 5% on the validation set from ADVANCE (compared to 75 . 8% with our -radiomicsbased approach), 63 . 4% on the testing set from ASCEND (compared to 74 . 6% ), and 61 . 3% on the testing set from DECIDE (compared to 74 . 3% ). Furthermore, to validate the efficacy of the volume-matching step performed between acute and chronic patches ( experiment 2 ), we attempted to classify patches using the volume of their core and periphery regions. As expected, classification accuracy was close to chance ( 50% ) across the ADVANCE ( 56 . 7% ), ASCEND ( 52 . 2% ), and DECIDE ( 53 . 9% ) trials, thereby demonstrating appropriate elimination of the ROI volume bias.

Prediction without inpainting features
Furthermore, to assess the benefit of including inpainting features ( experiment 3 ), our classification pipeline was applied to the restricted subset of radiomic features extracted from the original MRI scans. To ensure a fair comparison, a feature space of 32 radiomic features was selected via mod-RFE. The resulting model achieved a balanced accuracy of 70 . 1% on the validation set from ADVANCE (compared to 75 . 8% with inpainted information), 71 . 4% on ASCEND (compared to 74 . 6% ), and 71 . 2% on DECIDE (compared to 74 . 3% ). The radiomics model without inpainted features is additionally compared against our proposed -radiomics pipeline via receiver operating curves, as shown in Fig. 12 .

Prediction using a simple convolutional neural network
Lastly, we compared our -radiomics ML classifier against a simple convolutional neural network (CNN) trained to discriminate acute from chronic patches ( experiment 4 ) using the same input data as our proposed model (T1-weighted, T2-weighted, and inpainted T1-and T2weighted patches concatenated together with the WMH mask patch). The CNN architecture consisted of two convolutional blocks followed by two fully-connected layers. The convolutional blocks comprised eight and 16 filter kernels of dimensions 3 × 3 × 3 with stride 1 and 1-voxel padding, each followed by a leaky ReLu activation function and a maxpooling block of kernel size 2. The fully-connected layers contained 32 hidden units and one output unit respectively, and were connected via a leaky ReLu activation function, followed by a batch-normalization layer and a drop-out layer with = 0 . 4 . The model thus contained a total of 37,457 trainable parameters. A sigmoid activation function was applied to the output node, and the model optimized the binary cross-entropy loss using the Adam optimizer with a learning rate of 0 . 001 . The resulting CNN was cross-validated by generating 20 train/validation splits from the original training set on the basis of 80:20. It achieved a balanced accuracy of 74 . 3% ( ±0 . 7 ) on the validation set from ADVANCE (compared to 75 . 8% with our -radiomics approach), 74 . 9% ( ±0 . 7 ) on ASCEND (compared to 74 . 6% ), and 74 . 3% ( ±0 . 8 ) on DECIDE (compared to 74 . 3% ). Overall, our proposed -radiomics pipeline performed similarly to this simple CNN, as demonstrated via the receiver operating curves shown in Fig. 13 . The results of each control experiment are summarized together in Table 5 .

Full-Brain prediction
Our patch-based classification framework may be adapted to the full-brain dense acute MS lesion segmentation task by independently predicting the label of each WMH voxel ( acute or chronic ) using information contained in its surrounding patch. Sample predictions showing true positive, false negative and false positive examples are shown in Fig. 14 . In addition, the performance of our classification framework in the context of brain scans containing no ground truth acute MS lesion activity is illustrated in Fig. 15 , showing varying degrees of acute "over-prediction ". Future histopathology-MRI correlation analysis will be needed to determine whether some of these areas of apparent "overprediction " may nevertheless contain foci of pathologically "active " demyelination which are currently missed by the ground truth conventional MRI definition of "acute " lesions (the limitations of which were summarized in the Introduction section).

Discussion
We have developed an ensemble ML algorithm discriminating acute from chronic MS lesion patches with accuracy in the range 74.3-74.6% using a compact set of 32 -radiomic features. This algorithm demonstrated good generalization properties across different MS disease stages (relapsing-remitting MS [RRMS] in ADVANCE/DECIDE versus secondary progressive MS [SPMS] in ASCEND).

Interpretation of radiomic features
The potential interpretation of the 32 selected radiomic features that distinguished acute from chronic lesions was explored. In line with prior work ( Zhang et al., 2008( Zhang et al., , 2009, our results suggest that acute MS lesions are associated with a coarser texture on T2-weighted MRI relative to chronic MS lesions. 3 Among the selected radiomic features, coarseness is reflected by variables such as the coarseness (as extracted from the Neighboring Gray Tone Difference Matrix, NGTDM) as well as the normalized inverse difference moment (as extracted from the Gray Level Cooccurrence Matrix, GLCM) measured in the core of T2-weighted patches. Previous studies have associated increased coarseness in MS lesions on T2-weighted MRI with higher levels of demyelination, axonal damage, and inflammation consistent with acute pathology, which suggests that our method has content validity ( Zhang et al., 2013 ). Furthermore, it has been suggested that the increased coarseness in T2-weighted MRI texture in acute MS foci may be associated with infiltration by inflammatory cells (including macrophages, lymphocytes, and glial cells), loss of oligodendrocytes, recruitment of undifferentiated oligodendrocyte progenitors, phagocytosis of myelin proteins by macrophages ( Brück et al., 1995 ;Pittock and Lucchinetti, 2007 ;Prineas et al., 1993 ;Zhang et al., 2009 ), and acute axonal pathology ( Tedeschi et al., 2002 ). Fig. 13. Receiver-Operating Curves comparing our proposed approach (leveraging -radiomics via an ensemble of ML classifiers) versus a simple CNN, showing no statistically significant difference in performance. The 95% confidence interval (CI, shaded) was computed by fitting 20 different models on 20 different train/validation splits from the training set of patches, and applying each resulting trained model to each test set.

Table 5
Key global classification performance statistics across the ADVANCE, ASCEND, and DECIDE trials, for different input feature spaces in the context of 4 control experiments. In addition, many of the features discriminating acute from chronic lesions were found to be in the periphery of the patch, which is consistent with prior results demonstrating the presence of rich textural biomarkers within the peri -plaque tissue space in MS ( Zhang et al., 2013 ). We observed a greater degree of T1-weighted signal hypointensity in the periphery of acute patches, along with a greater degree of T2-weighted signal inhomogeneity in this region, as measured via the joint entropy (GLCM) and gray level variance (Gray Level Run-Length Matrix) variables. This may be consistent with clearance of debris at the lesion site, which occurs predominantly at the periphery of an MS lesion ( Zhou et al., 2019 ), as well as partial demyelination in the tissue surrounding acute foci. Nonetheless, it is important to recognize that the periphery region as defined in this study may not systematically overlap with the NAWM surrounding a focal MS lesion. Instead, it may intersect with varying degrees of WMH, depending on the proximity of that MS foci to other regions associated with MS damage. In particular, the periphery of patches extracted from confluent lesions will primarily contain neighboring lesion tissue and in contrast will contain little to no peri -plaque tissue. Consequently, we cannot exclude that the increased T2-weighted inhomogeneity and decreased median T1-weighted signal intensity observed in the periphery of acute patches could also reflect the presence of WMHs in the vicinity of acute foci.

Sensitivity to gadolinium-enhancing lesions
The sensitivity of our ensemble classifier to the detection of acute lesions at different ages was investigated by using the presence of Gd + as a marker of lesion age. Since the timescale for persistence of BBB disruption, as measured via Gd + , has been reported to be around 1.5 to 3 weeks ( Cotton et al., 2003 ;Guttmann et al., 2016 ), then Gd + lesions can be expected to have emerged on average within 1.5 to 3 weeks prior to scan acquisition. Across all cohorts, we achieved a significantly higher ( +10% improvement on average) sensitivity to Gd + than nonenhancing NET2 lesions. Intuitively, since NET2 lesions are most distinguishable from chronic lesions shortly after their inception, before they may slowly transition towards a chronic lesion profile over time ( Guttmann et al., 1995 ;Meier et al., 2007 ;Meier and Guttmann, 2003 ;Rovira et al., 2013 ), we may indeed expect that more recent NET2 may harbor maximal textural discrimination versus chronic lesions. Nevertheless, our classifier was still able to accurately discriminate acute from chronic lesions when the former was not Gd + , emphasizing that the distinguishing features selected by our algorithm are not restricted to the early-acute stage of new lesion formation.

Fig. 14.
Panel showing axial slices sampled from three full-brain prediction maps derived from three participants from the ASCEND test cohort, obtained by independently predicting the label of each WMH voxel using information contained in its surrounding patch. This panel illustrates, from top to bottom, an example of true positive acute MS lesion detection, an example of false negative (acute lesion incorrectly predicted as chronic) and an example of false positive (chronic lesion incorrectly predicted as acute). In each row we display, from left to right, (A) axial slice of T2-weighted brain MRI scan, (B) ground truth map of acute (red) and chronic (blue) MS lesions, and (C) probabilistic prediction map generated via the application of our patch-based classifier, where blue indicates a low predicted probability of acute MS lesion and red indicates a high predicted probability of acute MS lesion.

Location-driven classification
In experiment 1 , the significant improvement in performance ( +12% ) observed in the -radiomics classification relative to the location-driven control demonstrates the ability of our -radiomics model to detect local textural biomarkers specifically associated with acute versus chronic MS lesion activity, beyond location-dependent textural clues revealing the anatomical context of each patch.

Benefit of inpainting features
In experiment 3 , we observed a consistent 4% increase in classification accuracy under the inclusion of inpainted information, relative to using original MRI data only. Importantly, it should be observed that, since the inpainting model was trained from non-lesion white matter tissue from MS-diagnosed patients, and since the totality of the white matter in the brain of MS patients is abnormal relative to healthy subjects ( Elliott et al., 2021 ), we expect our inpainting model to reproduce MS-associated white matter abnormalities, rather than generate a truly "healthy " white matter tissue profile. We hypothesize that the observed performance gain may result from the improved detectability of anatomical landmarks within the inpainted image, relative to the original MRI scans. In particular, the presence of a hyperintense T2-weighted signal in the periphery of a patch taken from the original MRI scan could denote either proximity to a WMH, or proximity to a region rich in cerebrospinal fluid (CSF), such as the ventricles. In contrast, after removing WMHs via inpainting, the presence of a hyperintense T2-weighted signal in the periphery of a patch acts as a specific marker of proximity to CSF-rich brain regions. This may allow for a better spatial contextualization of each patch, potentially supporting the construction of a location-dependent decision rule. Additionally, it should be observed that lesion inpainting transfers information from various brain locations into the inpainted region (via attention mechanisms) and as such expands the receptive field offered to our classifier beyond the core and periphery ROIs.

Comparison against CNN
Lastly, in experiment 4 we observe that our -radiomics classifier outperforms (on DECIDE) or matches (on ASCEND) the accuracy reached by a simplistic CNN classifier. With regards to the unsophisticated nature of the CNN evaluated in this study, these results are promising for future applications of deep learning techniques to lesion classification in MS. In fact, the CNN-based approach is particularly advantageous with respect to timing consideration, as it accelerates inference by a factor of 100 relative to our -radiomics approach (approximately 300 patches/second on GTX-1080 GPU for the CNN, relative to 3 patches/second on AMD EPYC 7742 64-Core Processor CPU for our approach).
Nonetheless, it should be noted that in contrast with the "black-box " characteristics of a CNN-based approach, our radiomics-based method retains a higher degree of interpretability, owing to our ability to associate specific radiomic features with their biological correlates. These associations may be leveraged to independently justify acute MS lesion predictions. As an example, some MS lesions may be predicted as acute owing to their abnormally high T2-weighted signal in the periphery ROI, suggestive of peripheral edema, while others may similarly be predicted as acute because of their high T2-weighted coarseness in the core ROI, suggestive of infiltration of inflammatory cells. Fig. 15. Panel showing axial slices sampled from three full-brain prediction maps derived from three participants from the ASCEND test cohort for which no ground truth acute MS lesion was detected, obtained by independently predicting the label of each WMH voxel using information contained in its surrounding patch. This panel illustrates, from top to bottom, examples of small, medium and severe acute over-prediction. In each row we display, from left to right, (A) axial slice of T2weighted brain MRI scan, (B) ground truth map of acute (red) and chronic (blue) MS lesions, and (C) probabilistic prediction map generated via the application of our patch-based classifier, where blue indicates a low predicted probability of acute MS lesion and red indicates a high predicted probability of acute MS lesion.

Population bias: clinical trial participants
This study has several limitations. Importantly, models were trained and evaluated on a population of participants pooled from the placebo and treatment groups of randomized controlled clinical trials, and while treatment effects may alter the textural properties of acute and chronic MS lesions relative to a natural history population, this was not explicitly investigated. Furthermore, although the inclusion criteria for these trials spanned the MS spectrum (see Table 1 ), which allows our algorithm to model population variability, some differences are nonetheless likely to exist between the population considered in this study and the distribution of patients typically encountered in routine clinical settings.

Under-Representation of variability in MRI acquisition parameters
Furthermore, the acquisition parameters used to generate the T1and T2-weighted MRI brain scans leveraged in this study were constrained (as verified via dummy runs ) to similar ranges across all trials (see Table 1 ), yielding highly standardized images that may underrepresent the variability of image contrasts and noise conditions encountered across real-world practice. Notwithstanding, it should be noted that the compactness of the selected feature space, the inherent robustness of radiomic features to MRI acquisition parameters, and the use of an ensemble learning strategy endow our classification framework with robustness properties.

Limitations associated with a patch-based approach
Another important limitation of this work is its focus on a patchlevel classification task guided by a handcrafted patch sampling strategy, which differs from the brain-level task of acute MS lesion detection and/or segmentation. While our approach may be adapted to the dense segmentation setting as discussed in Section 6.5 "Full-Brain Prediction ", it nonetheless lacks a mechanism to ensure consistency of predictions across voxels within a discrete MS lesion component, or to enforce spatial smoothness of predictions across a confluent MS lesion mass (see Fig. 14 and Fig. 15 ).
Furthermore, we have chosen to facilitate model training by rejecting ambiguous patch samples (see Section 5.3.2 "Patch Sampling Strategy ") and artificially class-balancing our dataset (see Section 5.3.3 "Class Balancing across Patches "). This is not representative of the brain-level task, where substantial class imbalance exists (see Table 1 ), and ambiguous patches cannot be avoided. To deal with class imbalance, our framework should be adapted to include classification models designed to perform well under these conditions. Additionally, we may shift our optimization objective focus from balanced accuracy (preferred in this work for its interpretability) to the F1-score metric (better suited for imbalanced datasets where positive class occurrences are rare). We hypothesize that this may contribute to reducing the occurrence of false positives, which are illustrated in Figs. 14 and 15 .

Statistical testing
Due to the high computational cost of training our proposed pipeline, the results reported here were collected across a unique run generated from a single subject-level train/validation/test split. The lack of global cross-validation limits our ability to statistically compare different classification experiments, beyond the ROC curves shown in Figs. 12 and 13 .

Translation into clinical practice
This work tackles the task of acute versus chronic MS lesion classification in the cross-sectional setting and without contrast-enhanced T1-weighted MRI. As such, it is suited for estimating the volume and spatial distribution of acute lesion burden in MS patients for whom only one timepoint of MRI scan acquisition is available and/or for whom safety concerns related to nephrotoxicity of gadolinium injection may be relevant. In the context of a diagnostic MRI scan acquisition, our proposed method may augment the information available to the clinician, by increasing the sensitivity to acute lesion detection beyond the delineation of Gd + foci, and as such may support the characterization of dissemination of MS lesions in both time and space on a single scan ( Thompson et al., 2018 ). Furthermore, the detection of a high acute lesion burden from a diagnostic scan may provide motivation for selecting a treat-to-target treatment strategy, involving initiation of potent DMTs early in disease course, which may ultimately improve patient outcomes.
In the context of a clinical trial, our method may also support the implementation of inclusion/exclusion criteria based on more comprehensive measures of acute lesion burden estimated from a single screening MRI session. Lastly, our proposed algorithm may be able to detect foci of acute MS lesion activity occurring in areas of pre-existing T2 lesion and captured outside of their gadolinium enhancement phase, and as such may improve the sensitivity to acute MS lesion detection beyond current methods. All these potential claims would be subject to full clinical validation either in retrospective or prospective cohorts.

Future work
Future work should primarily focus on evolving the voxel-level MS lesion classifier to a brain-level segmentation that would augment current methods and potentially eliminate the need for gadolinium injection. This may involve integrating the set of independent voxel predictions produced by our model into a spatially smooth and consistent voxel-level prediction map, constrained within the bounds of the WMH mask. Smoothness may be enforced by post-processing the raw voxellevel prediction map generated by our current framework, via Gaussian smoothing and/or techniques derived from Markov Random Fields.
In the context of future deep learning efforts, further work may evaluate the association between textural radiomic features and latent patch embeddings produced via the feature representation mechanism inherent to CNNs. Furthermore, attention-based frameworks, or visual explanation tools such as Grad-CAM ( Selvaraju et al., 2017 ), may be used to investigate the potential relationship between those input regions to which the CNN prediction is sensitive, and the "core " or "periphery " ROIs as defined in this work. We may also investigate the possibility of constructing a CNN producing a dense segmentation map from a multichannel input consisting of T1-and T2-weighted MRI scans along with voxel-level feature maps computed for each one of the 32 selected radiomic features. This approach offers several benefits, such as decreased inference time, and would naturally enforce spatial consistency by producing voxel-level predictions in a globally dependent fashion. Such a model may also be informed with patient-level demographic and/or clinical disease variables.
Lastly, and most importantly, future efforts should focus on correlating predicted acute lesion burden with subsequent clinical disease outcomes. Indeed, a thorough clinical validation supported by patient-level metrics derived from a voxel-level map of predicted acute MS lesion activity may support the integration of an automatic acute MS lesion detection tool into clinical practice, by allowing inference to be drawn at the level of individual patients.

Conclusions
We have developed a ML-based ensemble classifier that can discriminate acute from chronic MS lesions using unenhanced cross-sectional T1weighted and T2-weighted scans without the use of a previous comparative reference scan and/or gadolinium. The model leveraged a compact set of 32 -radiomic features encoding textural patterns associated with acute versus chronic MS lesion activity. The model achieved 75 . 8% balanced accuracy on a validation set of RRMS subjects, which was main-tained on independent test datasets comprising data from both SPMS ( 74 . 6% accuracy) and RRMS ( 74 . 3% accuracy) populations.

Code and data availability statement
Requests for data should be submitted via the Biogen Clinical Data Request Portal ( www.biogenclinicaldatare quest.com). To gain access, data requestors will need to sign a data-sharing agreement. Data are made available for 1 year on a secure platform. The code for this paper is proprietarily owned by Biogen and cannot be shared.

Funding
This work was supported by Biogen .

Disclosures of Competing Interest
Caba, Liu, Jiang, Gafson, Fisher and Belachew are employees and shareholders of Biogen. Cafaro and Lombard are employees and shareholders of Therapanacea. Elliott is an employee of NeuroRx Research. Arnold receives consulting fees from Biogen, Celgene, Frequency Therapeutics, Genentech, Merck, Novartis, Race to Erase MS, Roche, and Sanofi-Aventis, Shionogi, Xfacto Communications, grants from Immunotec and Novartis, and an equity interest in NeuroRx Research. Paragios is an employee of Therapanacea, employee of CentraleSupélec, Université Paris-Saclay, French Ministry of Higher Education and Research; holds stock options in Arterdrone and TheraPanacea; and receives compensation for editorial services from Elsevier.
NeuroRx Research (Montreal, QC, Canada) for the evaluation of MRI scans. Excel Medical Affairs provided editorial assistance in copyediting and styling the manuscript per journal requirements.

A1.1. Definitions
Patches containing WMHs that did not exceed a volumetric threshold of 9 mm 3 (equivalent to 3 voxels in the native 1 × 1 × 3 mm spacing) were ignored. This heuristic was enforced to facilitate TA, as radiomics analysis is sensitive to outlier effects for small ROI sizes ( Jensen et al., 2021 ). Intuitively, textures characterized by patterns showing regularity over a distant spatial scale are also more difficult to detect in small ROIs.
Furthermore, let denote the voxel location defining the center of a patch and let ( ) be the label of voxel (acute or chronic). Then, the volumetric proportion of the section of WMH contained in the focus region of the patch that is of class ( ) is enforced to always exceed 80% . This heuristic guarantees that our dataset of patches does not contain ambiguous samples situated at the interface between acute and chronic tissue. On average, this exclusion criteria rejected 4 . 5% ( ±6 . 3% ) of the WMH voxels detected in ADVANCE scans, 2 . 6 % ( ±3 . 3% ) in ASCEND, and 4 . 7% ( ±6 . 4% ) in DECIDE.
Lastly, when a sampled voxel location was contained within a small discrete lesion, then that sampled location was corrected such as to colocalize with the center of mass of the foci. Specifically, a lesion was considered as "small" if it could be fully contained within the cubic patch of dimensions 15 × 15 × 15 mm. This encourages the selection of patches centered on lesion foci, rather than on their border. In contrast, for WMH components exceeding the patch size, any voxel from the core or border of the lesion could be uniformly sampled. As previously specified, these exclusion criteria constrained the generation of both the training set of patches, as well as the validation/testing sets.

A1.2. Implications
On the full-brain voxel-level inference task in which our classification model could be applied to every voxel contained within the WMH area, we expect our classifier to reproduce the classification performance reported in this paper only within the set of WMH voxels that satisfy the above-mentioned rules. In contrast, we make no claim regarding the performance of our classification model within areas consisting of voxel locations that would fail to meet our patch inclusion criteria. Fig. 16

A2.2. Qualitative Evaluation
Our approach was visually evaluated against prior methods for MS lesion inpainting such as FSL (Oxford center for Functional MRI of the Brain [FMRIB]'s Software Library) ( Battaglini et al., 2012 ) and SFL (Automatic Lesion Segmentation of Multiple Sclerosis [SALEM] Lesion Filling) ( Valverde et al., 2014 ), as shown in Fig. 17 . It matches or exceeds the state-of-the-art while offering improvesuppd versatility (unlike ( Zhang et al., 2020 ), our model does not require the availability of healthy brain scans): in particular, unlike FSL and SFL our method is free from artifacts and identifiable margins around inpainted regions. Although our model does not explicitly enforce an edge prior guiding the reconstruction of anatomical structures such as the ventricles, unlike the most recently-published deep learning method based on edge prior constraints ( Zhang et al., 2020 ); qualitative results demonstrate that we achieve continuity in the reconstruction of the periventricular space. Furthermore, a V-net segmentation model ( Milletari et al., 2016 )  Abbreviations: PSNR, peak signal-to-noise ratio.
trained to detect MS lesions was unable to detect any lesion after application of the inpainting algorithm. The effect of extending the inpainting mask beyond the boundaries of the WMH mask ( Francis, 2010 ) was investigated. For a 1-mm margin, peri -lesional abnormalities remained visible outside of the inpainting mask, resulting in diffuse hyperintensity occasionally "bleeding into" the inpainted region. Conversely, extending the margin to 3 mm results in a large volumetric extent of tissue to be inpainted, which reduces the quality of inpainting in the core of the inpainting mask owing to the increasing distance to the non-masked region where valid information is present. The effect of different margin thicknesses is illustrated in Fig. 18 . As previously reported, in the final inpainting model a margin of 1 mm was preferred, based on extensive observations suggesting that a 1-mm margin yields better inpainting results.

A2.3. Quantitative Results
The evaluation procedure for inpainting evaluation established in ( Zhang et al., 2020 ) was replicated, using peak signal-to-noise ratio (PSNR) as an evaluation metric. The quality of continuity of anatomical structures was quantified by applying an edge detection algorithm to both the original and lesion-free images, and subsequently comparing the resulting edge maps. A Canny Edge filter with = 0 . 8 (as in Zhang et al., 2020 ) was applied to a 5-mm periphery around each inpainted region, both within the original and inpainted images. The F1 score between the resulting binary edge maps is reported in Table 6 .
Trivially, due to the presence of several MS lesions in the baseline MRI, pre-lesion ground truth is typically not available, which complicates not only model training but also model evaluation. To tackle this issue, we sampled a lesion mask from a given subject and superimposed on the brain scan of a different subject selected such that did not co-localize with any lesion in . MSE, PSNR, and edge detection analyses were conducted by comparing the inpainted region contained within in , across 30 subjects pooled from our test set, sampled equally between the ADVANCE and ASCEND populations of participants. In Table 6 , we report the PSNR and F1 edge detection scores measured on T1-weighted MRI (FSL and SLF methods only support T1weighted MRI inpainting).

A2.4. Ablation Study
Lastly, an ablation study was conducted to evaluate the effect of ensembling over models trained on different anatomical templates and views. The L1 error (voxel-wise sum of absolute intensity differences) averaged across the T1-and T2-weighted MRI scans of 150 cases sampled equally from the ADVANCE and ASCEND validation sets was computed. The L1 error is expressed relative to the range of intensity values in each scan (-1 to 1) in Table 7 . Ensembling across atlases and views reduces the relative L1 error by approximately 25% , with the largest improvement being attributed to the integration of information across anatomical planes.

A3. Classification Benchmarking in the Ablated Feature Space
Tables 8 and 9 Fig. 16. Overview of our multi-atlas approach to MS lesion inpainting. Each 3D brain MRI scan is registered by an affine method to 4 common anatomical templates. For each anatomical template (and for each anatomical plane) one inpainting model is trained (yielding a total of 12 models). During inference, inpainted brains from difference atlas-specific models are mapped back to the source space by inverting the forward affine transformation, and predictions are aggregated across these models via averaging. Fig. 17. Comparison of inpainting results on T1-hypointense lesion masks extended by a 1 -mm margin, showing, from left to right, axial slices from the original T1-weighted brain MRI, the lesion mask (red), SFL inpainting, FSL inpainting, and our inpainting. Abbreviation: MRI, magnetic resonance imaging. Fig. 18. Comparison of results of inpainting on T2-weighted MRI with peri -WMH margin extensions of 1 mm and 3 mm. Extended WMH masks are shown in red for margins of 1 mm and 3 mm in B and C, leading to the associated inpainted results shown in D and E, respectively. Abbreviation: MRI, magnetic resonance imaging.  Table 8 Classification benchmarking results showing the balanced classification accuracy obtained on the training and internal validation sets (used for hyperparameter tuning, different from the independent validation set from ADVANCE) for each base estimator within the ablated feature space (bold: five best-performing algorithms selected for ensembling). In each cell, we indicate the mean and standard deviation in accuracy across all five cross-validations runs.