Machine-learning-based prediction of disability progression in multiple sclerosis: An observational, international, multi-center study

Background Disability progression is a key milestone in the disease evolution of people with multiple sclerosis (PwMS). Prediction models of the probability of disability progression have not yet reached the level of trust needed to be adopted in the clinic. A common benchmark to assess model development in multiple sclerosis is also currently lacking. Methods Data of adult PwMS with a follow-up of at least three years from 146 MS centers, spread over 40 countries and collected by the MSBase consortium was used. With basic inclusion criteria for quality requirements, it represents a total of 15, 240 PwMS. External validation was performed and repeated five times to assess the significance of the results. Transparent Reporting for Individual Prognosis Or Diagnosis (TRIPOD) guidelines were followed. Confirmed disability progression after two years was predicted, with a confirmation window of six months. Only routinely collected variables were used such as the expanded disability status scale, treatment, relapse information, and MS course. To learn the probability of disability progression, state-of-the-art machine learning models were investigated. The discrimination performance of the models is evaluated with the area under the receiver operator curve (ROC-AUC) and under the precision recall curve (AUC-PR), and their calibration via the Brier score and the expected calibration error. All our preprocessing and model code are available at https://gitlab.com/edebrouwer/ms_benchmark, making this task an ideal benchmark for predicting disability progression in MS. Findings Machine learning models achieved a ROC-AUC of 0⋅71 ± 0⋅01, an AUC-PR of 0⋅26 ± 0⋅02, a Brier score of 0⋅1 ± 0⋅01 and an expected calibration error of 0⋅07 ± 0⋅04. The history of disability progression was identified as being more predictive for future disability progression than the treatment or relapses history. Conclusions Good discrimination and calibration performance on an external validation set is achieved, using only routinely collected variables. This suggests machine-learning models can reliably inform clinicians about the future occurrence of progression and are mature for a clinical impact study.


Introduction
Multiple sclerosis (MS) is a chronic autoimmune disease of the central nervous system [1]. A recent census estimated more than 2·8 million people are currently living with MS [2]. It causes a wide variety of symptoms such as mobility problems, cognitive impairment, pain and fatigue. Importantly, 5 the rate of disability progression is highly variable among people with MS (PwMS) [3]. This heterogeneity makes the personalization of care difficult and prognostic models are thus of high relevance for medical professionals, as they could contribute to better individualized treatment decisions. Indeed, a more aggressive treatment could be prescribed in case of a negative prognosis. 10 Moreover, surveys indicate that PwMS are interested in their prognosis [4], which could help them with planning their lives.
There is a large amount of literature on prognostic MS models [5,6,7,8,9]. Some prognostic models are or were at some point available as web tools. However, with the exception of Tintore et al. [9] that focuses on conversion 15 to MS, none have been integrated into clinical practice and no clinical impact studies have been performed [5,6]. Because MS is a complex chronic disease that is often treated within a multidisciplinary context, the performance of a prognostic model studied in isolation from its clinical context gives limited information on its clinical relevance [10,11]. Recent systematic reviews have 20 highlighted several methodological issues within the current literature [5,6], such as the lack of calibration or a possible significant bias in the cohort selection. Moreover, the investigated data sets are rarely made available. They furthermore often contain variables that are not routinely collected within the current clinical workflow (e.g. neurofilament light chain) or are Importantly, and in contrast with the available literature on disease progression models for MS (except for one model to predict relapses [12]), our data pre-processing pipeline and our models check all the boxes of the TRI-35 POD checklist. Our work therefore provides an important step towards the integration of artificial intelligence (AI) models in MS care.

Data source
In this multi-center international study, we used data of people with 40 MS from 146 centers spread over 40 different countries and collected in the MSBase registry [13] as of September 2020. All data were prospectively collected during routine clinical care predominantly from tertiary MS centres [14]. The MSBase data set can be requested through MSBase that facilitates data sharing agreements with each individual center. 45

Inclusion criteria
The inclusion criteria for the initial extraction of the data from MSBase were at least 12 months of follow-up, aged 18 years or older, deceased or living and diagnosed with relapsing remitting (RR) MS, secondary progressive (SP) MS, primary progressive (PP) MS or clinically-isolated syndrome (CIS). This 50 initial data set contained a total of 44,886 patients.
In order to ensure data quality, some patients were removed from that cohort. Exclusion criteria include: • Visits of the same patient that happened on the same day but had different expanded disability status scale (EDSS) values were removed. 55 All duplicate visits with the same EDSS for the same visit date were removed (i.e., only one of the visits was retained). Visits from before 1970 were discarded.
• Patients with the CIS MS course at their last visit were discarded. For those patients the relevant question is whether or not they will progress 60 to confirmed MS, which is a different question than the one investigated in this work.
The complete list of exclusion criteria is available in Supplementary Section Appendix D. These criteria resulted in a total number of 40, 827 patients in the cohort. The details of the cohort construction are represented graph-

Definition of disability progression
Confirmed disability progression was defined based on the EDSS measurements. EDSS was scored by accredited scorers (Neurostatus certification was required at each center) and was calculated based on the functional system 70 scores.
Because assessing progression requires a baseline EDSS value to compare with, predictions were made at visit dates where an EDSS measurement was recorded. In our notation, t 0 denotes the time of the visits at which the prediction is made and the baseline EDSS is thus written as EDSS t=0 . Motivated by the non-linearity of the EDSS, unconfirmed disability progression (w = 1) 1 after two years (t = 2y) is defined as follows [15]: (2) EDSS t=2y represent the last recorded EDSS before t = 2 years. EDSS suffers from inter-and intra-rater variability [16]. The actual state of the patient also fluctuates, because of e.g. recent relapses from which the patient could still (partly) recover. We therefore study confirmed disability 75 progression (w c ) for at least six months. Progression is confirmed if all EDSS values measured within six months after the progression event and the first EDSS measurement after two years lead to the same worsening target w u = 1 according to Eq. 1. EDSS measurements within one month after a relapse are not taken into account for confirming disability progression [15]. 80

Definition of clinical episodes
For each patient, all visits can potentially represent a valid EDSS baseline for a progression episode. More generally, it is possible to divide the available clinical history of a patient in multiple (potentially overlapping) episodes for which a disability progression label can be computed. Each episode there-confirmation label (w c ) as shown on Figure 2. Extracting several episodes per patient allows to significantly increase the number of data points in the study.
We define two cohorts of patients, one with a minimum of three EDSS and the location of the first symptom (i.e., supratentorial, optic pathways, brainstem or spinal cord).

115
Except for Mitoxantrone and Cyclophosphamide, we assumed that only one DMT was administered at the same time. MRI variables were not included due to high missingness. Indeed, the lesion counts were available in less than 1·7% of the clinical episodes. The variable indicating whether the MRI was normal, abnormal MS typical, or 120 abnormal MS atypical was judged as not informative enough.
The above variables were then grouped in three feature sets: static, dynamic (summary statistics of the clinical history) and longitudinal [17]. These represent increasing quantity of information regarding the clinical history of patients. More details regarding the variables used and the grouping can be 125 found in Supplementary Section Appendix H.

Models
The disability progression was framed as a classification problem. The following models were used to predict disability progression: a temporal attention model with continuous temporal embeddings [18], a Bayesian neural the TRIPOD guidelines for reporting prognostic models [19]. The checklist can be found in Supplementary Figure E.8.
The multi-layer perceptron model is a neural network architecture that takes as input the static and dynamic features set, represented as a fixed length vector. The model is composed of five hidden layers of dimension 128.
The Bayesian neural network has a similar architecture as the multi-layer 140 perceptron, but provides uncertainty estimates on the weights of the last hidden layer by incorporating MCdropout [20]. This should confer better generalization capabilities as well as better calibration. The temporal attention model relies on a transformer architecture [18]. In contrast to the previous models, this architecture is able to handle the 145 longitudinal feature set, as it is able to process the whole clinical time series. Each visit is encoded as a fixed-length vector along with a mask for missing features and a continuous temporal embedding. This temporal embedding allows for arbitrary time differences between measurements, and is therefore especially suited for clinical time series where irregular sampling is most 150 common. The static and the dynamic feature sets were included in the model as extra temporal features that are repeated over the patient history. Two temporal attention layers with dimension 128 were used.

Evaluation
The data set was split into 60% for training, 20% for validation and 20% 155 for testing. The validation data was used to optimize the hyperparameters of the models. Post-hoc calibration methods (Platt scaling [21] and isotonic regression [22]) were used on the validation set and the performance evaluated on the test set. The test set was not seen during model training and hyperparameter optimization. To produce a measure of uncertainty of the 160 performance of the models, the procedure of splitting the data and training the models was repeated five times, corresponding to five splits. As the data set consists of patients from different centers, we split the data set such that the validation and test sets represent an external validation. Patients from the same centers were therefore assigned to the same set 165 (training, validation or test).
Discrimination was evaluated using the area under the receiving operator characteristic (ROC-AUC) and the area under the precision recall curve (AUC-PR). Calibration was evaluated numerically using the Brier score and the expected calibration error (ECE) with 20 bins. Calibration was evaluated 170 visually using reliability diagrams.

10
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint  Figure 2: Problem Setup. (a) For each patient episode, the available data for prediction consists of the baseline data and the longitudinal clinical data in the observation window. Disability progression (w c ) is assessed based on the difference between the EDSS at time t = 0 and two years later (t = t 2y ) as defined in Equation 1. (b) Based on the available historical clinical data (in the observation time window), we aim at training a model able to predict the risk p(w c ) of disability progression at a two years horizon (t 2y ).

11
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022.

Participants
A flowchart of the patient inclusion for the final cohort is shown in Figure 1. The requirements on data quality and availability led to a final cohort of 15,240 and 25,246 for the three and six EDSS measurements criteria respectively. Basic characteristics of the final cohorts are shown in Table 1.

Model development
The inclusion criteria and pre-processing of the raw data resulted in 283,115 episodes from 26,426 patients in the 3-visits cohort. 11·64% of those 180 episodes represent a progression event, hence showing a mild imbalance. We addressed this imbalance by re-weighing each sample proportionally to its label occurrence. The code for training the models and the final models are publicly available and can be found at https://gitlab.com/edebrouwer/ ms_benchmark.

Model performance
The performance of the models is reported in Tables 2 to 4. A visual illustration of the discrimination performance is shown in Figure 3. The attention-based model reaches a ROC-AUC of 0·71 ± 0·01 and a AUC under the precision-recall curve of 0·26 ± 0·02 with a calibration error of 0·07 ± 0·04 190 on the external test cohort.
To assess the reliability of those results on specific sub-groups of patients, we also report the performance for each different MS course at time of prediction (Table 3) and different base EDSS (EDSS t 0 ) ( Table 4). The relapsingremitting (RR) category shows a performance similar to the full cohort. The 195 smaller primary progressive and secondary progressive groups, on the other hand, suffer from low sample size, resulting in a decreased discrimination performance. The same effect is to be observed when we segment the performance by disability severity, with the group of higher severity showing a lower discrimination performance. In the supplementary material, we also 200 show a segmentation of the results by the medical center of origin of the patients ( Figure A.5), indicating a higher variability of the results for small centers.
The calibration of the different models can be assessed from the Brier score and the expected calibration errors from the results tables. In Figure 4, 205 we also show the calibration plot of the longitudinal attention model on the 12 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint 7·4 a: mean ± standard deviation b: median (quartiles) c: % missing data Table 1: Summary statistics of the cohort of interest after extraction from MSBase (Extracted Cohort) and after patient and sample selection (Final Cohort). For all variables the value at the last recorded visit was used.

13
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; https://doi.org/10.1101/2022.09.08.22279617 doi: medRxiv preprint

Feature importance
The importance of the different variables used in our models is investigated. Table 5 shows the results of a permutation importance test on the MLP model, by assessing the loss in discrimination performance when 215 a variable is shuffled over the test set [24]. Table 5 ranks the features in decreasing order of importance. The most important variables include the 14 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; https://doi.org/10.1101/2022.09.08.22279617 doi: medRxiv preprint

15
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; Figure 3: Visual representation of the discrimination performance: the ROC-AUC curve, the AUC-PR curve and the distribution of the estimated probability of relapse per group obtained with the temporal transformer model.

Discussion
The models investigated in this study provide a significant advance towards deploying AI in clinical practice in MS. After validation of the results 230 in a clinical impact study, they have the potential to let the MS field benefit from the advantages of advanced predictive modelling capabilities.
Our work confirms that predicting disability progression of MS patients is feasible. Importantly, it can be achieved with variables that are collected as part of routine clinical care. Despite MS progression being inherently 235 stochastic, we show that relevant historical clinical data can lead to high discrimination performance combined with a good calibration (Figure 4), which is crucial in healthcare applications. This points towards a readiness 16 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. of this model to be tested in a clinical impact study. Our external validation procedure ensures generalizability within centers participating in MSBase, as 240 it allows to estimate inter-center variation of the performance. However, the models also suffer from limitations. First of all, several countries with good quality MS registries were not included because they are not part of the MSBase initiative. Since treatment decisions can be country specific to a significant degree [25], it can result in a difference of 245 performance of the proposed models on countries not included in this data set. Yet, a clinical impact study in MS centers participating in MS Base would not suffer from such external validity problems.
Second, our inclusion criteria require patients with good follow-up (at least one yearly visit with EDSS measurement), so stable patients that do 250 not visit regularly cannot benefit from this model. Importantly, because 17 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; we rely on at least three years of historical clinical data, our model cannot be applied for patients who got recently diagnosed with MS. This task for newly-diagnosed patients would require the design of a dedicated model.
Third, the progression target that we defined in this work cannot real-255 istically fully capture the complexity of the disease and progression in MS cannot be summarized by EDSS only. Despite its imperfections, this metric has been proven clinically useful, striking a good balance between abstraction and expressiveness. Our work therefore builds upon those concepts and inherits their flaws and advantages.

260
Despite these imperfections, our models could potentially help patients in the planning of their lives and provide a baseline for further research. An emphasis on reproducibility was made, in an attempt to provide a strong benchmark for this important task. Thanks to the excellent clinically-informed pre-processing pipeline, researchers can easily extend the current models or 265 propose their own, to continuously improve disease progression prediction. Extensions to our method could include treatment recommendation or inclusion of other biomarkers available in a specific center.

Conclusion
In this work, we developed and externally validated machine learning 270 models for predicting disability progression of people with MS. The performance achieved by these models, along with the availability of the predictors they rely on, implies that a clinical impact study is feasible. Such an impact study would provide important information regarding how patients use such model predictions, and how medical professionals interact and use such 275 predictions.
Our clinically-informed data processing pipeline and task definition allow the machine learning community to contribute meaningfully in improving such prediction models.

280
The data set used in this study is available upon request to the MSBase principal investigators included in the study. MSBase operates as a single point of contact to facilitate the data sharing agreements with the individual data custodians.

18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint  (2015) 1863-1874. [

20
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod) the tripod statement, Circulation 131 (2015)

22
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ;

23
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; https://doi.org/10.1101/2022.09.08.22279617 doi: medRxiv preprint

Appendix A. Per-center validation
On Figure A.5, we plot the ROC-AUC of individual centers in the test set against the size of the center. We observe that as the size of the centers 445 grow, the performance converges to the average ROC-AUC. As the size of centers shrinks, the variability in performance increases, which is statistically expected due to low sampled size.

Appendix B. Calibration curves
On Figure B.6, we show the calibration curves of the different models on the test set (fold (e.g. train-test split) 0). Calibration was performed using Platt scaling [21]. We observe good calibration for all models. The discrepancy with the ideal line (dotted) in the larger scores regime can be 24 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; explained by the lowest number of data points in that region, leading to more variance.

Appendix C. Calibration per clinical group
In Figure C.7, we show the prevalence of progression within different clinical subgroups of patients (Observed proportion) and the average probability of progression in the subgroup as given by the different models. We observe 25 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . This is a way to assess calibration performance for different subgroups.

Appendix D. Exclusion criteria
• Patients whose diagnosis date or age at first symptoms (i.e., MS onset date) was missing or with invalid formatting were removed.
• Patient whose MS course or sex was not available were removed.

465
• Patients whose date of MS diagnosis, birth, MS onset, start of progression, clinic entry or first relapse was higher than the extraction date were discarded.

26
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 11, 2022. ; https://doi.org/10.1101/2022.09.08.22279617 doi: medRxiv preprint • All visits whose visit date had an invalid format or was after the extraction date were discarded.

470
• Visits of the same patient that happened on the same day but had different expanded disability status scale (EDSS) values were removed. All duplicate visits with the same EDSS for the same visit date were removed (i.e., only one of the visits was retained). Visits from before 1970 were discarded.

475
• Patients with the CIS MS course at their last visit were discarded. For those patients the relevant question is whether or not they will progress to confirmed MS, which is a different question than the one investigated in this work.

480
The design of the algorithms carefully followed the TRIPOD checklist as shown on Figure E.8. All points are checked or are deemed not applicable in our study. This consists of the following : • 6b. Report any actions to blind assessment of the outcome to be predicted.

485
• 11. Provide details on how risk groups were created, if done. No risk groups were identified in this study.
• 14b. This can only be done for statistical models. However, we report measures of variables importance in section 3.4.
• 17. Model updating. The models proposed here are not updates of 490 previous iterations but rather their first development.
Note also that no sample size calculations were performed; the size of this retrospective data set was fixed.

Appendix F. Full comparison with other machine-learning models
In this section, we report more detailed performance results of the pro-495 posed models along with other machine-learning architecture that were considered. More information about all considered architectures is to be found in Appendix G.

27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

28
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 11, 2022. ; of all models. Two cohorts are considered: patients with a least 3 visits with EDSS in the last 3.25 years and patients with at least 6 visits with EDSS in the last 3.25 years. with EDSS ≤ 5.5 at baseline, while high severity patients are defined as having EDSS > 5.5 at baseline.

515
Two cohorts are considered: patients with a least 3 visits with EDSS in the last 3.25 years and patients with at least 6 visits with EDSS in the last 3.25 years.

Appendix G. Machine Learning Models Details
Appendix G. 1. Bayesian Neural Networks 520 Introduced by Gal et al. [20], Monte Carlo Dropout is an approximate ensemble method for Bayesian Neural Networks (Bayesian NN or BNN). They prove that a Neural Network with dropout layers and L2-regularization approximates the predictive posterior distribution of a Gaussian process for a given data set. The resulting ensemble is in general well-calibrated [26], 525 and more accurate than its non-Bayesian counterpart due to the regularizing effect. The Bayesian NN implemented in this work is identical to the baseline neural network previously introduced, with a few key differences. First,dropout [27] is applied between every layer. Second, the logits of the network are modeled as Gaussian distributions, rather than point estimates.

30
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

31
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

32
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

34
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; Third, the loss function is a modified cross-entropy loss, that samples from the aforementioned logit distributions [26]. While Kendall et al. [26] define a loss that allows capturing aleatoric uncertainty, we simplify it further to work for our binary classification. We use Monte Carlo integration to approximate the distribution.

535
The resulting network has a way of expressing epistemic (model-bound) uncertainty by using dropout, creating a distribution over the model weights.
On the other hand, the resulting BNN also has a way of expressing aleatoric uncertainty, by modelling the output of the network as a Gaussian distribution in logit space. 540 We define the network to have two outputs, µ and log σ 2 . Instead of learning variance, log variance is learnt to constrain the value to be positive. Our loss function for a single input, with T the amount of Monte Carlo integration samples, and y the true label is defined as follows: The DeepMTP framework was introduced by Iliadis et al.
[28] as a unified approach for multi-target prediction (MTP) problems. Applications that fall under the umbrella of MTP are concerned with the simultaneous prediction of multiple target variables. Even though this work focuses on the prediction of a single binary variable (patient progresses or not), we are able to use 550 the DeepMTP framework by applying a multi-task trick. To achieve this, we select one categorical feature from the available data set (country) and create multiple targets (or tasks), thus forming a multi-task learning problem. By doing this, the goal becomes the prediction of the progression of a patient depending on the country (s)he is residing in. Even though the natural 555 benchmark comparison of a multi-task problem is a collection of models that are trained on subsets of the original data set belonging to a single country (single-task models), we believe that this is out of the scope of this work. For this reason, after prediction, we collapse the tasks and create a single prediction that is comparable with the other methods tested in this paper.

35
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; In terms of architecture, DeepMTP uses a two-branch architecture that is flexible enough to be adapted for the different MTP prediction settings. In this specific multi-task case, the first branch encodes the same features as all the other methods, and the second uses a one-hot encoded vector that maps to a given country. Both branches are comprised of one or more fully 565 connected layers, and their outputs are combined using a dot product. The single value resulting from the dot product is the output of the entire network (progression or not).

Appendix G.3. Factorization Machines
In 2010, Rendle introduced Factorization Machines (FM) as a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models [29]. Factorization machines model interactions between features using factorized parameters. The prediction function of degree two, meaning that pairwise interactions represent the highest degree of interaction considered, is given by: with model parameters w 0 ∈ R, w ∈ R n and V ∈ R n×k , and the dot Factorization Machines are more widely applicable than regular factorization models such as matrix factorization, as they naturally include features as well and learn to map them into a lower-dimensional latent factor space. This behaviour explains why FM can be surprisingly successful when work-575 ing with categorical features (e.g. country, sex), even under high sparsity. Also, thanks to their linear time complexity, they are often applied in large real-world recommendation data sets [30].
In this work, we used stochastic gradient descent with adaptive regularization as a learning method [31,30]. A Python implementation is available 580 at https://github.com/godpgf/pylibfm. The main hyper-parameter to tune is the size k of the latent factor space.

36
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; Except for Mitoxantrone and Cyclophosphamide, we assumed that only one DMT was administered at the same time. This implies that if a new DMT was started, the administration of the previous DMT was considered to have ended, even if no end date was registered in the data. Mitoxantrone and Cyclophosphamide can be administered in combination with another 590 DMT. Indeed, they are induction DMTs and are thus expected to have a long-term effect. Therefore, only the start dates of these two DMTs were recorded. They were coded by a separate category: highly active induction DMTs. Alemtuzumab and Cladribine are also induction DMTs. In contrast to Mitoxantrone and Cyclophosphamide they are not combined with other 595 DMTs. If a new DMT was started, it was assumed that they were considered as not effective and the start date of the new DMT was taken as the end date of Alemtuzumab or Cladribine.

Appendix H.2. Grouping of the included clinical variables
The static feature set contains variables available at t = 0 without taking 600 into account possible previous values. Categorical variables can be encoded as indicator variables. For example, sex is encoded as female yes / no and male yes / no. If data can be missing, the category "unknown" is added. EDSS and the KFS scores were treated as continuous variables, even though they are categorical. The variables of the static feature set are: Sex, Age 605 (years), Age at MS onset (years), Disease duration (years), MS course at t = 0 (RRMS, SPMS, PPMS), EDSS at t = 0, Last used DMT at t = 0, Use of induction DMT at t = 0, all KFS scores at t = 0, education status, first symptom (supratentorial, optic pathways, brainstem, spinal cord or missing), time of prediction (years since 1990), time of diagnosis (years since 1990).

610
The dynamic feature set adds information about behaviour before t = 0 (longitudinal information) to the static data set. It contains variables that are hand-engineered from the longitudinal variables: number of visits in the last 3.25 years, the minimum and maximum in the whole history (t ≤ 0) of the EDSS and all KFS variables, mean and standard deviation over the 615 last 3.25 years of the EDSS and all KFS scores, oldest EDSS and KFS score measured in the last 3.25 years, relapse rate over the whole history (number of relapses divided by the follow-up period -since first clinical visit), time 37 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; since the last relapse (years), presence of high-efficacy DMT in the past, disease duration until a first DMT was administered, disease duration until 620 an high-efficacy active DMT was administrated, time spent on a DMT during the disease duration (ratio of time on a DMT divided by the time since MS onset), time since the last Fampridine administration The variables time since the last relapse, disease duration until a DMT was administered, disease duration until a high-efficacy DMT was admin-625 istrated and time since the last Fampridine administration are transformed according to an 1/(1 + t) scaling, with t the actual time. If no time can be defined because, e.g., no DMT has ever been administered, the transformed variable is 0. If t < 0, which can happen because of erroneous dates in the data set, the transformed variable is set to 1.

630
The longitudinal feature set contains the dates and values for the following variables: all measured EDSS values and KFS scores, relapses occurrence (encoded as a binary variable set to 1 when a relapse occurs), relapse position (brainstem, pyramidal tract or other), cumulative relapse count, MS course, DMT administration (start and end dates), induction DMT administration 635 (start date), Fampridine administration. The timing of measurements is expected to be informative [25,17,32].

Appendix I. Extra information on target definition
Importantly, if progression w = 1 cannot be confirmed because there are no EDSS measurements after two years that can be used for confirmation, it 640 is not a valid target and no label can be derived. If progression cannot be confirmed because an EDSS used for confirmation leads to w = 0, it counts as no disability progression (w c = 0). If there is no disability progression (w = 0), no confirmation is needed to make it a valid target. Note that even with confirmation for at least six months, around 20% of progression events 645 are expected to regress after more than five years [33]. However, disability progression that lasts several years is a relevant outcome for the person with MS.
Episodes were considered valid if they meet the following criteria. The time at which the prediction were made should be after 1990 (t 0 > 1990, Jan 1st).

650
This ensured that we had a cohort of patients from decades were disease modifying therapies (DMTs) were available [34].
All variables measured at visits after 1970 were used to perform the prediction. We further required a minimum number of EDSS measurements in 38 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; https://doi.org/10.1101/2022.09.08.22279617 doi: medRxiv preprint the last three years and three months of the observation window. In this study, two cohorts are investigated, one with a minimum of three EDSS measurements, the other with a minimum of six EDSS measurements. This excluded patients who have a less than yearly (or biyearly) EDSS follow-up frequency. The three additional months were chosen to allow for some margin in when the yearly visit was planned. The patient should not be classified 660 as being in a clinically isolated syndrome (CIS) at t = 0.
To summarize our target of disability progression in words: the patient will experience a disability progression event somewhere in the next two years. This event is sustained for at least six months and at least until two years after the time the prediction is made. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ; https://doi.org/10.1101/2022.09.08.22279617 doi: medRxiv preprint

40
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

41
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ;https://doi.org/10.1101https://doi.org/10. /2022 • Guillermo Izquierdo received speaking honoraria from Biogen, Novar-Serono, Roche, Biogen idec, Sanofi Genzyme, Pendopharm and has received grant support from Genzyme and Roche, has received research grants for his institution from Biogen idec, Sanofi Genzyme, EMD Serono.
• Tomas Kalincik served on scientific advisory boards for BMS, Roche,865 Janssen, Sanofi Genzyme, Novartis, Merck and Biogen, steering committee for Brain Atrophy Initiative by Sanofi Genzyme, received conference travel support and/or speaker honoraria from WebMD Global, Eisai, Novartis, Biogen, Sanofi-Genzyme, Teva, BioCSL and Merck and received research or educational event support from Biogen, Novartis,870 Genzyme, Roche, Celgene and Merck.

47
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 11, 2022. ;https://doi.org/10.1101https://doi.org/10. /2022