Modeling train route decisions during track works

model are modest. Results indicate that a substantial amount of unobserved construction site heterogeneity is present, which Random Forest cannot capture either.


Introduction
The construction schedule is an important factor in minimizing the impact of construction-based capacity restrictions in order to guarantee smooth rail operations. To optimize this process, the aim of this paper is to provide an empirical basis for supporting the train route schedulers by providing likely outcomes based on current observations for five different scheduling alternatives: Cancellation of the train schedule at the beginning (KA), in the middle (KM) and at the end (KE) of the itinerary of the train service, detour (U) and delay/ahead of time (V). Using 39 train-, construction site-, and infrastructure attributes, statistical models are trained for large datasets from the German Railway system (DB).
The envisaged optimization pipeline is schematically described in Fig. 1: Once the affected trains are known in step (1) -based on the predictions of the statistical model -perspective rules and regulations (2) are created on an annual basis. These rules are then included in the optimization process (3) of the construction schedule (for an introduction to the automatic construction schedule process, see discussions in Dahms et al., 2019). If a new schedule is approved by the operators concerned (4), it will be adopted (5); if not, there will be a feedback loop re-examining the input predictions. The goal of this paper is related to step (2), trying to better understand the choice behavior of the nationwide operating train route schedulers and to accurately predict their choices B. Schmid et al. for the subsequent annual planning horizons. As the network is constantly evolving and interdependencies start to occur as soon as first real-time choices are made, the forecasting models are not intended to be used in the short-term, i.e. for the actual daily operation. 1 Since the construction site schedules are created on an annual basis (see discussions in Dahms et al., 2019), they rather should support train route schedulers' decision-making in a longer-term planning horizon.
The main contribution of this paper is to provide a detailed empirical analysis on the different factors affecting the choices of train route schedulers -a completely new topic that has not yet been addressed in the field of transportation research so far. While providing little in terms of methodological advances, the main value added of this paper should be seen from a technopractical perspective, providing insightful analyses and tools relevant to train route schedulers. A comprehensive investigation of the importance of the different train-, construction site-, and infrastructure attributes is crucial to understand their behavior, including the potential presence and amount of unobserved construction site heterogeneity. While a better understanding of the decision-making process is mostly important from a management and communication point of view, from a technical point of view, the main goal is to make accurate predictions about new decisions in the construction site schedule for new construction sites that enter the optimization process. To investigate both requirements, the choices on how to deal with a train when it is affected by a construction site are modeled using two conceptually different classification approaches: A traditional econometric and a machine learning approach.
A general trend is observable that machine learning approaches are being used more and more often especially for prediction purposes (for a review of different methods, see e.g. Kotsiantis, 2007), and recent research also has made successful attempts to combine them with traditional econometric models (e.g. Sifringer et al., 2018;Yang et al., 2018). In the field of transportation research, the applications vary from travel behavior and mode choice prediction (e.g. Cantarella and de Luca, 2005;Omrani, 2015;Ke et al., 2017;Sun et al., 2018;Lhéritier et al., 2019;Zhao et al., 2020), travel incident prediction (e.g. Nassiri et al., 2014;Brown, 2015), travel time and flow prediction (e.g. Vanajakshi and Rilett, 2007;Zhang and Haghani, 2015;Xie et al., 2020) to purpose and mode imputation (e.g. Montini et al., 2014;Feng and Timmermans, 2016) as well as pattern recognition for travel mode and route choice prediction (for a comprehensive literature review, see also Cheng et al., 2019;Pineda-Jaramillo, 2019). Many of them have reported a superior prediction accuracy (PA) of Random Forest (RF; e.g. Breiman, 2001;Liaw and Wiener, 2002;Cutler et al., 2012) compared to other machine learning classifiers (for a comprehensive performance overview for different datasets, see also e.g. Caruana and Niculescu-Mizil, 2006), and have demonstrated that RF especially outperforms the traditional discrete choice models such as Multinomial Logit (MNL) often substantially, in some cases showing more than 20%-points improvements in PA (e.g. Hagenauer and Helbich, 2017;Cheng et al., 2019;Zhao et al., 2020).
Discrete choice models are based on the well-founded concept of utility maximization when modeling the choice of an alternative (e.g. McFadden, 1986), facilitating the interpretation of results due to a transparent specification of the functional form. The discrete choice models start with a basic specification, the MNL model, which is extended by accounting for unobserved heterogeneity and correlated alternatives, the nested error component Mixed Logit model (MIXL; e.g. Hensher and Greene, 2003;Walker et al., 2007;Train, 2009). Therefore, an important strength of discrete choice models is related to the possibility to explicitly account for the panel structure. While recent papers discuss the superior PA of machine learning models also for panel data (Zhao et al., 2020;Chen, 2021), they do not explicitly recognize the dependency structure of repeated observations. Also, one should note that all attributes are describing the train-, construction site-, and infrastructure characteristics, which are invariant across alternatives. Thus, including random error components is an obvious way to account for unobserved heterogeneity at the construction site level, while random taste parameters cannot be estimated by definition (Train, 2009). Given the already large number of parameters (39 attributes × (5 -1) alternatives + 4 constants = 160) to be estimated even in the most basic MNL

Data preparation and description
Data are generated based on the German train scheduling software MakSi FM (Makro Simulation Fahrplan Modifikation), to which the responsible train route schedulers at DB prospectively (i.e. for every planning horizon in advance, which typically covers one year) add rules on how trains are affected when they enter a construction site. DB was collaborating with the Institute for Transport Planning and Systems (IVT) of the Swiss Federal Institute of Technology (ETH) to investigate the key attributes and their influences on the choices of train route schedulers.
The data preparation mainly involved the merger of the train-, construction site-, and infrastructure attributes, where the relevant variables and their manifestations (i.e. exact calculations, available choice alternatives, discrete vs. continuous value ranges, etc.) were created in a continuous exchange with professionals from DB. Given that the main goal is to provide an annual forecasting model for different planning horizons, three datasets were created for the years 2020, 2021 and 2022. In subsequent analyses, 2020 is used as a training dataset (estimation sample), whereas 2021 and 2022 are used as test datasets (holdout sample). 3 As shown in Table 1, each dataset contains more than 150,000 observations of about 2000 construction sites.
A construction site thus typically involves multiple choices for different trains, as illustrated in Fig. 2. For all three datasets it shows highly rightskewed distributions with an average of about 80 choice observations per construction site (note that the train attributes may vary for each construction site, while by definition the construction site-, and infrastructure attributes are invariant).
The relevant choice dimension involves five alternatives 4 : • KA: Cancellation of the train schedule at the beginning of the itinerary of the train service (also includes total cancellation of the train schedule) • KM: Cancellation of the train schedule in the middle of the itinerary of the train service • KE: Cancellation of the train schedule at the end the itinerary of the train service • U: The train is redirected; it makes a detour • V: The train passes through the construction site with a delay or ahead of time As shown in Fig. 2, in all datasets U exhibits the highest relative choice frequency of more than 45%, followed by V (about 15%) and the three cancellation alternatives KA, KM and KE.
A complete list of train-, construction site-, and infrastructure attributes is shown in the Appendix, Table A.1, including a short description of each variable (for summary statistics, see Appendix, Table A.2). Together with the categorical attributes such as train type, type of the start/end yard of the itinerary of the train service, regulation at the construction site and track standard (which are re-coded as dummy variables), in total there are 39 explanatory variables (excluding the reference categories due to identification issues) 5 that are used in subsequent analyses.
Train-related attributes include the different train types, mass (only available for freight trains) and length of the train, when and how much a train is affected by the construction site and information on the itinerary of the train service. Attributes related to the construction sites mainly include information on the type of work and when it takes place, as well as the regulation at the construction site, while infrastructure attributes mainly include information on the track operation, technological standards including number of detours, effective capacity and edge betweenness 6 of the construction site.
To get a first idea on the dependencies within the training dataset, 7 Fig. 3 shows a correlation matrix of the choice and explanatory variables. It already indicates the sign and magnitude of effects on the choice of each alternative: The strongest correlations exhibit variables related to the proportions of the train (Mass_Freighttrain_1000t and Length_Train_1000m), indicating that longer and heavier (freight) trains have a lower chance of a cancellation (negative correlations with KA, KM and KE) but a higher chance of being redirected (positive correlation with U) -a similar pattern that is found for the number of detours. The opposite pattern is found for regional passenger trains (most pronounced for Traintype_Regional_SBahn), exhibiting a higher chance of a cancellation while a redirection becomes less likely. On the other hand, if there is a total route closure (Regulation_Total_Closure), the chance of a cancellation increases, while the chance of a delay (V) strongly decreases.
3 Based on preliminary investigations, merging the 2020 and 2021 into one large training dataset has been shown to be not decisive in terms of PA for the 2022 dataset (see also discussions in Section 4.4 and results in Table 5). 4 The original choice dimension involved one additional alternative (D: The train passes through the construction site without any changes in the timetable and construction schedule). This alternative was removed after discussions with experts from DB, since it exhibited a very low choice frequency of 2% and was considered as not relevant for construction site scheduling. 5 Note that categories of categorical variables with a share below 2% in the training dataset are not modeled explicitly (i.e. added to the reference categories) due to estimation issues in the MNL and MIXL model. This includes all categories denoted with a star in Table A .2. 6 Edge betweenness is a measure of the relative importance of the track within the German rail network (Freeman, 1978). 7 Note that the dependencies are very similar in the two test datasets.
B. Schmid et al. Important for the subsequent interpretation of effects are the correlations within attributes, as further discussed in Section 3.5. Fig. 3 reveals dependencies within explanatory variables such as Mass_Freighttrain_1000t and Length_Train_1000m, as expected exhibiting a strong and positive correlation of +0.85. Both variables are also strongly correlated with the number of available detours (i.e. such trains are typically operated on routes with more options for diversion) and the different train types (freight trains typically are heavier and longer, while especially regional passenger trains are smaller and shorter). It also shows that mainly external (i.e. not belonging to DB) freight trains (Traintype_Extern_Freight ) typically follow longer itineraries of the train service (positive correlations with Start_Traveltime_h and End_Traveltime_h) and exhibit a higher chance to leave or enter Germany (LeavesOrEntersGermany), and that heavier and longer trains (mainly including external freight trains) are stronger associated with a start or end in a freight or marshaling yard. Furthermore, shift work operation (Shiftwork) is positively associated with construction during night time (Construction_Night_Only), while construction sites over multiple days (Log_Days_Train_Affected) also exhibit a longer duration of uninterrupted operation of construction work (Construction_Cont_1000h). Most of the remaining correlations are small to moderate (|c| < 0.5). As discussed in subsequent sections, these interdependencies are important for a better understanding of the results of the different modeling approaches.

Modeling framework
From a technical point of view, the main goal of the models is to make accurate predictions about future decisions in the construction site schedule. Therefore, maximizing the out-of-sample PA is the main goal. From a management and communication point of view, however, there is also a strong requirement for better understanding the choice behavior of the train route schedulers in the sense of which attributes are most important and how they affect the choices quantitatively.
We want to stress at this point that a profound comparison between discrete choice and machine learning models is not targeted, as it has been done in previous research already (e.g. comprehensive work has been done by Zhao et al., 2020). The main goal is to use both techniques in a complementary and supporting way, and in subsequent analyses put the focus on one method when it outperforms the other (see e.g. discussions in Chen, 2021). We therefore propose a pragmatic approach by making use of the advantages of the different modeling techniques depending on the specific application -either as an input tool for the optimization process (maximizing PA) or to implement policy relevant management decisions (maximizing behavioral insights).

Multinomial Logit (MNL) and Mixed Logit (MIXL) model
The utility function of alternative ∈ {KA, KM, KE, U, V} and construction site ∈ {1, 2, … , } in each choice situation for train ∈ {1, 2, … , } is given by where (delay/ahead of time) is the reference alternative for identification purposes. The utility function , , includes the following components: • : Alternative-specific constant (ASC parameter) • , : Vector of train-, construction site-, and infrastructure attributes (note that for a given construction site , train attributes may vary in ) • : Alternative-specific parameter vector of train-, construction site-, and infrastructure attributes • , ∼  (0, 2 ): Unobserved random error component in the utility function (e.g. Bhat, 1995;Walker et al., 2007) related to construction site . • , , : Remaining IID extreme value type I error term With the data structure (panel data) described in Section 2, a major issue is the violation of distributional assumptions regarding the error terms (i.e. independently and identically distributed; IID; see e.g. Train, 2009). Observations are not independent and unobserved factors, especially at the construction site level, 8 may come into play, that -if not properly accounted for -may lead to a bias in parameter estimates (e.g. Greene and Hensher, 2007;Baltagi, 2008), potentially affecting the interpretation and results of behavioral outputs. Therefore, the MIXL models also include the alternative-specific random error components , . One should note that the MIXL may not be substantially better in terms of PA than the MNL model, since for out-of-sample forecasts the unobserved components are, by definition, unknown. Nevertheless, this additional layer of complexity may help to get more informative parameter estimates and improve the general picture of which attributes are important and how sensitive results are with respect to changes in the underlying modeling assumptions.
The probability (⋅) that alternative among the full set of available alternatives ∈ {KA, KM, KE, U, V} for train passing by construction site is chosen is given by where , , is the systematic component of utility and is the vector of all model parameters to be estimated. Models are estimated with the -package mixl (Molloy et al., 2021), a specialized software tool for estimating flexible choice models on large datasets. The initial MIXL model uses 100 Sobol draws to simulate the choice probabilities (e.g. Train, 2009). Although this number is not large enough to guarantee stability in parameter estimates (Walker and Ben-Akiva, 2002), given the large training dataset it is the maximum number with a still feasible computation time. 9 Therefore, in an additional effort, a reduced/compressed dataset is created and a model only including influential effects is estimated, using 2000 Sobol draws to guarantee stability (RMIXL model), which is mainly used to calculate unbiased behavioral measures such as MPE and E. Clusterrobust (by construction site ID) 10 standard errors are obtained by using the Eicker-Huber-White sandwich estimator (e.g. Zeileis, 2006).

Random Forest (RF) model
The Random Forest (RF) approach is used to classify the choice of alternative ∈ {KA, KM, KE, U, V} using binary recursive partitioning (e.g. Liaw and Wiener, 2002;Cutler et al., 2012). It is particularly efficient for very large and high-dimensional datasets and if one of the main goals is to obtain a high PA (e.g. Archer and Kimes, 2008;Qi, 2012). RF is relatively easy to use and mainly requires two hyperparameters to be defined (Liaw and Wiener, 2002): The number of trees in the forest ( ) and the number of variables in the random subset of explanatory variables for which the best split is chosen ( ). However, a more fine-grained choice of additional hyperparameters is necessary to avoid over-fitting and increase computational efficiency. Therefore, in a preliminary effort, different combinations of hyperparameters were tested using the -package tuneRanger (Probst et al., 2019) for , the minimum number of observations in the terminal nodes ( ) and the fraction of the training dataset used for training of the trees ( . ), which we did for varying . A good performance (using a parsimonious setting) was found for = 18, . = 0.86 and = 2. In a second effort, different combinations of and maximal number of terminal nodes ( . ) were investigated. Results have shown that after = 250 and .

Artificial Neural Network (ANN) model
The Artificial Neural Network (ANN) approach is used to classify the choice of alternative using a three-layer, 12 inter-connected, feed-forward network, aiming to minimize the error between observed and predicted choices (e.g. Bishop et al., 1995;Olden et al., 2004;Cantarella and de Luca, 2005). The connection weights are trained using error back-propagation and a logistic activation function with a maximum of 1500 iterations. The input layer consists of 39 neurons (i.e. one for each attribute), the second (hidden) layer consists of a number of neurons to be defined (see below) and the third layer is the output layer relating to the choice. Based on grid search techniques, multiple combinations of the number of neurons in the second layer, as well as weight decay parameters 8 Examples are missing construction site and/or infrastructure attributes that may require specific regulations of trains, leading to a unobserved preferences for certain choice alternatives. 9 The MIXL model was estimated on the ETH supercluster Euler using 48 cores with a total CPU time of 12,326 h; see also Table 4 for CPU time comparisons across all estimated models. 10 Note that for numerical issues, construction sites with >150 observations were assigned to a new construction site ID. 11 Note that increasing and/or . might even reduce the PA in the test dataset due to overfitting. The hyperparameter setting in the RF models used in subsequent analyses is summarized in Table A.4. After all, the differences in PA were small, supporting the consensus that RF in general is not very sensitive to the exact values of hyperparameters (Liaw and Wiener, 2002), which stands in stark contrast to other machine learning classifiers such as SVN and ANN (e.g. Deng et al., 2019). 12 Adding an additional hidden layer (i.e. four layers in total), given any specification of hyperparameters, failed with respect to model convergence.
(both are related to the degree of over-fitting) were tested (e.g. Krogh and Hertz, 1992;Smith, 2018), showing a high out-of-sample PA (2021 test dataset) with 19 neurons and a weight decay parameter of 0.05. Models are trained using the -package caret (Kuhn, 2008).

Support Vector Machine (SVM) model
The Support Vector Machine (SVM) approach is used to classify the choice of alternative by mapping the input vector of train-, construction site-, and infrastructure attributes into a high dimensional feature space using a non-linear kernel function, finding a hyperplane that optimally separates different classes by maximizing the margin between them (e.g. Cortes and Vapnik, 1995;Cervantes et al., 2020). As discussed in Lameski et al. (2015), when the classes are not linearly separable, the choice of the cost parameter governs how much mis-classification errors are penalized in the training dataset. Using a Gaussian radial basis function (RBF) kernel, the parameter governs the flexibility of the decision boundary that separates the hyperplane. Based on grid search techniques, multiple combinations of the hyperparameters were tested, showing high out-of-sample PA (2021 test dataset) for a parameter of 10 and a parameter of 0.01. Models are trained using the -package e1071 (Meyer et al., 2019).
Compared to the RF approach, both the SVM and ANN models were more sensitive to the choice of tuning hyperparameters, although the range of acceptable values could be narrowed down after some preliminary trials. However, it took substantially more computing power to estimate the models, especially for the SVM, as presented in Table 4.

Variable importance (VI)
When investigating the discrete choice and machine learning approaches with respect to the interpretation of results, several important issues that are related to the ranking in VI, MPE/E and PA have to be considered. In the discrete choice model, if the explanatory variables are correlated, the parameter estimates may still be unbiased and so are the resulting marginal effects and elasticities, allowing a ceteris paribus (all else equal) interpretation. 13 The linear-additive utility function in Eq. (1) implies a separable and equal meta-level importance of each variable that is weighted by exactly one parameter for each except the reference alternative (i.e. , ), no matter if is discrete or continuous. This rather restrictive setting is one of the main explanations for the lower PA of traditional econometric models compared to machine learning classifiers (e.g. Gevrey et al., 2003;Hagenauer and Helbich, 2017;Paredes et al., 2017). Clearly, without further adjustments, possible higher-order interactions and non-linear relationships are not captured (see e.g. Hillel et al. (2019); if so, they have to be parametric, i.e. explicitly specified as polynomial or dummy effects, or using other non-linear transformations, which would make model specification cumbersome especially for high-dimensional datasets as the current one). However, the main advantage is that VI has a very clear interpretation: Ceteris paribus, attribute contributes more to the utility of alternative if and/or , is higher (in absolute values). A simple and intuitive measure of VI in discrete choice models can be defined as the total average utility partworth (for the concept of utility partworth, see e.g. Goldberg et al., 1984;Kuhfeld, 2010), given by where the absolute value of the estimated parameter |̂, | is multiplied with the sample mean of the absolute values of the corresponding variable, | |, and summed up over all alternatives. In the RF model, the VI of attribute can be defined as the mean decrease in the Gini index (MDG; e.g. Louppe et al., 2013) if the split in node occurs for attribute , summed up over all split nodes and averaged over all trees, which is given by where , is the impurity decrease after the split in each tree , , ∕ is the proportion of samples reaching node and ( ) is the attribute on which the node is split. An alternative measure is given by the mean decrease in accuracy (MDA; e.g. Liaw and Wiener, 2002) which is given by where for each tree , , is the error rate (share of incorrect predictions obtained by majority votes) with random permutation of and * , is the error rate without permutation of in the out-of-bag (OOB) sample. In the ANN model, the VI of attribute can be obtained by assigning the output connection weights of each neuron in the hidden layer to components related to each input feature (Gevrey et al., 2003): 13 Note that if correlation is present, the standard errors of parameter estimates are inflated (similar as in a linear regression model; see e.g. Farrar and Glauber, 1967), leading to an increase in type II errors (i.e. falsely accept the null hypothesis). However, this is not directly related to the actual values of attribute weights , .
where ,ℎ is the normalized connection weight (in absolute values) of input to neuron ℎ in the hidden layer. To make all VI measures comparable, they are normalized [in %] by dividing by ∑ . 14

Estimation results: MNL and MIXL models
Three models are presented in the Appendix in Table A.3. CONST is a Multinomial Logit model just including the alternativespecific constants (ASC) to reproduce the relative choice frequencies (''market shares'') and MNL is a Multinomial Logit model including all 15 train-, construction site-, and infrastructure attributes. MIXL is a nested error component Mixed Logit model accounting for unobserved heterogeneity at the construction site level and, after several previous investigations, nesting the three cancellation alternatives KA, KM and KE. This is done by adding one additional shared random error component (on top of the alternative-specific error components) to the three cancellation alternatives, mimicking a nested Logit structure by allowing for shared unobserved correlation patterns between alternatives (e.g. Brownstone and Train, 1999;Walker et al., 2007). A likelihood ratio test highly rejected the null (alternative-specific error components) in favor of the nested model (one additional parameter with an increase in log-likelihood of 268 units), indicating that there is a significant correlation present among the three cancellation alternatives.
The increase in goodness of fit is substantial when comparing the MNL with the CONST model (increase in 2 by 0.37 to 0.51; see also Table 2 for the improvements in PA), clearly indicating that the train-, construction site-, and infrastructure attributes have substantial explanatory power. Many of them exhibit highly significant and substantial effects which are discussed in more detail in Sections 4.3 and 4.5 where the total partworths, MPE and E are calculated, allowing a more intuitive and easier interpretation of VI, effect size and direction. To just illustrate one example in Table A.3, Regulation_Total_Closure shows a highly significant ( < 0.01) and positive effect on all cancellation and the detour alternatives relative to the delay (V) alternative and the reference category Regulation_Track_Signal.
Including the random error components in the MIXL model again improves the goodness of fit substantially (increase in 2 by 0.18) by only estimating six additional parameters, indicating that the amount of unobserved heterogeneity at the construction site level is substantial. It is also notable that most parameters of the observable attributes exhibit qualitatively (i.e. sign and relative magnitude) similar results when comparing the MNL and MIXL model, although it indicates that differences are present and effects of certain attributes may change their direction and importance in explaining behavior. To just illustrate one extreme example in Table A.3, in the MNL model Log_Days_Train_Affected exhibits a significant and negative effect on alternative KM, which in the MIXL model becomes significant and positive (similar for Infra_Bidirect_Line_Op). This shows that for a confident evaluation of parameter robustness it may be beneficial to have models with different fundamental assumptions at hand.
Since the parameter estimates may not be stable in the MIXL, as indicated by the sometimes diverted effects between the MNL and MIXL, we use different models estimated based on compressed datasets (since the current MIXL model with only 100 draws was extremely cumbersome to estimate) to calculate the behavioral indicators such as MPE and E, as further discussed in Section 4.5. Nevertheless, since the current MIXL model is based on the full dataset including all available information and the full set of variables, we take it for the subsequent evaluation of PA and VI.

Prediction accuracy (PA)
The prediction accuracy (PA), i.e. the share of correctly predicted choices for the CONST, MNL, MIXL, RF, SVM and ANN model, is presented in Table 2 for the 2021 and 2022 test datasets. This is a conservative validation approach, since predicting for new planning horizons not only involves new construction sites, but also may be affected by novel corporate conditions, regulations and other (unobserved) factors. We use two different methods to calculate the PA: The economist method uses a probabilistic calculation by sampling the choices according to the alternative-specific probabilities, better replicating the relative choice frequencies (''market shares'') of each alternative, while the optimizer method 16 assumes that the alternative with the highest probability is always chosen (see discussions in Train, 2009). Although the latter assumption does not take into account the probabilistic distribution among the alternatives and misses the point of having imperfect information about the decision-making process, the optimizer method is often used in practice and therefore also reported in subsequent analyses. 29.2%; 2022: 29.3%), adding the train-, construction site-, and infrastructure attributes improves the PA by more than 24%-points. When compared to the MNL model, results indicate that accounting for the panel structure in the MIXL improves the PA by obtaining more informative parameter estimates (see also e.g. Thiene et al., 2017).
When the main goal is to achieve a high PA, results indicate that the RF model is clearly superior also when compared to the ANN and SVM models (see also e.g. Hagenauer and Helbich, 2017;Cheng et al., 2019;Zhao et al., 2020), increasing the PA by more than 4%-points when compared to the second best ANN model. Among the machine learning classifiers, we therefore focus our attention on the RF model in subsequent analyses. Nevertheless, it should be noted that the advantage of the RF approach seems moderate when compared to the discrete choice models given its strong ability to account for non-linear relationships and higher-order interactions, which is further discussed in Section 4.4.
All models show a consistent decrease in the accuracy over the two planning horizons. Given this inter-annual heterogeneity, for a practical application it is important that the models are updated as soon as new data is available to improve this lack of explanatory power. Merging the 2020 and 2021 datasets into one large training dataset with 71% of all observations is investigated using the RF model, 18 increasing the PA by 3.6%-points to 59.9% in the 2022 test dataset. Since this PA is still below the one for 2021 based on the 2020 training dataset (60.8%), but higher than the one for 2022 based on the 2021 dataset (58.6%), data pooling and model updating are recommended on a yearly basis. Table 3 shows the distribution of the correctly and incorrectly predicted choices for each choice alternative in the RF model. 19 This adds additional important insights on the alternative-specific model performance that may inform train route schedulers in their practical application and evaluation of alternatives. The elements in the diagonal are the alternative-specific prediction accuracies (that sum up to the values presented in Table 2), while in the off-diagonal elements show where the model fails to predict correctly. Furthermore, it shows which alternatives are over-and underestimated. Both test datasets show a similar pattern in all these domains.
The relative performance of U is highest (2021: 35.45/44.98 = 78.8%), while of KA (2021: 4.25/10.25 = 41.5%) is lowest. This pattern becomes even more pronounced in 2022, where the relative performance of KA drops to 31.2%. Thus, the model has problems in predicting KA correctly, which in 2022 is mainly attributed to a wrong classification of KE instead (3.16%). A possible explanation for this low performance may be attributed to the definition of alternatives (see also Section 2), where total cancellation of the train schedule is also part of KA. The relative performance of the other tree alternatives KM, KE and V are in both years all around 50%.
When comparing the observed and predicted relative choice frequencies, KA and KE are also the alternatives that are underestimated strongest in 2021 (KA: 7.45-10.25 = -2.8%; KE: 10.95-14.14 = -3.5%), while U is overestimated in both years. It shows that this is mainly attributed to a wrong classification of U where KE (2021: 3.70%; 2022: 3.19%) and V (2021: 7.00%; 2022: 6.09%) are predicted instead. Together with U being wrongly classified when V would be correct (2021: 5.58%; 2022: 6.53%), alternatives U and V are in the cluster with the highest absolute classification error. Notably, the RF model also performs better 17 There are two important differences observable between the economist and optimizer method: (i) In terms of PA, the economist method is more pessimistic than the optimizer method, which holds for all different models (except for the CONST model, where the optimizer method is not meaningful anyway) and (ii) according to the economist method, the MIXL always outperforms the MNL (and vice versa for the optimizer method). A possible explanation is that the MNL model assumes independence within construction sites (note that it is solely trained based on the observed attributes), while the MIXL takes the dependencies into account, therefore (for a given construction site) exhibiting more homogeneous probabilities of likely outcomes (i.e. by putting less relative weight on the observed attributes). This negatively affects the prediction -which is solely based on observed attributes -of the choice according to the highest probability, while it better reproduces the alternative-specific probability distributions. 18 While still feasible for the MNL model, the MIXL model could not be estimated with such a large number (400,652) of observations. However, as shown in Table 5, the PA even decreases in the MNL model when using a merged training dataset. 19 Note that the distributions in the other models look very similar as in the RF model, though exhibiting a consistently lower PA. As shown for the MIXL in the Appendix, Table A.6, the model performs particularly bad in correctly predicting KM. in predicting the aggregated market shares than the MNL and MIXL models, although one of their main focuses is to reproduce them (e.g. Ben-Akiva and Lerman, 1985;McFadden, 1986). The average market share prediction error for all five alternatives is 2.7% (2021) and 2.9% (2022) in the RF model, and 3.3% (2021) and 3.6% (2022) in the MIXL model, a similar result that has been found in Zhao et al. (2020). Results indicate that the train route schedulers should be careful when making their final decisions based on the model predictions. Specifically, if the model predicts e.g. a detour (U), one should keep in mind that delay (V) is the most common mis-classification and that it may need more detailed considerations between these two alternatives. One reason may be due to shortcomings in the data, such that if e.g. the capacity is reduced due to construction and one train is delayed and the other is diverted, the choice which train receives which consequence might to some extent be arbitrary. Nevertheless, together with the results presented in Section 4.3 and later in Section 4.5, this analysis can serve as a very useful tool in choosing the ''best'' alternative in an informed way.

Variable importance (VI)
Variable importance (VI) of attribute is calculated according to Section 3.5 for the discrete choice models using , , as well as for the best performing machine learning classifier, the RF model, using both metrics, , and , . Starting with the MNL model, Fig. 4 shows that the top nine variables already account for 83.7% of the total utility partworth, with the top five accounting for more than 66%. Given the modeling features of the different approaches, one may assume that from a behavioral perspective, the MIXL would give the most accurate results, since it accounts for unobserved construction site heterogeneity -an issue that has been shown to increase the model fit enormously. Nevertheless, the top nine VI in the MIXL are similar to the MNL model, though a bit less pronounced (the top nine variables contain 77.8% of total utility partworth). Results clearly indicate that only a few among the full set of variables are actually important in explaining the behavior of train route schedulers, and that many do not explain much; a good example of the Pareto principle.
In both the MNL and MIXL model, the most important variable is Start_Traveltime_h that is mainly related to the very strong and negative effect on KA as shown in Table A.3 and the relatively high mean value of that attribute as shown in Table A.2. Regulation_Total_Closure is the second most important variable in both models. After that, both rankings include the same variables End_Traveltime_h, Length_Train_1000m and Effective_Capacity but in slightly different order. The top six variable is Traintype_Regional_SBahn and again is the same in both models. Then, while in the MNL model certain train types exhibit a higher importance, construction site and infrastructure attributes play a more important role in the MIXL model. Importantly, however, note that all train types taken together (since train type is a categorical variable and for estimation was split into dummy variables) would -for all models and metrics -exhibit a relatively high importance ( = 14.5%; = 11.9%; = 7.3%; = 14.4%; the complete list of VI for all measures and variables is shown in the Appendix in Table A.5). The two VI metrics in the RF case -based on exactly the same model -show different rankings and relative importance values, but there are again the same five top attributes in the top nine as in the discrete choice models. VI according to the MDG is most comparable to the choice models, with the top nine attributes accounting for 87.1% of total VI. Notably, it shows a slightly lower gradient, thus VI is more dispersed among the top nine. Length_Train_1000m now is the most important variable (24.3%), followed by Start_Traveltime_h and End_Traveltime_h. The top nine include different attributes than in the choice models, with Edge_Betweenness now becoming the fifth, Detours the sixth, Mass_Freighttrain_1000t the seventh and Construction_Cont_1000h the eighth most important variable. The MDA finally shows a very distinct pattern in the sense that the top nine variables only account for 58.2% in total VI; thus, variables classified as weak by the MDG as well as the discrete choice models exhibit a more important and uniformly distributed importance. Also, Length_Train_1000m, which is the most important attribute according to MDG, now is on rank 7. 20 As discussed in Section 3.5, there may be a bias in VI in the RF model in favor of continuous variables for both the MDG and MDA metric, indicated by the occurrence of Edge_Betweenness, Construction_Cont_1000h and others in the top nine variables, which in the discrete choice models -assuming linear relationships -rank very low (e.g. in the MIXL: Edge_Betweenness on rank 18; Construction_Cont_1000h on rank 24; see Appendix, Table A.5).
Finally, a correlation analysis of the different VI measures according to Table A.5 shows that and are very strongly related (+0.96), while and are not so much (+0.72). Furthermore, the correlation between and is also rather low (+0.68), while for and it is only slightly higher (+0.74). 21 Results indicate that a critical 20 For the sake of completeness, VI is also reported for the ANN model using , , as shown in the Appendix, Fig. A.1 and Table A.5. It should be noted that the VI pattern and ranking differs remarkably between the RF and discrete choice models. It is most comparable to the MDA given its uniformly distributed VI and exhibits four top features that are also included in the MNL, MIXL and RF models. Interestingly, Regulation_Total_Closure is not part of the top nine features, which -from a behavioral point of view -is rather questionable. 21 exhibits the lowest correlations among all methods, which is highest with (+0.66).
investigation of different VI measures is necessary to get a grounded idea of actual VI. While the discrete choice models use intuitive though rather simplistic assumptions, the RF model may capture non-linearities and higher-order interactions in a way the former cannot compete. The big question remains, how the train route schedulers actually process the available information into a choice. Discussions with experts from DB revealed that e.g. certain train types, especially regional (less possibilities for detours, higher chance of cancellation in the middle of the itinerary of the train service) and freight (detours are possible more easily) trains, are very important attributes (clearly speaking in favor of ), while the train length -although important for the actual technical feasibility of an alternative -may also be seen as a proxy of the train type. Since train type is correlated with train length (see Fig. 3), in the RF model it absorbs the effects of the train type, making it less important than in the choice models, suggesting that from a behavioral perspective, the choice models obtain the most plausible results.

Data and model simplification strategy
We have shown in the previous sections that the model fit of the MIXL increases substantially and that accounting for unobserved heterogeneity has an influence on parameter estimates and PA. This model includes all information available in the data, which makes it extremely cumbersome to estimate -even though it has a relatively simple structure with linear-additive effects. While serving as a benchmark model for doing VI and PA comparisons with the MNL and machine learning models, due to the low number of draws one should be alerted that parameter estimates may not be stable to obtain meaningful behavioral indicators such as MPE and E. We therefore first obtain a subset of the original dataset based on which the models are simplified and restructured, allowing to improve the model specification (i.e. accounting for non-linear relationships of continuous attributes) with a sufficiently high number of random draws by keeping the computational costs manageable.
Based on the results of the initial MNL model shown in the Appendix, Table A.3, we first exclude those attributes that exhibit no significant ( < 0.1) effects on any alternative, which includes Doubletrack, Regulation_Track_Command and Trackstandard_P_160. Based on the correlation analysis shown in Fig. 3, we also exclude Mass_Freighttrain_1000t to avoid multicollinearity issues. Furthermore, based on additional investigations to simplify the model structure and the alternative-specific probability plots from the RF model as shown in the Appendix, Fig. A.2, Start_Traveltime_h, End_Traveltime_h and Detours are discretized into three categories each: Start_Traveltime_h and End_Traveltime_h are recoded into levels < 1 h (reference category), 1 ≤ < 4 h and ≥ 4 h; Detours is recoded into levels 0 (reference category), 1 and ≥ 2 detours.
A k-means cluster analysis (e.g. Wu, 2009) is applied to this simplified dataset to reduce the number of observations. Notably, the above described discretization of continuous variables and the exclusion of four attributes further helps to get more homogeneous observations within the clusters. Based on the original 2020 training dataset with 151,901 observations, we created two new training datasets with 30,000 (compression ≈ 5×) and 15,000 (compression ≈ 10×) observations, respectively, by randomly drawing one observation from each cluster and re-weighting it in the subsequent choice models proportional to the number of observations in this cluster. 22 In contrast to a related procedure described in van Cranenburgh and Bliemer (2019), our method uses zero prior information of the parameter values by maximizing the information content in the compressed datasets. Table A.7 in the Appendix presents the results of five choice models. RFULL is a Multinomial Logit model using the simplified, yet uncompressed dataset (151,901 choice observations), and excludes all insignificant ( > 0.1) 23 parameters, RCOMP is a Multinomial Logit model using the compressed dataset (compression ≈ 10× in all subsequent models) with 15,000 choice observations, 24 RSIMP again excludes all effects that became insignificant due to the data compression ( > 0.1), RBOX uses Box-Cox transformations 25 of continuous variables (e.g. Spitzer, 1982) to capture potential non-linear effects of according to and RMIXL adds the same random components as in the initial Mixed Logit model, but now using 2000 draws. Table 4 provides an overview of the computation time (CPU hours) for all estimated models. As mentioned above, using the original 2020 training dataset with 151,901 observations, 166 parameters and random error components was at the limit of computational feasibility, which made data and model simplification necessary to improve the model specification. Computation time of the RMIXL model could be substantially reduced by factor 35, while using 20 times more draws and accounting for nonlinearities. Computation time was not an issue in the remaining models. Especially the RF models performed extremely well, 22 In both datasets, the distribution of weights is very similar and highly right-skewed. 15,000 observations training dataset: Min.  (2019)). Given our model simplification strategy, this subsequently leads to a more parsimonious model specification in the RSIMP model. 25 We have tried different specifications such as polynomial effects up to degree three, but the gains in the log-likelihood were in a similar range as in the Box-Cox specification. However, solid statistical inference would have been impossible due to the extremely high correlations among variables. In any case, developing competing functional forms as the RF model naturally provides is -from a practical point of view -not feasible. exhibiting CPU hours well below one, while the ANN and SVM models required substantially more (up to factor 18 and 28, respectively) computation time than the RF model.
The applied data compression method shows to be a valid approach to scale down the size of the dataset, producing -in most cases -very similar coefficient signs and magnitudes, as shown in Table A.7. When comparing the RFULL and RCOMP model, only six out of 103 coefficients are substantially different (̂, ± 2 SE are not overlapping). When comparing the RCOMP and the parsimonious RSIMP model (24 parameters less), the latter still exhibits a lower AICc (for finite sample size corrected Akaike Information Criterion; the smaller, the better is the model fit; see e.g. Wagenmakers and Farrell (2004)) of about 10 units. Adding the Box-Cox transformations in the RBOX model only marginally improves the model fit: Although the AICc decreases by 68 units (10 additional parameters), the relative improvement is not substantial. Only two effects significantly deviate from a linear relationship (i.e. wherê, ≠ 1; < 0.05): The effect of Effective_Capacity on the utility of KM is positive logarithmic (̂_ , = −0.07), while the effect of Edge_Betweenness on the utility of U is negative reciprocal (̂_ , = −0.4). Finally, the model fit of the RMIXL increases substantially, again underpinning the substantial amount of unobserved heterogeneity present (AICc decreases by more than 2700 units). 26 All coefficients in the RBOX and RMIXL model are consistent (same sign) and in most cases exhibit the same relative magnitudes and significant levels. Importantly, with 2000 draws the coefficients of the RMIXL model are stable, serving as a solid basis to calculate the MPE and E.
Models are tested for the 2021 and 2022 datasets, confirming the validity of the data and model simplification strategy by showing only minor PA differences (see Table 5). Nevertheless, it indicates that none of the choice models are able to outperform the initial MIXL model (see Table 2; focusing on the economist method; 2.2%-points for 2021 and 0.5%-points for 2022 higher accuracy than in the RMIXL model). While adding the non-linear effects (RBOX) does not add substantial explanatory power, adding the random components increases the PA by about 2.7%-points in both test datasets, highlighting the benefit of properly accounting for unobserved heterogeneity. Furthermore, combining the 2020 and 2021 datasets into one large training dataset does not improve the PA for the 2022 test dataset in the RFULL model (it even decreases by 0.7%-points), based on which we decided to solely use the 2020 dataset for estimation.
Finally, the RRF model (RF model using the same training dataset as the RFULL model) is still better in terms of PA than the RMIXL model by 3.4%-points (2021) and 3%-points (2022), but this difference is relatively small given the high flexibility of the RF model. Also, when using the initial dataset and model specifications, the PA of the RF model never outperformed the corresponding choice models by more than 6.1%-points (see Table 2). While some recent studies (e.g. Hagenauer and Helbich, 2017;Cheng et al., 2019;Zhao et al., 2020) have found a considerably higher PA of RF compared to MNL/MIXL models of more than 20%-points, this is not the case in the current application. After all, this makes us confident that the discrete choice models are well-specified and that a considerable amount of the train route schedulers' choice behavior is construction site specific and unobserved, which the RF model cannot capture either.

Marginal probability effects (MPE) and elasticities (E)
The marginal probability effects (MPE; discrete attributes) and arc-elasticities (E; continuous attributes) (e.g. Winkelmann and Boes, 2006) are presented in Table 6 for the RSIMP, RBOX and RMIXL model (see also Appendix, Table A where , is the average (simulated in case of the RMIXL) alternative-specific predicted probability before the change and , * after the change, conditional on the estimated parameters in̂. One main advantage is that both indicators have a clear interpretation (i.e. they are not interpreted relative to the reference alternative as the estimated parameters in the discrete choice models are) and thus can be compared between different model types. For the RF model, however, no MPE and E are presented. As also discussed in Zhao et al. (2020), previous investigations have shown that the values are completely misleading. E.g. in the case of train types (discrete) and train length (continuous), the latter would essentially absorb most of the effects of the former due to the correlation structure and its higher value range.
The focus now lies on the most important attributes as presented in Section 4.3 for the initial MNL and MIXL model. First of all, one should note that variables with a high VI are not necessarily exhibiting higher MPE and E (and vice versa). VI also depends on the actual distribution of in the data, while the MPE/E only depend on̂(and, of course, the difference between and * ). This is why a variable such as e.g. Infra_Bidirect_Line_Op (55.5% of observations; see Table A.2) is much more important than Traintype_Regional_Train (3.0% of observations; see Table A.2) according to all four VI measures (see appendix, Table A.5), while the latter variable exhibits much higher MPE. 27 The MPE and E in Table 6 are in most cases consistent between the different models (same signs and similar magnitudes). In case of the RMIXL model, we now are confident that the results are stable, allowing us to better focus on the actual differences/improvements compared to the simpler models without unobserved heterogeneity. Importantly, we cannot see an overall pattern in which direction (de-vs. increasing MPE/E) the indicators in the RMIXL change compared to the simpler models. Also note that in the RSIMP and RBOX model, indicators are, in all cases, very similar. Accounting for non-linearities -which was done in a pragmatic way and is by far less sophisticated as the non-parametric mechanism in the RF model works -did not affect results substantially. Assuming that the RMIXL model provides the behaviorally most plausible results, we take it as the benchmark in the following discussion.
Start_Traveltime_h shows the expected negative effect on alternative KA, indicating that for values between one and four hours the effect peak among the two discretized variables is already reached (5.16%-points decrease in the probability of KA; similar as for values greater than four hours). Intuitively, the higher the travel time from the departure station to the construction site, the lower is the probability of a train cancellation at the beginning of the itinerary of the train service. However, the second category shows stronger positive effects on KM, while the third category exhibits stronger positive effects on U and negative effects on V. In both categories, the RMIXL shows slightly smaller values than the simpler models.
A similar, though reversed pattern is found for End_Traveltime_h on KE, decreasing the probability by 7.94%-points, while increasing the probability of KA. The effects on U and V follow the same pattern as for Start_Traveltime_h, while for the second category, the effect on KM is essentially zero.
If the route is closed (Regulation_Total_Closure), as expected V (i.e. the train is passing through the construction site with a delay/ahead of time) becomes very unlikely (−36.25%-points), while the probabilities of all other alternatives increase (strongest for U; +26.48%-points). The effects of the three models exhibit the same signs and magnitudes.
If Length_Train_1000m increases by 1%, the probability of U and V increase by 0.09% and 0.07%, respectively, while the probability of KA, KM and KE decrease by 0.08%, 0.34% and 0.72%, respectively. There is also evidence that accounting for unobserved heterogeneity results in a visibly larger value for KE (0.17%-points higher E than in the simpler models). As discussed in Section 4.3, Length_Train_1000m may be a proxy for train type, where longer trains typically are freight or intercity express trains that either get canceled at the beginning of the itinerary of the train service, or more easily can be/with a higher priority are redirected than regional passenger trains (for which finding a suitable route is often impossible). The effects can also be explained by infrastructure considerations, where the size is important: Canceling a long train at the beginning may allow to place it in its home storage depot, while in the middle or end of the itinerary of the train service, it may be more of an organizational issue.
If Effective_Capacity increases by 1%, the probability of V decreases by 0.24%, while the probabilities of KA and U increase by 0.29% and 0.06%, respectively. Discussions with experts from DB indicate that at first glance these results are technically counterintuitive (the probability of V decreases, while for KA and KE it increases) though it can be explained by spurious correlations with the construction site complexity. A high capacity comes along with a denser and more complex route network, for which the type of work often requires more severe train service restrictions than in construction sites with low capacity, reducing the possibility that a train can pass though. Notably, this somewhat unexpected effect decreases substantially in the RMIXL model by almost factor two. After all, affected trains then tend to be canceled at the beginning of the itinerary of the train service.
Of special interest are the effects of train types where e.g. Traintype_Regional_SBahn is the sixth most important variable in the initial MNL and MIXL model. In the RMIXL, the probability of U decreases by 25.95%-points, while mainly the probability of KM and V increase by 19.25%-points and 6.58%-points, respectively. All four regional passenger train types show this strong and consistent pattern with varying degrees. This is very plausible, since due to geographical constraints it is often very difficult to redirect such trains as discussed above, while the chance of getting canceled especially in the middle of the itinerary of the train service increases. Also, they rather pass through with a delay (V) than are redirected (U). Especially for Traintype_Regional_SBahn, the RMIXL provides substantially different values than the simpler models, where the negative effect on U and the positive effect on V both amplify (≈ 4%-points absolute increase for both alternatives in the RMIXL). Freight trains, on the other hand, show an opposite pattern where mainly the probability of KA decreases, while the probability of U increases. The main explanation is that those trains typically travel much longer distances facilitating a detour and are much more flexible in the time table (freight trains do not carry passengers and often travel during night time).
Finally, Shiftwork exhibits a positive effect on V (+6.82%-points), while the probability of all three cancellation alternatives and U decreases, showing that as expected, it allows a more uninterrupted operation of rail traffic compared to regular construction work. Again, the indicators are slightly amplified in the RMIXL model, underpinning the usefulness of properly accounting for unobserved construction site heterogeneity. 27 Note that the infrastructure for bi-direct line operation (Infra_Bidirect_Line_Op) and Doubletrack attributes were removed in the final model specifications (and therefore are not included in Table 6), since they did not exhibit any significant effect in the initial phase of model development (see Section 4.4).

Conclusions
For analyzing the choices of train schedulers in construction sites, discrete choice (MNL and MIXL) and Random Forest (RF), Artificial Neural Network (ANN) and Support Vector Machine (SVM) models are trained and tested on large datasets that involve prospective observations for five alternatives: Cancellation of the train schedule at the beginning (KA), in the middle (KM) and at the end (KE) of the itinerary of the train service, detour (U) and delay/ahead of time (V). The main goal is to improve the workflow of the train route schedulers by providing likely outcomes based on past observations for these five different construction scheduling alternatives.
In terms of prediction accuracy, results indicate that the RF (2021: 60.8%; 2022: 58.6%) is superior to the MIXL model by about 6%-points, indicating that the effective gain of RF is not as large as it is in other recent studies, while the MNL model performs worst. The other two machine learning approaches tested, ANN and SVM, performed notably worse than the RF approach, 28 but still better than the discrete choice models. While ANN and SVM may shine when it comes to pattern recognition, image processing, speech or signal decryption (e.g. Wang, 2003;Ma and Guo, 2014), their benefits are limited in the current application, also in terms of computation time. After all, we decided to focus on the RF approach, since it has been shown to be a very efficient machine learning classifier for our large and high-dimensional datasets, and on the discrete choice approach, since it is the most promising candidate for behavioral investigations. 29 An important conclusion is that none of the models is able to predict with a considerably high accuracy, questioning the benefits of our analyses for a blind incorporation in the envisaged optimization framework. A PA of about 60% is not exciting, but certainly better than the PA of a model that just reproduces the relative choice frequencies (29%). In a practical application, the optimization framework could still benefit from e.g. ranked predictions according to the difference of the highest (= predicted choice; if the optimizer method was used) and second highest probability, informing the train route schedulers about the uncertainty of a specific forecast. Although results indicate that there are improvements possible regarding the data quality, the general trends and importance measures of attributes are assuredly of high practical value. Also, we want to emphasize that our validation approach is twofold conservative, since (i) we do not just predict on new construction sites, but (ii) also on completely new planning horizons -given the basic requirement of creating the construction schedule on an annual basis. Not surprisingly, many studies have reported high PA for much more homogeneous samples, and often did not strictly separate observations from the same ID for model validation in panel datasets (e.g. Zhao et al., 2020). In the current application, both issues are highly relevant for an objective model assessment. Finally, our conservative validation approach may lead to a deterioration of the actual machine learning benefits, being not substantially better in forecasting than the discrete choice models.
Among several train-, construction site-, and infrastructure attributes, different metrics are used to evaluate the importance of a variable. While for the discrete choice models, the concept of total utility partworth is applied, the RF uses the standard metrics such as the mean decrease in node impurity and accuracy of a given variable. The top nine attributes correspond to about 80% of variable importance and five of them are listed in common. Those include the travel time from the departure station to the construction site, total or line closure, travel time from the construction site to the terminus, length of the train and effective line capacity.
There is a substantial amount of unobserved construction site heterogeneity present, which the RF model cannot capture either. Also, when calculating the marginal probability effects and elasticities, properly accounting for heterogeneity repeatedly demonstrates its usefulness in obtaining more accurate behavioral indicators. If the improvements in prediction accuracy and behavioral insights are in balance with the substantial increase in computational costs -a trade-off that may become increasingly relevant in future research with very large and high-dimensional datasets -depends on the specific application and the available computing infrastructure.
While being a weakness for interpretation purposes, the strength in capturing non-linear functional relationships and higherorder interactions can be seen as the main advantage of the RF model when it comes to predictive performance. After all, results suggest that both the traditional econometric and machine learning approach should be considered when trying to make informed decisions. While the former is shown to be superior from a behavioral point of view, the latter has its clear advantages when it comes to forecasting. Both approaches should be seen as complementary tools, and applying methods with different fundamental assumptions finally provides a more complete picture of the underlying problem. In the current case, we propose to use the RF approach as an input tool for the optimization framework, since it better reproduces the actual choices, while the results of the discrete choice models should be considered when doing policy relevant management decisions.
A remaining issue to discuss is which metric -the optimizer or economist method -should be applied in the optimization framework. Since the model is providing alternative-specific probabilities, in any case it would be recommendable to make use of all available information of a specific prediction (e.g. to assess its uncertainty) -also if the main interest was just in the most likely outcome (optimizer method). The advantage of the economist method would come into play if the optimization framework considered all alternatives as potential candidates according to the alternative-specific probability distribution. For example, the optimization framework could simulate the choices multiple times and re-iterate the feedback process (see Fig. 1), potentially achieving an improved outcome in the overall train route schedule. This may also better account for interactions between train route decisions (e.g. when the capacity in a construction site has been reached, the remaining options could be evaluated -again 28 The moderate performance of the ANN and SVM relative to the RF model may be related to an improved ability of the latter to capture non-linear relationships and higher-order interactions. 29 Certainly, we do not claim that there does not exist any other method that may outperform our results, either in terms of PA or any other metric, though we have tried to justify our modeling decisions using relevant literature and preliminary investigations.
B. Schmid et al.  Freight train between terminals and/or loading streets Traintype_Regional_Rail Regional rail passenger train Traintype_Regional_Express Regional express passenger train Traintype_Regional_Train Regional passenger train Traintype_Regional_SBahn Urban express passenger train Traintype_Diverse Diverse train types with share <2%

Mass of freight trains [1000 t] Length_Train_1000 m Total length of the train [km] Train_WE_Only
Train is affected by construction site only at weekends Log_Days_Train_Affected Log of # days where train is affected by the construction site LeavesOrEntersGermany Itinerary of the train service starts or ends abroad (i.e. not in Germany)

Start_NA
Missing value or unknown start yard of the itinerary of the train service (reference) Start_Freightyard Start of the itinerary of the train service is a freight yard Start_Marshallingyard Start of the itinerary of the train service is a marshaling yard Start_Junction Start of the itinerary of the train service is a (sidestep) junction

Start_Traveltime_h Travel time from the departure station to the construction site [h]
End_NA Missing value or unknown end yard of the itinerary of the train service (reference) End_Freightyard End of the itinerary of the train service is a freight yard End_Marshallingyard End of the itinerary of the train service is a marshaling yard End_Junction End of the itinerary of the train service is a (sidestep) junction using the corresponding probability distributions). Finally, which metric should be put into operation may crucially depend on other factors that are hard to evaluate a-priori. For example, as discussed with the experts from DB, apart from the ease of practical implementation (speaking in favor of the optimizer method), the computing resources available to the optimization framework may become a crucial factor as well. Track standard M 230 (mixed traffic; = 230 km/h) Trackstandard_G_50_120 Track standard G 50 or G 120 (freight traffic; = 50 or 120 km/h) Trackstandard_P_160 Track standard P 160 (passenger traffic; = 160 km/h) Trackstandard_P_230_300 Track standard P 230 or 300 (passenger traffic; = 230 or 300 km/h) Trackstandard_R_80 Track standard R 80 (regional passenger traffic; = 80 km/h) Trackstandard_R_120 Track standard R 120 (regional passenger traffic; = 120 km/h)         the coming years. Also, taking into account spatial information of the construction sites in the discrete choice models (e.g. McMillen, 1992;Smirnov, 2010;Bhat et al., 2016) could improve the behavioral insights and prediction accuracy, since currently the location of construction sites is only incompletely represented by the number of detours and edge betweenness. An important methodological B. Schmid et al. improvement could also be achieved if the data would incorporate the fact that scheduling decisions depend on the choices made for other trains (e.g. in the case where one out of two trains needs to be canceled and the other one can be redirected). 30 Finally, a dynamic discrete choice model that continuously updates the parameter estimates (e.g. using Bayesian methods) would be an interesting methodological advance for short-term/real-time forecasting, such that the model would account for construction site specific, unobserved heterogeneity as soon as the first observations are available for a new construction site (e.g. Revelt and Train, 2000).

Data availability
The data that has been used is confidential.

Appendix
See Figs. A.1, A.2 and Tables A.1-A.7. 30 Note that we consider the interactions between scheduling decisions for different trains and the construction site heterogeneity to be different matters. The interaction for train scheduling decisions is primarily a matter of capacity (e.g. if there is insufficient capacity for two trains to pass the construction site, train A can only use the track if train B is canceled/redirected and vice versa) or specific scheduling preferences (e.g. connecting certain freight yards at least once a day). The construction site heterogeneity is caused by actual differences such as the type of construction or geological characteristics. We believe that the model captures both effects to the best possible extent given the data and that the construction site characteristics are exogenous to the possible interactions between scheduling decisions.       **Cluster-robust (by construction site ID) standard error: < 0.05. ***Cluster-robust (by construction site ID) standard error: < 0.01.