Distribution-Free Location-Scale Regression

Abstract We introduce a generalized additive model for location, scale, and shape (GAMLSS) next of kin aiming at distribution-free and parsimonious regression modeling for arbitrary outcomes. We replace the strict parametric distribution formulating such a model by a transformation function, which in turn is estimated from data. Doing so not only makes the model distribution-free but also allows to limit the number of linear or smooth model terms to a pair of location-scale predictor functions. We derive the likelihood for continuous, discrete, and randomly censored observations, along with corresponding score functions. A plethora of existing algorithms is leveraged for model estimation, including constrained maximum-likelihood, the original GAMLSS algorithm, and transformation trees. Parameter interpretability in the resulting models is closely connected to model selection. We propose the application of a novel best subset selection procedure to achieve especially simple ways of interpretation. All techniques are motivated and illustrated by a collection of applications from different domains, including crossing and partial proportional hazards, complex count regression, nonlinear ordinal regression, and growth curves. All analyses are reproducible with the help of the tram add-on package to the R system for statistical computing and graphics.


Introduction
Location-scale regression has its roots in two-sample comparisons, where one extends the location model for some distribution function under treatment F (y − µ) by adding a scale parameter σ to the location shift µ, that is F ((y − µ)/σ), in comparison to the distribution function F (y) under no treatment.One of the earliest contributions is Lepage's test (Lepage 1971), which is essentially a combination of the Wilcoxon and Ansary-Bradley statistics.Generalized additive models for location, scale, and shape (GAMLSS, Rigby and Stasinopoulos 2005;Stasinopoulos and Rigby 2007) can be motivated as a generalization of the two-sample location-scale model to the regression setup, i.e., with covariate-dependent location and scale parameters, µ(x) and σ(x), and also potentially other parameters ν(x) and τ (x) describing skewness and kurtosis.Thus, GAMLSS allow explanatory variables to affect multiple moments of a variety of parametric distributions and can be understood as an early forerunner of "distributional" regression models (Kneib et al. 2023).
For a continuous response variable Y with explanatory variables X = x, GAMLSS are characterized by a parametric distribution D with typically no more than four parameters µ(x) for location, σ(x) for scale, ν(x) for skewness, and τ (x) for kurtosis.For the simplest case assuming a normal distribution D = N(µ(x), σ(x) 2 ) for the conditional response of Y ∈ R, the model can be written in terms of the conditional mean µ(x) and standard deviation σ(x) as with Φ being the standard normal cumulative distribution function.Without relying on such a prior assumption of a parametric distribution D, Tosteson and Begg (1988, Equation 1) introduced a distribution-free location-scale ordinal regression model in the context of receiver operating characteristic (ROC) analysis for ordinal responses Y ∈ {y 1 < y 2 < • • • < y K } formulated as a conditional cumulative distribution function with intercept thresholds ϑ k depending on the kth response category, parameters ϑ k ≤ ϑ k+1 being monotonically non-decreasing.The model features two model terms, µ(x) and σ(x), and is defined by a cumulative distribution function F .Tosteson and Begg (1988) discuss normal (F = Φ) and logit models (F = logit −1 ) in more detail.The latter corresponds to the "non-linear" odds model discussed in McCullagh (1980, Section 6.1), which was later extended in terms of "partial proportional odds models" (Peterson and Harrell 1990).A very attractive feature of such models is their distribution-free nature and easily comprehensible covariate-dependence through location-scale parameters µ(x) and σ(x).However, they lack the broad applicability of the GAMLSS family, for example to censored, bounded or mixed discrete-continuous responses.Inspired by location-scale ordinal regression our primary aim is to develop a distribution-free and parsimonious flavor of GAMLSS.
We propose a generalization of location-scale ordinal regression by introducing a smooth parsimonious parameterization of the intercept thresholds in terms of a transformation function, allowing to estimate distribution-free location-scale models for continuous, discrete, and potentially censored or truncated outcomes in a unified maximum likelihood framework (Hothorn et al. 2018).This framework of location-scale transformation models allows one to model the impact of explanatory variables on the location and the dispersion of the response distribution, without relying on distributional assumptions.We demonstrate the practical merits of such an approach by applications of (1) maximum-likelihood estimation in stratified models and models for crossing or partially proportional hazards, (2) novel location-scale regression trees, (3) transformation models with smooth non-linear location-scale parameters for growth-curve analysis, and discuss (4) model selection issues arising in these contexts.

Model
For univariate and at least ordered responses variables Y ∈ Ξ we propose to study regression models describing the conditional distribution function of Y given explanatory variables X = x as The model is characterized by (i) a monotonically increasing transformation function h : Ξ → R depending on parameters ϑ ∈ R P , (ii) a cumulative distribution function F : R → [0, 1] of some random variable with log-concave Lebesgue density on the real line, (iii) a covariatedependent location parameter µ(x) ∈ R, and (iv) a covariate-dependent scale parameter σ(x) ∈ R + .The model is distribution-free in the sense that a unique transformation function h exists for every baseline distribution P(Y ≤ y | X = x 0 ), i.e., a distribution conditional on explanatory variables x 0 with µ(x 0 ) = 0 and σ(x 0 ) = 1.In this case, the transformation function is given by h(y | ϑ) = F −1 (P(Y ≤ y | X = x 0 )).Conditional distributions arising from changing x 0 to x are linear in h on the scale of the link function Unknowns to be estimated are the parameters ϑ defining the transformation function, the location function µ(x), and the scale function σ(x), whereas F is chosen a priori.The applicability to ordered, count, or continuous outcomes possibly under random censoring, its distribution-free nature, and the location-scale formulation allowing simple interpretation of the impact explanatory variables have on the response' distribution shall be discussed in the following.Figure 1

Interpretation
In all models (2), positive values of the location parameter µ(x) correspond to larger values of the response and smaller values of the scale parameter σ(x) are associated with smaller variability of the response and thus result in more "concentrated" conditional distributions (Figure 1).Fitted models can conveniently be inspected on the scale of the conditional distribution, survival, density, (cumulative) hazard, odds, or quantile functions.
Statements beyond these general facts and interpretation of µ(x) and σ(x) in particular depend on the specific choice of F .Suitable choices for F include inverses of common link functions, such as ≡ 1, the model reduces to well-established regression models.For F = cloglog −1 , one obtains a proportional hazards model, a proportional reverse-time hazards model is defined by F = loglog −1 , and a proportional odds model given by the choice is the conditional mean of the h-transformed response.An overview on these models and interpretation of µ(x) is available from Hothorn et al. (2018, Table 1).
Under certain circumstances, these simple ways of interpretation carry over to location-scale models of form (2). Consider Cox' proportional hazards model (F = cloglog −1 ) for a continuous survival time.A change from x to x is reflected by the difference µ(x) − µ(x) on the scale of the log-hazard functions conditional on x and x, respectively.The introduction of a scale parameter σ(x) to this model does not affect this form of interpretation as long as σ(x) = σ(x), owing to the fact that in model (2) µ(x) is not multiplied with σ(x) −1 .For a proportional odds model, µ(x) − µ(x) is the vertical difference between the two conditional log-odds functions.Therefore, if model interpretation on these scales is important for certain explanatory variables, one should try to omit these variables from the scale term.
Another form of model interpretation can be motivated from probabilistic index models (Thas et al. 2012), which describe the impact of a transition from x to x by the probabilistic index P(Y ≤ Ỹ | x, x).This probability can be derived from transformation models (Sewak and Hothorn 2023).for example, the probabilistic index has a simple form, with two independent draws from this model, the first, Y , conditional on x and the second, Ỹ , conditional on x.Especially in cases where an explanatory variable affects both the location term µ(x) but also the scale term σ(x), the probabilistic index may serve as a comprehensive measure to describe the impact of changes in the covariate configuration on the response's distribution.

Parameterization
We in general express the transformation function in terms of P basis functions h(y | ϑ) = a(y) ⊤ ϑ.For absolute continuous responses Y ∈ Ξ ⊆ R, the transformation function h can be conveniently parameterized in terms of a polynomial in Bernstein form h(y | ϑ) = a Bs,P −1 (y) ⊤ ϑ (McLain and Ghosh 2013;Hothorn et al. 2018).The basis functions a Bs,P −1 (y) ∈ R P are specific beta densities (Farouki 2012) and it is straightforward to obtain derivatives and integrals of h(y | ϑ) = a Bs,P −1 (y) ⊤ ϑ with respect to y and, under suitable constraints, a monotonically increasing transformation function h (Hothorn et al. 2018).For count responses Y ∈ {0, 1, 2, . . .} this transformation function is evaluated for integer values only, i.e., h(⌊y⌋ | ϑ) (Siegfried and Hothorn 2020).For ordered categorical responses A non-parametric version assigns one parameter to each unique value of the outcome in the same way.In all cases, monotonicity of h can be implemented by the constraints ϑ p ≤ ϑ p+1 , p ∈ 1, . . ., P − 1 (Hothorn et al. 2018).

Likelihood
From model (2), the log-likelihood contribution ℓ i (ϑ, µ(x i ), σ(x i )) of an observation (y i , x i ) with y i ∈ R given as a function of the unknown parameters ϑ, µ(x i ), and σ( Evaluating this expression requires the Lebesgue density f = F ′ and the derivative h ′ (y | ϑ) = a ′ (y) ⊤ ϑ of the transformation function with respect to y.For a discrete, left-, right-or interval-censored observation ( ȳi , ȳi ] the exact log-likelihood contribution For observed categories y k , the datum is specified by (y k−1 , y k ] and for counts y i ∈ N it is ( ȳi , ȳi ] = (y i − 1, y i ].For random right-censoring at time t i it is given by ( ȳi , ȳi ] = (t i , ∞) and for left-censoring at time t i by ( ȳi , ȳi ] = (0, t i ].For the important special case of i = 1, . . ., N independent realizations from model (2) with linear location term µ(x) = x ⊤ β and log-linear form for the scale term σ(x) −1 = exp(x ⊤ γ), the unknown parameters ϑ, β, and γ can be estimated simultaneously by maximizing the corresponding log-likelihood under suitable constraints Score functions and Hessians as well as conditions for likelihood-based inference can be derived from the expressions given in Hothorn et al. (2018) for models defined in terms of ϑ and β.

Model selection
Model selection in this framework can be performed by including an L 0 penalty in the likelihood implied by model ( 2) using an adaptation of the sequencing-and-splicing technique suggested by Zhu et al. (2020).
Here, s ∈ {1, . . ., 2J} denotes a fixed support size and ∥•∥ 0 denotes the L 0 norm.The parameters of the transformation function ϑ remain unpenalized.When the support size s is unknown, s is tuned by minimizing a high-dimensional information criterion (SIC).The procedure is summarized in Algorithm 1.Further information on the choice of the initial active set and the inclusion of unpenalized parameters is given in the Supplementary Material C.
Algorithm 1: Best subset selection for location-scale transformation models.

Inference for applications
Motivated by applications from different domains we detail the estimation of location-scale transformation models, including important aspects of model evaluation, interpretation and testing.The wide range of applications of the model framework is further exemplified by contrasting it to established location-scale models.
In Section 3.1.1we outline the estimation of location-scale transformation models from the perspective of a stratified model.Section 3.1.2presents the application of location-scale models to survival data in the presence of crossing hazards, further introducing a locationscale alternative to the commonly used log-rank test.Interpretability of location-scale transformation models is exemplified in Section 3.1.3assessing seasonal and annual patterns of deer-vehicle collisions.The estimation of non-linear, tree-based location-scale transformation models is discussed for self-reported orgasm frequencies of Chinese women in Section 3.2.Inspired by the GAMLSS framework, Section 3.3 describes the estimation of a distribution-free version of additive models featuring smooth covariate-dependent location and scale terms.
The application of model selection in this framework is exemplified in Section 3.4.

Stratification
Haslinger et al. (2020) reported data from measuring postpartum blood loss Y ∈ R + during 676 vaginal deliveries and 632 caesarean sections at the University Hospital Zurich, Switzerland.Aiming to contrast blood loss during vaginal deliveries or cesarean sections the conditional distributions can be estimated by stratification, for example.
In the following we estimate such a stratified model with two separate transformation functions h(y | delivery = vaginal) = a Bs,15 (y) ⊤ ϑ vaginal and h(y | delivery = cesarean) = a Bs,15 (y) ⊤ ϑ cesarean as polynomials in Bernstein form.In a similar spirit, a location-scale transformation model with transformation function a Bs,15 (y) ⊤ ϑ for vaginal deliveries and transformation function exp(γ)a Bs,15 (y) ⊤ ϑ−β for cesarean sections can be defined.Transformation functions of both models were defined on the probit scale.The stratified, distribution-and model-free approach estimates 2 × P = 32 parameters whereas the distribution-free locationscale model consists of only P + 2 = 18 parameters, providing a lower-dimensional alternative to stratification.Owing to Weierstrass' theorem, polynomials in Bernstein form with sufficiently large order P −1 can approximate any continuous function on an interval and therefore the location-scale model does not make assumptions about the transformation function h and thus the distribution of blood loss for vaginal deliveries.However, the location and scale terms govern the discrepancy between blood loss distributions for the two modes of delivery, and thus this approach can be characterized as being distribution-free but not model-free.
Due to the practical challenges in measuring blood loss in the hectic environment of a delivery ward, interval-censored observations were reported and the corresponding interval-censored negative log-likelihood ( 4) is minimized by Augmented Lagrangian Minimization (Madsen et al. 2004) to estimate the parameters ϑ, β, and γ simultaneously.Visual inspection of distribution and density functions as well as the in-sample log-likelihoods in Figure 2 shows that the two models are practically identical.The two estimated conditional distribution functions cross around 1,000ml, which is only possible due to estimation of two separate transformation functions h(y | delivery) or via the inclusion of the delivery mode dependent scale term exp(γ).
We can also compute the probabilistic index here, which indicates that a randomly selected woman having a vaginal delivery has a probability of 0.71 (95% confidence interval: 0.68−0.74)for a lower blood loss compared to a randomly selected woman undergoing a cesarean section.

Crossing hazards
In the following we re-analyze a two-arm randomized controlled trial of 90 patients with gastric cancer (Schein and Gastrointestinal Tumor Study Group 1982).Trial patients received either chemotherapy and radiotherapy (intervention group) or chemotherapy alone (control group).
The non-parametric Kaplan-Meier estimates of the survivor functions of both groups in Non-proportional hazards are a common violation of a standard model assumption in survival analysis necessitating tailored models to express differences in survival times T ∈ R + by interpretable parameters.We suggest a location-scale transformation model of the form For γ = 0, the model reduces to a proportional hazards model with log-hazard ratio β.A distribution-free version can be implemented by choosing a polynomial in Bernstein form for the transformation function h(t | ϑ) = a Bs,6 (t) ⊤ ϑ.A Weibull model corresponds to a loglinear transformation function h(t | ϑ) = ϑ 1 +ϑ 2 log(t), which was introduced as a special case in the GAMLSS framework (Rigby and Stasinopoulos 2005, Table 1) and later investigated in more detail by Burke and MacKenzie (2017) and Peng et al. (2020) using the equivalent parameterization ϑ 1 + exp(ϑ 2 ) log(t) for the control and ϑ 1 + exp(ϑ 2 + γ) log(t) + β for the intervention group.Further extensions of the Weibull location-scale model were studied in Burke et al. (2020b) and a non-parametric approach, leaving h completely unspecified, was theoretically discussed by Zeng and Lin (2007) and Burke et al. (2020a).
Model parameters for both models were estimated by maximizing the likelihood defined by (3) for death times and likelihood (4) for right-censored observations.The non-parametric Kaplan-Meier estimates in Figure 3  The individual score contributions are defined as Note that the first element of r i is the log-rank score for the ith individual and a bivariate linear test statistic is simply the sum of the scores in the intervention group.After appropriate standardization, maximum-type statistics or quadratic forms can be used to obtain p-values from the asymptotic or approximate permutation distribution (Hothorn et al. 2006).The log-rank test alone does not lead to a rejection at 5% (p-value = 0.638) but the bivariate test does (p-value = 0.002 for the maximum-type and p-value = 0.001 for the quadratic form).

Partial proportional hazards
We analyze a time series of daily deer-vehicle collisions (DVCs) involving roe deer that were documented over a period of ten years (2002 -2011) in Bavaria, Germany (Hothorn et al. 2015).In total, 341,655 DVCs were reported over 3,652 days, with daily counts 16 ≤ Y ≤ 210.
As a benchmark, we fitted a location transformation model were additionally estimated.Because the year does not affect the scale term, parameters β year are interpretable as log-hazard ratios common to all days 1, . . ., 365 within a year.In this sense, the model is a partial proportional hazards model.As in Section 3.1.2,we study a distribution-free version (h in Bernstein form) and a more restrictive Weibull model (log-linear h) which, for counts, is applied to the greatest integer ⌊y⌋ less than or equal to the cut-off point y.The correct interval-censored likelihood (4) for count data was used in the three cases to estimate the unknown model parameters ϑ, γ, β year , β weekday(d) and β simultaneously (Siegfried and Hothorn 2020) by Augmented Lagrangian Minimization (Madsen et al. 2004).
Assessing the temporal changes in DVC risk across a year, all three models are displayed on the quantile scale in Figure 4    animal activity.The plots further indicate that for the location transformation model large median values are associated with larger dispersion.This is not the case for the other two models, indicating a certain degree of underdispersion.The median annual risk pattern is very similar in all three models, however, the distribution-free location-scale transformation model reveals smaller variance compared to the other two models.
The partial proportional hazards location-scale transformation model further allows investigation of the general trend of DVCs over a decade.From the log-hazard ratios β year we computed multiple comparisons of hazard ratios comparing subsequent years, with multiplicity control.Table 1 is in line with an increasing DVC risk from 2002 to 2004, followed by a plateau in 2005 and 2006, a further risk increase in 2007, and then plateauing in the remaining years.

Location-scale transformation trees
Pollet and Nettle (2009) analyzed the self-reported orgasm frequency of 1,533 Chinese women with current male partners.The ordinal outcome Y was reported in terms of ordered categories: never < rarely < sometimes < often < always.To assess the effect of explanatory variables on the distribution of reported orgasm frequencies, we re-analyze this data using a tree-structured location-scale transformation model.Explanatory variables included in the model are: partner income, partner height, duration of the relationship, respondents age (Rage), difference between education and wealth between both partners, the respondents education (Reducation: no school < primary school < lower-middle school < upper-middle school < junior college < university), health, happiness (Rhappy: very unhappy < not too unhappy < relatively happy < very happy) and place of living (Rregion).
We apply a modification of the transformation tree induction algorithm by Hothorn and Zeileis (2021) to estimate the location-scale transformation tree: (i) Fit an unconditional transformation model, (ii) assess the correlation of model scores and explanatory variables, (iii) find an appropriate binary split in the explanatory variable strongest correlated to the scores, (iv) proceed recursively.The novelty here is that location-scale trees pay attention to bivariate location-scale scores exclusively, instead of the P scores for the transformation parameters ϑ (as in Hothorn and Zeileis 2021).As in Section 3.1.2,the unconditional model is fitted in each node of the tree by optimizing the likelihood The bivariate score contributions are defined by Permutation tests are then applied to assess the association between the jth explanatory variable based on a quadratic form collapsing the linear test statistic , where g j (x i ) is a Q(j)-dimensional vector representing the jth explanatory variable of the ith subject.The bivariate score allows the tree to detect location and scale effects, for the model in Figure 5 on the logit scale.A p-value is computed for all j = 1, . . ., J explanatory variables and the variable with minimum p-value is selected for splitting.
The location-scale transformation tree (Figure 5) indicates that higher orgasm frequencies were in general reported from higher educated, happier, and younger females.In this subgroup, the coastal south region was associated with a tendency to higher reported orgasm frequencies compared to the rest of China.

Transformation additive models for location and scale
The head-circumference growth chart obtained from the Dutch growth study (Fredriks et al. 2000) is one of the standard examples in the GAMLSS literature.The top panel of Figure 6 shows the head-circumference quantiles for boys conditional on age obtained from fitting a GAMLSS with Box-Cox-t distribution, featuring four model terms µ(age), σ(age), ν(age) and τ (age) (reproducing Figure 16 in Stasinopoulos and Rigby 2007).In our re-analysis, we replace the four parameter Box-Cox-t GAMLSS with a distribution-free transformation additive model for location and scale (TAMLS) featuring a conditional distribution function In contrast to the GAMLSS, there is no need to assume a specific parametric distribution in the TAMLS and only two instead of four smooth terms have to be estimated.In this model, the transformation parameters ϑ can be understood as nuisance parameters.We employ the Rigby and Stasinopoulos (RS) algorithm (Rigby and Stasinopoulos 2005) developed for GAMLSS to estimate the two smooth terms µ(age) and σ(age) in our TAMLS.For a given likelihood depending on a location and scale term, this algorithm allows estimation of these two terms in a structured additive way.We shield the more complex formulation of our model from the RS algorithm by setting-up a profile likelihood which, under the hood, estimates the nuisance parameters ϑ given µ and σ controlled by the RS algorithm.More specifically, for candidate functions µ and σ, the profile likelihood over ϑ is given by We used log-likelihood contributions (3) in this specific application.Augmented Lagrangian Minimization (Madsen et al. 2004) was used to estimate ϑ given µ(•) and σ(•).The penalized profile likelihood was optimized with respect to the two functions µ(•) and σ(•) in Step 2a(i) of the RS algorithm (Appendix B, Rigby and Stasinopoulos 2005).The in-sample log-likelihood of the four term Box-Cox-t GAMLSS is slightly larger than the one of the distribution-free TAMLS, but the conditional quantile sheets obtained from the two models are very close and hardly distinguishable for boys older than 2.5 years (Figure 6).
Models assuming additivity of multiple smooth terms for the location effect µ(x) = J j=1 m j (x) and the scale effect −2 log(σ(x)) = L l=1 s l (x) can be fitted by maximizing the same profile likelihood using the RS algorithm or L 2 boosting (for GAMLSS, Mayr et al. 2012).In this sense, transformation models introduce a novel distribution-free member to the otherwise strictly parametric GAMLSS family.

Model selection
In the following we aim to assess the effect of explanatory variables on the medical demand by the elderly, i.e., number of physician visits Y = 0, 1, 2, . . .for patients aged 66 or older, using a sample from the United States National Medical Expenditure Survey conducted in 1987 and 1988 (Deb and Trivedi 1997).For such applications, location-scale transformation models (2) are especially attractive for parameter interpretation when linear location and scale terms are considered, and variables of special interest are present in the location term only (Section 2.1).If continuous explanatory variables x are present in the model, a parameter identification issue arises which has previously been discussed in the GAMLSS context (Rigby et al. 2019, page 60).In a location-scale model, Location and scale parameter estimates, β and γ, from applying the two estimation procedures, maximum likelihood (ML) or best subset selection (BSS), to a location-scale transformation model including the following explanatory variables: Health (poor < average (baseline) < excellent), Sex (female (baseline), male), Insurance (no (baseline), yes), Chronic (number of chronic conditions) and School (number of years of education).

BCT GAMLSS
Variables which were dropped when applying best subset selection are indicated by -.
the intercept, which is implicit in the transformation function h(y | ϑ) = h(y | θ) − β 0 , must not be multiplied with the scale term and an explicit intercept must be added to the location term, changing the model to The two models are not equivalent, but adding β 0 to h leads to an unidentified parameter when γ is close to zero and omitting β 0 leads to different model fits when a constant is added to a continuous explanatory variable (e.g., when defining a suitable baseline distribution).However, for the sake of interpretability we aim to drop variables from the scale term whenever possible and therefore apply the L 0 penalty (detailed in Section 2.4, implemented in package tramvs (Kook 2023)) on γ to the likelihood of model ( 5).
Applying the two estimation procedures, maximum likelihood and best subset selection, to a location-scale transformation model (with F = cloglog −1 ) estimating the effect of selfperceived health status (Health), sex (Sex), insurance coverage (Insurance), and the number of chronic conditions (Chronic) and years of education of patients (School) on the frequency of physician visits, allows for a head-to-head comparison of the parameter estimates (Table 2).In the best subset location-scale transformation model the variables Health, Sex and School are dropped from the scale term allowing to interpret their effects in terms of (log-)hazard ratios.
For the variable Sex, for example, the corresponding exp(− β) = 1.1333 can be interpreted as hazard ratio comparing the hazards of male patients to the hazards of female patients, all other variables being equal.Tosteson and Begg (1988) introduced the notion of distribution-free location-scale regression in the context of ROC analysis.While they were able to estimate a corresponding model for ordinal responses, they contemplated that for models (2), "there is, as yet, no software for accommodating continuous test results, which are common outcomes for laboratory tests" (Tosteson and Begg 1988).With the introduction of a smooth transformation function and corresponding software implementation in the tram add-on package (Hothorn et al. 2023), we address this long-standing issue.We derive likelihood and score contributions for all response types and discuss suitable inference procedures for various functional forms.

Discussion
In a broader context, we contribute a new distribution-free member to the rich family of location-scale models.The model is unique in the sense that data analysts do not have to commit themselves to a parametric family of distributions before fitting the model.The flexibility of our approach comes from the pair of location and scale terms allowing interpretability of conditional distributions on various scales, including proportional odds or hazards.Despite the distribution-free nature, we parameterize the model such that simple maximum-likelihood estimation for all types of responses becomes feasible.Therefore, our implementation handles arbitrary responses, including bounded, mixed discrete-continuous, and randomly censored outcomes, in a native way.Among other diverse applications, our flexible approach can help to generalize Weibull location-scale models previously studied as a model for crossing hazards using GAMLSS, allows for over-or underdispersion to be explained by covariates in complex count regression models, adds a notion of dispersion to regression trees for complex responses, provides means to reduce the complexity of growth-curve models, and has important applications in ROC analysis (Sewak and Hothorn 2023).
Special care with respect to parameter interpretability is needed when formulating the model.Parameters in linear location terms are interpretable as log-odds or log-hazard ratios for as long as there is no corresponding scale parameter.Thus, model selection becomes vitally important should the data analyst be interested in direct parameter interpretation.A novel approach to best subset selection was presented and empirically evaluated.Model interpretation is possible on other scales (e.g., probabilistic indices or conditional quantiles), yet constitutes probably the biggest challenge of location-scale transformation models.
All models discussed here are "distributional" in the sense that they formulate a proper distribution function.Via appropriate constraints, our software implementation ensures that fitted models also directly correspond to conditional distribution functions.This feature allows straightforward parametric bootstrap implementations.Alternative suggestions for location-scale ordinal regression do not necessarily lead to estimates which can be interpreted on the probability scale (Cox 1995;Tutz andBerger 2017, 2020).
Algorithmically, we stand on the shoulders of giants, because only minor modifications to well-established algorithms were necessary for enabling parameter estimation.We didn't fully explore all possibilities here, and for example location-scale transformation forests building on Hothorn and Zeileis (2021) or functional gradient boosting for this class (Hothorn 2020b) are interesting algorithms for smooth interaction modelling in potentially high-dimensional covariate spaces.

D. Simulation
The algorithms employed for estimating stratified models (Section 3.1.1),crossing hazards models (Section 3.1.2),partial proportional hazards models (Section 3.1.3),location-scale transformation trees (Section 3.2), and transformation additive models for location and scale (Section 3.3) have all been demonstrated to work well in a plethora of publications.The best subset selection algorithm applied to linear location-scale transformation models is novel and thus a more thorough investigation of its empirical performance was called for.
Data were generated from a linear location-scale transformation model with X j ∼ U[0, 1], j = 1, 2 and corresponding baseline distribution Algorithm 2: Best subset selection for location-scale transformation models with mandatory covariates.
As a benchmark method, full maximum likelihood estimation of the model parameters was chosen.The results in Figure S8, comparing the location and scale parameter estimates, β and γ, to the true parameter values, corroborate that both model parameters can be precisely estimated, with the corresponding variance staying constant along all magnitudes of the true parameter values.
To contrast the best subset selection algorithm, additional variables X j ∼ U[0, 1], j = 3, . . ., 10 with β j = γ j = 0 were added to the location and scale term of model ( 7).The corresponding model parameters, estimated using the best subset selection algorithm, are shown in Figure S9.Similar to the benchmark method, the parameter estimates, β and γ, employing the best subset selection algorithm, indicate that the parameters can be well estimated, however with some distortion towards zero (Figure S9,top).Assessing the estimates corresponding to the uninformative covariates "location x j " and "scale x j " (with true value β j = γ j = 0), reveal that in the best subset model the uninformative variables are mostly correctly negated, especially for configurations with larger values of β or γ (Figure S9, bottom).
illustrates the flexibility of the location-scale transformation model on the scale of the link function, i.e., F −1 (P(Y ≤ y | X = x)), and of the conditional distribution function P(Y ≤ y | X = x) for different values of µ(x) and σ(x).

Figure 1 :
Figure 1: Location-scale transformation model.The transformation (left) and cumulative distribution function (right) are shown for the baseline configuration (i.e., µ(x) = 0 and σ(x) = 1) and different values of the location parameter µ(x) and of the scale parameter σ(x).

Figure 2 :
Figure 2: Stratification.Distribution (top) and density (bottom) of postpartum blood loss conditional on delivery mode estimated by the stratified transformation model (left) and location-scale transformation model (right).In addition, the empirical cumulative distribution function is shown in the top row, in-sample log-likelihoods are given in the bottom row.

Figure 3 :
Figure 3: Crossing hazards.The survivor functions of the two groups estimated by the non-parametric Kaplan-Meier method (step function) are shown along the estimates from the location-scale Weibull model (left) and the distribution-free location-scale transformation model (right).
are overlaid with survivor functions obtained from the Weibull model (log-linear h, left panel) and the distribution-free location-scale model (h being a polynomial in Bernstein form, right panel).Both models show crossing survivor curves and the more flexible model appears to have better fit.However, in the context of a randomized trial, a test for the null of equal survivor curves is more important than model fit.The likelihood ratio tests lead to a rejection of the null hypothesis at 5% in either model (p-value = 0.034 for the Weibull model and p-value = 0.011 for the distribution-free model).The bivariate Wald-test, proposed byBurke and MacKenzie (2017) for crossing-hazards problems, also leads to a rejection with p-value = 0.032.An alternative location-scale test can be motivated in analogy to the log-rank test.The bivariate permutation score test for γ and β for testing the null γ = β = 0 is defined based on the unconditional model P(T > t) = exp[− exp{h(t | ϑ)}], that is, the model fitted under the constraint γ = β = 0. Thus, the likelihood contribution of the ith subject is ℓ i ϑ, β, exp(γ)−1 = ℓ i (ϑ, 0, 1).The maximum-likelihood estimator is θ = arg max ϑ N i=1 ℓ i (ϑ, 0, 1), subject to ϑ p ≤ ϑ p+1 , p ∈ 1, . . ., P − 1.

Figure 4 :
Figure 4: Partial proportional hazards.Three annual quantile functions (0.25, 0.50, and 0.75th quantile) for DVCs (for a hypothetical Monday in 2002) estimated by three transformation models of increasing complexity.The in-sample log-likelihoods of the corresponding models are given in the panels.

Figure 5 :
Figure 5: Location-scale transformation tree.Female orgasm frequency in heterosexual relationships as a function of questionnaire variables reported by the female respondent.

Figure 6 :
Figure 6: Transformation additive models for location and scale (TAMLS).Conditional quantiles of head circumference along age estimated by the Box-Cox-t GAMLSS (BCT GAMLSS, top panel) and the TAMLS (bottom panel).The former model comprises four and the latter model two smooth terms.
FigureS7: Re-analysis ofTosteson and Begg (1988).ROC curves for ultrasound by primary tumor site estimated by a location-scale transformation model (left) and the original Figure9ofTosteson and Begg (1988) (right).

Figure S8 :
Figure S8: Simulation results.Location parameter estimates β (first row) and scale parameter estimates γ (second row) employing full maximum likelihood estimation are shown along all configurations of β and γ.The true value of the corresponding parameter is indicated by the dashed horizontal line.

Figure S9 :
Figure S9: Simulation.Location parameter estimates β (first row) and scale parameter estimates γ (second row) employing the subset selection algorithm are shown along all configurations of β and γ (top).The estimates corresponding to the uninformative covariates employing the best subset selection algorithm are shown in the bottom.The true value of the corresponding parameter is indicated by the dashed horizontal line.

Table 1 :
Partial proportional hazards.Estimates and corresponding simultaneous 95% confidence intervals (CI) of multiplicative changes in hazards by year.Hazard ratios smaller one indicate increasing DVCs when comparing two subsequent years.