A transformation perspective on marginal and conditional models

Summary Clustered observations are ubiquitous in controlled and observational studies and arise naturally in multicenter trials or longitudinal surveys. We present a novel model for the analysis of clustered observations where the marginal distributions are described by a linear transformation model and the correlations by a joint multivariate normal distribution. The joint model provides an analytic formula for the marginal distribution. Owing to the richness of transformation models, the techniques are applicable to any type of response variable, including bounded, skewed, binary, ordinal, or survival responses. We demonstrate how the common normal assumption for reaction times can be relaxed in the sleep deprivation benchmark data set and report marginal odds ratios for the notoriously difficult toe nail data. We furthermore discuss the analysis of two clinical trials aiming at the estimation of marginal treatment effects. In the first trial, pain was repeatedly assessed on a bounded visual analog scale and marginal proportional-odds models are presented. The second trial reported disease-free survival in rectal cancer patients, where the marginal hazard ratio from Weibull and Cox models is of special interest. An empirical evaluation compares the performance of the novel approach to general estimation equations for binary responses and to conditional mixed-effects models for continuous responses. An implementation is available in the tram add-on package to the R system and was benchmarked against established models in the literature.


Introduction
In the context of the analysis of dependent data or clustered observations, many statistical approaches for fitting conditional and marginal models have been studied.Generalised mixed-models (GLMMs, Stroup 2012) condition on unobservable random effects and allow interpretation of covariate effects among subjects sharing the same value of such a random effect.Conditional models typically assume a specific random effects distribution and thus induce a joint distribution from which marginal distributions can be derived either analytically or by numerically integrating over random effects.Models formulating marginal covariate associations without requiring a model for the joint or conditional distribution can be estimated by solving generalised estimating equations (GEEs, Zeger et al. 1986Zeger et al. , 1988)).Later, marginalised multilevel models (Heagerty 1999;Heagerty and Zeger 2000) were introduced providing a likelihood-based approach to estimate marginal coefficients in the framework of a conditional model.Gory et al. (2021) contribute a model definition allowing parameters estimated in a conditional model to be interpreted in a marginal fashion, and McGee and Stringer (2022) discuss marginal additive models for potentially non-linear population-averaged associations, starting from a generalised additive mixed-model framework.
Marginal predictive distributions with interpretable parameters are easy to derive from normal linear mixed-effects models (LMMs) and binary probit GLMMs as well as from some frailty models using a copula representation (Goethals et al. 2008).Regression coefficients in other generalised linear mixed-effects or frailty models for non-normal responses (binary logistic, Poisson, or Cox normal frailty models, for example) only have a conditional interpretation, that is, given unobservable normal random effects.It is possible to obtain the marginal covariate effect by integrating out the normal random effects, however, the simple interpretability of the fixed-effects regression coefficients is then lost.In contrast, marginal models allowing a marginal interpretation of effects cannot be defined in an unambiguous way without the specification of a joint distribution and we refer to Lee and Nelder (2004) and Muff et al. (2016) for a broader discussion of these issues.
Herein, we address the problem of formulating and estimating linear transformation models for the joint distribution of cluster-correlated observations arising, for example, in multi-centre trials or when a subject is repeatedly examined over time.In contrast to many methods in the mixed-effects and frailty literature primarily aiming at explanation, that is, inference for regression coefficients conditional on random effects in the presence of correlated observations, the focus of this paper is on inference for marginal distributions which can be derived analytically from this novel joint model.
Transformation models for correlated observations have been studied mostly in the survival analysis context.Parameter estimation is typically performed by non-parametric maximum likelihood estimation (Cai et al. 2000;Zeng et al. 2017) where the transformation function is only allowed to jump at distinct observed event times and is conceptually understood as a infinite-dimensional nuisance parameter.Lin et al. (2017) even go a step further and propose a maximum rank correlation estimator for estimating the regression coefficients along with the mean of random effects while refraining from specifying neither the transformation function nor a distribution for the random effects.While this model is extremely general, interpretation of the parameters is unclear and predictive distributions cannot be derived.Transformation models estimated in a fully parametric way (McLain and Ghosh 2013;Hothorn et al. 2014Hothorn et al. , 2018;;Klein et al. 2022) are practically as flexible as semiparametrically estimated models yet technically much easier to handle.Therefore, various flavours of conditional mixed-effects transformation models have been suggested (Manuguerra and Heller 2010;Garcia et al. 2019;Tang et al. 2018;Sun and Ding 2021;Tamási et al. 2022).
The results presented herein allow the formulation and estimation of models for the joint multivariate distribution of clustered non-normally distributed responses.The marginal distributions obtained from the joint distribution of observations in the same cluster feature directly interpretable marginal effects.Analytic formulae of the log-likelihood and the corresponding score function for absolute continuous and potentially non-normal responses observed without censoring are available.For censored and discrete observations, evaluation of the likelihood requires evaluation of multivariate normal probabilities, however, in low dimensions such that the approximation error can be made arbitrarily small in reasonable time.Applications from different domains presented in Section 3 highlight that the methodology helps to unify models and inference procedures for clustered observations across traditionally compartmentalised sub-disciplines of statistics.An empirical evaluation of marginal effects Table 1: Interpretation of the linear predictor x β under different inverse link functions F .In practice, many models are known with respect to the link function F −1 , which we report accordingly.We denote the baseline (x β = 0) cumulative distribution function by F (h(y)) and the conditional cumulative distribution function by and distributions estimated by this novel approach is presented in Section 4.

Methods
Linear transformation models for the conditional distribution function of some univariate and at least ordered response Y ∈ Ξ given a configuration x of covariates X are defined by three objects: An "inverse link function" F , a linear predictor x β with regression coefficients β excluding an intercept, and a monotone non-decreasing transformation, or "intercept", function h.Only β and h are unknowns to be estimated from data whereas F defines the scale linearity of the effects is assumed upon.This model class covers many prominent regression models, such as normal, log-normal, Weibull, or Cox models for absolute continuous responses, binary models with different link functions, proportional odds and hazards cumulative models for ordered responses, and many less well-known or even novel models (Hothorn et al. 2018).A summary of the possible inverse link functions F and of the corresponding interpretation of the linear predictor x β is given in Table 1.This flexible modelling framework covers probabilistic index models (Thas et al. 2012), i.e. models allowing interpretation of the effect size in terms the probability that a randomly selected subject has an outcome greater than the one of another randomly selected subject, given that the covariate values for both subjects are known (for instance, whether a subject received treatment or not).We illustrate the usefulness of this quantity in Sections 3.3 and 3.4.
For clustered or longitudinal data, we observe multiple values of the response Y for each observational unit (clusters or subjects) whose interdependencies are not reflected in model (1).Adding, in analogy to GLMMs, a random effects term u r to the linear predictor in (1) defines mixed-effects transformation models When the random effects follow a specific bridge distribution to F , that is, the normal distribution for F = Φ, the stable distribution when F = cloglog −1 (Aalen et al. 2008), or the distribution derived by Wang and Louis (2003) for F = logit −1 , marginal distributions can be derived.Neither the likelihood nor marginal distributions, and thus a marginal interpretation of β, are available in closed form when the model is formulated differently, especially when normal random effects are coupled with F = Φ (Tamási et al. 2022).

Joint Transformation Models
To address these issues, we present a novel transformation model for the joint distribution that provides simple analytic expressions for marginal predictive distributions of the form (1).
In this setup, i = 1, . . ., N independent observational units, each consisting of N i correlated observations of the response , are available for estimating the joint distribution.While refraining to specify a certain parametric joint multivariate distribution for Y i , we assume that probabilities on the scale of a suitable transformation of Y i can be evaluated using a multivariate normal distribution whose structured covariance matrix captures the correlations between the transformed elements of Y i .The aim of this paper is to simultaneously estimate the transformation, regression coefficients, and the structured covariance from data using models which emphasise predictive distributions and parameter interpretability.
The non-decreasing transformation function h : Ξ → R is applied element-wise to the response vector ensuring that the same transformation is applied to all N i observations.Together with Y i , one observes a corresponding matrix of full rank containing treatment assignment or covariates whose corresponding regression coefficients β are of interest.In addition, the design of the experiment is described by a matrix We exclusively study setups with simple cluster assignment encoded in this matrix (U i = (1) N i ,1 ) or longitudinal data (u ij = (1, t ij ) indicating that Y ij for the ith subject was observed for at time t ij ).We propose to study models for the joint distribution function of Y i given X i and U i of the form Here, Φ 0 N i ,Σ i (γ) (•) is the distribution function of an N i -dimensional normal random vector with mean vector zero and structured covariance matrix as defined by the random effects design matrix and an unstructured Cholesky factor Λ(γ) ∈ R R×R depending on unknown variance parameters γ ∈ R R(R+1) /2 ; I N i denotes the N i × N i identity matrix.We isolate the square roots of the diagonal elements of Σ i (γ) in the matrix a scaling factor to the transformation function h and regression coefficients β which is instrumental for the derivation of marginal distributions.In the longitudinal setup, Λ = γ 1 0 γ 2 γ 3 and the covariance Σ i (γ) j, depends on the observation times t ij and t i .
The key component is the shifted transformation h N i (y) − X i β modelling the impact of the regression coefficients on the transformed scale.
The transformation function h, the regression coefficients β, and the variance parameters γ are unknowns to be estimated from data.In (2), Φ −1 N i (p) = (Φ −1 (p 1 ), . . ., Φ −1 (p N i )) applies the quantile function Φ −1 of the standard normal element-wise to some vector of probabilities p = (p 1 , . . ., p N i ) ∈ (0, 1) N i .Furthermore, F : R → R is an a priori defined cumulative distribution function of some absolute continuous distribution with log-concave density f ; F N i and f N i are the element-wise applications of F and f , respectively.
For absolute continuous responses Y i ∈ R N i , model (2) implies that the latent variable defined as an element-wise transformation of the observations Y i follows a multivariate normal distribution ).The model is distribution-free in the sense that for a baseline configuration (with X i β = 0 N i ), such a transformation into multivariate normality exists for all marginal distributions (Klein et al. 2022).The model does, however, impose a certain correlation structure through the choice of U i .An example for the joint distributions induced by increasing correlations among bivariate repeated measurements with skewed marginal distributions is given Figure 1.
The key aspect of an implementation of model ( 2) is the parameterisation of the transformation function as is the matrix of evaluated basis functions a : Ξ → R P .Choices of basis functions a are problem-specific and several options are discussed in Section 3 and, in more detail, in Hothorn et al. (2018) and Hothorn (2020).

Connection to Normal Linear Mixed-effects Models
We first consider the special case F = Φ, where the transformation of Y i simplifies to Model (2) contains the LMM as a special case.In its standard notation, the LMM reads The matrices X i and U i are typically referred to as "fixed effects" and "random effects" design matrices in the literature.This model can be reformulated as a model for the joint multivariate distribution with linear basis functions a(y) = (y, −1) and parameters ϑ = (σ −1 , ασ −1 ) , and finally fixed effects β = σ −1 β.
Using this notation, the conditional distribution function of some element y ∈ Ξ of Y , conditional on x, u, and unobservable random effects R = r, is such that both marginal distributions follow the χ 2 9 law.For γ 1 = 0, observations within a cluster are independent, and their correlation increases with increasing values of γ 1 .
The marginal distribution of some element y ∈ Ξ of Y , which is still conditional on x and u but integrates over the random effects R, can be obtained from the joint multivariate normal (4) as The shrunken marginal fixed effects β/ u Λ(γ)Λ(γ) u + 1 were also described by Wu and Wang (2019) in a Bayesian implementation of this model.Understanding the LMM as special case of a transformation model allows to relax the normality assumption for Y i by introducing non-linear transformation functions h(y) = a(y) ϑ defined by a non-linear basis a (Hothorn et al. 2018).Section 3.1 contains a comparison of the two models.Probit GLMMs for binary responses Y ∈ Ξ = {0, 1} can also be understood as a special case of a transformation model with intercept h(0) = α and h(1) = ∞.Several implementations of such GLMMs are compared empirically to an implementation motivated from a transformation model perspective in Section 3.2.

Distinction from Generalised Mixed-effects and Frailty Models
Two important extensions of the LMM include GLMMs and frailty models.For binary responses , the logistic GLMM has the conditional, given normal random effects r, interpretation In survival analysis with Y ∈ Ξ = R + , a Weibull normal frailty model leads to the conditional interpretation A normal frailty Cox model replaces the log-linear transformation function of the Weibull model with a smooth logcumulative hazard function h(y).All three models are special cases of mixed-effects transformation models (tramME).
Assuming normal random effects u, neither model can be understood in terms of model ( 2) and two main difficulties are associated with these types of models assuming additivity of the fixed and random effects on the log-odds ratio or log-hazard ratio scales.First, unlike in the LMM (LMM), there is no analytic expression for the marginal distribution and thus a marginal interpretation of the fixed effects β is difficult.Second, evaluation of the likelihood typically relies on a Laplace approximation of the integral with respect to the random effects' distribution and problems with this approximation have been reported, for example by Ogden (2015).The novel multivariate transformation model for clustered observations based on (2) addresses both of these issues as shall be explained in the next subsections.

Transformation Models with Marginal Interpretation
Simple analytic expressions for the marginal distribution are available (also for F = Φ), independent of the choice of the basis function a, noting that the variance of the jth element of Z i (Equation 3) is The Gaussian copula distribution of Equation 2 directly implies the marginal distribution function in form of a marginal transformation model (mtram): (mtram) In this model, the fixed effects β divided by u Λ(γ)Λ(γ) u + 1 are directly interpretable given U = u, for example as log-odds ratios (F = logit −1 ) or log-hazard ratios (F = cloglog −1 ).Because Λ(γ)Λ(γ) is positive semidefinite, there might be a reduction in effect size when comparing the fixed effects β from formula (2) to the marginal effects β/ u Λ(γ)Λ(γ) u + 1 from model (mtram).For repeated measurements with u = 1 we get a constant reduction by 1/ γ 2 1 + 1.In longitudinal models , the marginal effect at time For positively correlated random intercepts and random slopes (i.e.γ 2 > 0), the marginal effect always decreases over time.

The Likelihood Function
For parameters ϑ, β, and γ, the log-likelihood contribution i (ϑ, β, γ) of the ith subject or cluster is based on the transformation where dz is the integral over the N i -dimensional multivariate normal density φ N i with mean zero and covariance Σ i .The structure of Σ i (γ) can be exploited to dramatically reduce the dimensionality of the integration problem.Applying the procedure by Marsaglia (1963), one can reduce this N i -dimensional integral to an R-dimensional integral over the unit cube (see Appendix A).
For continuous observations y ∈ R N i , it is common practice (Section 5, Lindsey 1999) to approximate this log-likelihood by a log-density evaluated at the observations y i : It should be noted that the exact log-likelihood function (6) does not require the precision matrix Σ i (γ) −1 to be computed.In the above approximation, log N i is the element-wise natural logarithm and f N i the element-wise density of F .A (y i ) denotes the matrix of evaluated derivatives a of the basis function a.The log-likelihood of Equation 7 is derived in Appendix B.
Using either log-likelihood, we obtain simultaneous maximum-likelihood estimates for all model parameters from ( θN , βN , γN ) = arg max Some models require additional constraints on ϑ to be implemented (Hothorn et al. 2018).
Analytic score functions for all model parameters ϑ, β, and γ are available (see Appendix B).Score functions for the discrete or censored likelihood ( 6) and the observed Fisher information matrices for both likelihoods are obtained numerically.The full parameterisation of h allows application of standard results for likelihood asymptotics (van der Vaart 1998) to independent observations (Hothorn et al. 2018).Because the model ( 2) is a special case of the multivariate transformation model of Klein et al. (2022) where the transformation function h and the fixed effects β are constrained to be the same for all "coordinates" of the random vector Y i (that is, observations in the same cluster).Therefore, model (2) benefits from the same asymptotic results reported by Klein et al. (2022).

Applications
In this section, we discuss four potential applications of marginally interpretable transformation models.Data, numerical details, and code reproducing the results are available from the Online Appendix (Barbanti and Hothorn 2022).We start with two head-to-head comparisons where model (mtram) suggested here can be estimated by already existing software implementations of mixed-effects probit models for the purpose of validating the implementation of model (mtram) in the add-on package tram (Hothorn et al. 2022) to the R system for statistical computing.

Non-normal Mixed-effects Models
The average reaction times to a specific task over several days of sleep deprivation are given for i = 1, . . ., N = 18 subjects (Belenky et al. 2003).The data are often used to illustrate LMMs with correlated random intercepts and slopes of the form (LMM) Because the reaction times can hardly be expected to follow a symmetric distribution, we consider the non-normal conditional and marginal transformation model where a monotonically increasing transformation function h(y) is allowed to deviate from linearity.Such probit-type mixed-effects models have been studied before, e.g. by merging a Box-Cox power transformation h with a grid-search over REML estimates (Gurka et al. 2006), a conditional likelihood (Hutmacher et al. 2011), or a grid-search maximising the profile likelihood (Maruo et al. 2017).Recently, Tang et al. (2018) and Wu and Wang (2019) proposed a monotone spline parameterisation of h in a Bayesian context.
We parameterise h(y) = a(y) ϑ in terms of a monotonically increasing polynomial in Bernstein form of order six (Hothorn et al. 2018).The conditional transformation model (Tamási et al. 2022) can be estimated by maximising a Laplace approximation to the log-likelihood (Tamási and Hothorn 2021) simultaneously with respect to all parameters ϑ, β, and γ.Direct optimisation of the log-likelihood (7) for the marginal transformation model (mtram) leads to identical results (log-likelihood −859.55), because the conditional and marginal models are identical for F = Φ and the Laplace approximation is very accurate in this case.For F = Φ, conditional and marginal transformation models differ, and numerical integration with respect to the normal random effects is required when marginal distributions shall be obtained from a conditional model.In contrast, the marginal transformation model (mtram) provides a closed-form expression for marginal distributions for all choices of F .With F = logit −1 , the log-likelihood of the marginal model increases slightly (−860.6377).
The daily marginal distribution functions of normal and non-normal models are compared to the daily marginal empirical cumulative distributions in Figure 2. Especially for short reaction times early in the experiment, the non-normal transformation models seem to fit the data better than the normal linear model.Between the probit transformation model and the logistic marginal transformation model, only minor discrepancies can be observed.

Binary Marginal Models
For a binary response y ∈ {0, 1}, the transformation h(y) = α reduces to a scalar intercept.Thus, maximisation of the discrete log-likelihood (6) provides an alternative to commonly applied approximations, such as Laplace or Adaptive Gauss-Hermite Quadrature, for fitting conditional mixed-effects models.In addition, the possibility to interpret parameters marginally also for F = Φ asks for a comparison to generalised estimation equations (GEEs).
We first compared different implementations of binary probit mixed-effects models for the notoriously difficult to handle toe nail data (Backer et al. 1998) for which quasi-separation issues have been reported (Sauter and Held 2016).The ordinal response measuring toe nail infection was categorised to two levels.We were interested in binary probit models featuring fixed main and interaction effects β 1 , β 2 , and β 3 of treatment (itraconazole vs. terbinafine) and time.Subject-specific random intercept models and models featuring correlated random intercepts and slopes were estimated by the glmer function from package lme4 (Bates et al. 2015), by the glmmTMB function from package glmmTMB (Brooks et al. 2017), and by direct maximisation of the exact discrete log-likelihood (6) given in Appendix A.
The estimated model parameters, along with the discrete log-likelihood (6) evaluated at these parameters, are given in Table 2.For the random intercept models, AGQ, the Laplace approximation in glmmTMB, and the discrete log-likelihood gave the same results, the Laplace approximation implemented in package lme4 seemed to fail.It was not possible to apply the AGQ approach to the random intercept / random slope model.The two implementations of the Laplace approximation in packages lme4 and glmmTMB differed for the random intercept but not for the random intercept / random slope model.The log-likelihood obtained by direct maximisation of (6) resulted in the best fitting model with the least extreme parameter estimates.Computing times for all procedures were comparable.
In a second step, a marginal transformation model with logit link was compared to marginal odds ratios obtained from a GEE.We refitted published GEE models for this data (SAS results in Chapter 10, Molenberghs and Verbeke 2005) and noticed substantial differences indicating numerical instabilities for this dataset (see Online Appendix).The monthly multiplicative treatment effect on the odds ratio scale was 0.91 (95% confidence interval 0.83 − 1.00) when a logistic GEE with unstructured working correlation was estimated.The logistic transformation model estimated the same parameter as 0.94 (95% confidence interval 0.89 − 0.99).Molenberghs and Verbeke (2005, p. 211) reported a GEE-based marginal odds ratio of 0.89 (95% confidence interval 0.81 − 0.98, with model-based standard errors and exp-transformed Wald intervals).The performance of GEEs and marginal transformation models are compared against ground truth in a simulation experiment in Section 4. and random intercept/random slope (RI + RS) models were estimated by the Laplace (L) and Adaptive Gauss-Hermite Quadrature (AGQ) approximations to the likelihood (implemented in packages lme4 and glmmTMB).In addition, the exact discrete log-likelihood ( 6) was used for model fitting and evaluation (the in-sample log-likelihood ( 6) for all models and timings of all procedures are given in the last two lines).

Models for Bounded Responses
The fixed effects are interpretable as log-odds ratios, conditional on random effects.The data are presented in the top panel of Figure 3.A transformation model (mtram) with F = logit −1 featuring a transformation function h(y) = a(y) ϑ defined by a polynomial in Bernstein form of order six on the unit interval, and correlated random intercept and random slope terms (u = (1, t) for times t = 0, 7, 12 weeks) is visualised by means of the corresponding marginal distribution functions in the bottom panel of Figure 3. Similar to the results reported earlier (Manuguerra and Heller 2010), the model highlights more severe pain in the active treatment group at baseline.A positive treatment effect can be inferred after 7 weeks which seemed to level-off when subjects were examined after 12 weeks.It is important to note that these results have a marginal interpretation and that the model does not assume a specific distribution of the response, such as a Beta distribution for example.
From the marginally interpretable transformation models, relevant quantities, like the probabilistic index, can be derived (Online Appendix, Barbanti and Hothorn 2022).In this application, the marginal probabilistic index is the probability that, for a randomly selected patient in the treatment group, the neck pain score at time t is higher than the score for a subject in the placebo group randomly selected at the same time point.We obtain a probability of 0.72 (95% confidence interval [0.58; 0.83]) at baseline, 0.29 (95% confidence interval [0.17; 0.43]) after 7 weeks and 0.38 (95% confidence interval [0.24; 0.54]) after 12 weeks.

Marginally Interpretable Survival Models
The CAO/ARO/AIO-04 randomised clinical trial (Rödel et al. 2015)  apy for rectal cancer patients to the same therapy using fluorouracil only.Patients were randomised in the two treatment arms by block randomisation taking the study centre, the lymph node involvement (negative vs. positive), and tumour grading (T1-3 vs. T4) into account.The primary endpoint was disease-free survival, defined as the time between randomisation and non-radical surgery of the primary tumour (R2 resection), locoregional recurrence after R0/1 resection, metastatic disease or progression, or death from any cause, whichever occurred first.The observed responses are a mix of exact dates (time to death or incomplete removal of the primary tumour), right-censoring (end of follow-up or drop-out), and intervalcensoring (local or distant metastases).The conditional hazard ratio 0.79 (0.64, 0.98) was reported as obtained from a Cox mixed-effects model with normal random intercepts and without stratification fitted to right-censored survival times (Rödel et al. 2015).This means that a rectal cancer patient treated with the novel combination therapy benefits from a 21% risk reduction compared to a patient from the same block treated with fluorouracil only.
We were interested in estimating a marginally interpretable treatment effect (acknowledging the fact that patients enrolled into the trial were not a random sample from all rectal cancer patients) based on a marginally interpretable stratified (with respect to lymph node involvement and tumour grading) Weibull model for clustered observations (blocks) in the presence of interval-censored survival times.This model can be formulated by (mtram) choosing F = cloglog −1 , a(y) = (1, log(y)) , u = being the block indicator, and variance parameter γ 1 (corresponding to the correlation structure of a random intercept only model) as well as a treatment parameter β (comparing the novum to fluorouracil only).Stratification was implemented by strata-specific parameters ϑ for each of the four strata.It should be noted that this model is not equivalent to a classical Weibull normal frailty model.
A confidence interval for the marginal hazard ratio exp(β/ γ 2 1 + 1) was computed by simulating from the joint normal distribution of ( β, γ1 ).With a relatively small γ1 = 0.15 (with standard error 0.13), this resulted in a marginal hazard ratio of 0.80 (95% confidence interval [0.65; 0.98]), meaning that rectal cancer patients treated with the combination therapy benefit from a 20% risk reduction on average.By relaxing the Weibull assumption (log-linear transformation h) to a Cox proportional hazards model (nonlinear transformation h), we obtain a hazard ratio of 0.78 (95% confidence interval [0.64; 0.96]) and a marginal probabilistic index of 0.56 (95% confidence interval [0.51; 0.61]), meaning that over all study centres, a randomly selected patient receiving Oxaliplatin has a 56% probability of staying disease-free longer than a randomly selected patient receiving the standard treatment only, given that they both have the same lymph node involvement and tumour grading.

Empirical Evaluation
Practitioners interested in inference for marginal effects will likely apply some form of GEE estimation when analysing a binary response, or might integrate over random effects in a conditional mixed-effects model for more complex response distributions.In this section, we assess the quality of likelihood-based marginal transformation inference (model mtram) in comparison to GEEs for binary responses and to mixed-effects models for continuous responses.
We report the mean-squared errors (MSEs) along with mean widths and coverages of 95% confidence intervals for µ p , p = 1, 2, 3 based on 10 000 simulation iterations in Table 3.

Binary Responses
Binary responses were generated by dichotomisation of the continuous response at the overall median.We fitted logistic GEEs with exchangeable working correlation structure and computed estimates and confidence intervals for all three marginal parameters µ p , p = 1, 2, 3.
Results are shown in the first block of Table 3.In addition, marginal transformation models were fitted to these binary responses.Joint maximum-likelihood estimates of γ 1 and β were computed from which we derived estimates and confidence intervals for the marginal effects µ p , p = 1, 2, 3. We drew 10 000 samples from the asymptotic joint normal distribution of γ 1 and β to derive confidence intervals for µ p , p = 1, 2, 3 in each simulation iteration.These results in the second block of Table 3 are practically equivalent to the results reported for GEEs.For µ 2 and µ 3 , the coverage of confidence intervals computed from model (2) were slightly closer to the nominal 95% level.

Continuous Responses
Marginal transformation models fitted to data on the original scale, i.e. without dichotomisation of the response, performed better in terms of smaller MSEs and confidence interval widths (third block in Table 3).The coverage remained close to the nominal level.
In addition, we compared marginal transformation models for continuous responses to two mixed-effects models: A normal linear mixed-effects model (LMM) and a conditional logistic mixed-effects transformation model (tramME, Tamási and Hothorn 2021).Unlike GEEs, these two additional competitors are misspecified and one has to integrate over normal random effects to obtain a marginal distribution given a specific configuration of x.For the normal linear mixed-effects model, the marginal distribution is again normal.Numerical integration was used to obtain marginal distributions from the tramME model.
For a marginal transformation model ( 2), a conditional logistic mixed-effects transformation model with the same model complexity in terms of parameters for the transformation function and for the shift parameters, and a normal linear mixed-effects model, we derived the marginal distribution conditional on x = (.5, .5, .5)for 100 simulation iterations and present the difference F (y | x) − F (y | x) of the true and estimated marginal distribution functions for all three procedures in Figure 4.The normal linear mixed-effects model (LMM) lead to biased marginal distributions, simply because the model is not able to adapt to the skewness of the marginal distributions.The results for the marginal (mtram) and conditional (tramME) transformation models were surprisingly similar, especially for smaller values of γ 1 .For γ 1 = 0 and thus independence measurements, results are expected to be identical.For γ 1 = 3, and thus very large correlations among the five repeated measurements, the estimated marginal distribution functions obtained from tramME seemed to be slightly more biased than the marginal distribution functions obtained from the marginal transformation model.
This impression is also supported in Figure 5, where the integrated mean-squared error of the • f (y | x)dy is presented for the conditional logistic mixed-effects transformation model (tramME) and the marginal transformation model (2).For γ 1 < 2, the two procedures performed very similar, for larger correlations the misspecified tramME model exhibited slightly larger descrepancies between true and estimated marginal distribution function.Of course, it is not possible to derive marginal effects and corresponding confidence intervals from such numerically obtained marginal distributions.

Discussion
There is a difference between a marginal and a marginally interpretable model.A marginal model, for example defined by generalised estimation equations (Zeger et al. 1988), does not specify the joint distribution.A marginally interpretable model is a model for the joint or conditional (given random effects) distribution from which one can infer the marginal distribution (Lee and Nelder 2004).The models proposed here follow the latter approach with the important distinctive feature that very simple expressions for the marginal distribution function are available.Thus, there is no need to apply numerical integration to the joint or conditional model formulation.In our view, model (mtram) is especially attractive because it allows the interpretation of scaled regression coefficients as marginal effects acting on the marginal predictive distribution in terms of a log-odds ratio or a log-hazard ratio, for example.The Gaussian copula approach for obtaining marginally interpretable models has gained some interest in the last years (Zhang et al. 2021;Masarotto and Varin 2012), however, the simple framework of transformation models allows estimation for a wide range of responses without encountering computational burdens or challenges that other methods typically do.Naturally, the questions arises which model is to be preferred: a marginal, a conditional, or a marginally interpretable one?In this case, the "right" model is not the model which most closely reflects the data generating process, which is usually unknown, but rather the model that allows the user to answer the research question at hand by interpreting the estimated parameters, as McGee and Stringer (2022)  is that besides allowing for interpretation of the fixed-effects on a marginal level, they also yield valid models for the whole marginal distribution (1) of the response given the covariates.An advantage of marginalised multilevel models (Heagerty and Zeger 2000) over marginal transformation models is that the former models are parameterised in terms of marginal effects of interest, whereas effect shrinkage is part of the latter models.The distribution-free nature, general applicability to all types of responses, and the relative computational simplicity are, in our opinion, attractive features of transformation models compared to marginalised multilevel models.
The models and estimation procedures introduced here are limited by some practical and some conceptual constraints.Response-varying regression coefficients β(y) define distribution regression models (Foresi and Peracchi 1995;Chernozhukov et al. 2013) where corresponding mixed-effects models have been presented recently (Garcia et al. 2019).This would be relatively straightforward to implement in the framework presented here, in fact, stratification in Weibull models was parameterised in a similar way.A mix of continuous and censored observations within one cluster would require to compute the likelihood by partial integration over an N i -dimensional normal, this is currently not implemented.On a more conceptual level, it seems impossible to implement multilevel models for discrete or censored responses, because the likelihood ( 7) is only defined for contributions by independent clusters.

B. Likelihood and Score Function: Continuous Case
The joint probability of y i ∈ R N i is given by: To simplify the notation, we define z() as in (5): We can derive the corresponding joint density for an arbitrary F : The resulting log-likelihood contribution (7) for the ith observation is given by: The score function for all model parameters ϑ, β, and γ can be derived based on the results of Stroup (2012) as applied to normal linear mixed-effects models by Wang and Merkle (2018).
With the M = R(R+1) /2 unique elements γ = (γ 1 , . . ., γ M ) of the lower Cholesky factor Λ(γ) we get The derivative of Λ(γ) with respect to an element γ m of γ is a matrix of zeros with the exception of a single one at the position of γ m .Moreover, we compute: A(y i ) + A (y i ).

Figure 2 :
Figure 2: Sleep deprivation.Marginal distribution of reaction times, separately for each day of study participation.The grey step-function corresponds to the empirical cumulative distribution function, the blue line to the marginal cumulative distribution of the normal linear mixed-effects model (8), estimated by the lmer function from package lme4(Bates et al. 2015), the solid yellowish line to the probit transformation model (9), and the dotted yellowish line to the logistic marginal transformation model.

Figure 3 :
Figure 3: Neck pain.Pain trajectories of 90 subjects under active treatment or placebo evaluated at baseline, after 7 and 12 weeks (top) and marginal distribution functions of neck pain at the three different time points (bottom).These results were obtained from model (mtram) using F = logit −1 and a polynomial in Bernstein form h(y) on the unit interval.

Figure 4 :
Figure 4: Difference between the true and estimated marginal distribution functions for a normal linear mixed-effects model (LMM), a conditional logistic mixed-effects transformation model (tramME) and a marginal transformation model (mtram).For mixed-effects models, the marginal distribution function was computed by integrating out the random effects (analytically for LMM and numerically for tramME).

Figure 5 :
Figure 5: Integrated mean-squared error between the true marginal distribution function and the estimated marginal distribution function for a conditional logistic mixed-effects transformation model (tramME) and a marginal transformation model (mtram).

Table 2 :
Toe nail data.Binary probit models featuring fixed intercepts α, treatment effects β 1 , time effects β 2 , and time-treatment interactions β 3 are compared.Random intercept (RI) Manuguerra and Heller (2010)a randomised two-arm clinical trial comparing a novel neck pain treatment to placebo.Neck pain levels of 90 subjects were assessed at baseline, after 7, and after 12 weeks (complete trajectories are available for 84 subjects) on a visual analog scale.Manuguerra and Heller (2010)proposed a mixed-effects model for such a bounded response.

Table 3 :
Simulations.MSE, widths and coverages of 95% confidence intervals for three marginal effects.For dichotomised binary responses, results obtained from GEEs can be directly compared to results from marginal transformation models (first two blocks).The last block reports results of marginal transformation models fitted to continuous responses.