Monte Carlo modified profile likelihood in models for clustered data

The main focus of the analysts who deal with clustered data is usually not on the clustering variables, and hence the group-specific parameters are treated as nuisance. If a fixed effects formulation is preferred and the total number of clusters is large relative to the single-group sizes, classical frequentist techniques relying on the profile likelihood are often misleading. The use of alternative tools, such as modifications to the profile likelihood or integrated likelihoods, for making accurate inference on a parameter of interest can be complicated by the presence of nonstandard modelling and/or sampling assumptions. We show here how to employ Monte Carlo simulation in order to approximate the modified profile likelihood in some of these unconventional frameworks. The proposed solution is widely applicable and is shown to retain the usual properties of the modified profile likelihood. The approach is examined in two instances particularly relevant in applications, i.e. missing-data models and survival models with unspecified censoring distribution. The effectiveness of the proposed solution is validated via simulation studies and two clinical trial applications.


Introduction
Clustered data, either cross-sectional or longitudinal observations which may be arranged in groups, are nowadays encountered in all applied areas. Their main characteristic is the unobserved heterogeneity across clusters, with the consequence that units within a cluster might be correlated. How to deal with such correlation depends strongly on the purpose of the study.
When the interest is on group-specific effects and also on estimation of the intra-cluster correlation, a frequent practice is to assume a random effects model. In this setting, cluster-specific covariate effects depend on unobservable latent variables, named random effects. Such a modelling strategy leads to the so-called conditional approach. Alternatively, if the interest centers on comparing the response variable of units across groups, it is preferable to adopt so-called marginal models, where the clustering structure is ignored for estimation of the regression coefficients, and it is only employed to ensure correct inference on the standard errors. These specifications are usually estimated using generalized estimating equations (Liang and Zeger, 1986). Generally, interpretation of the regression coefficients in conditional and marginal models is different. Therefore the corresponding estimates are not directly comparable (see, for instance, Lee andNelder, 2004, andAgresti, 2015, Chapter 9). In this paper, we will focus mainly on the conditional approach.
Random effects models require to assume some suitable underlying distribution for the random effects, and even their incorrelation with the covariates in the model (Lancaster, 2000). The latter unrealistic hypothesis frequently drives the decision to opt for a more flexible fixed effects approach, a choice particularly popular in the econometric literature. Fixed effects models capture the heterogeneity among clusters via the inclusion of nuisance parameters, one for every group. These are also referred to as incidental parameters (Lancaster, 2000), since each of them appears only in the distribution of the observations in one given cluster. Under fixed effects models, as well as under marginal models, inferential results are free from any assumption on the probabilistic distribution regarding the dependence structure within clusters. In addition, if also relevant for the analysis, the fixed effects strategy allows to quantify the heterogeneity across groups by comparing the estimates of the cluster-specific parameters.
The increased robustness of the fixed effects solution over the random effects one in the conditional approach is balanced by the drawback that when clusters have small to moderate size, likelihood inference is prone to suffer from the incidental parameters problem (Neyman and Scott, 1948). Such problem depends on the fact that the bias of the profile score function for the parameter of interest increases along with the dimension of the nuisance component in the model (see, e.g., McCullagh and Tibshirani, 1990), invalidating usual asymptotic results if the number of groups, N , is much larger than the single group size, T . Reliable inference on the parameter of interest needs thus to be carried out via alternative pseudo-likelihoods which are unaffected by this issue.
Exact pseudo-likelihoods leading to extremely accurate conclusions are only available in specific model classes (Severini, 2000, Chapter 8). Correcting the profile likelihood for the presence of incidental parameters represents instead a more general strategy. Among the several adjustments found in the literature, a prominent position is held by the modified profile likelihood (MPL) of Barndorff-Nielsen (1980, 1983. Specifically,  proved its inferential superiority with respect to the ordinary profile likelihood within the (T × N )-asymptotic setting that characterizes clustered data with independent units.
Under the same frequentist paradigm, another possible approach to avoid the incidental parameters problem is the integrated likelihood for the component of interest , where elimination of the fixed effects is achieved by integration in an appropriate parametrization. De Bin et al. (2015) have shown that this function is asymptotically equivalent to the MPL and enjoys analogue properties in the two-index asymptotics for clustered data.
The first formulation of the MPL involves statistical quantities which are easily obtainable only for exponential or group family models (Pace and Salvan, 1997, Chapters 5, 7). Such computational difficulties can be overcome by the approximate MPL owed to Severini (1998) just within a limited range of statistical problems.  use Monte Carlo simulation in order to compute Severini's version of the MPL when its exact calculation, al-though possible in principle, is especially tedious, given the assumed dependence structure of the data. The same complication arises under the nonstationary autoregressive model for normally distributed observations, discussed in (De Bin et al., 2015, Example 4.3) as an example of application for the integrated likelihood. In the Supplementary material, it is possible to see how the MPL can be conveniently computed via Monte Carlo simulation even in that setting.
The aim of this paper is to extend the Monte Carlo approach to situations where the analytical calculation of the MPL is not only tedious, but can also be infeasible due to peculiar modelling and/or sampling hypotheses. We illustrate the potential usefulness of the method in two frameworks highly relevant for applications. In particular, the first considers binary regression with nonignorable missing response, while the second deals with survival data with unspecified censoring distribution. In both cases, although not directly covered by the theory in , the usual good inferential properties of the MPL are empirically confirmed, suggesting the proposed Monte Carlo solution as the default choice in applications.
The structure of the paper is as follows. Section 2 introduces the basic notation and defines the profile and modified profile log-likelihoods in models for clustered data. The procedure for computing the MPL through Monte Carlo approximation is detailed in Section 3, and then its use is illustrated by means of simulations in different nonstandard contexts for grouped observations. In particular, datasets with possibly missing binary response are considered by Section 4, whereas Section 5 is dedicated to survival data with unspecified censoring mechanism. The issues related to the calculation of the MPL are ascribable to the model complexity implied by the incompleteness of the data in the first case, and to the lack of parametric assumptions on the distribution of the censoring times in the second. In both frameworks we also consider an application to clinical trial data that illustrates the practical effectiveness of the method. Main results are summarised and commented in Section 6, which also mentions potential developments of the present work.

Profile and modified profile likelihood
For clustered observations y it subdivided in N groups of sizes T i , suppose the parametric statistical model Y it |X it = x it ∼ p(y it |x it ; ψ, λ i ) , i = 1, . . . , N, t = 1, . . . , T i , which accommodates also dynamic specifications where the index t runs over consecutive time periods and the temporal evolution of the dependent variable is explained by including in the pdimensional vector of covariates x it responses previously recorded in the same cluster (see Section S2 of the Supplementary material). The global parameter is θ = (ψ, λ), where ψ ∈ Ψ ⊆ IR k denotes the component of interest and λ = (λ 1 , . . . , λ N ) ∈ Λ indicates the vector containing the incidental parameters. Note that, here and henceforth, in order to avoid clutter we omit the transpose symbol acting on vectors unless such an omission could result in ambiguity. In the following, the assumptions of balanced groups and scalar nuisance components, i.e. T i = T and dim(λ i ) = 1 for each i = 1, . . . , n, respectively, shall be used without loss of generality, for the sake of notational simplicity only. Under the hypothesis of independent groups, the log-likelihood function about θ can be expressed by with l i (ψ, λ i ) = T t=1 log p(y it |x it ; ψ, λ i ) being the log-likelihood contribution for the ith cluster. Let us define the full maximum likelihood (ML) estimate for model (1) asθ = (ψ,λ) = arg max θ l(θ). Standard likelihood inference on the parameter of interest is typically based on the profile log-likelihood whereλ iψ is the constrained ML estimate of λ i for fixed ψ obtained, under standard regularity conditions, by equating to zero the score and solving for λ i (i = 1, . . . , N ). Givenλ ψ = (λ 1ψ , . . . ,λ N ψ ), the full constrained ML estimate for fixed ψ is denoted byθ ψ = (ψ,λ ψ ). The general expression taken by the logarithmic version of the MPL is where the modification term M (ψ) serves to remedy the effect of replacing the unknown nuisance parameter λ with the estimateλ ψ in the profile likelihood. Such plug-in effect typically translates in bias of the profile score function. The expression of M (ψ) largely corrects this bias, making the MPL much closer to a proper likelihood (McCullagh and Tibshirani, 1990;Diciccio et al., 1996). The independence hypothesis among clusters implies the additive form M (ψ) = N i=1 M i (ψ). By using Severini's formulation of the MPL (Severini, 1998), the ith summand in the modification term equals In (5), j λ i λ i (θ) = −∂ 2 l i (ψ, λ i )/(∂λ i ∂λ i ) is evaluated at the constrained ML estimateθ ψ and I λ i λ i (θ;θ ψ ) is an approximation of a term involving sample space derivatives in the original Barndorff-Nielsen's MPL. In particular, I λ i λ i (θ;θ ψ ) = E θ 0 l λ i (θ 0 )l λ i (θ 1 ) θ 0 =θ,θ 1 =θ ψ indicates the scalar expected value calculated with regard to the full ML estimateθ of the product of partial score functions defined in (3) evaluated at two different points in the parameter space, i.e.θ andθ ψ . In contrast to the original formulation introduced by Barndorff-Nielsen (1980, 1983, Severini's variant of M (ψ) is computable even when the conditional probability density or mass function of y it given x it does not belong to full exponential or composite group families.  gives sufficient conditions under which inferences on ψ conducted via the profile or the modified profile likelihood are adequate when dealing with independent clustered data. In more detail, usual results apply in the (T × N )-asymptotics for quantities based on l P (ψ) if N = o(T ), while it suffices that N = o(T 3 ) to achieve reliable conclusions using the MPL in (4). This explains why the employment of l M (ψ) should be preferred in the presence of highly stratified datasets where the quantity of groups is much larger than the amount of observations per group.

Monte Carlo modified profile likelihood
Analytical computation of (5) is fairly simple in a number of widespread statistical models (see, e.g., Sartori, 2006). However, the expected value I λ i λ i (θ;θ ψ ) cannot be readily obtained in circumstances that demand special assumptions to correctly model the relevant aspects of the phenomenon under study. Sometimes its exact calculation is too cumbersome, sometimes infeasible.
One convenient expedient to compute Severini's MPL even when the necessary expectation is not available in closed form foresees to approximate I λ i λ i (θ;θ ψ ) by the following empirical quantity based on R Monte Carlo replicates: where l r λ i (·) is the score (3) computed on observations y r it of the rth sample (r = 1, . . . , R) randomly generated under the ML fit of model (1), thus setting (ψ, λ) = (ψ,λ). It is worth mentioning that such a strategy only requires to derive the score function l λ i (θ) and to simulate from the assumed distribution of the data, with no additional fitting. Indeed,θ andθ ψ in (6) are the estimates derived from the observed dataset. This makes the approximation far less expensive than a standard bootstrap from a computational standpoint. Moreover, the execution time is not particularly influenced by the value of T and the number of replications R usually does not need to exceed 500 for a reasonably accurate estimation of ψ, as attested by preliminary sensitivity analyses.
The principal quality of this Monte Carlo solution is its broad applicability.  already experimented it, proving its competitiveness with econometric inferential methods in the estimation of dynamic fixed effects models for binary panel data. Here, we propose to adopt and extend the same technique in order to investigate the superiority of l M (ψ) with respect to ML procedures in alternative scenarios. To this end, we will consider regression models for missing binary data (Section 4) and survival models for right-censored data (Section 5), two practically relevant settings in which explicit formulation of (5) is either computationally involving or impossible due to the particular modelling framework. Specifically, in the survival analysis case we will use a fitted semiparametric model for generating the Monte Carlo samples to calculate (6), making inference robust with respect to a possibly misspecified censoring distribution.
For ease of reference, from now on Severini's version of the MPL obtained by making use of Monte Carlo simulation will be called Monte Carlo MPL (MCMPL). The corresponding loglikelihood function is l M * (ψ) = l P (ψ) + M * (ψ), where the modification term takes the form with I * λ i λ i (θ;θ ψ ) defined in (6). Generally, both l P (ψ) and l M * (ψ) are maximized numerically. Of course, the higher the dimension of ψ, the larger the number of iterations for the numerical optimization could be. Nevertheless, for fixed R, the overall computational effort required by l M * (ψ) increases linearly with the number of iterations and, in our experience, hardly becomes too costly with respect to l P (ψ).

Regression models for missing binary data 4.1 Introduction
The lacking registration of some data is the rule rather than the exception in quantitative research analysis. Rubin (1976) developed the first basic classification of data still in use today: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). While the first two categories are associated with an ignorable mechanism of missingness, when data are MNAR the probability of missing observations also depends on values that are unobserved, and thus the supposed model must take into account the missingness process for providing valid results (Little and Rubin, 2002, Section 15.1).
Among the various approaches proposed to deal with the nonignorable incompleteness of the data, selection models and pattern-mixture models play a major role (Fitzmaurice et al., 2008, Chapter 18). Let us consider independent possibly missing clustered observations y it and define the corresponding missingness indicators M it such that M it = 1 if y it is unobserved and M it = 0 otherwise (i = 1, . . . , N, t = 1, . . . , T ). From a likelihood viewpoint, the joint distribution of Y it and M it in some global parametrization ϕ has to be specified. Following the classical formulation of selection models, we shall assume a marginal distribution for Y it depending on the parameter θ and a conditional distribution of M it given Y it = y it depending on γ, so that with ϕ = (θ, γ). Computationally speaking, in moderately complex models for incomplete datasets, maximization of the log-likelihood function incorporating all the available information is often quite an arduous task. Indeed this function, named observed log-likelihood, involves integrals or summations over the distribution of the missing data which can be hardly tractable. It is well-known that the EM algorithm (Dempster et al., 1977) is a possibly advantageous strategy for ML estimation whenever data either are partially not observed or may be viewed as such. This method is pervasive in the literature of missing data, and many extensions to the original version have been posited in the years (Little and Rubin, 2002, Section 8.5). Optimization problems in likelihood inference may also be solved by numerical iterative algorithms different from the EM. For example, we recall the Nelder-Mead simplex method (Nelder and Mead, 1965) applied by  and  in presence of arbitrarily MNAR clustered observations, and the popular Newton-Raphson algorithm employed by both Parzen et al. (2006) and Sinha et al. (2011).
A universally optimal solution to maximize the log-likelihood in studies with incomplete observations is impossible to prescribe. It is yet important to point out that, regardless of the selected technique, nonignorable missing-data models need to be fitted with special care because the available information may be insufficient to estimate all parameters (Ibrahim et al., 2001).

Setup and Monte Carlo modified profile likelihood
We focus here on possibly missing clustered binary observations. Adopting the typical factorization of selection models defined in (7), for independent data y it one can write the marginal probability mass function with F (·) a suitable cumulative distribution function (CDF) and β = (β 1 , . . . , β p ) vector of regression parameters, whereas the conditional model for the missingness indicator introduced in Section 4.1 may be expressed by where ζ it ∈ (0, 1). Notice that, since covariates are considered given and entirely observed, in writing the two distributions we neglect the conditioning on the p-vector x it for succinctness. A general formulation for ζ it is where G(·) is a CDF and γ 1 = (γ 11 , . . . , γ 1p ). The parameter of primary interest in the joint model described by (8)-(10) is the regression coefficient β ∈ IR p , and the incidental parameters are grouped in λ = (λ 1 , . . . , λ N ) ∈ IR N , so that θ = (β, λ) ∈ Θ ⊆ IR p+N . In the binary regression with the indicator of missingness as response, the coefficients are γ = (γ 1 , γ 2 ) ∈ Γ ⊆ IR p+1 , thus the overall parameter is given by ϕ = (θ, γ) ∈ Φ ⊆ IR 2p+N +1 . To simplify reference, let us gather the parameters common to all clusters in one vector and denote it by ψ = (β, γ) ∈ IR 2p+1 . We finally stress that expression (10) does not contemplate the presence of an intercept, either common or cluster-specific, in the model for M it in order to avoid identifiability issues during the fitting phase (see, e.g., the discussion in Parzen et al., 2006, Section 6).
According to the assumption about the missing-data mechanism, it is possible to identify different relations between the probability of missingness and the variables in the study. Such relations, in their turn, translate into constraints on the model parameters (Parzen et al., 2006). Here, since covariates are nonrandom, from specification (10) follows that data can be either MCAR, when γ 2 = 0, or MNAR otherwise (Baker, 1995).
Models like (8) for complete datasets were already investigated in Bellio and Sartori (2003), who showed how to analytically derive Severini's MPL in order to consistently estimate β when N is much larger than T . The presence of missing values, however, creates trouble in the explicit calculation of the adjustment term. The expectation therein should be evaluated with regard to the joint distribution p Y,M (y it , m it ;φ), taking also the missing-data mechanism into account, but the correct way of doing so is not without ambiguity. More specifically, in the light of the arguments made by Kenward and Molenberghs (1998), one expects to be allowed to neglect the missingness process only when data are MCAR.
Consider now the most general MNAR framework and, for the sake of clarity, denote by y obs the vector of observed entries in the dataset y = (y it ) and by y mis the vector of remaining missing elements. As highlighted in (Little and Rubin, 2002, Section 6.2), the actual data consist of y obs and of the vector containing the indicators of missingness, m = (m it ). The observed loglikelihood about ϕ is obtained by summing over all possible values of y mis the joint probability mass function of Y = (Y obs , Y mis ) and M , so that l i (ϕ) = log y mis p Y y obs , y mis ; θ p M |Y m|y obs , y mis ; γ has the global ML estimateφ as maximizer and can be decomposed in the N cluster-specific contributions taking the form . . . , N ). The score function (3) in the global parametrization ϕ equals here where f it = f it (θ) = ∂F (λ i + β T x it )/∂λ i . Then, differentiating one more time with respect to λ i and changing the sign of the obtained derivative lead to The constrained estimateλ iψ which solves the equation l λ i (ϕ) = 0 can be found numerically and its substitution for λ i (i = 1, . . . , N ) in (11) permits to obtain the MNAR profile log-likelihood, l P (ψ) = N i=1 l i P (ψ). Definedφ ψ = (ψ,λ ψ ), the same replacement in formula (13) over the unconditional sampling distribution, using the terminology of Kenward and Molenberghs (1998), is not obvious. Indeed, the joint distribution of (Y it , M it ) was not specified directly, but divided in the two factors (8) and (9). The mean of the product of scores should then be calculated with respect firstly to p M |Y (m it |y it ;γ) and secondly to p Y (y it ;θ), with sufficiently intricate computational steps. Viceversa, the Monte Carlo solution presented in Section 3 may be applied quite plainly. Particularly, the approximation (6) in the MNAR case takes the form where l r λ i (·) is the score (12) of the rth partially observed sample y r it (r = 1, . . . , R) obtained in two stages: first, a complete dataset y r,C it is simulated under model (8) with θ =θ and second, some entries in this dataset are deleted and considered missing according to the specification (9) with MNAR probability ζ it = ζ it (γ) = G(γ T 1 x it +γ 2 y r,C it ). Note thatψ = (θ,γ) is the global maximizer of the MNAR profile log-likelihood l P (ψ) which also takes the missingness process into consideration. Therefore, the average of score products over the R incomplete samples properly estimates the unconditional expectation required.
Before proceeding, it seems worthwhile making a few more comments about the general formula (11). Supposing an ignorable MCAR missing-data mechanism by imposing γ 2 = 0 in (10) yields clearly to ζ 0 it = ζ 1 it = ζ it = G(γ T 1 x it ), and hence (11) simplifies to Since our interest is only on the parameter β, and ζ it does not carry any useful information about it, we can rely on the equivalent function which is the ordinary group-related log-likelihood in binary regression computed only on the recorded data. Indeed, when the missingness mechanism is MCAR, a complete-case analysis discarding units with missing values is unbiased, as the wholly observed cases are basically a random sample from the reference population (Little and Rubin, 2002, Section 3.2). For this specific model, it is also fully efficient because θ and γ are distinct, provided that the full parameter space is Φ = Θ × Γ (Little and Rubin, 2002, p. 120). This means that likelihood inference can be conducted disregarding the process which generates the missing observations.
As a major implication for our study, the expected value involved in Severini's MPL may be derived from the conditional distribution of Y it given M it = 0. Specifically, it can be shown (Bellio and Sartori, 2003) that such expectation has the closed-form expression where estimatesθ = (β,λ) andθ β = (β,λ β ) descend from ordinary ML inference on the parameter of interest β via the MCAR profile log-likelihood l P (β) based on (15). Furthermore, inasmuch as under the hypothesis of ignorable missingness it is possible to use the function l(θ) with components like that in (15), the general Monte Carlo approximation reported in (14) admits to be reformulated in the MCAR case as where l r λ i (θ) = t: y it ∈y obs (y r it − π it )f it /{π it (1 − π it )} is the score of the incomplete sample y r it simulated by the two-step procedure above, but with an important difference: nowθ results from the maximization of l(θ), whileγ =γ 1 is obtained by a separate ML fit of the binary regression based on (10) subject to the constraint γ 2 = 0, with the missingness indicator as dependent variable and the vector of covariates x it as unique predictor. Below, the utility of Monte Carlo approximation in the presence of incomplete data will be evaluated through simulation experiments referring to binary regression with different missingness processes. Specifically, objects of comparison shall be the unadjusted profile log-likelihood, either the MCAR l P (β) or the MNAR l P (ψ), the modification proposed by Severini l M (β) that ignores the missing values and is analytically computed by formula (16), and the MCMPL that accounts for some presumed missingness mechanism. In order to avoid confusion, its logarithmic MCAR variant employing the estimate (17) will be denoted by l M * (β), whereas l M * (ψ) shall indicate the MNAR MCMPL with habitual expectation approximated by (14).

Logistic regression: simulation studies
The following analyses are performed supposing a logistic link between the mean of the response and the predictors, meaning F (·) = logit −1 (·) in model (8), along with G(·) = logit −1 (·) in the expression for the probability of missingness (10), where logit −1 (·) denotes the CDF of the logistic random variable. Pairing these assumptions with that of an MCAR mechanism brings about the equality whose right-hand side does not depend on the parameter of interest. Hence the only part of Severini's modification term relevant to estimating β is log|j λλ θ β |/2, and one can write As the single cluster contribution to the profile log-likelihood l P (β) equals (15) with π it replaced by logit −1 (λ i + β T x it ), it is simple to show that in such a setting the score related to the ith incidental parameter equals thus the expression of the MCAR Monte Carlo estimate I * λ i λ i (θ;θ β ) follows immediately from the previous formula and (17). Loosely speaking, if observations are MCAR, l M (β) takes the same form as in general logistic regression for clustered data with no missing values, yet is computed only on the complete units. Its numerical maximization can then be automatically implemented in the R software (R Core Team, 2017) exploiting the code of the current version of the package panelMPL (Bellio and Sartori, 2015), which can handle binary regression with logit or probit links.
For the reasons discussed above, one analytical formulation of Severini's MPL is not immediately obtainable when missingness of the data is hypothesized to be nonignorable. On the contrary, M * (ψ) can be calculated via Monte Carlo simulation through (14) simply by recalling that in expressions (12) and (13) one has In the MNAR scenario, the functions l P (ψ) and l M * (ψ) are optimized numerically using a quasi-Newton method (in the R function nlminb). Standard errors of the parameters' estimates are calculated using the second numerical derivative of the functions at their maxima. Notice that in the MNAR case the argument ψ = (β, γ) of the objective functions to be optimized has dimension equal to 2p + 1, whereas in the MCAR case the argument β is only p-dimensional. The higher complexity in the maximization problem is reflected by longer execution times and by some numerical instabilities, especially due to the estimation of γ and its variance, which however are not of direct interest.
It is worth recalling that, as is common practice for binary longitudinal regression, the optimization stage needs to be anticipated by the omission of non-informative groups (Bellio and Sartori, 2003) from the sample under analysis. In missing-data situations, whatever the supposed mechanism, the clusters which cannot contribute to estimate β are those with y obs it = 0 or y obs it = 1 for every t = 1, . . . , T and those which are totally unobserved, i.e. where y it = y mis it for each t = 1, . . . , T (i = 1, . . . , N ).
The two principal simulation experiments are recognisable according to the model used to select the missing values in the experimental datasets of dimensions T = 4, 6, 10 and N = 50, 100, 250. In both scenarios, we consider p = 1 and the covariate x it is simulated by means of independent draws from the N (−0.35, 1) distribution, with intercepts λ i (i = 1, . . . , N ) also independently generated as N (−0.35, 1). The global parameters in (8) and (10) for generating the S = 2000 samples with MCAR observations are set equal to β = 1, γ 1 = 2.5 and γ 2 = 0. Instead, simulation of MNAR data is carried out with β = 1, γ 1 = 5, γ 2 = 1. Such settings were chosen in order to observe a percentage of missing observations in the resulting datasets varying between 35% and 40%.
This simulation setup is taken from a conditional model, with a random effects specification. Such a choice also allows the comparison with inference from a marginal model with generalized estimating equations (GEE). Indeed, although the two approaches are not directly comparable, if the random effects model is correctly specified the corresponding coefficient of x it in the marginal model would approximately equal β m = β/ 1 + σ 2 λ /c 2 , where c = 1.7 and σ 2 λ is the variance of the random effects' distribution (Agresti, 2015, Section 9.4.1). Here σ 2 λ = 1, and therefore β m = 0.862. In addition, results for the case where λ i = T t=1 x it /T + u i , with  u i ∼ N (0, 1), are made available in the Supplementary material. In that setting, incidental parameters are correlated with the covariate, thus a random effects model would be wrongly specified and the comparison with GEE unfeasible. Instead, the MCMPL approach guarantees the same qualitative results under both frameworks. Tables 1, 2 and 3 report the performance of the compared inferential functions in respect of bias (B), median bias (MB), root mean squared error (RMSE) and median absolute error Table 2: Inference on β = 1 in the logistic regression for MCAR longitudinal data. The compared methods are the MNAR profile log-likelihood l P (ψ), the MNAR MCMPL l M * (ψ) computed with R = 500 and GEE. Results based on a simulation study with 2000 trials.
where β is the value of the regression coefficient used to simulate the S datasets,β s is its ML estimate on the sth sample (s = 1, . . . , S) and x (s) denotes the sth element in the vector of order statistics (x (1) , . . . , x (S) ). The empirical standard deviation (SD) of the estimates is also  In addition, the ratio SE/SD, where SE stands for the average over simulations of likelihoodbased estimated standard errors, and empirical coverages of 0.95 Wald confidence intervals (CI) for β are shown. Note that the large values of N examined here ensure adequacy of the quadratic approximation around the maximum of the various log-likelihoods, hence the generally more accurate coverages derived by inversion of the likelihood ratio statistic would be practically identical.
Behaviours of the likelihoods built under the correct MCAR hypothesis are shown in Table  1. The latter attests the inadequacy of inference on β deriving from the profile likelihood in this incidental parameters setting. The introduction of the modification term, either explicitly calculated or approximated by Monte Carlo simulation with R = 500, conspicuously refines the point estimation and actual coverage of confidence intervals. In particular, as happens for complete stratified data, the bias of the ML estimator is of order O(1/T ) regardless of the value of N , as opposed to that of its modified version which is of order O(1/T 2 ). On the contrary, for fixed T , confidence intervals become less precise as N increases, since standard deviations get smaller. The most important evidence supplied here by Table 1 is the absence of the need to take the MCAR mechanism into consideration when computing the MPL. Indeed, the performance of l M (β) is essentially identical to that of l M * (β) for all the sample sizes considered. This finding confirms what argued by Kenward and Molenberghs (1998).
Inference on the same MCAR datasets can also be made via l P (ψ) and l M * (ψ), which assume a general nonignorable model of missingness. Moreover, GEE provides a further alternative for inference, given that its consistency is guaranteed under MAR, and therefore MCAR, mechanisms (Agresti, 2015, Section 9.6.4). Experimental outcomes of such analysis are presented in Table 2. Despite some undercoverage of Wald intervals when T = 4, the global accuracy of the MNAR MCMPL is considerable and definitely higher than that of the corresponding unmodified profile likelihood. The latter proves to be more reliable than its MCAR counterpart in Table 1, while l M * (ψ) is generally superior in terms of bias but inferior in terms of coverage to l M (β) and l M * (β), which efficiently avoid unnecessary estimation of the missingness parameters. However, as will be seen in Table 3, l M * (ψ) balances this loss of efficiency with its robustness to the underlying missingness mechanism. Fit of the model via GEE was implemented through the gee library in R Carey et al. (2015), specifying either an independence, exchangeable or unstructured within-cluster correlation. The quasi-likelihood approach seems to work very well under the MCAR assumption. It is yet important to bear in mind that MPL and GEE are estimating two distinct models here (conditional and marginal, respectively, with different true parameter value). Note that Table 2 reports the most favorable results for GEE, obtained assuming independence of observations and using non-robust standard errors of the estimates. Table 3 refers to the second experiment based on datasets generated with MNAR observations. Classical inference through the MNAR profile log-likelihood is found imprecise, as expected. The most interesting simulation outcome concerns the pattern of inferential results reached by the two versions of the MPL considered. Indeed, for any given number of clusters, as T increases the accuracy of l M (β) deteriorates both in terms of bias and of confidence intervals' coverage, whereas that of l M * (ψ) improves. The fact that the MPL by Severini leads to worse results for large T may seem counterintuitive at first. In fact, this makes sense since incompleteness of the data is more perceived in larger groups and thus the harmful impact of the wrong MCAR assumption reveals itself as T grows. Nevertheless, apart from some numerical instabilities that may occur occasionally when T is small (mainly with T = 4), the MCMPL ensures better inference on β than its analytical version based on the wrong model. In this setting, l M * (β) is still found equivalent to l M (β) and therefore is not shown in the table. Finally, the GEE method based on independent observations and non-robust standard errors proves, as expected, to be inconsistent when data are MNAR (Agresti, 2015, Section 9.6.4), suffering from severe bias in all simulation setups, with accuracy getting worse as the number of clusters N Table 4: Estimates and related standard errors (in parenthesis) in the logistic regression for the toenail data with missing response. The methods considered are MCAR profile loglikelihood l P (β), Severini's exact MCAR MPL l M (β), MCAR MCMPL l M * (β), MNAR profile log-likelihood l P (ψ) and MNAR MCMPL l M * (ψ), computed with R = 500. In outline, the Monte Carlo strategy is particularly convenient in this missing-data scenario. It allows indeed to easily calculate the MNAR MCMPL which appears robust to the missingness mechanism, where the price to pay for this robustness is only a minor loss in efficiency. As proved by Section S3 of the Supplementary material, the same consideration can be made when a probit link function is used in model (8).

Application to a toenail infection study
The solution detailed in Section 4.2 can be applied to the toenail data, carefully described in Molenberghs and Verbeke (2005, Section 2.3). This dataset was collected upon a two-armed clinical trial in N = 294 patients treated for toenail infection and followed-up at T = 7 time occasions. The outcome variable codes whether the infection was severe (y it = 1) or not (y it = 0), for t = 1, . . . , T . Each patient is here identified as a cluster, containing 7 observations at the different follow-up visits. This is a clear example of situations where the number of clusters (N = 294) is much higher than the cluster size (T = 7). Two covariates were also recorded: number of months from the first visit and the oral treatment, A or B, used. Note that only the former is a time-varying covariate, as every subject was randomly assigned to only one treatment for the whole duration of the study. The main interest was to understand how the percentage of severe infections evolved over time and if such evolution was affected by the treatment. Due to a variety of reasons, the response is missing at some time points for several subjects. The percentage of missing response in the sample is 7.29%. Although this value is lower than those considered in the simulation studies, the results below indicate a perceivable difference between inference based on the various approaches.
Model (8) is fitted to the data based on the assumption of the logistic link where x 1it ∈ {0, 1, 2, 3, 6, 9, 12} measures the time in months from the first visit of the ith patient and x 2it is the interaction term between time and treatment. We remark that the treatment is not included in the regression because the presence of the fixed effects prevents those coefficients referred to covariates with no within-cluster variation from being identified. According to the mechanism of missingness supposed in (9), inference about β = (β 1 , β 2 ) can be conducted by either MCAR or MNAR methods. Results are partially shown in Table 4. Notice that the output obtained by using the MCAR MCMPL l M * (β) is not reported, being basically indistinguishable from that of l M (β). Whatever the hypothesis about the missingness process, both standard and modified profile likelihood functions detect a strongly significant decrease over time in the percentage of severe infections among subjects who receive oral treatment A, with a smaller estimated effect given by the MPL. For what concerns β 2 , which represents the difference in the evolution of infection between the two treatment arms, the conclusion is less clear. This is in line with previous analyses neglecting the missing data problem (Molenberghs and Verbeke, 2005, Section 10.3). Contrary to the ML fits, the use of the MPL suggests no effect of oral treatment B in improving the recovery process with respect to treatment A, at a 5% significance level. However, the p-value is well below 0.1 if the data are assumed MNAR. This last hypothesis seems indeed quite realistic, as the probability of missing a visit for one patient is likely to depend on his current toenail infection status. The estimate of the parameter associated with the response y it in the specification (9)-(10), γ 2 , equals −∞ for both l P (ψ) and l M * (ψ). This indicates the occurrence of separation in the data available for estimating the MNAR probability of missingness. Particularly, one obvious interpretation is that a patient with a severe toenail infection is much more motivated to undergo the scheduled visit than a patient who has healed. We note however that separation in the estimation of γ 2 does not affect the estimates of the remaining parameters, as usually happens in binary regression without missing data. These considerations, along with the robustness confirmed by the simulations of Section 4.3, further make the MNAR MCMPL the most reliable inferential tool in this example.
5 Survival models for right-censored data

Setup and background
Let independent clustered failure timesỹ it ≥ 0 be realizations of the random variables Y it such that where Since observations may be right-censored, the sample actually consists of realizations of the pair Y it , ∆ it , where Y it = min Y it , C it with C it random censoring time and ∆ it event indicator being ∆ it = 1 if Y it ≤ C it and ∆ it = 0 otherwise. The censoring mechanism is hypothesized to be independent and non-informative, meaning that each C it is unrelated to the other survival or censoring times and its continuous distribution does not depend on θ. We also suppose that the density of C it is the same in all N groups.
Under this scenario, inferential solutions to the incidental parameters problem also need to cope with the presence of censored data. In the past years, the application of the MPL has been experimented only to a limited extent because its computation is not straightforward in regression frameworks like (19) with general censoring scheme. The technique proposed by Pierce and Bellio (2006) to overcome such complications relies on Monte Carlo simulations as well, but targets fully parametric settings where the distribution of censoring is completely defined. Pierce and Bellio (2015) considered also higher-order asymptotics for semiparametric Cox regression. In that case, an adjustment to the likelihood ratio statistic was obtained either by implementation of a parametric bootstrap employing a reference censoring mechanism or by simulation. However, not only their proposal considers inference on scalar parameters of interest, but also it does not usually improve on the partial likelihood.
Model (19) can be viewed as an extension of the scenarios on which Cortese and Sartori (2016) focused. Therein, the use of Severini's frequentist integrated likelihood for estimating ψ was found to be superior to random effects models with seriously misspecified frailty distribution. However, the computational effort implied by their approach is remarkably sensitive to the number of predictors in the study, and indeed they only consider cases with p = 0 or p = 1. In addition, the authors specify some parametric distribution of C it , whereas here we prefer to avoid such a restriction which might affect the inferential results. On the one hand, our choice relaxes the assumptions of the analysis, but on the other, it prevents the term (5) in Severini's MPL from being exactly calculated. In what follows, the Monte Carlo strategy presented in Section 3 will be shown general and flexible enough to tackle this difficulty.

Monte Carlo modified profile likelihood
Consider the observed couple y it , δ it introduced in the previous section. If the censoring times c it are independent realizations of a continuous random variable with generic distribution , then data are drawn from the joint density where ϕ = (θ, ς) and, in the interests of conciseness, dependence on covariates is omitted. The distribution of C it is independent of the parameter θ and does not vary across clusters, thus the log-likelihood function about θ based on the whole dataset y it , δ it (i = 1, . . . , N, t = 1, . . . , T ) can be formulated by Starting from the previous expression, the profile log-likelihood for ψ and the score in this setting can be derived following the general definitions in (2) and (3).
The first quantity to be computed in (5), j λ i λ i (θ ψ ), is typically obtainable with ease even with right-censored data. On the contrary, exact calculation of the expected value in (5) should be carried out with reference to a fully specified model, i.e. the joint probability density function (20) comprising also the distribution of the censoring times. However, in order to avoid unnecessary assumptions, we will not specify a parametric form for p C it (c it ; ς) and S C it (c it ; ς), as instead done by Cortese and . Indeed, such an assumption can be avoided for calculating the MPL via the Monte Carlo solution reported in Section 3, because estimation of the censoring distribution can be implemented nonparametrically, making the resulting approximation more robust.
With an unspecified density of C it , it is still possible to simulate the Monte Carlo samples (y r it , δ r it ) (r = 1, . . . , R) on which (6) is based. Censoring times are not available for units with an observed failure, but they can be simulated by bootstrap techniques. The procedure is explained in the sequel. First, failure timesỹ r it are generated from the ML fit of model (19). Second, new censoring times c r it are determined by performing the conditional bootstrap described in Algorithm 3.1 of Davison and Hinkley (1997, p. 85). In particular, if the original indicator δ it equals zero we set c r it = c it , otherwise we draw c r it from the conditional distribution of C it |C it > y it computed as where S C it (·) is the Kaplan-Meier nonparametric estimator of the survival function of C it . Precisely, each c r it corresponding to δ it = 1 is found as the unique solution c to the equation , with u r it ∼ U (0, 1). Finally, for i = 1, . . . , N and t = 1, . . . , T , the observed survival times are y r it = min(ỹ r it , c r it ) and hence the new event indicators are defined as δ r it = 1 ifỹ r it ≤ c r it and δ r it = 0 ifỹ r it > c r it .

Weibull model
As an illustration, assume now the Weibull distribution for the random variables Y it . Consequently, in model (19) the probability density function can be expressed as where η it = e −(λ i +β T x it ) controls the scale of the distribution. The interest is on estimating the common shape parameter ξ > 0 and the regression coefficients in β = (β 1 , . . . , β p ) ∈ IR p , while treating the vector of group-related intercepts λ = (λ 1 , . . . , λ N ) ∈ IR N as nuisance. We shall is the baseline hazard parametrically modeled and shared by all groups, whereas h 0i (ỹ it ; ξ, λ i ) = h 0 (ỹ it ; ξ)e −ξλ i can be seen as the equivalent for the ith cluster (i = 1, . . . , N ). Thus (22) is a stratified proportional hazards model, and its logarithmic transformation coincides with a so-called accelerated failure time model (see, for instance, Cortese and Sartori, 2016, Section 6).
Denoting the number of failures recorded in the ith group by δ i· = T t=1 δ it (i = 1, . . . , N ) allows to write (21) under the Weibull model as Furthermore, the score in formula (3) equals and the relating cluster-specific constrained ML estimate is explicitly found aŝ The profile log-likelihood function for ψ presented in (2) has the expression and its maximizerψ = (ξ,β) can be obtained numerically.
For what concerns the computation of the MPL, changing sign to the derivative of (25) with regard to λ i gives In the second summand of (5), the usual expectation can be estimated via Monte Carlo by whereη it = exp −(λ i +β T x it ) , δ r i· = T t=1 δ r it and (y r it , δ r it ) (r = 1, . . . , R) are the simulated datasets generated via the procedure described at the end of Section 5.2.
The simulation results in the following section will shed light on the possibility to solve the incidental parameters problem using the MPL under the Weibull model for clustered time-toevent data with unspecified censoring distribution. Specifically, the studies will examine on a comparative basis the profile log-likelihood l P (ψ) in (27) and the MCMPL l M * (ψ) depending on the approximation (28). A comparison with a stratified Cox regression, which models nonparametrically h 0i (ỹ it ; ξ, λ i ) in (23), will also be considered.

Simulation studies
Two experiments of S = 2000 simulations are conducted to study inference on ψ in the Weibull model for right-censored observations presented in Section 5.3. The within-group size and the number of clusters in the artificial samples are set equal to T = 4, 6, 10 and N = 50, 100, 250, respectively. The regression model includes p = 2 covariates. The first, x 1it , in each ith group (i = 1, . . . , N ) is obtained by imposing x 1it = 0 for t = 1, . . . , T /2 and x 1it = 1 for t = T /2 + 1, . . . , T . The second, x 2it , is drawn from the standard normal distribution. The common shape parameter is chosen as ξ = 1.5 and the vector of regression coefficients as β = (−1, 1), while each cluster-related intercept is independently sampled as λ i ∼ N (0.5, 0.5 2 ). Failuresỹ it are simulated via the Weibull density function (22). The censoring times c it can be obtained by random generation from the distribution Exp(ς), where the parameter is selected in such a way as to control the overall proportion P c of censored data. In detail, given the quantities above and for a certain P c , ς is fixed to the value solving the equation where ̺ = (θ, ς) and p C it (y; ς) = ςe −ςy . Then, in each of the S fictitious datasets, observations y it , δ it stem from the usual definitions of censored failures and event indicators, i.e. y it = min(ỹ it , c it ) and δ it = 1 whenỹ it ≤ c it , otherwise δ it = 0 (i = 1, . . . , N, t = 1, . . . , T ). The first series of simulations considers data with censoring probability P c = 0.2, the second relates to situations with higher proportion of censored observations, namely P c = 0.4. Inferences from the profile likelihood and from the MCMPL on ψ are investigated as done in Section 4.3. Notice that, before proceeding to maximize the two functions for every simulated dataset, noninformative clusters with only censored failure times must be discarded from the study. Indeed, (26) shows thatλ iψ is not finite if δ i· = 0 and hence the ith group does not make any contribution to estimating ψ (i = 1, . . . , N ). Both estimatesψ andψ M * = ξ M * ,β M * are found by joint numerical optimization of l P (ψ) and l M * (ψ), respectively.
It is well-known that the MPL can lead to both a location and a curvature adjustment of the profile likelihood. These imply, respectively, a correction of the bias and of the standard errors of the corresponding estimates. Typically, both effects are present. But there are instances in which only the curvature adjustment is needed for some components of ψ (Bellio and Sartori, 2006, Section 3.3). This is the case in the current example. Indeed, the presence of many nuisance parameters does not imply a bias in the estimation of the regression coefficients. For Table 5: Inference on ξ = 1.5 in the stratified Weibull regression model for right-censored survival data and probability of censoring P c = 0.2. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials.  Tables 5 and 6. The accuracy of l M * (ψ) is extremely good for all unknown quantities and diverse dimensions of the data, yet inferential conclusions on ξ drawn via l P (ψ) are found quite misguided. Table  5 testifies how the Monte Carlo modification is capable not only of greatly reducing the severe empirical bias of the ML estimator, but also of correcting the excessively low actual Wald coverages derived by the profile likelihood. Indeed, these can also be ascribed to the supplied standard errors ofξ, prominently downward biased for smaller T , independently of N . Estimated variability ofξ M * is, conversely, much more trustworthy.
Performances of the two inferential tools under examination in the second simulation study are summarized by Table 6, for what concerns the shape parameter. The reported indexes prove the convenience of l M * (ψ) even when a greater amount of data is subject to censoring. When P c = 0.4 the empirical bias ofξ M * remains systematically lower than that ofξ, reaching negligible values when T and N increase. In contrast, the imprecise point estimation provided by l P (ψ) is especially critical when the within-group size is smaller and stays basically constant as N grows, coherently with the existing theoretical knowledge for models without censoring . All the empirical coverage probabilities based on the MCMPL are very close to the nominal level, while those based on the profile likelihood are well below it, even for the aforementioned unreliable estimated standard errors ofξ.
Both l P (ψ) and l M * (ψ) are invariant under reparametrizations. Hence we can also consider inference on the relative risks referred to the two covariates, which are typically the measures of main interest in survival analysis. Under the Weibull model (22), such quantities are expressed Table 6: Inference on ξ = 1.5 in the stratified Weibull regression model for right-censored survival data and probability of censoring P c = 0.4. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials. by RR 1 = e −ξβ 1 = e 1.5 and RR 2 = e −ξβ 2 = e −1.5 . Alternatively, these relative risks can be estimated by fitting a stratified Cox proportional hazards regression, where a separate baseline hazard function is supposed for each group. The function coxph in the R package survival (Therneau, 2015) performs such analysis. In Tables 7 and 9 for RR 1 and Tables 8 and 10 for RR 2 , we compare results from the fit of the Weibull regression via the profile likelihood and via MCMPL with those obtained assuming the semiparametric survival model. Reported 0.95 Wald coverages related with the Cox specification descend from the confidence intervals for the relative risks returned by summary.coxph. The profile likelihood under the Weibull model performs very poorly in estimating the relative risks, as a result of the imprecise ML inference provided on the shape parameter ξ. On the contrary, l M * (ψ) proves to be extremely accurate in terms of both point and interval estimation. Empirical coverages derived through the partial likelihood of Cox are generally the closest to the nominal level, however this is due to the larger variability of the obtained estimates with respect to those descending from the MCMPL. Indeed, the latter exhibits the lowest RMSE and implies a gain in efficiency over the semiparametric approach, which in turn is more robust. A thorough comparison between the outcomes of the two experiments above may be helpful to check whether and how the incidence of censored data in the sample affects the accuracy of the statistical techniques employed. Particularly, l P (ψ) appears to suffer more than l M * (ψ) from a high censoring probability. Indeed, in making inference on ξ via the profile likelihood, only the coverages when N = 50 are slightly more adequate with P c = 0.4. The same pattern is observed with regard to the estimated relative risks. On the contrary, conclusions descending from the MPL and the partial likelihood for the Cox model look less impacted by the percentage of observations subject to censoring.
The empirical findings in this example are substantially in accordance with those relating to   (2016). Nonetheless, there exist three important motivations to prefer the MCMPL approach illustrated in Section 5.2. Firstly, it is far less computationally expensive, as the effort entailed by the numerical integration to calculate Severini's integrated likelihood in the regression setting is considerable. Secondly, its basic procedure easily lends itself to encompass the bootstrap for nonparametric estimation of the censoring mechanism, protecting against misspecification risks. And thirdly, it can handle different distributions of the failure times Y it , such as logNormal or Gamma, whereas the method of Cortese and Sartori (2016) demands to derive ad hoc formulae for finding a suitable reparametrization of the model .

Application to an HIV clinical trial
We now employ the proposed methodology in the analysis of a dataset from one clinical trial conducted to compare the time to death under two different treatments for Mycobacterium  Data are the observed survival times and the corresponding event indicators, i.e. the realizations (y it , δ it ) from the pairs (Y it , ∆ it ), recorded along with the treatment used (x it = 0, 1, respectively, in treatment groups "Tx 1" and "Tx 2"), for a total of 69 patients enrolled by N = 11 different medical centers. While such a number of clinics (i.e., groups) is small compared with the simulation settings of the previous section, the interesting aspect here is the relatively moderate amount of patients (i.e., cluster sizes T i , i = 1, . . . , 11) followed by most of the centers. Moreover, very few or no events of death are observed in each group (only 5 people died in group "Tx 1", and 13 in "Tx 2"), with a global proportion of censored observations equal to 74%. We remark that the simulation-based evidence attested that ordinary inferential techniques especially suffer from a high censoring probability. Fitting the Weibull model (23) via the profile likelihood returns estimatesξ = 1.150 (s.e. 0.236) andβ = −1.012 (s.e. 0.520). Exploiting the invariance property, one can say that the Table 9: Inference on the relative risk RR 1 = e 1.5 in the stratified regression for right-censored survival data and probability of censoring P c = 0.4. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500 under the Weibull model and the partial log-likelihood l Cox under the Cox proportional hazards model. Results based on a simulation study with 2000 trials. The study performed through the MCMPL does not produce the same significant results in support of the first treatment. The estimates areξ M * = 1.051 (s.e. 0.221) andβ M * = −0.981 (s.e. 0.564), implying an hazard ratio equal to e −ξ M * β M * = 2.806. Testing H 0 : β = 0 by means of the MPL ratio statistic leads now to a more dubious conclusion, since the p-value equals 0.049, and such uncertainty is reflected by the 0.95 confidence interval for β obtained by inversion of the same quantity, namely (-2.400, -0.004). A similar interval for ξ is instead (0.672, 1.538). Figure 1 provides a graphical representation of the inferential discrepancy existing between the two contrasted methods, even in this case with a moderate value of N .  Inference on the relative risk can also be made by fitting a stratified Cox proportional hazards regression by means of the R function coxph, introduced in Section 5.4. Specifically, under this model the estimated risk ratio is equal to 1.953 (s.e. 1.054), with corresponding 0.95 confidence interval (0.641, 5.950) returned as output. The analogue Wald confidence regions for the Weibull relative risk based on the profile likelihood and MCMPL, resulting upon application of the Delta method, are instead (2.697, 3.702) and (2.378, 3.234), respectively. Thus, in this specific example, the interval obtained under the semiparametric Cox model is significantly wider than both those obtained assuming a Weibull distribution for the survival times, since there is a substantial price to be paid for the weaker assumption with such a moderate sample size.

Discussion
This work shows how to exploit Monte Carlo simulation for widening the field of application of the MPL ( Barndorff-Nielsen, 1980, 1983. Severini (1998) made a first step in this direction, yet his approximation is still not approachable enough to deal with the today's degree of modelling sophistication for clustered units. Our solution, introduced in Section 3, helps to fill such a gap in accessibility and to solve the incidental parameters problem even when the experimental design imposes quite complex assumptions on the analysis. The suggested procedure is easy, implementable in broad generality and reasonably fast, and for these reasons could be thought of as the default choice in applications. Indeed, an extended version of the package panelMPL is currently under development and will also handle the automatic computation of the MCMPL for the model classes discussed here.
Section 4 addresses issues in inferences on the parameter of interest related to the presence of missing values in binary grouped data. In this case analytical calculation of Severini's MPL is practicable but is not a simple task, while its approximation can be done by means of a simple two-step procedure to simulate the required Monte Carlo samples. Results of simulation studies are presented here for the logistic regression scenario and for the probit regression in the Supplementary material. In the analysis of MCAR observations Monte Carlo simulation is found unnecessary to compute the MPL, as the inferential precision of the MCAR MCMPL appears equivalent to that of the analytical MPL which disregards the missing data. Remarkably, the MNAR MCMPL sets an example of robustness to the ignorable incompleteness of the data. When the true mechanism of missingness is nonignorable, the MNAR MCMPL proves to be generally more accurate than Severini's function, especially if T is not too small, for any N .
Justifications for this outcome are given in Section 4.3. To explore the usefulness of the Monte Carlo strategy when computing the MPL on partially MAR or partially missing always at random data for inference about ψ (Little et al., 2017) could be another possible direction of research.
Clustered survival times subject to right-censoring are discussed in Section 5. Under the Weibull regression model with group-related intercepts, our proposed approximation to the MPL is made necessary by the lack of distributional assumptions on the random censoring mechanism. An explicit calculation of Severini's modification requires full parametric specification of the density for the censoring times, whereas the Monte Carlo strategy enables to estimate it nonparametrically, using a conditional bootstrap (Davison and Hinkley, 1997, Algorithm 3.1). Experimental outcomes examined in Section 5.4 corroborate the theory pertaining to inference in the standard two-index asymptotic setting. Estimation of the parameter of interest via the MCMPL is preferable to that via the profile likelihood in every relevant respect and is not affected by the proportion of censored data in the sample. Finally, the computational burden demanded by existing alternative statistical procedures (Cortese and  is much heavier than that of the solution adopted here. The potential room for future applications is vast, thanks to the generality of the methodology presented. One instance is given by semiparametric regression models where the incidental nuisance parameters are expressed as unknown real-valued functions (He and Severini, 2014). Bellio, R. and N. Sartori (2015). panelMPL: Modified profile likelihood estimation for fixed-effects panel data models. http://ruggerobellio.weebly.com/software.html.
Cortese, G. and N. . Integrated likelihoods in parametric survival models for highly clustered censored data. Kenward, M. G. and G. Molenberghs (1998

S1 Introduction
Additional simulation results concerning possible applications of the methodology presented in Section 3 are available here. Specifically, Section S2 deals with the autoregressive model for nonstationary normally-distributed panel data, where computation of the MPL is quite cumbersome. The probit regression model with missing response is instead considered in Section S3, which complements the analysis for the logistic link shown in Section 4.3 of the paper. For both link functions, results obtained within the fixed effects simulation setup are also reported. Finally, Section S4 contains supplementary outcomes of simulations for the stratified Weibull regression under right censoring, described in Section 5.4.

S2.1 Setup and background
Let us consider the nonstationary version of the first-order autoregressive model for normal response with y 0 = (y 10 , . . . , y N 0 ) vector of unrestricted given initial conditions. The parameter of interest is ψ = (ρ, σ 2 ) ∈ IR × IR + and λ = (λ 1 , . . . , λ N ) ∈ IR N denotes the nuisance component of fixed effects. The lack of stationarity of the stochastic process Y it in each ith group (i = 1, . . . , N ) implies the temporal variation of its mean or its autocovariance function, i.e. the covariance of the response with itself at pairs of time points. As a consequence, the autoregressive parameter ρ is left free to equal or exceed unity and the fixed vector y 0 does not need to meet any specific requirement, so that the log-likelihood function is expressed by conditioning on it. In order to facilitate the presentation, both exogenous covariates and further lagged responses y i,t−l (l > 1) are excluded from the set of model regressors; however, no additional difficulties would be encountered in applying the proposed methodology otherwise. The incidental parameters problem occurring in the analogue stationary AR(1) model has been addressed in the statistical literature several times. Particularly, Cruddas et al. (1989) proved that a marginal likelihood for ψ exists and is equivalent to the MPL introduced by Barndorff-Nielsen (1980, 1983. Also econometricians showed interest in this issue and one latest proposition to improve ML inference in fixed effects dynamic models for stationary panel data is given by the bias-corrected estimator of Dhaene and Jochmans (2016), specially tailored for macroeconomic settings with N = O(T ).
Here, a great deal of attention is paid to the nonstationarity assumption of model (S1). Analytical derivation of I λ i λ i (θ ψ ;θ) in this case would be possible but quite tedious. Instead Monte Carlo approximation dramatically reduces the amount of effort demanded to use Severini's modification. Moreover, we are specifically concerned with datasets where T is much smaller than N , i.e. with situations where l P (ψ) exhibits its worst performance. Estimation of ψ under these conditions was already investigated in the past. For example, inference in autoregressions of order l was thoroughly examined in Dhaene and Jochmans (2014), who obtained an adjusted profile log-likelihood through integration of a recentered score function. For l = 1, their solution is essentially the same as that found in Lancaster (2002, Section 3). From a purely Bayesian perspective, the latter proposed a strategy to integrate out the incidental parameters from the likelihood in order to derive a marginal posterior density with consistent mode for ψ. Such approach is in substance equivalent to that adopted by De Bin et al. (2015) to make inference on ρ via the frequentist integrated likelihood of .

S2.2 Monte Carlo modified profile likelihood
When groups are supposed independent, the log-likelihood of model (S1) conditioned on the initial vector y 0 is and the score function takes the form Solution for λ i to the equation l λ i (θ) = 0 delivers the constrained ML estimate of λ i depending just on the autoregressive parameter whereȳ i = T t=1 y it /T andȳ i,−1 = T −1 t=0 y it /T . The profile log-likelihood is then obtained by replacement of λ i withλ iρ in expression (S2) for each i = 1, . . . , N .
The first part in Severini's modification term is immediately available, since j λ i λ i (θ ψ ) = T /σ 2 . Yet, the derivation of I λ i λ i (θ ψ ;θ) requires more elaboration. The ML estimate of λ i equalŝ Therefore, by adding and subtracting the same quantity ρȳ i,−1 , one getŝ The equation above lets us express the score evaluated at the constrained ML estimate in a convenient way. In particular, we start writing where the second equality holds because we simultaneously sum to and subtract from the bracketed part bothλ i andρy i,t−1 . Now, since manipulating (S5) leads tô by substitution of the latter expression in (S6) it is not hard to obtain Then, the necessary expected value is a linear function of ρ, and specifically with E 1 = Eθ l 2 λ i (θ) and E 2 = Eθ Y i,−1 l λ i (θ) . Note that expectation (S7) is computed with reference to the distribution p(y it |x it ;ψ,λ i ).
Although possible in principle, the analytical calculation of E 1 and E 2 is not straightforward. Conversely, estimating I λ i λ i (θ ψ ;θ) via Monte Carlo simulation represents an easily implementable solution. Based on what illustrated in Section 3, the empirical mean for approximating (S7) becomes here where y r it (i = 1, . . . , N, t = 1, . . . , T ) is generated by model (S1) with (ψ, λ) = (ψ,λ), but the starting vector is kept unchanged, namely y r 0 = y 0 for each r = 1, . . . , R. We acknowledge that, in this specific case, (S7) could be alternatively obtained through analogue Monte Carlo approximations to the expected values E 1 and E 2 , which need to be derived just once because they involveθ only. Nevertheless, for the reasons exposed in Section 3, the general approach detailed by (S8) is not much more costly in terms of computing effort.

S2.3 Computational aspects
The estimateθ can be written in closed form by applying the ordinary least squares method to the linear autoregression with normally distributed errors corresponding to (S1). As a consequence, the ML estimate of σ 2 is expressed bŷ where formulations forρ andλ i follow directly from (S4). On the contrary, maximization of l M * (ψ) for finding the estimateψ M * = (ρ M * ,σ 2 M * ) usually has to be performed by means of numerical algorithms, and estimated standard errors are derived using the second derivative of the function at its maximum. Under this particular scenario, it is more convenient to derivê σ 2 M * by evaluation of the explicit constrained estimatê atρ M * , i.e. the scalar solution to the optimization problem with objective function l ρ M * (ρ) = l M * ρ,σ 2 ρ,M * . Observe that also l P (ψ) can be further profiled in order to get l ρ P (ρ) = l P (ρ,σ 2 ρ ), whereσ 2 ρ takes the form equivalent to (S9), but with estimatesρ andλ i replaced by ρ andλ iρ as in (S3), respectively.
According to expression (S7), for values of the autoregressive parameter beyond a certain threshold depending onρ the expectation I λ i λ i (θ ψ ;θ) is negative and l M (ψ) is not computable, similarly to what happens with the integrated likelihood of De Bin et al. (2015). In its turn, the approximation I * λ i λ i (θ ψ ;θ) can be smaller than or equal to zero for not too large values of ρ. A potentially undefined modification term poses a problem for the numerical optimization of l ρ M * (ρ). In addition, as will emerge from the figures available in Section S2.4, the MCMPL is found to reach its global maximum as ρ → +∞ for any sample size, in accordance with the various functions for inference on ψ studied in Lancaster (2002), Jochmans (2014) andDe Bin et al. (2015), respectively. On such grounds, we choose to maximize l ρ M * (ρ) by performing a one-dimensional search in a real bounded interval Υ through the algorithm implemented by the R function optimize. Specifically, adopting the same notation of Lancaster (2002), Υ = (−ρ l , ρ u ) with ρ l = ρ u = 1.5, since in usual applications the autoregressive parameter is hardly observed to lie outside these extremes. The estimate resulting from the local maximization of l M * (ψ) is then uniquely defined asψ M * = ρ M * ,σ 2 M * , whereρ M * = arg max ρ∈Υ l ρ M * (ρ) andσ 2 M * =σ 2 ρ M * ,M * . A careful discussion about the conditions under which consistency of this local maximizer is achieved is beyond the scope of the present work. We refer to Dhaene and Jochmans (2014) for further details on the topic.

S2.4 Simulation studies and numerical examples
In this section, the accuracy of the MCMPL in drawing inferences on ψ under the nonstationary normal AR(1) model is assessed with regard to that of the standard profile likelihood through a Table S1: Inference on ρ = 0.5 in the nonstationary AR(1) model for panel data. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials. The two simulation setups differ only in the true value of the autoregressive parameter used to generate the samples from model (S1): in the first ρ = 0.5, while in the second ρ = 0.9. The conditional variance of the response variable is σ 2 = 1 and the fixed effects are independently drawn from a N (1, 1) distribution, following the example of Lancaster (2002). In every simulated dataset, all N initial observations in the vector y 0 are fixed equal to zero with no loss of generality, as this is equivalent to interpret each y it as y it − y i0 and each λ i as λ i − y i0 (1 − ρ) (i = 1, . . . , N, t = 1, . . . , T ) (Lancaster, 2002). Lastly, the number of Monte Carlo replicates employed to compute l M * (ψ) is R = 500. Inferential results of the first study for ρ and σ 2 are displayed in Tables S1 and S2, respectively, using indexes defined in Section 4.3. Similar comments as in  can be made. In all configurations, no significant differences between bias and median bias of the same estimator are observed, but the improvement determined by using the MCMPL in this sense is remarkable. Consistently with the theory for independent units , the bias does not vary with N but decreases as T increases, whereas the root mean squared error depends on both indexes. Empirical coverage probabilities of 0.95 Wald confidence intervals based on l M * (ψ) are generally accurate for both components of interest, with larger departures from the nominal level occurring when T = 4. Such conspicuous refinements to the poor interval estimation supplied by l P (ψ) mainly stem from bias reduction. Yet some correction in curvature also takes place, being SE/SD for the MCMPL typically closer to one than for the ordinary profile likelihood.
Tables S3 and S4 illustrate results of the simulation experiment run with a true value of ρ approaching the boundaries of the stationary region (−1, 1), particularly ρ = 0.9. Relative Table S2: Inference on σ 2 = 1 in the nonstationary AR(1) model for panel data with ρ = 0.5. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials. Perhaps here the general improvements originating from the employment of l M * (ψ) are somewhat milder than when the autoregressive parameter is farther away from nonstationariety. This observation can be referred both to bias and, mostly, to empirical coverages of confidence intervals for ρ. Nonetheless, the quality of MPL-based inference remains unquestionably higher than that reached through standard ML techniques. Figure S1 shows l ρ P (ρ) and l ρ M * (ρ), as defined in Section S2.3, in their relative version. Specifically, the plots are referred to samples generated from model (S1) with ρ = 0.5, 0.9, T = 4 and N = 250, 1000. All panels confirm the results of simulations discussed so far. The maximum of the profile log-likelihood is significantly smaller than the true value of the autoregressive parameter, corresponding to the vertical line. Because of this and the accentuated curvature of l ρ P (ρ), such value never belongs to the 0.95 confidence region defined by inversion of the profile likelihood ratio statistic and marked by the horizontal line. Conversely, the local maximization of the MCMPL yields to adequate point and interval estimations of ρ. The unusual trend of l ρ M * (ρ), whose global maximizer lies at infinity, was already anticipated in Section S2.3. The absence of restrictions on the initial conditions in y 0 causes in fact the MCMPL to be re-increasing, sometimes already in the stationary parameter region (Dhaene and Jochmans, 2014).

S3 Probit regression model with missing response
Suppose now that specifications (8)-(10) in Section 4.2 of the paper hold with F (·) = Φ(·) and G(·) = logit −1 (·), where Φ(·) is the CDF of the standard normal random variable. Even Table S3: Inference on ρ = 0.9 in the nonstationary AR(1) model for panel data. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials.
Under these hypotheses, the ith score function may be expressed as and j λ i λ i (θ) is readily derived by changing sign to its first derivative with respect to λ i . Using (S10) and (S11), it is possible to obtain both l M (β) in closed form and l M * (β) as described in (17). In the present probit framework, the formula of the standard profile log-likelihood l P (β) follows in fact from (15) with π it = Φ(λ i + β T x it ). When we conjecture that incompleteness of the data originates from a nonignorable process, Monte Carlo simulation serves to approximate the unconditional expected value I λ i λ i (φ;φ ψ ), whose exact formulation is hardly retrievable. The expression of l M * (ψ) in the probit setting may be obtained by multiple substitution of Φ(λ i + β T x it ) for π it , φ(λ i + β T x it ) for f it and −(λ i + β T x it )φ(λ i + β T x it ) for f ′ it in equations (11)-(14) of the paper.  The numerical optimization methods employed for the various functions correspond to those of the logistic case. We also recall that exclusion of the non-informative clusters by the dataset must take place prior to the fitting phase.
The basic structure of the studies performed in Section 4.3 is held unchanged, yet here a probit link between the response and the unique predictor is considered. In the first experiment, missing observations are chosen according to an MCAR mechanism with γ 1 = 2.5; in the second, the true missingness generation process is MNAR with γ 1 = 5 and γ 2 = 1. The covariate is again simulated from the N (−0.35, 1) distribution. Exploiting the well-known relation between the logistic and normal distributions (Amemiya, 1981), in order to obtain data and quantity of informative groups comparable to the logistic setting, the complete simulated samples are generated by fixing β = 1/1.6 = 0.625 and intercepts λ i (i = 1, . . . , N ) are independently generated as N (−0.22, 0.39) The corresponding value of the parameter for x it under a marginal model with generalized estimating equations (GEE) is β m = β/ 1 + σ 2 λ , where σ 2 λ is the variance of the random effects' distribution (Agresti, 2015, Section 9.4.1). Here σ 2 λ = 0.39, and therefore β m = 0.625/ √ 1.39 = 0.53.
Tables S5 and S6 summarize the usual measures of estimation accuracy for β based on S = 2000 simulations of the study regarding MCAR data. Relative behaviours of the three MCAR log-likelihoods illustrated by Table S5 do not significantly differentiate from those viewed in Table 1 of Section 4.3 for the logit link. The defective performance of l P (β) is greatly corrected by the adjustment proposed by Severini, from any relevant inferential perspective and for all possible couples (T, N ) with T = 4, 6, 10 and N = 50, 100, 250. Results achieved by l M (β) and l M * (β) are still very similar, thanks to the validity of the MCAR hypothesis. Inferences on β drawn using the MNAR functions and the GEE method for the same MCAR samples are displayed by Table S6. In this probit regression setting the empirical properties of l M * (ψ) are much more favourable than those of the corresponding unmodified function. This is well reflected by all bias and Wald coverage indicators. We observe that, contrary to expectations, for fixed T the bias of the ML estimator decreases with N . In addition, the MNAR log-likelihoods seem to supply better point estimation but less trustworthy confidence intervals compared to their MCAR counterparts. Comments pertaining the performance of the approach based on GEE are along the same lines as those made about Table 2 in Section 4.3 of the paper, dedicated to the logistic regression. Even with the probit link, l M * (ψ) is more reliable than GEE as the group size grows up and generally succeeds in detecting the underlying ignorable missingness process, which represents a reduced form of the full MNAR model presupposed by that MCMPL.
An account of the last simulation experiment is given in Table S7, which is referred to incomplete datasets with MNAR units. Similarly to what emerged by Table 3 in the paper under the MNAR logistic framework, l M * (ψ) appears to retain higher inferential precision than the analytical MPL which ignores the missingness model, the only exception being the case with T = 4. Thus, when groups are not extremely small, taking the true nonignorable missingdata mechanism into consideration via Monte Carlo simulation is practically translated into improved bias and coverage properties of the estimator based on the MNAR MCMPL. Note that the performance of the MCAR MCMPL is not displayed, like for the logit link function,  since it does not significantly differ from that of Severini's l M (β). Furthermore, as expected, the GEE approach is found inconsistent when fitting a binary regression with MNAR response. In closing, the use of Monte Carlo approximation to compute the MNAR MPL proves to be generally recommendable in the presence of binary missing observations even under the probit link assumption, thanks to its robustness to the true mechanism of missingness. Finally, we consider a slightly different simulation setting, in which the incidental parameters are generated in such a way that they are intrinsically correlated with the covariate in the model. Specifically, we assume λ i = T t=1 x it /T + u i , where u i ∼ N (0, 1) (i = 1, . . . , N ). In this case, a random effects model would not be correctly specified and thus neither a corresponding marginal model would be available for comparing inference based on GEE. Tables S8-S10 show for the logistic link function that results based on the fixed effects approach and achieved via the MCMPL are, as expected, as accurate as those obtained under the alternative simulations in Section 4.3 of the paper. The same pattern of performance can be found in Tables S11-S13 concerning the probit specification.

S4 Weibull regression model for right-censored data
Tables S14-S17 summarize inference on β = (β 1 , β 2 ) obtained through the simulation studies described in Section 5.4 of the paper. In particular, Tables S14 and S15 refer to the first experiment considering an average censoring probability P c = 0.2, while Tables S16 and S17 refer to the case P c = 0.4. The results attest the sufficient adequacy of l P (ψ) in drawing conclusions about β. However, due to better estimation of the standard errors ofβ M * = (β 1M * ,β 2M * ), the MCMPL is still superior in terms of appropriateness of confidence intervals. Hence, one can conclude that the Monte Carlo adjustment is valuable to further improve the quality of standard ML procedures even when making inference on the regression coefficients under the Weibull modelling framework of Section 5.3.     Table S14: Inference on β 1 = −1 in the stratified Weibull regression model for right-censored survival data and probability of censoring P c = 0.2. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials.  Table S16: Inference on β 1 = −1 in the stratified Weibull regression model for right-censored survival data and probability of censoring P c = 0.4. The compared functions are the profile log-likelihood l P (ψ) and the MCMPL l M * (ψ) computed with R = 500. Results based on a simulation study with 2000 trials.