Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data

Multilevel semicontinuous data occur frequently in medical, environmental, insurance and financial studies. Such data are often measured with covariates at different levels; however, these data have traditionally been modelled with covariate-independent random effects. Ignoring dependence of cluster-specific random effects and cluster-specific covariates in these traditional approaches may lead to ecological fallacy and result in misleading results. In this paper, we propose Tweedie compound Poisson model with covariate-dependent random effects to analyze multilevel semicontinuous data where covariates at different levels are incorporated at relevant levels. The estimation of our models has been developed based on the orthodox best linear unbiased predictor of random effect. Explicit expressions of random effects predictors facilitate computation and interpretation of our models. Our approach is illustrated through the analysis of the basic symptoms inventory study data where 409 adolescents from 269 families were observed at varying number of times from 1 to 17 times. The performance of the proposed methodology was also examined through the simulation studies.


Introduction
Random effects models have been widely used in the analysis of multilevel data in various fields of applications. Random effects have traditionally been assumed to be covariateindependent in these models; however, such assumption may lead to ecological fallacy [1] in the interpretation of cluster or sub-cluster level covariates. The ecological fallacy occurs if the misleading inference is made about covariate effects when group level covariates are associated with individual observations directly. To account for correlation between clusterspecific random effects and cluster-specific covariates, models with covariate-dependent random effects have been introduced to handle clustered binary data [2], zero-inflated count data [3], discrete-survival data [4] and survival data [5,6]. Modeling of multilevel semicontinuous data has recently become an area of active research [7][8][9]; however, the cluster-specific and sub-cluster-specific random effects are assumed covariate-independent so far in the literature. Multilevel semicontinuous data with cluster-specific covariates are frequently observed in medical, environmental, economic and insurance studies. An example of three-level semicontinuous data with cluster or sub-cluster level covariates is the basic symptoms inventory (BSI) study data [10] where 409 adolescents from 269 independent HIV infected parents were observed at varying number of times from 1 to 17 times. The purpose of this study was to evaluate the effectiveness of an intervention. The intervention was designed based on social learning theory that focused primarily on skill building to reduce the risk from dangerous sexual practices and substance abuse [10]. At baseline, patients and adolescents were randomly assigned to the intervention (132 patients and 203 youth) or standard care (137 patients and 206 youth). The effect of intervention on adolescents was measured by the Global severity index (GSI). GSI is a composite index that based on 53 psychiatric indices which are an indicator of abnormal mental behavior [11]. Response variable of GSI scores is right-skewed with excessive number of exact zeros. Furthermore, parental, child-specific and visit-specific covariates were observed for these 409 adolescents; therefore, accounting for the covariates at their relevant levels would be more appropriate.
In this paper, we introduce a Tweedie compound Poisson model with covariatedependent cluster-specific and sub-cluster-specific random effects for three-level semicontinuous data. Unlike two-part models [8,9], our model characterizes zero and positive components in an integral way instead of separately. In addition, the skewness can be flexibly modeled by the power index parameter of Tweedie family of compound Poisson models. Furthermore, our approach consolidates conditional and marginal modeling. The estimation of our model has been developed based on the orthodox best linear unbiased predictors (BLUP) of random effects given the data as an extension of [12]. Explicit expressions derived for BLUP predictors of random effects predictors in this paper facilitate computation and interpretation of our models. To the best of our knowledge, this is the first time a compound Poisson model is developed with covariate-dependent random effects. Tweedie compound Poisson models are often used to characterize insurance claims, rainfalls, pollutants and health data in practice.
The rest of the paper is organized as follows. After introducing our compound Poisson mixed model with covariate-dependent random effects, we discuss prediction of random effects and model estimation in Sections 2 and 3, respectively. Our approach is illustrated through the analyses of the basic symptoms inventory data in Section 4. The simulations and conclusion are presented in Sections 5 and 6.

The Model
The zero-inflated continuous data are usually referred to as semicontinuous data. More specifically, we only consider nonnegative semicontinuous data in this paper. Let Y ijk represent the semicontinuous responses recorded from the kth (k = 1, 2, . . . , n ij ) observation of the jth (j = 1, 2, . . . , J i ) sub-cluster of the ith (i = 1, 2, . . . , I) independent cluster. Then the response vector can be expressed as Y = (Y 111 , . . . , Y 11n 11 , . . . , Y I J1 , . . . , Y I Jn I J ) . We consider the cluster level random effect U i for the response of the ith cluster and the sub-cluster level random effect V ij for the response from the jth sub-cluster of the ith cluster. Let W = (U , V ) denote the vector of the random effects where U = (U 1 , . . . , U i , . . . , U I ) and V = (V 1 , . . . , V j , . . . , V J I ) with V j = (V j1 , . . . , V jj , . . . , V jJ I ) respectively. Our model is based on the following three assumptions: Assumption 1. Cluster level random effects U 1 , . . . , U i , . . . , U I are independently distributed with gamma distribution with mean µ i and variance σ 2 . That is E(U i ) = µ i and Var(U i ) = σ 2 , where µ i = exp Z i β (1) with Z i be the cluster level covariate vector of the ith cluster.

Assumption 3.
Given the cluster and sub-cluster level random effect W = (U , V ) , the responses Y ijk follow compound Poisson distribution which can be expressed as where the explicit expression for c p (y ijk ; ρ 2 ) are given by [13] but are immaterial in our derivation in the following sections. This Tweedie family is also known as the power-variance family since (3) , where Z ijk represents observation level covariates.
The index parameter is restricted to the range 1 < p < 2 for the semicontinuous response since only this sub-family of Tweedie distributions has a support of the response Y ijk on left-closed interval [0, +∞). In addition, gamma distributed random effects have been traditionally used in the multiplicative models for multilevel data.

Moment Structure
The main focus of this section is to present the moment structures of the compound Poisson mixed model with covariate-dependent random effects discussed in the previous subsection. The moments of the model can be obtained after some algebraic calculations by methods of conditioning on random effects. The unconditional mean and covariance are presented here to facilitate the parameter estimation. As U i independently follows gamma distribution with mean µ i and variance σ 2 , the moment structure of U i can be expressed as where δ (s,i) = 1, if s = i and 0 otherwise. Following Assumption 2, the conditional mean and variance of V ij |U = u can be expressed as E(V ij |U) = µ ij U i and Var(V ij |U) = τ 2 µ 2 ij U i . This implies that the unconditional mean of V ij is and the covariance of V st and V ij is After some extensive algebra, Cov[U s , Y ijk ] and Cov[V st , Y ijk ] can be expressed as respectively. Following Assumption 3, the conditional mean and variance of E[Y ijk |W ] can be simplified as E[Y ijk |W ] = µ ijk V ij and Var[Y ijk |W ] = ρ 2 µ p ijk V ij , respectively. Thus the unconditional mean of the response variable Y ijk can be calculated by conditioning on the vector of random effects W, which can be expressed as where The moments structures described above will be used to develop the estimating equations for the regression and random effect parameters in next section.
In the literature, marginal and conditional inferences refer to the transformed linearity in regression parameters under appropriate link function for marginal and conditional means, respectively [14][15][16]. Our approach consolidates both conditional and marginal modeling interpretations under one model since both the conditional mean and the marginal mean of our model are clearly linear in regression parameters under the log-link.

Estimation of Parameters
In this section, we discuss the prediction of random effects and estimation of regression and random effects parameters.

Best Linear Unbiased Predictors of Random Effects
As in [12], the orthodox BLUP predictors of cluster level random effects Ui can be expressed in matrix form as follows Furthermore, the explicit expression of BLUP for cluster level random effects U i are given by Similarly the BLUP predictors of the sub-cluster level random effects V ij can be calculated as which can be simplified as The BLUP predictors for cluster and sub-cluster random effects given in Equations (7) and (9) are positive with explicit linear combinations of semicontinuous responses. These BLUP predictors provide the best prediction of the random effects among the class of all linear functions of the observed responses.

Estimation of Regression Parameters
In this section, we first consider the estimation for the regression parameters in the case of known random effect parameters. To do that, following [12], we differentiate the partially observed 'joint' log-likelihood of the Tweedie mixed model for the data and random effects with respect to β = (β (1) , β (2) , β (3) ) and yield the partially observed 'joint' score function. Then we replace the random effects with their BLUP predictors to obtain an unbiased estimating function for the regression parameters β. To be specific, the score function for cluster level regression parameters can be achieved by the following score function which can be modified after including the BLUP predictors of the cluster level random effect as where the ith component corresponds to the ith independent cluster. Similarly the subcluster level and observation level score functions can simplified as ψ (2) and ψ (3) Without loss of generality, we assume that the matrix of the covariates X = (Z i , Z ij , Z ijk ) is of full rank. Let the global estimating function be defined by As in [12,17], a global matrix expression can be obtained for this ψ(β). Furthermore, a simple relationship among sensitivity matrix S(β) = E β ∂ψ(β)/∂β , vari- can be obtained. Since the joint density for (Y, U) forms an exponential family in regression parameter β, the proof in [17] applies to ψ(β) as long as the score functions are linear in both U and Y. However, this linearity is clear since the score functions are in fact the right hand side of (10)- (12) with U(β) being replaced by U. A component-wise proof can also be found in [18]. Thus we have the following results: For Tweedie compound Poisson models with covariate-dependent random effects, the predicted global score function ψ(β), sensitivity matrix S(β) and variability matrix V(β) can be expressed as follows: Following [12], under mild regularity conditions, the solution of the estimating equation ψ(β) = 0 is consistent and asymptotically normal with asymptotic mean β and asymptotic variance given by the inverse of the sensitivity matrix S(β) = E β {∂ψ(β)/∂β} as the subjects are assumed to be independent. In addition, this estimating function ψ(β) = 0 is optimal in the sense that it attains the minimum asymptotic covariance for the estimator of β within the class of all linear functions of Y [12]. As in [12], this estimation equation ψ(β) = 0 can be solved iteratively using the scoring algorithm, where the value of β is updated following As the asymptotic variance matrix of regression parameter vectorβ is given by the inverse of the negative sensitivity matrix −S −1 (β), the asymptotic variance for each component of regression parameter vectorβ is given by the corresponding diagonal element of its asymptotic variance matrix, −S −1 (β); therefore, the estimated standard errors of β are obtained as the square root of diagonal elements of its asymptotic variance matrix, −S −1 (β).

Estimation of Random Effects Parameters
The dispersion parameters are now assumed to be unknown. The unknown dispersion parameter is estimated using the adjusted Pearson estimator [12]. The adjusted Pearson estimate has some advantages; one of them is that the Pearson estimator is adjusted by bias correction. We may thus estimate σ 2 by the following adjusted Pearson estimator: where the explicit expression for d(i) can be expressed as The expression of τ 2 is written aŝ where The expression for ρ 2 is written asρ As the clusters are assumed to be independent, standard estimating function theory can be applied to demonstrate the consistency and asymptotic normality for both regression and random effects parameters of our models as the number of clusters tends to infinity. In fact, this proof has actually been shown in the Chapter 5 of [18]. In addition, our estimation algorithm iterates among updating regression parameters through the scoring method, predicting random effects via BLUP predictor by (6) and (9), updating dispersion parameters through the moment estimators given in (14)-(16).

Analysis of Basic Symptoms Inventory Study Data
In this section, we apply our proposed covariate dependent semicontinuous model to the BSI study data.

Model Specification
In the BSI study, 409 adolescents from 269 independent HIV infected parents were followed over time between 1995 and 2002. Adolescents were observed at varying number of times from 1 to 17 times. Therefore, the data are of a hierarchical structure with observations nested within adolescents and adolescents further nested by their parents. The response of interest is the measurement of global severity index (GSI) scores with covariates at observation, adolescent and parent levels. A complete list of these variables with descriptions is given in Table 1. The BSI data were analyzed in [19] as longitudinal normal data ignoring parental level; however, there is a large proportion of exact zeros in the GSI response as shown in Figure 1; therefore, a three level model for semicontinuous data is more appropriate. Let Y ijk represent the GSI for observation k from the adolescent j of the ith parent. The random effects for parent i and adolescent j of parent i are denoted by U i and V ij , respectively. The three-level Tweedie models with covariate-dependent random effects (TMCDRE) is then specified as follows.
with µ ijk = exp(z ijk β (3) ). In (17), z ijk represents observation level covariates and V ij |U * = u * follows where µ ij = exp(z ij β (2) ) with z ij represents sub-cluster level covariates. In (18), cluster level random effects U i are assumed to be independent with where µ i = exp(z i β (1) ) with z i represents cluster level covariates.

Analysis Results
First, we obtained a maximum likelihood estimatep from Tweedie's compound Poisson model using the R package "tweedie" Version 2.07 (Dunn, 2010) in the presence of the previous described covariates. The estimated index parameters p was 1.55. For p = 1.55, we applied our approach to estimate regression and random effect parameters based on the model specified above. The regression and random effect parameters are presented in Table 2. To compare our proposed model with Tweedie compound Poisson models with covariate-independent random effects, we also presented analysis results based on the conventional Tweedie mixed model (CTMM) in Table 2. Our results in Table 2 show that the effects of covariates at observation level on the GSI score did not change materially in terms of direction, significance and magnitude. Second, all covariates at adolescent level had positive effects on the GSI score with race factor clearly insignificant and gender factor highly significant. The interesting phenomenon is that effect of baseline age on the GSI score changed from significant to insignificant at the significance level of 0.05 when all covariates were considered at their relevant levels. Third, again the effects of covariates at parent level on the GSI score did not change materially in terms of direction, significance and magnitude. Our results demonstrate that analysis results may change materially when all covariates were associated at their relevant levels. We have not observed any dramatic change in this analysis. This might partly be explained by the nature of involved covariates in this study since no covariates on health status were available. Another interesting phenomenon is that the variation at the adolescent level was clearly shifted to that at the parent level when all covariates were associated at their relevant levels. A finding of practical importance is that the intervention was not effective whether covariates were associated at their relevant levels in the analysis or not.

Simulation Studies
In this section, we compare the performance of the proposed models (TMCDRE) with the conventional Tweedie mixed model (CTMM) through simulation studies to assess the effects of associating all covariates were considered at their relevant levels. To replicate data in realistic conditions, we simulated data based on analysis results given in Table 2 with respective arrangements of covariates under the TMCDRE and CTMM. The estimated regression parameters (β) and random effect parameters (σ 2 , τ 2 , ρ 2 ) under the TMCDRE and CTMM in Table 2 were taken as true values for the respective models. The simulation study under the TMCDRE and CTMM were done separately in Sections 5.1 and 5.2.

Simulating Data from the Tweedie Compound Poisson Model with Covariate-Dependent Random Effects
In this section, we first generated clustered semicontinuous data from the TMCDRE model in Table 2. We then analysed the simulaed based on the TMCDRE and CTMM models. Our objective here was to compare the performance of TMCDRE and CTMM when the true data were generated from TMCDRE. The data generation procedure is given below.
• Finally, we generate n ij samples (Y ij1 , . . . , Y ijn ij and n ij varies from 1 to 17) for each v ij We carried out 400 simulation runs using the procedure discussed above. Estimated bias, simulated standard errors (Sim SE) and estimated standard errors (Est SE) of the regression and random effect parameters using TMCDRE and CTMM techniques are presented in Table 3. Parameter estimates of cluster level covariates are less biased for TMCDRE than for CTMM. Parameter estimate of the sub-cluster level continuous covariate Adolescent base age is less biased for TMCDRE than for CTMM. TMCDRE produced less biased estimates for observation level binary covariates Spring and Summer than CTMM. All dispersion parameters are overestimated by the TMCDRE, whereas only σ 2 and τ 2 are overestimated by CTMM. The simulation study shows that ρ 2 is underestimated by the CTMM approach. Dispersion parameters are more biased for CTMM than for TMCDRE. Table 3 shows the Sim SE and Est SE. Clearly, differences between Sim SE and Est SE of parameter estimates of observation level covariates (True month, Spring and Summer) are smaller for TMCDRE technique than those from the CTMM technique. Differences between Sim SE and Est SE for regression parameter estimates of all sub-cluster level covariates (Adolescents base age, Hispanic, Gender) and cluster level covariates (Treatment and Parent base age) are less for the TMCDRE than for CTMM. The only exception is with the covariate Gender.Pr. As expected, TMCDRE performed better than for CTMM when data were simulted from the TMCDRE model.

Simulating Data from the Tweedie Compound Poisson Model with Covariate-Independent Random Effects
In this section, we first generate clustered semicontinuous data using the conventional Tweedie Mixed Model (CTMM) method. Then, we analyze these simulated data by using TMCDRE and CTMM methods. Our objective is to compare TMCDRE and CTMM's performances when the true data were generated based on CTMM. The data generation technique is discussed below: • We will generate 269 samples (u 1 , . . . , u 269 ) using Gamma(1, σ 2 ). • In the second step, we will generate n i samples (v i1 , . . . , v in i and n i varies from 1 to 5) for each u i using Gamma(u i , τ 2 u i −1 ). • Finally, we will generate n ij samples (Y ij1 , . . . , Y ijn ij and n ij varies from 1 to 17) for each The fixed effect and random effect parameter estimates in Table 2 using CTMM technique (β = (−1.15, −0.047, 0.086, −0.020, 0.044, 0.122, 0.401, 0.0149, 0.230, −0.019), σ 2 = 0.10, τ 2 = 0.59, and ρ 2 = 0.53) are used as true parameter values. Similar to Section 5.1, we carried out 400 simulation runs using the procedure discussed above. Estimated bias, simulated stadard errors (Sim SE) and estimated standard errors (Est SE) of the regression and random effect parameters using TMCDRE and CTMM techniques presented in Table 4.
Our simulated results indicate that although data are generated from CTMM, parameter estimates from TMCDRE for observation level continuous covariate True month is less biased than the corresponding parameter estimate from CTMM. TMCDRE parameter estimate for the cluster level binary covariate Parent gender is also less biased than the corresponding parameter estimate from CTMM. CTMM gives less biased estimates for all other covariates than the TMCDRE. Dispersion parameters σ 2 and ρ 2 are overestimated by both conventional and covariate-dependent approach, whereas τ 2 is underestimated by CTMM and TMCDRE. Dispersion parameters are more biased for TMCDRE than those from CTMM are. Table 4 also indicates that the differences between Sim SE and Est SE for parameter estimate of observation level binary covariate (Spring) are smaller for TMCDRE than the CTMM. One important feature of TMCDRE is that covariates are separated in their corresponding level. The simulation study shows that the differences between Sim SE and Est SE of parameter estimates corresponding to binary covariates (Hispanic, Gender) in sub-cluster level are smaller for TMCDRE than for CTMM. Differences between Sim SE and Est SE for all other covariate are less for CTMM than TMCDRE. In other words, when data were generated from the CTMM, CTMM performed better than TMCDRE in general; however, TMCDRE still produced better estimates for some covariates than the CTMM.

Conclusions
In this paper, we have introduced a three-level Tweedie compound Poisson model with covariate-dependent random effects for semicontinuous data. As random effects at different levels are likely to be associated with the covariates at relevant levels, this new model enables us to account for such association in our analysis. Our analysis of BSI data demonstrate that analysis results may change materially when all covariates were associated at their relevant levels. Our simulation study has shown that Tweedie compound Poisson model with covariate-dependent random effects performs better than Tweedie compound Poisson model with covariate-independent random effects when random effects at different levels are associated with the covariates at relevant levels. Although we also compared performances between models with covariate-dependent and covariate-independent random effects when random effects at different levels are not associated with the covariates at relevant levels, such no association situation is far less likely in practice. Therefore, Tweedie compound Poisson model with covariate-dependent random effects is more appropriate for analysis of multilevel semicontinuous data with covariates at different levels.
Tweedie compound Poisson model with covariate-dependent random effects becomes far more complex than Tweedie compound Poisson model with covariate-independent random effects; however, we still obtained explicit expression for random effects predictors at different levels. These explicit expressions may facilitate interpretation of clustering effects at cluster and sub-cluster levels.
Models with covariate-dependent random effects have been introduced to handle multilevel binary, zero-inflated and survival data in the literature. Our model for multilevel semicontinuous data has enriched this class of models. Furthermore, Tweedie models with covariate-dependent random effects have also been developed in [18,20] for multilevel continuous and count data.