Estimating equations and diagnostic techniques applied to zero-inflated models for panel data

Abstract: Many practical studies analyze semi-continuous longitudinal data using, for example, ZAIG (Zero-Adjusted Inverse Gaussian) or BEZI (Beta Zero-Inflated) models as marginal distributions. We develop herein estimating equations analogous to Liang and Zeger’s estimating equation of independence and related diagnostic techniques for regression models, with zero-inflated response variable and panel data structure. A simulation study to evaluate some properties of the estimators obtained from the estimating equations and two applications with real data are presented.


Introduction
Semi-continuous response variables appear in many practical situations. In finance, for instance, losses in loans (zero loss means no default and a positive loss means no payment); in medicine when it is necessary to measure the concentration of a certain substance that may or may not be present in blood. In these cases, the response variable y can be described as where C is a positive random variable (in this paper, continuous) and B is a Bernoulli variable with parameter ν. The probability distributions attributed to model y are called zero-inflated. Some particularly interesting distributions are the Zero-Adjusted Inverse Gaussian distribution, named as ZAIG distribution, e.g. [5,7], and the Beta Zero-Inflated distribution, named as BEZI distribution [14]. The ZAIG is a zero-inflated distribution, where C follows an inverse normal distribution. In the case of BEZI, the random variable C follows a beta distribution. Ridout et al. [17] describe other inflated distributions for counting data.
In cross section studies, estimates of the regression model parameters for ZAIG or BEZI responses may be obtained by the library gamlss [19] for R 1 .
The authors in [4] developed generalized estimating equations for zero inflated random variables. In their proposal, there was a regression model for the probability of zero and for the mean of the continuous part of the distribution. The method assumes that C belongs to the class of exponential dispersion models (see [8], for instance) and when it follows a log-normal distribution. Dobbie and Welsh [2] developed estimating equations for zero-inflated counting data.
Other advances in the study of regression models for zero-inflated distributions, in the presence of dependence among the observations and when C is a discrete variable, may be found in [13], who developed random effect models for zero-inflated counting data. Multivariate distributions with zero-inflated marginal distributions may be found in [3].
In this paper we developed diagnostic techniques for regression models, with zero-inflated response variable and panel data structure, applying estimating equations. We consider cases with homogeneous and heterogeneous dispersion parameters. These techniques are based on [20] who considered estimating equations for longitudinal continuous data with probability distribution in the exponential family and under a homogeneous dispersion parameter. This paper is organized as follows. The next section presents basic concepts of estimating functions. In Section 2 we propose a general model for zero-inflated panel data analysis, analogous to Liang and Zeger's independent estimating equations, whose application to ZAIG and BEZI response variable is presented in Sections 2.2 and 2.3, respectively. Section 2.1 describes an interactive method for estimating parameters. In Section 3 we propose some diagnostics measures.
Some properties of the estimators obtained by the proposed estimating equations are evaluated by a simulation study in Section 4. Finally, we analyze a data set with the proposed methodology and then we present our concluding remarks.

General model
By definition, any measurable function ψ of the data and of the parameters of interest (θ) is an estimating function. Let y i = (y i1 , . . . , y iT ) ⊤ , i = 1, . . . , n, be a sample of independent random vectors 2 . Assume that ψ i = ψ i (y i ; θ), i = 1, . . . , n, are estimating functions. The concept of estimating function may be extended to the sample by Ψ(y; Under general regularity conditions, e.g. [18], we may prove that the estimator θ, obtained from Ψ(θ) = 0 is consistent and that √ n(θ − θ) is asymptotically normal with null mean vector and covariance matrix given by the inverse of In the presence of zero-inflated data, let y i = (y i1 , . . . , y iT ) ⊤ be a sample of independent random vectors of the ith experimental unit, i = 1, . . . , n, with where C it is a continuous random variable with a regular probability density function given by f it = f (y it ; µ it , σ it ), with a position parameter µ it and a second parameter given by σ it (by convenience, it will be called a dispersion parameter, although it may indicate any other distribution characteristic).
Consider the existence of three column vectors of fixed covariates x it , d it and q it of dimensions p, q and r, respectively, such as where β, γ and δ are parametric vectors and g 1 , g 2 and g 3 are continuous, invertible and double differentiated link functions, with i = 1, . . . , n and t = 1, . . . , T . The probability density function of y it is given by where I y∈A is an indicator variable that assumes the value 1 if y belongs to the set A. The right side of equation p it may be factorized in one term that just depends on ν it and another one that depends on µ it and σ it , which uses the continuous part of the dependent variable (see [11]). We propose the use of an estimating function for θ = (β ⊤ , γ ⊤ , δ ⊤ ) ⊤ that is identical to the score function obtained in the case of independence among y ij , i = 1, . . . , n and j = 1, . . . , T (full independence). In practical terms, the point estimators of the parameters are identical to those obtained under full independence, but the standard errors must be corrected by equation (1) for the presence of dependence among observations of the same sample unit. This result is similar to the one obtained by [10] for independence estimating equations.
Under full independence, the likelihood function for θ would be given by Consequently, the logarithm of the likelihood function of θ = (β ⊤ , γ ⊤ , δ ⊤ ) ⊤ would be given by where From (2) the estimating equation, which coincides with the score function obtained in case of full independence, is given by where , if y it = 0 , i = 1, . . . , n and t = 1, . . . , T .
Estimating equations and diagnostic techniques applied to zero-inflated models 1645 Assuming the existence of an structure of dependence among the components of y i , from (1), we have √ n(θ−θ) where ∆(·) is an operator that, when applied to a vector, creates a diagonal matrix with the vector elements at the main diagonal. Besides,u i ,ṁ i andu iσ are T -dimensional vectors whose components are given, respectively, bẏ Moreover, E (u itσ ) may also be written as and E (ṁ it ) is defined by the same following way.

Estimation of the parameters
We may obtain an estimate ofθ by the following iterative procesŝ where m = 0, 1, 2, . . . indicates the step of the iterative process. Equation (4) may be presented as reweighted least square iterative process with weight matrix W i and a modified dependent variable z i given bŷ

ZAIG response variable
Herein, we will apply the results of this section to the case of the ZAIG response variable. From (2), the log-likelihood in case of full independence would be given by When y it > 0, we have and with From (6) and (7), the estimating function (3) is given by For a ZAIG response, under heterogeneity of the dispersion parameter, we Estimating equations and diagnostic techniques applied to zero-inflated models 1647

BEZI response variable
Herein, we analyze a BEZI response variable case. The corresponding log-likelihood function as described in equation (2), under independence of all observations, should be expressed by When y it ∈ (0, 1), one can say and where ψ(·) is a digamma function, .
From (8) and (9), the estimating function (3) is given by . . , n and t = 1, . . . , T . According to (3), the estimating function for BEZI response variable models, under heterogeneity of the dispersion parameter is given by and ψ ′ (·) is a trigamma function.

Diagnostic techniques
Based on [20], this section presents some diagnostic measures to detect leverage points, influential points and outliers for the models proposed in Section 2. Equation (5) may be written aŝ where In this case, M represents a design matrix andŴ and z assume the role of a weight matrix and a dependent variable, respectively. From (10), whereŴ 1/2 z may be seen as a response vector [15], the ordinary residual is given by where Matrix H i has characteristics similar to those of hat matrix from a linear model. Therefore, the main diagonal elements may be used to detect the presence of leverage points [20,15].
To identify outlier observations, a new residual is defined by standardizing the ordinary residual described in (11). Under the model described in (3), one may consider that Cov( Thus, the standardized residual for observation y it is given by where e (it) is a T -dimensional vector assuming the value 1 in the position related to y it and 0 otherwise, and c it is the t-th element of the main diagonal of C i (θ), with i = 1, . . . , n and t = 1, . . . , T . This result is not a simple extension of [20]. For repeated measures regression models [20], the Cook distance used to detect influential points is given by where h it is the t-th element of the main diagonal of H i , with i = 1, . . . , n and t = 1, . . . , T . Under homogeneity of the dispersion parameter, assume r = 1; otherwise, r is the dimension of the parametric vector δ.
Graphically, the plot of h it versus i− where h it represents the t-th element of the main diagonal of H i , i = 1, . . . , n and t = 1, . . . , T − may be useful to identify leverage points. Widely discussed in the Statistics literature from [15], a relative high value of the Cook distance may indicate an influential point; to identify these points, do (DC) it versus the index i, i = 1, . . . , n and t = 1, . . . , T . At last, relative high values of (r P D ) it , i = 1, . . . , n and t = 1, . . . , T , may indicate outliers.

Simulation study
A simulation study was conducted in order to evaluate the quality of the estimators obtained with the estimating equations proposed in this paper. The behavior of the estimators was examined under different sample sizes (n = 50, 100, 500), response vector sizes (T = 2, 3, 5, 10) and five different degrees of dependence.
Normal copulas (e.g. [21]) were used to generate multivariate data with (0, 1) uniform marginal distributions. Let c it be the generated copula value for the individual i in time t, the multivariate vector with ZAIG (or BEZI) marginal distribution was given by doing where y it is the simulated value for individual i in time t and Q it is the quantile of order c it of the respective ZAIG (or BEZI) model.
The copulas were obtained from multivariate normal distributions with correlation coefficients ρ (ρ = 0.0, 0.2, 0.5, 0.8 and 0.9). In this context, ρ is a measure of the degree of dependence among the generated vector components. This procedure was done with the help of the packages copula [6] and gamlss [19] from R.
Each combination of n, T and ρ was simulated 10,000 times. The model used in the simulation considered one independent variable, x it , affecting ν it , µ it and σ it , i = 1, . . . , n, t = 1, . . . , T , where x it were obtained by generating independent uniform distributions in the range (−1, 1).

For ZAIG models
with β 0 = γ 0 = δ 0 = 0, β 1 = 2 and γ 1 = 1 and δ 1 = 0.4. The choice of δ 1 = 0.4 minimized the amount of cases that failed to achieve convergence in the estimation process. The lack of convergence was sometimes due to extreme outliers in the data set. The value chosen for β 1 allows that the probability of zero response may vary between 12% and 88% roughly. For BEZI models The relative mean absolute error (RMAE), obtained by dividing the observed mean absolute error by the parameter value, and the relative square root of the mean square error (RMSE), obtained by dividing the square root of the observed mean square error by the parameter value, were used to evaluate the simulation results. As the conclusion based on these two indicators were similar, only the RMAE results are presented in this section.
RMAE, in general, decrease, as n and T increase, for ZAIG and BEZI models. That is expected by the consistency of the estimators.
To make the conclusions of the study of the behavior of the errors due the dependency degree clear, for each combination of n and T , the ratio between RMAE for a fixed value of ρ and for ρ = 0 was calculated. Figures 1 and 2 illustrate these results. It may be seen that the effect of the correlation is low for small values of T , and it increases as T and ρ increase, for all n. This is more visible for values of ρ greater than 0.5. This conclusion is the same for ZAIG and BEZI models.

Application to the real data
In this section, traffic-related death rates in the southeastern cities of Brazil, between 2000 and 2002, will be analyzed. The dependent variables are the squareroot of the annual mortality rate of traffic-related accidents per 100 thousand inhabitants (Ratesq) and the proportion of traffic-related deaths among all causes of death (Proportion).
We will build different models for each dependent variable; nevertheless, all models will use the following set of independent variables: Estimating equations and diagnostic techniques applied to zero-inflated models 1651

Propmasc proportion of men in the city's population in 2000;
Prop2029 proportion of inhabitants aged 20-29 years; and EHDI education index of the human development index of a municipality in 2000. Table 3 shows some descriptive statistics for the response variables, there is a high incidence of zero values. By comparing the information for 2000, 2001 and  2002, we notice, in both variables, a small variation of the proportion of municipalities with zero values, in the average values and in the standard deviation. This may be a suggestion time effect insistence. Due to some data problems, the sample sizes are slightly different.
The ZAIG distribution was used to model the square root of the trafficrelated death rate and the BEZI distribution to model the percentage of trafficrelated deaths. In both cases a model with heterogenous dispersion parameter was considered.
Let   The estimates of θ were obtained from the GAMLSS library available in software R; the standard error corrections and the diagnostic measures were obtained from a macro developed for R package.

Square-root of the annual traffic-related death rate
The following model was proposed in order to model the square root of the traffic-related death rate: Estimating equations and diagnostic techniques applied to zero-inflated models 1655  The parameters estimates are presented in Table 4. The corrected results, considering the dependence between the observations, are presented in the last three columns of the  comparing these columns, it is possible to notice the correction effect. The most important differences were observed in the intercept of µ it (γ) models and, for the coefficient of Propmasc, in the σ it (δ) model. In Figure 3, we may find the diagnostic measure graphs (projection matrix, Cook's distance and standardized residual) for each parameter vector of the regression models: β, γ and δ. We may see that the city ofÁlvaro de Carvalho (SP) appears as a high leverage point for the three parameter vectors. The identification of the city ofÁlvaro de Carvalho as a high leverage point may be due the fact that it presents discrepant values for Propmasc (59.94, when the mean is 50.68 and the standard deviation is 1.25) and Prop2029 (24.95, when the mean is 16.54 and the standard deviation is 1.48).
For β, the observation for the city of Itabira (MG), in 2002, appears as an influential point and as an outlier. Its observed value for Ratesq is zero and the model forecasts a low probability of assuming this value (0.45%); the estimated proportion of cities with actual zero value for this variable is 63.58%.

Estimating equations and diagnostic techniques applied to zero-inflated models 1657
For δ, Barra do Turvo (SP), in 2000, appears as an influential point and as an outlier. This city presents high values for the response variables in the three years considered in the analysis (the highest in 2002, the second highest in 2000 and the third highest in 2001).
The observations for the city of Serra da Saudade (MG) and Josenópolis (MG), for γ, in 2001, are highlighted in Cook's distance and standardized residual graphs. We did not find any explanation for this fact.

Proportion of traffic-related deaths
The regression models for the proportion of traffic-related deaths are and The parameter estimates may be found in Table 5. As we have seen previously, the correction effect occurs both increasing and decreasing the standard errors.   Three state capitals (out of four) -São Paulo, Rio de Janeiro and Belo Horizonte -appear as leverage points in the mean modeling. This may be explained by high value of the variable Lnpop. In 2000, the mean population value of the cities included in the analysis were 12.2 thousand inhabitants, while the population of these capitals were 10.5 million, 5.9 million and 2.2 million, respectively.
Alvaro de Carvalho was identified as a leverage point, just like in the last section. Itirapina, Riolândia and Rosana were identified as high leverage points too; the first two presented a high value for the variable Propmasc (Itaparina: 55.67 and Riolândia: 55.76). Rosana showed low value for Propurb, 0.26) (the mean was 0.70 and the standard deviation was 0.21), and high value for EHDI 0.91 (the mean was 0.82 and the standard deviation was 0.06); this behavior is unexpected since the correlation between these variables is 0.72. Nova Serrana was identified as a high leverage point, but we did not find any explanation for this fact.

Estimating equations and diagnostic techniques applied to zero-inflated models 1659
The city of Itabira (MG), in 2002, also appears as an influential point and as an outlier to BEZI model. The value for Proportion in 2002 is zero, but the model also forecasts a low probability of assuming this value, 0.42%.
Mathias Lobato is detected as outlier and as influent for γ and β. It is a city with a low population (Lnpop=8.0, while the mean population is 9.4 with standard deviation 1.2) and, in 2000, the observed proportion of traffic-related deaths was extremely high: 50%. Besides its value for Propurb was 90%, what is unexpected for a city with 3,642 inhabitants. The correlation between Lnpop and Propurb is 0.49.

Concluding remarks
In this paper we proposed estimating equations for regression models for zeroinflated random variables. In particular, we focused on BEZI and ZAIG distributions. The estimating equations are similar to the score function obtained in the full independence case. In practical terms, the method provides a correction for the standard errors of the parameter estimators.
The results are directly applicable to the analysis of any zero-inflated semicontinuous response variable. In particular, to zero-inflated Gamma (ZIG), zeroinflated log-normal (ZILN) and zero-inflated truncated Pareto (ZITPo), see [12] and [1] for further details about these distributions.
Furthermore we proposed diagnostic techniques to identify outliers, leverage points and Cook's influence points, which is an advance compared to other studies cited in this paper.
The simulation study suggests that the estimating errors decrease as the sample size and the size of the response vector grow and increase when the dependence degree among the response vector components is high, mainly for large samples and response vector dimensions.
We applied the methods to two data sets to get a correct value for the standard error estimates of regression model parameters, in the presence of dependency among observations of the same sample unit. The standard errors obtained from Fisher's information acquired under the assumption of independence among all observations would lead to wrong inferences.