Modeling the incidence of citrus canker in leaves of the sweet orange variety ‘Pera’

Citrus canker, caused by the bacterium Xanthomonas citri subsp. citri, is one of the most important diseases of citrus. The use of resistant genotypes plays an important role in the management and control of the disease and is the most environmentally sustainable approach to disease control. Citrus canker incidence was recorded in an experiment on nine genotypes of the sweet orange variety ‘Pera’ grafted on four rootstocks. The experiment was started in 2010 and the incidence of citrus canker on the leaves was recorded on a quarterly basis. The incidence data from the experiment were analyzed using a zero-inflated Beta regression model (RBIZ), which is the appropriate method to describe data with large numbers of zeros. Based on the residual analysis, the data fit the model well. The discrete component of the explanatory variable, rootstock, was not significant as a factor affecting the onset of disease, in contrast with the continuous component, genotype, which was significant in explaining the incidence of citrus canker.


Introduction
According to Gonçalves-Zuliani, Nunes, Zanutto, Filho, and Nocchi (2015), citrus canker caused by Xanthomonas citri subsp.citri (Xcc, Schaad et al., 2006) is an important disease in many citrusproducing regions of the world (Gottwald et al., 2002).Xcc can cause disease in many commercial varieties of citrus, specifically sweet orange (Citrus sinensis L. Osbeck), resulting in significant economic losses to the producer.In addition to direct yield loss, the lesions caused by citrus canker can preclude the marketing of fresh fruit.In Brazil, citrus canker has been studied by several authors including Carvalho et al. (2015); Gonçalves-Zuliani et al. (2015) and Braido et al. (2015).The search for genotypes resistant to citrus canker is the most attractive way to control this disease as it has least environmental impact (Gonçalves-Zuliani et al., 2015).Gonçalves-Zuliani et al. (2015) aimed to evaluate resistance to citrus canker in nine sweet orange genotypes of the variety 'Pera' grafted on Rangpur lime, sunki tangerine, Cleopatra mandarin and rustic orange.According to these authors, rootstock influenced the tolerance of the canopy to plant disease; in general, the rootstocks that induced less prolific growth in the scion showed a lower incidence of disease compared with the rootstocks that induced more vigorous growth.In this case, the effect was estimated from the total count of leaves and diseased leaves; thus, the resulting data set had a high proportion of zeros, that is, the data were zeroinflated.For analysis, Gonçalves-Zuliani et al. (2015) used a non-parametric statistic and later applied Tukey's HSD test, without taking into account the large proportion of zeros in the data or the correlation due to repeated measures.The conclusion was that grafted genotypes of sweet orange showed a range in the incidence of diseased leaves, with those scions on Rangpur lime rootstock being the most susceptible to the pathogen, possibly due to increased canopy vigor due to this rootstock.
In the current experiment, the objective was to evaluate the functional relationship of dependent variables to one or more predictor variables (Neter, Kutner, Nachtsheim, & Wasserman, 1996).Many regression models have been proposed in statistical modeling, but the most common is the classic regression model or the linear model.In this case, the relationship between variables is described by a linear function assuming independence and normality of errors.These models, however, are inappropriate when the dependent variable is a rate, ratio, or fraction with the records contained in one of these limited ranges: ([0, 1), (0, 1] and [0, 1]).In these cases, the estimates obtained by using the classical regression model may exceed these limits.Thus, it is recommended that the response variables be transformed to avoid this discrepancy to allow the estimates to conform to the linear model.However, data transformation can make the interpretation of the model parameters difficult in relation to the original response.
An alternative method to fit a regression model for continuous variables is to assume a probability distribution to describe the data.The Beta distribution is appropriate for modeling binary, zero-inflated data (0, 1).Several reports describe the fitting of regression models for response variables using the Beta distribution.For example, Ferrari and Cribari-Neto (2004) developed a regression model where the response variable was described by the Beta distribution and the average response was described by a linear predictor using a linking function.The authors reparameterized the Beta distribution to facilitate their interpretation by indexing using mean and dispersion parameters.Consequently, the model is useful when the dependent variable has its values in the interval (0, 1) and they are related to other predictor variables.Furthermore, other researchers such as Paolino (2001) have used the Beta regression to compare the estimates obtained using Beta regression models and linear regression with and without transformation of the dependent variable.The estimates of the Beta distribution have significant advantages over the linear model in those cases where the response variable assumes values in the interval (0, 1).Pereira, Souza, and Cribari-Neto (2014) have effectively used the inflated Beta regression model to assess the administrative efficiency of Brazilian municipalities to compare the performance of each region with respect to the management of public resources.
However, there are situations in which the variable of interest assumes values at one or both ends of the range, i.e., [0, 1), (0, 1] and [0, 1].In these cases, the use of the Beta distribution is not feasible and a mixture model using discrete and continuous distributions has been recommended.The discrete distribution estimates the probability mass at zero, one, or both, and the continuous distribution describes the continuous component of the data.This type of model is known as an inflated model.Ospina and Ferrari (2010) introduced the Beta distribution inflated distributions that mix the Beta distribution with the Bernoulli distribution to estimate the mass of probability of zero, one, or both.This family of distributions is formed by the parameterization of the Beta distribution and by the distribution that will describe the discrete component.
We propose an alternative analysis to the report by Gonçalves-Zuliani et al. (2015), considering an evaluation of genotype response due to the presence of correlations between the repeated observations and the large proportion of zeros in these data.The objective of this study was to apply an inflated regression model to describe the incidence of citrus canker on leaves of different genotypes of sweet orange, variety Pera, taking into account the different rootstocks and their effects on the susceptibility of the sweet orange genotypes to citrus canker.Thus, the same data set as that described by Gonçalves-Zuliani et al. (2015) was used here.
Approximately two-year-old plants were assessed quarterly to determine the incidence of citrus canker on the leaves.Gonçalves-Zuliane et al. ( 2015) selected 10 plants per genotype and four branches from each plant.We only used one of the evaluations to perform the analyses proposed.The total leaf number and the number of diseased leaves were counted on each branch such that the total sample size was n = 360.

Statistical analysis
The percentage incidence of canker-infected leaves on the different genotypes of Pera sweet orange was analyzed using the Beta distribution, which has been widely used for modeling when data are in the range (0, 1).However, because the incidence of citrus canker can take values in the interval [0, 1), it was necessary to make some changes in the analysis.Thus, the distribution that was used was the inflated Beta distribution following the modification suggested by Ferrari and Cribari-Neto (2004).
The parameterization of the Beta distribution suggested by Ferrari and Cribari-Neto (2004) to describe a random variable Y restricted in range (0, 1) has a density function given by: Thus, Y follows a Beta distribution function with the average parameter μ, precision φ and is denoted by Y B(μ, φ).The respective mean and variance of Y in this parameterization are E[Y]=μ and , where Var(μ)= μ(1-μ) is the variance function.
The cumulative distribution function using the mixture model of the Beta distribution with degeneration at zero, one or both is given by: {c} {c} BI (y; , , ) I (y) (1 )F(y; , ) in which I A (y) is an indicator function that assumes the value 1 if y ∈ A, or 0 otherwise, and A is the set of elements for the value y = c; F(.; μ, φ) is the cumulative function of the Beta distribution; α=P(y=c) is the parameter of the mixture distribution 0<α<1.As BI {c} has a mass point at y = c, it cannot be considered completely continuous.Note that with a probability α, the variable Y is selected from a degenerate distribution at c, and when the probability is (1-α), the variable is selected from a Beta distribution.The probability density function of the variable Y is given by the value generated by the mixture model and is written in the following form: where 0<α<1, , φ>0, and f(.; μ, φ) is the density function in Equation ( 1) (Ferrari and Cribari-Neto, 2004).If α>0, the probability mass distribution at Beta point y = c is exceeded, i.e., the probability of observing y = 0 or y = 1 is α=P(y=c).Note that the first term of the distribution shown in Equation (3) depends on (α), and the second term depends on (μ, φ) because it involves the continuous part of the response variable (Ospina & Ferrari, 2010).
The expected mean and the variance of Y following the inflated Beta distribution are given by: E[Y]=αc+(1-α)μ and  3) for Y in the interval [0, 1) is nominated an inflated Beta distribution at point zero (BIZ), and denoted by =Y BIZ(α, μ, φ).In this analysis, the case will be discussed based on the observed values in the range [0, 1) (0 ≤ y < 1).
On account that Y 1 , ..., Y n are independent random variables, in which Y 1 , i=1, ..., n, follows the distribution at the inflated Beta c point (c = 0 or c = 1) as in Equation (2), i.e., Y t BI {c} (α t , μ t , φ), the inflated Beta regression model (RBI c ) is defined by the systematic components:  in which Z t1 , ..., Z tM and X t1 , ..., X tm are observations from known regression variables with M+m<n.For discrete components of the inverse probit link function α t =Φ{ζ t }, and for the continuous model, the inverse of the log link function is μ t =exp{η t }.We note that μ t is the conditional average of y t to y∈(0, 1) and φ is the dispersion parameter that can be variable or constant for all of the observations.In the model RBI c , the estimate of the parameters vector θ=(γ T , β T , φ) T can be calculated using the maximum likelihood method whose function is given by: However, as these estimators do not have a closed form, they may be obtained by maximizing the log-likelihood function using a nonlinear optimization algorithm, such as a Newton algorithm or a quasi-Newton algorithm (Ferrari & Cribari-Neto, 2004).We used the package gamlss (Generalized Additive Models for Location, Scale and Shape) from the statistical program R to obtain point estimates for the parameters of the RBIZ model.
Various measures of "goodness of fit" can be evaluated, and the model of fit assessment can be based on the estimated values for the maximum likelihood from the sample.One of these goodness of fit values is the pseudo R 2 of McFadden (1973), given by: where l θ is the log-likelihood function of the fitted model and l 0 is a function of the log-likelihood of the null model, or the model without the regression structure.Based on Louviere, Hensher, and Swait (2000), the model has the best goodness of fit when the value of ρ 2 ranges from 0.2 to 0.4.Domencich and McFadden (1975) ran simulations to compare the range of ρ 2 with multiple correlation coefficients of the range (R) and found that the range of ρ 2 (0.2 to 0.4) is equivalent to an R range of 0.7 to 0.9.Based on Ospina and Ferrari (2012), the residual analysis of a regression model inflated with zeros should be divided into two parts: the first evaluates residuals of discrete components (r D pt ) and the continuous (r C pt ) model.These authors propose that the residuals be Pearson standardized based on Fisher's iterative algorithm scores to estimate the parameters.Second, the authors defined a random quantile residual as the global residual of the model using the information from the two components.To verify the assumptions about the residuals as well as the presence of atypical observations of the discrete component of the inflated Beta regression model, Ospina and Ferrari Acta Scientiarum.Agronomy Maringá, v. 39, n. 1, p. 1-8, Jan.-Mar., 2017 (2012) suggested a standardized version of the Pearson residual.For the continuous component, they followed Espinheira, Ferrari, and Cribari-Neto (2008) to set a weighted version of the common residual of Fisher's iterative algorithm score used to estimate the parameter β when φ is fixed.A more detailed discussion can be found in Ospina and Ferrari (2012).

Results and discussion
Initially, an exploratory analysis of canker incidence was performed for each genotype and rootstock.All zeros were removed from the data set.Based on these data, the genotype Arapongas developed the highest incidence of citrus canker (Figure 1).Considering rootstock effects, the 'Caipira' orange (Figure 2) rootstock appeared to support the lowest incidence of citrus canker in the scion (regardless of scion genotype).However, the distribution of the variable incidence is asymmetric -there is a high proportion of zeros in these data, Figure 3. Approximately 85.28% of these data are zeros, which justifies the use of the zero-inflated model.
In Figure 3, the incidence of citrus canker can be described by an inflated Beta distribution.Therefore, an RBIZ model was used and adjusted with Y i BIZ(α, μ, φ) as Equation ( 2), including the covariates rootstock and genotype in the model as follows:   (5) These equations represent sub-models and are the discrete and continuous components, respectively, in the model described by Equation (2).To fit these models, we used the gamlss module in R, which iteratively maximizes the loglikelihood function using the RS algorithm, which is a generalization of the algorithm used by Rigby and Stasinopoulos (2005).The estimates of the RBIZ model and their standard errors are presented in Table 2.We calculated the pseudo R 2 of McFadden (1973) to assess the suitability of the model, applying the likelihood ratio test.The value of the estimated pseudo R 2 is 2 ˆ0.385 ρ = , indicating that the model goodness of fit was appropriate.The likelihood ratio (LR) test provided evidence to reject the null hypothesis (H 0 :(γ 0 =...=γ 11 =0); (β 0 =...=β 11 =0)), at the usual level of significance of 5%, with the test statistic and LR = 51.003;p-value = 0.0024.Therefore, at least one of the parameters in the model was significant.
However, some parameters in the two submodels were not significant, (Table 2).Therefore, we used the Akaike information criterion (AIC) to select the most appropriate reduced model (using stepGAIC in module gamlss in R).
The observed estimates and their standard errors for the reduced model are presented in Table 3.For the sub-model Probit(α) (4), the estimates of the discrete component regression parameters were significant for the variable rootstock.The estimates for the sub-model Log(μ) (5), the continuous component of the model, and the parameters were significant for all genotypes of the sweet orange variety Pera.
Based on the Probit(α), sub-model ( 6), the rootstock 'Caipira' orange induced the most resistance in the scion to citrus canker.'Caipira' orange was followed by Cleopatra tangerine, Sunki tangerine and rangpur lime in increasing incidence of canker on the scion.
The Log( μ ), sub-model ( 7), indicated that the genotype Arapongas had the lowest incidence of citrus canker, followed by genotypes IAC and G58.
The genotypes EEL and Bianchi were more susceptible to citrus canker.With the exclusion of observations 291, 305, 332, and 340, the estimates do not differ from those that include all of the observations.Therefore, they should not be excluded from the analysis.

Conclusion
In this study the 'Caipira' orange rootstock resulted in the greatest resistance of the scion to citrus canker.Using one of the evaluations we obtained similar results to those reported by Gonçalves-Zuliani et al. (2015).Furthermore, it was possible to detect the genotypes that were most vulnerable to citrus canker and those rootstocks that imparted greater resistance using a smaller number of plants, with the modeling assumptions providing more reliable results.Thus, the zero-inflated Beta regression model is the most appropriate for modeling data with an excessive proportion of zeros.Besides that, this model saved time and resources.
= α + − α μ estimates the response of the inflated Beta model.The distribution shown in Equation (

Figure 1 .
Figure 1.Box-plot of the incidence of citrus canker on leaves of different genotypes of sweet orange Pera (Citrus sinensis).

Figure 2 .
Figure 2. Box-plot of the incidence of citrus canker on leaves of scions of sweet orange Pera (Citrus sinensis) on four different rootstocks.

Figure 3 .
Figure 3. Frequency of incidence of citrus canker on leaves of the sweet orange variety Pera (Citrus sinensis).
fit, Figure4, show that (a) the residuals were randomly scattered around zero, (b) the points were in the range from -3 to 3, and (c) and (d) the residual distribution function approximated the normal.

Figure 4 .
Figure 4. Measures of model fit.(a) fitted values versus quantile residuals; (b) index versus quantile residuals; (c) histogram of residual frequency; (d) quantile-plot.To check the outliers of the RBIZ model, we reviewed the residuals (r D pt ) and (r C pt ).For the discrete component (Probit(α)), observations 291, 305, 332 and 340 exceeded the range of -3 to 3 and are outliers (Figure 5 (a) and (b)); (Figure 5 (c) and (d)) represent the continuous component.These observations are the same as the outliers in the box-plot, Figures 1 and 2. In Figure 5 (c) and (d), the points are in the range from -3 to 3.With the exclusion of observations 291, 305, 332, and 340, the estimates do not differ from those that include all of the observations.Therefore, they should not be excluded from the analysis.

Figure 5 .
Figure 5. Model fit in relation to outliers.(a) and (b) represent the discrete component; (c) and (d) represent the continuous component.

Table 3 .
Estimates and standard errors of the regression using the zero-inflated Beta model.

Table 2 .
Estimates and standard errors of the zero-inflated Beta regression model to predict the incidence of citrus canker on sweet orange leaves.