Linear and Non-Linear Regression Models Assuming a Stable Distribution

In this paper, we present some computational aspects for a Bayesian analysis involving stable distributions. It is well known that, in general, there is no closed form for the probability density function of a stable distribution. However, the use of a latent or auxiliary random variable facilitates obtaining any posterior distribution when related to stable distributions. To show the usefulness of the computational aspects, the methodology is applied to linear and non-linear regression models. Posterior summaries of interest are obtained using the OpenBUGS software.


Introduction
A wide class of distributions that encompasses the Gaussian one is given by the class of stable distributions.This large class defines location-scale families that are closed under convolution.This distribution family (see for instance Samorodnitsky & Taqqu 1994) is described by four parameters α, β, δ and γ.The α ∈ (0, 2] parameter defines the "fatness of the tails", and when α = 2 this class reduces to Gaussian distributions.The β ∈ [−1, 1] is the skewness parameter and for β = 0 there are symmetric distributions.The location and scale parameters are, respectively, δ ∈ (−∞, ∞) and γ ∈ (0, ∞) (see Lévy 1924).
The difficulty associated with stable distributions S α (β, δ, γ) is that, in general, there is no simple closed form for their probability density functions.However, it is known that the probability density functions of stable distributions are continuous (Gnedenko & Kolmogorov 1968, Skorohod 1961) and unimodal (Ibragimov & Chernin 1959, Kanter 1976).Also the support of all stable distributions is given in (−∞, ∞), except for α < 1 and |β| = 1 when the support is [δ, ∞) for β = 1 and (−∞, δ] for β = −1 (see page 12 in Nolan 2015).The characteristic function ϕ(•) of a stable distribution is given by log(ϕ(t)) = iδt − γ α |t| α [1 + iβ sign(t) tan( πα 2 )(|γ t| 1−α − 1)], for α = 1 iδt − γ|t|[1 + iβ 2 π sign(t) log(γ|t|)], for α = 1, (1) where i = √ −1 and the sign(•) function is given by ( Although this is a good class for data modeling in different areas, there are difficulties in obtaining parameter estimates with a classical inference approach due to the lack of closed form expressions for the probability density functions.An alternative is the use of Bayesian methods.However, the computational cost can be further exacerbated in assessing posterior summaries of interest. A Bayesian analysis of stable distributions is introduced by Buckle (1995) using Markov Chain Monte Carlo (MCMC) methods (also see Achcar, Lopes, Mazucheli & Linhares 2013).The use of Bayesian methods with MCMC simulation can have great flexibility by considering latent variables (see, for instance, Damien, Wakefield & Walker 1999, Tanner & Wong 1987), where samples of latent variables are simulated in each step of the Gibbs or Metropolis-Hastings algorithms.parameters α, β, δ and γ.This theorem establishes that a stable distribution for a random variable Z defined in R − {0}, is obtained as the marginal of a bivariate distribution for the random variable Z itself and an auxiliary random variable Y .This variable Y is defined in the interval (−0.5, a α,β ), when Z ∈ (−∞, 0), and in (a α,β , 0.5), when Z ∈ (0, ∞).The quantity a α,β is given by where b α,β = β π 2 min{α, 2 − α}.The joint probability density function for random variables Z and Y is given by (5) and Z = X−δ γ , for γ = 0 and α = 1.From the bivariate density (4), Buckle (1995) shows that the marginal distribution for the random variable Z is S α (β, 0, 1).Usually, the computational costs of obtain posterior summaries of interest using MCMC methods is high for this class of models, which could give some limitations for practical applications.Another problem could be the simulation algorithm convergence.In this paper, we propose the use of a software that is popular and free available to obtain the subsequently summaries of interest: the OpenBUGS software.
The paper is organized as follows: in Section 2, we present a Bayesian analysis for stable distributions; in Section 3 we introduce the use of stable distributions for linear regression models while non-linear regression models are considered in Section 4; in Section 5, we introduce some numerical illustrations; finally, in Section 6, we present some concluding remarks.

A Bayesian Analysis for General Stable Distributions
Let us assume that x i , for i = 1, . . ., n, is a random sample of size n, where 1).Assuming a prior joint distribution for α, β, δ and γ, given by π 0 (α, β, δ, γ), Buckle (1995) shows that the posterior joint distribution for the parameters α, β, δ and γ is given by where respectively, the observed and non-observed data vectors.Notice that the multivariate distribution in expression ( 6) is given in terms of x i and the latent variables y i , and not in terms of z i and y i (there is the Jacobian γ −1 , multiplied by the right-hand-side of expression ( 4)).
It can be observed that when α = 2, θ = 2 and b α,β = 0.In this case, we have a Gaussian distribution with mean δ and variance 2γ 2 .
Since we are using the OpenBUGS software to simulate samples from the joint posterior distribution we are not presenting all the full conditional distributions needed for the Gibbs sampling algorithm.This software only requires the data distribution and prior distributions for the interested random quantities.This gives great computational simplification to determine posterior summaries of interest as shown in the applications in Section 5.

Linear Regression Models Assuming Stable Distributions
Consider a random variable X related to a controlled variable V given by the linear relationship where • the random variable X i represents the response for the i-th unit associated with an experimental value of the independent or explanatory variable v, which is assumed to have a fixed value (a common regression model assumption).In this way, x i it is an observation of X i ; • the variables ε 1 , ε 2 , . . ., ε n are considered as components of unknown errors and are unobserved random variables.Assume that these random variables ε i , for i = 1, 2, . . ., n, are independent and identically distributed with normal distribution N (0, σ 2 ε ); • the parameters d 0 and d 1 are unknown.
From the above assumptions, we have normality for the responses, that is, In this way X i has a normal distribution with mean d 0 + d 1 v i and common variance σ 2 ε .Usually we get estimators for the regression parameters using the least squares approach or standard maximum likelihood methods (see, for instance, Draper & Smith 1981or Seber & Lee, 2003).
Standard generalization for the linear model ( 8) is given in the presence of k independent or explanatory variables, that is, a multiple linear regression model given by From the normality assumption for the error ε i in (10), the random variable X i has a normal distribution with mean In practical applications, we need to check if the above assumptions are satisfied.As such, we consider graphical approaches to verify if the model residuals satisfy the above assumptions.
In the presence of outliers or discordant observations, we could have a large impact on the estimators obtained for the regression model given by (10), which could invalidate the inferences obtained.In this situation, we could use nonparametric regression models or assume more robust probability distributions for the data.One possibility is to assume that the random variable X in ( 10) or ( 8) has a stable distribution S α (β, δ, γ).
If this is the case, we assume that the response is x i in the linear regression model ( 10), for i = 1, . . ., n, has a stable distribution , 0, 1), where the location parameter δ i of the stable distribution is related to the explanatory variables by a linear relation given by, Assuming a joint prior distribution for α, β, d and γ, where Buckle (1995) or Achcar, Achcar & Martinez (2013) show that the posterior joint distribution for parameters α, β, d and γ, is given by, where ) and y = (y 1 , y 2 , . . ., y n ) are respectively the observed and non-observed data vectors.

Non-Linear Growth Regression Models Assuming Stable Distributions
Growth curves are included in a class of nonlinear models widely used in biology to model different problems such as the population size or biomass (in population ecology and demography, for population growth analysis) or the individual body height or biomass (in physiology, for growth analysis of individuals).A growth curve is an empirical model of the evolution of a quantity over time.Growth curves are employed in many disciplines besides biology, particularly in statistics, subject in which there is a large amount of literature on this subject related to nonlinear models.Under a more probabilistic and mathematical statistics approach, growth curves are often modeled as being a continuous stochastic process.
Standard classical inference methods to obtain point or interval estimates for the parameters of growth curves are presented within the nonlinear modeling methodology.
Nonlinear regression methodology is similar to linear regression methodology; it is, a modeling approach to relate a response X to a vector of covariates, v = (v 1 , . . ., v k ) , where v denotes the transpose of the vector v. Different from linear models, nonlinear regression is characterized by the fact that the prediction equation depends nonlinearly on one or more unknown parameters.Different from the linear regression methodology , and often used for building a purely empirical model, nonlinear regression methodology is usually employed when there is some physical reason which implies that the relationship between the response and the predictors follows a particular functional form.
A nonlinear regression model has the general form, where X i are the responses, for i = 1, . . ., n; f is a known function of the covariate vector; v i = (v i1 , . . ., v ik ) is a vector of k covariates or independent variables; θ = (θ 1 , . . ., θ p ) is the vector of p parameters and ε i are random errors.The errors ε i are usually assumed to be uncorrelated and normally distributed with mean zero and constant variance.
The most popular criterion to estimate the p parameter vector θ in the nonlinear model ( 13) is to find estimates for the parameters (via nonlinear least squares) which minimize the sum of squared errors, given by, n i=1 , follow a normal distribution, then the least squares estimator for θ is also the maximum likelihood estimator.
Usually, nonlinear regression estimates must be computed by iterative procedures using optimization methods to minimize the sum of squared errors given by the expression (14).It is important to point out that the definition of nonlinearity is related to the unknown parameters and not to the relationship between the covariates and the response.As an example, X = β 0 + β 1 v + β 2 v 2 + ε is considered as a linear model (see, for instance, Bates & Watts 1988, Ratkowsky 1983, Seber & Wild 1989).
A popular iterative technique to find the least squares estimator of nonlinear models is the Gauss-Newton algorithm.The Gauss-Newton algorithm increments the working estimate θ at each iteration by an amount equal to the coefficients from the linear regression of the current residuals u i , defined as u i = x i − f (v i , θ), on the current gradient matrix X.
If the function f (•) in ( 13) is continuously differentiable in θ, then it can be linearized locally as where V 0 is the n × p gradient matrix with elements ∂f (v i , θ 0 )/∂θ j and θ 0 is a vector of initial values for the iterative procedure.This leads to the Gauss-Newton algorithm for estimating θ, where e is the vector of working residuals e If the errors ε i are independent and normally distributed N (0, σ 2 ε ), then the Gauss-Newton algorithm is an application of Fisher's method of scoring.This algorithm is implemented in many existing statistical softwares such as, R, Minitab (version 16) or SAS.
If X is of full column rank in a neighborhood of the least squares solution, then it can be shown that the Gauss-Newton algorithm converges to the solution from a sufficiently good starting value.Though, with practical applications, there is no guarantee, that the algorithm will converge from values further from the solution.Some improvement of the Gauss-Newton algorithm are given in the literature, such as the Levenberg-Marquart damping algorithm (see, for instance, Seber & Wild 1989).
Standard inferences for the parameters of nonlinear models are obtained from the asymptotical normality of the least squares estimators θ with mean θ and variance-covariance matrix σ 2 ε (V V) −1 , where the variance σ 2 ε is usually estimated by, It is important to point out that since most asymptotic inference for nonlinear regression models is based on the linear models analogy, and since this inference is only approximated as the actual model differs from a linear model; various measures of nonlinearity have been proposed in the literature to verify how good linear approximations are likely to be in each case.One class of measurement focuses on curvature (intrinsic curvatures) of the function f (•) and it is based on the size of the second derivatives of f (see, for instance, Bates and Watts, 1980).Another class of measurement is the intrinsic curvature defined by the residuals u i = x i − f (v i , θ) or parameters effect curvatures (see, for instance, Bates & Watts 1988).
In many applications, the systematic part of the response is known to be monotonic increasing in v, where v might represent time or dosage.Nonlinear regression models with this property are called growth models.The simplest growth model is the exponential growth model defined as where θ = (θ 1 , θ 2 ) .
There are several growth models introduced in the literature.In this paper we consider the following special cases All these different models could be considered to be model growth curves.Very often we have physical interpretations for the use of a particular model.In others we try different growth models to decide which one best fits the data.
In many situations the usual normality assumption for the errors in (13) will not be appropriate.For instance, this can be the case when we have discordant observations which could have an impact on the obtained inferences.In this way, we could assume more robust distributions for the data, like the stable distribution.
We point out that the use of asymptotical inference results may not be accurate depending on the sample sizes and on the intrinsic curvature of the function f (•) in ( 13).In those cases, we propose the use of stable distributions for the response X i in (13).
Let us assume that the response x i , for i = 1, . . ., n, in the nonlinear regression model ( 13) has a stable distribution , where the location parameter δ i of the stable distribution is related to the explanatory variables by a nonlinear relation given by, Assuming a joint prior distribution for α, β, d and γ, where d = (d 0 , d 1 , d 2 , . . ., d k ) given by π 0 (α, β, d, γ), Buckle (1995) shows that the joint posterior distribution for parameters α, β, d and γ, is given by the general form (12).

An Example of a Simple Linear Regression Model
In Table 1, we have a data set related to an industrial experiment, where x denotes the response and v denotes an explanatory variable associated with each response considering n = 15 observations (see Johnson & Bhattacharyya 1980).From a preliminary data analysis, we see that a linear regression model ( 8) is suitable for this data set.The estimated regression straight line obtained by minimum squares estimates and by using the software MINITAB (version 16) is given by x i = 0.603 + 0.0444 v i , where the regression parameter d 1 is statistically different from zero (p-value < 0.05).From standard residuals graphs we can verify that the required assumptions (residuals normality and homoscedastic variance are satisfied).
Under a Bayesian approach, Table 2 shows the posterior summaries of interest assuming the linear regression model defined by (8) with a stable distribution for the response.Using the OpenBugs software we assume the following prior distributions: α ∼ U(0, 2), β ∼ U(−1, 0), d 0 ∼ N (0, 1), d 1 ∼ N (0, 1) and γ ∼ U(0, 3).It can be observed that we are assuming approximately non-informative priors for the parameters of the model.This procedure will be used in the other Bayesian analysis.We further assume independence among the random quantities.We also assume a uniform U(−0.5, 0.5) distribution for the latent variable Y i , for i = 1, 2, . . ., 15.We simulated 800,000 Gibbs samples, with a "burn-in-sample" of 300,000 samples that were discarded to eliminate the effects of the initial values in the iterative simulation procedure.We took a final sample of size 1,000 (every 500 th sample chosen from the 500,000 samples).Gibbs sampling algorithm convergence was monitored from standard trace plots of the simulated samples.
In Table 2, we also have the sum of absolute values (SAV) for the differences between the observed and fitted mean values, given by, In Table 2, we also show the posterior summaries of the regression model ( 8) assuming a normal N (0, σ 2 ε ) distribution for the error and the following priors for the parameters of the model: d 0 ∼ N (0, 1), d 1 ∼ N (0, 1) and ζ = 1/σ 2 ε ∼ U(0, 3).In this case, we simulated 55,000 Gibbs samples taking a "burn-in-sample" of 5,000 using the OpenBUGS software.We take a final sample of size 1,000 (every 50 th sample chosen from the 50,000 samples).From results show in Table 2, we can observe similar results, assuming both normality or stable distribution for the data.In this case, we conclude that we do not need to assume a stable distribution for the data, since the results are very similar to the results obtained from the normality assumption for the errors.Besides, the computational cost of using stable distributions is very high (see Achcar, Achcar & Martinez 2013).Figure 1(a) shows the plots of observed and fitted means considering both models versus samples.The graphs in Figure 1(a), manifest similar fit for both models (linear regression model assuming both normality and stable distributions).
Revista Colombiana de Estadística 39 (2016) 109- 128It can be observed that we have SAV = 4.496 when we assume a stable distribution, and SAV = 4.504 when we assume a normal distribution.Hence, we have very close results.
Let us now consider the presence of an outlier or discordant response (considered as a measurement error) by replacing the 15 th response (equal to 2.8) in Table 1 by the value 8.0.Table 3, presents the posterior summaries that were obtained assuming the same priors and the same simulation procedure as in Table 2. Figure 1(b), show the plots of observed and fitted means considering both models versus samples in the presence of one outlier.This figure, demonstrates that the model with a stable distribution is very robust in terms of the presence of the outlier, given similar inference results that were obtained without its presence (see results in Table 2).Table 3 also reports the estimated regression parameters with normal error and we can observe how strongly the results are affected by the presence of this outlier.It can be seen that we have SAV = 4.623 assuming a stable distribution (a value very close to the SAV values given in Table 2, without the presence of an outlier) and SAV = 6.394 assuming a normal distribution.

An Example With a Growth Non-Linear Model
In Table 4, shows a data set, reported by Bache, Serum, Youngs & Lisk (1972), which is related to the concentration of polychlorinated biphenyl (PCB) residues in a series of lake trout from Cayuga Lake, NY, USA.The ages of the fishes were accurately known, since they were annually stocked as yearlings and distinctly marked as to the year class.Every fish was mechanically chopped, ground, thoroughly mixed, and 5-gram samples were taken.The samples were treated and PCB residues in parts per million (ppm) were estimated using column chromatography.From a preliminary data analysis, we can see that a nonlinear regression model ( 13) is suitable for the data set both in the original and transformed scales (see Figure 2).We assume standard classical approach for nonlinear models considering the eigth growth models introduced in Section 4. The responses are in the logarithmic scale log(PCB), and we use the software MINITAB (version 16).Table 5 presents the estimation results for each assumed growth model, and we observe that for some growth models we need a large number of iterations for the Gauss-Newton algorithm convergence.q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 4 6 8 10 12 0 5 10 15 20 25 30 Age PCB q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 4 6 8 10 12  Figure 3, shows the observed log(PCB) plots and fitted models versus the observations.From these plots, we can see that the most of the assumed models give a reasonable fit for the data (models 1, 2, 4, 5, 6 and 7 given in Table 5).Only models 3 and 8 are not suitable for the data.It is important to note that the sum of absolute values for the differences (observed -estimated mean) for models 1, 2, 4, 5, 6 and 7 are given, respectively, by 11.4412, 11.1411, 12.0196, 11.0047, 11.0209 and 11.0310.That is, those values are very similar to each other, and indicate a similar fit.Considering models 3 and 8, the values for these differences are given, respectively by, 15.5568 and 23.4592 (also see Figure 3).We can observe that the obtained classical estimates shown in Table 5, could be very unstable depending on the initial values used in the iterative algorithm.It is usually difficult to obtain good initial empirical values to be used in the iterative procedure.We also point out that the estimates standard errors (given in parenthesis in the right panel of Table 5) could be very large as is observed in this table.
Using a Bayesian approach, Table 6, shows the posterior summaries of interest considering the eight growth models 1-8 introduced in Section 4 with a normal distribution for the error (see expression ( 13)) and the OpenBUGS software assuming the following prior distributions: θ j ∼ U(0, 1), j = 1, 2, 3, 4 and ζ = 1/σ 2 ∼ G (1,1), where G(a, b) denotes a gamma distribution with mean a/b and variance a/b 2 .We simulated 10,000 Gibbs samples, with a "burn-in-sample" of 1,000 samples that were discarded to eliminate the effects of the initial values in the iterative simulation procedure and we took a final sample of 900 (every 10 th sample choosen from the 9,000 samples).Gibbs sampling algorithm convergence was monitored from the standard trace plots of the simulated samples.From the results shown in Table 6, we observe that the posterior standard deviation for each estimated parameter of the growth model is very small.It is important to point out that these posterior summaries obtained using MCMC methods are very accurate and do not depend on approximations as they do when using the classical approach.Figure 4 shows observe that we have a good fit for some of the data models.To decide on the best statistical model we could also use the selection Bayesian Deviance Information Criterion (DIC) introduced by Spiegelhalter, Best, Carlin & van der Linde (2002).This criterion is especially useful in problems in which samples of the posterior distribution for the parameters of the model have been simulated using Markov Chain Monte Carlo (MCMC) methods.
Define the deviation as where θ is the vector of unknown parameters in the model, L(θ) is the likelihood function and C is a constant that does not need to be known in the models comparison.The DIC criterion is given by Overall, model 5 (Gompertz) gives the smallest value for the sum of absolute values for the differences (observed -estimated mean) and smaller DIC (49.47) value.We could consider this model to be the best fitted model.
We now assume a stable distribution of the responses considering the Gompertz growth curve model (model 5) introduced in Section 4. Using a Bayesian approach, Table 7 shows the posterior summaries of interest assuming the nonlinear regression model 5.It considers the regression model (13) for the location parameter of the stable distribution given by, In the Bayesian analysis for this model, we use the OpenBugs software assuming the following prior distributions: α ∼ U(0, 2), β ∼ U(−1, 0), d j ∼ N (0, 1), j = Figure 5, shows the plot of the posterior means assuming a Gompertz growth model (model 5) with a normal error against the observations, together with the plot of the location posterior against observations assuming a stable distribution.We observe similar results.We also observe that the sum of absolute values for the differences (observed -estimated mean) for model 5 with a stable distribution is given by 12.1437, that is, similar to the obtained value assuming normal errors (11.0915).Let us now consider the presence of one outlier or discordant response (considered as a measurement error) replacing the 5 th PCB response (2.0) value in Table 4 by the value 200.0.This high value is necessary as otherwise the effect can not be observed.Table 8, shows the obtained posterior summaries assuming the same priors and same simulation procedure as in Table 7. Figure 6, shows the plots of observed and fitted means considering both models versus samples.From the graphs in Figure 6, we can see that the model with a stable distribution is very   robust in term of the presence of the outlier, given similar inference results that are obtained without this outlier (see results in Table 7).Table 8 also shows that the estimated regression parameters with normal error are strongly affected by the presence of this outlier.It can be seen that we have sum of absolute values for the differences (observed -estimated mean) equals to 11.8064 and assuming a stable distribution (a value very close to the respective one without the presence of an outlier), while the sum of absolute values for the differences (observed -estimated mean) is equal to 16.1205 assuming a normal distribution.

Some Concluding Remarks
The use of stable distributions could be a good alternative, especially using Bayesian methods with MCMC simulation technique.This class of distributions is very robust in the presence of discordant observations.The presence of outliers or discordant observations, is often, due to measurement errors very common in practical applications of linear or nonlinear regression analysis.In the presence of these discordant observations, the usual obtained classical inferences on the regression parameters or in the predictions under the usually assumption of errors normality and homoscedastic variance could be greatly affected, which could imply in wrong inference results.In this way, the use of stable distributions could be a good alternative, since this distribution has a great flexibility to fit for the data.With the use of Bayesian methods and MCMC simulation algorithms it is possible to obtain inferences for the model despite the nonexistence of an analytical form for the density function, as was shown in this work.It is important to point out that the computational effort in the sample simulations for the joint posterior distribution of interest can be largely simplified using standard free available software like the OpenBUGS software.
In the illustrative examples that have been shown, we observed that the use of data augmentation techniques (see, for instance, Damien et al. 1999) is the key to obtaining a good performance for the MCMC simulation method for applications using stable distributions.
We emphasize that the use of OpenBUGS software does not require large computational time to obtain posterior summaries of interest, even when the simulation of a large number of Gibbs samples is needed to achieve algorithm convergence.These results could be of great interest for researchers and practitioners when dealing with non Gaussian data, as in the applications presented here.

Figure 1 :
Figure 1: Observed and fitted means values, considering both models: (a) samples; (b) presence of one outlier.

Figure 2 :
Figure 2: Observed values against age for PCB and log(PCB).

Figure 3 :
Figure 3: Observed log(PCB) values against estimated means considering the eight fitted models.

Figure 4 :
Figure 4: Log(PCB) values that were observed against posterior means considering eight fitted models.
where D( θ) is the deviation evaluated at the posterior mean θ = E(θ|data) and n D is the effective number of parameters of the model, given by n D = D − D( θ), with D = E(D(θ)|data) being the posterior deviation that measures the quality of the data fit for the model.Smaller values of DIC indicate better models.It should be noted that these values can be negative.

Figure 5 :
Figure 5: Log(PCB) values that were against posterior means considering the Gompertz growth model, with Normal error and assuming a stable distribution.

Figure 6 :
Figure 6: Log(PCB) values that were against posterior means considering the Gompertz growth model with Normal error, assuming a stable distribution (presence of one outlier).

Table 1 :
An industrial experiment data set.

Table 4 :
Lake trout data set.

Table 5 :
Classical estimates for the growth models.

Table 6 :
Bayesian estimates for the growth models assuming a normal error.

Table 8 :
Bayesian estimates for the Gompertz growth model assuming a stable distribution for the responses (presence of one outlier).