Robust mixture regression based on the skew t distribution

In this study, we propose a robust mixture regression procedure based on the skew t distribution to model heavy-tailed and/or skewed errors in a mixture regression setting. Using the scale mixture representation of the skew t distribution, we give an Expectation Maximization (EM) algorithm to compute the maximum likelihood (ML) estimates for the paramaters of interest. The performance of proposed estimators is demonstrated by a simulation study and a real data example.


Introduction
Mixture regression models are used to investigate the relationship between variables which come from some unknown latent groups. These models first introduced by Quandt (1972) and Quandt and Ramsey (1978) as switching regression models which are widely used in areas such as engineering, genetics, biology, econometrics and marketing. The parameter estimation of a mixture regression model is usally based on the normality assumption of the error terms. It is well-known that the estimators based on the normality assumption perform well when the error distribution is normal, but they are very sensitive to the departures (outliers, heavy-tailedness, skewness) from normality. To deal with the departures from normality robust mixture regression procedures have been proposed. Some of these works can be summarized as follows. Markatou (2000) and Shen et al. (2004) used a weight function to estimate the parameters robustly in the mixture regression models. Bashir and Carter (2012) used S-estimation method for the mixture linear regression model. Bai (2010) and Bai et al. (2012) proposed a robust estimation procedure based on M-regression estimation to estimate the parameters of the mixture linear regression model. Wei (2012) and Yao et al. (2014) explored the mixture regression model based on t distribution which is an extension of the mixture of t distribution studied by Peel and McLachlan (2000). Further, Zhang (2013) studied the robust mixture regression model using the Pearson Type VII distribution and Song et al. (2014) proposed a robust estimation procedure for the mixture regression models using the mixture of Laplace distribution as an error distribution. As it is pointed out by them, the robust mixture regression estimation procedure based on the Laplace distribution can be regarded as the application of the least absolute deviation (LAD) regression estimation to the mixture regression models. Liu and Lin (2014) proposed mixture regression model based on the skew normal distribution. Also, Pereira et al. (2012) studied performance of the estimates procedure for the mixtures of skew normal distribution.
In this paper, we propose a robust mixture regression procedure based on the skew t distribution to efficiently deal with heavy-tailedness and skewness in the mixture regression model setting. This is an extension of the mixture of skew t distribution proposed by Lin et al. (2007) to the mixture regression models. We will use the skew t distribution results from the scale mixture of the skew normal distribution introduced by Gupta et al. (2002), Gupta (2003) and Azzalini and Capitaino (2003). The scale mixture representation of the skew t distribution enables to easily implement an EM algorithm to obtain the ML estimates for the parameters of interest in the mixture regression model. One can see the works by Doğru and Arslan (2014) and Doğru (2015) on the mixture regression model based on the skew t distribution.
The paper is organized as follows. In Section 2, we give the basic definition of the mixture regression model. In Section 3, we present the robust mixture regression results based on the skew t distribution. In Section 4 and 5, we give a simulation study and a real data example to compare the performances of the proposed estimation procedure with the other estimation procedures obtained from normal, t ) and skew normal (Liu and Lin (2014)) distributions. The paper concludes with a conclusion section.

Mixture regression model
The model setting for a general mixture of linear regression models can be formulated as follows. Let be a p-dimensional vector of explanatory variables, be the response variable and be a latent class variable independent of . Suppose that given , the response variable depends on the explanatory variable in a linear way (1) where, ( ) is the unknown vector of regression parameters and is the number of components in mixture regression model. The random errors and are assumed to be independent. In literature, it is often assumed that the random errors 's have distributions from the location-scale family with zero means and scale parameters. Suppose that ( | ) , denote the mixing probabilities with ∑ , then the conditional density function of given can be of the form where, ( ) is the density function of the component with some shape parameters (e.g. degrees of freedom for t distribution) and ( ) is the unknown parameter vector. This model is called as a g-component mixture regression model.
The ML estimation method is used to estimate the unknown parameter vector in model (2). Let *( ) ( ) ( )+ be a given sample. Then, the ML estimates is obtained by maximizing the following log-likelihood function with respect to However, it should be noted that the ML estimators cannot be explicitly obtained. The EM algorithm (Dempster et al., 1977) is used to find the ML estimates.

Robust mixture regression based on the skew t distribution
In this section, we will use the skew t distribution in order to model possible skewed and heavy-tailed errors in the mixture regression model. By doing so, we will obtain more robust estimators for the parameters of the mixture regression model. We will use the Azzalini type skew t distribution (Azzalini and Capitanio 2003, Gupta et al. 2002, Gupta 2003 with the following density function where, is the skewness parameter, ( ) is the probability density function (pdf) of the t distribution with ( ) degrees of freedom and ( ) is the cumulative density function (cdf) of the t distribution with degrees of freedom.
In the mixture regression model (2), assume that the errors have a skew t distribution with zero location, and and scale, skewness and degrees of freedom parameters, respectively. On the contrary to the symmetric case the mean ( ) . For the skew t distribution only affects the intercept. Thus, when we estimate the intercept we will take into account this and correct ̂ by using ( ) . In order to estimate the unknown parameters we should maximize the following log-likelihood function where, ( ) However, the maximizer of the above log-likelihood function cannot be explicitly obtained so that an EM-type algorithm should be used to estimate the unknown parameters The EM algorithm can be implemented as follows. Let be the latent variables such that { where, and . To simplify the EM algorithm we use the stochastic representation of a skew t distributed random variable given by Azzalini and Capitanio (2003) (see Appendix for more details). This stochastic representation yields the following hierarchical formulation in terms of the conditional distributions where, denotes the truncated normal distribution, and ( ). Then, regarding , and are as missing data, the complete data log likelihood function for ( ) given can be written as ). Further, based on the theory of the EM algorithm, the conditional expectation of the complete data log-likelihood function given the observed data and the current parameter estimate ̂( ) should be calculated. That is, we have to find the following conditional expectation To get this conditional expectation the following expectations should be obtained: ( | ̂( ) ), ( | ̂( ) ), ( | ̂( ) ), ( | ̂( ) ) and ( ( )| ̂( ) ). After some straight forward algebra we get the following expressions for these expectations Then, the EM algorithm to obtain the parameter estimates for the mixture regression model based on the skew t distribution can be given as follows.
2. E step: To proceed an E step, we have to find the conditional expectation of the complete data log likelihood function given the current parameter values ( ) . This can be done by computing the conditional expectations ̂ ( ) ̂ ( ) ̂ ( ) ̂ ( ) and ̂ ( ) for . After finding these conditional expectations we get the following objective function to be maximized at M step of the EM algorithm 3. M step 1: Maximize the ( | ̂( ) ) with respect to the unknown parameters ( ), assuming that ( ) are fixed, to obtain ( )th values for the parameter ( ). This maximization gives 4. M step 2: Using the new values for ( ) gained in M step 1 solve the following equations to obtain new estimates for the parameters ( ) 5. Repeat E and M steps until the convergence criteria ‖ ( ) ( ) ‖ is satisfied.
Note that to simplify the computation of ̂ ( ) we will use the following estimate in the simulation study and real data example

Simulation Study
In this section, we will give a simulation study to show the performance of the proposed estimator obtained from skew t (MixregST) and we also compare the other estimators obtained from normal (MixregN), t (Mixregt) and skew normal (MixregSN) distributions in terms of bias and mean square error (MSE).
We generate the data *( ) + from the following two component mixture regression models (Bai et al. (2012) We take the following error distributions: Case I: ( ), standard normal distribution. Case II: ( ), t distribution with the degrees of freedom 3. Case III: ( ) ( ), contaminated normal distribution. Case IV: ( ), skew t distribution. Case V: ( ), standard normal distribution with outliers, and .
We use the Case I to compare the estimators with the traditional MLE (MixregN) when the error terms have the normal distribution and there are no outliers. Case II is the example for the heavy-tailed error distribution. The distribution given in Case III is to create outliers. This distribution is often considered in literature as an outlier model. Case IV is to examine the behavior of the estimators when the error term is skewed and heavy-tailed. Case V is considered to test the performances of the estimators to deal with the high leverage points. In this case of the observations are replaced by and . In the simulation study, the sample sizes are taken as and and the number of replicates is . The simulation study and real data example are conducted using MATLAB 2013a.

Real Data Example
In this section we analyze the tone perception data set (Cohen (1984)) to further illustrate the performance of the mixture regression estimates based on the skew t distribution on a real data set. In the tone perception experiment of Cohen (1984), a pure fundamental tone was played to a trained musician. Also, electronically obtained overtones were added which were determined by a stretching ratio. This ratio is between the adjusted tone and the fundamental tone. In the experiment, 150 trials were performed by the same musicians. The aim of this experiment was to find out how the tuning ratio affects the perception of the tone and to decide if either of two musical perception theories was reasonable (see Cohen (1984) for more detail). This data set has also been analyzed by Yao et al. (2014) and Song et al. (2014) to test the performance of the mixture regression estimates based on the t and Laplace distributions, respectively. Figure 1 shows the scatter plot and the histogram of the perceived tone ratio. From these plots it is clear that there are two groups in the data and it also shows the non-normality. We use this data set to compare the performances of the estimators in the case of with and without outliers. We present the scatter plots with the fitted regression lines obtained from MixregN, Mixregt, MixregSN and MixregST procedures in Figure 2 for the tone perception data set. Also, we summary the ML estimates and some information criterions in Table 3. Note that in real data example we assume that in both groups the degrees of freedom equals to . We try other values of degrees of freedom and get the similar results. We observe that MixregST has the best fit than the other mixture regression models in terms of the Akaike information criterion (AIC) (Akaike (1973)) and the Bayesian information criterion (BIC) (Schwarz (1978)) values.  Next we add ten pairs of outliers at ( ). These outliers can be considered as high leverage points. By adding these points we would like to see the performance of the estimators against to the high leverage points. Figure 3 displays the scatter plots of the data set with the fitted regression lines obtained from MixregN, Mixregt, MixregSN and MixregST procedures. We give the ML estimation results in Table  4. We see that MixregN and MixregSN are drastically affected by the high leverage points. On the other hands, the estimators based on the t and the skew t distributions (Mixregt and MixregST) give fits to the majority of the data without influencing from the high leverage points. Also, MixregST gives best results in terms of information criterion. Note that the estimates including the estimates for skewness parameters with and without outliers are very similar (see Tables 3

Conclusions
In this paper we have proposed a robust mixture regression procedure based on the skew t distribution.
We have given an EM algorithm to compute the proposed estimators for the mixture regression model. We have given a simulation study to explore the performance of the estimators based on the skew t distribution over the estimators obtained from the normal, the t and the skew normal distributions. The simulation results confirm that when heavy-tailedness and skewness are present the proposed estimators behave better than the counterparts. We have also given a real data example to further illustrate the capabilty of the proposed estimators dealing with the outliers and/or high leverage points in the data. Likewise, for the real data our proposed estimators show superiorty over the estimators based on normal, t and skew normal.
If a random variable has the skew t distribution ( ( )) with the location parameter , scale parameter ( ), skewness parameter and degrees of freedom , it has the following stochastic representation (Azzalini and Capitaino (2003)) where and are independent and shows the skew normal distribution, respectively. Also we can further give the following stochastic representation for , which has already given by Azzalini (1986, p.201) and Henze (1986, Theorem 1)

| | √
where and are independent standard normal random variables and | | will have truncated normal distribution. This stochastic representation can be used to get the following conditional distributions These conditional distributions will help us to conduct the steps of the EM algorithm. By Proposition 2 of Lin et al. (2007) we can have the following conditional expectations for , , and ( ) given These conditional expectations will be used in EM algorithm given in Section 3.