Skew-normal median regression model with applications

Skewed/asymmetrical data often appear in many research fields such as finance, economy, climate science, environmental science, engineering technology, social sciences and biomedicine. Little research has been carried out to establish median regression model to investigate the influence of some covariates on distributional characteristics. Since the median can capture the “medium” level of the skewed data, the main objective of this paper is to construct the median regression model for the analysis of skew-normal data. We develop an expectation-maximization (EM) algorithm facilitated by Newton-Raphson algorithm to estimate the regression coefficients and parameters. The proposed methods are evaluated by some Monte Carlo simulations and are illustrated by a real data analysis.


Introduction
In the real world, most of the collected data are skewed (or asymmetrical). The skewed distribution can reveal the asymmetric information when the observations are skewed as an alternative data analysis tool. Asymmetrical/skewed distributions have an additional shape parameter to represent the direction of the asymmetry of the density. If the skewness in observations is ignored, statistical inferences with symmetric distributions may result in biased or even misleading conclusions. In recent years, many skewed distributions including the skew-normal distribution [1] and the skew-t-normal distribution [2] were proposed for modeling such skewed observations.
Median is the value of ''medium'' level of the population. Its characteristic is that it is not affected by outliers. Especially for skewed distribution, the representation of the median is better than the mean. It has important practical significance in economics, social sciences and other disciplines. Many countries in the world use the median to describe the per capita income of residents, and the median is more robust and almost not affected by the income changes at the high and low ends.
The multivariate skew-normal distribution was developed by [3]- [5]. Especially, [6] investigated a classical and Bayesian approach for the skew-normal nonlinear regression model. For the measuremen t error data, [7] studied the parameter estimation of the skew-normal linear measurement error model. [8] considered the parameter estimation of the skew-normal linear mixed measurement error model. F or mixture skewed data, [9] proposed a skew-normal mixture regression models. For variable selection, [10] investigated the simultaneously variable selection in joint location and scale models of the skewnormal distribution. [11] proposed variable selection in joint location, scale and skewness models base d on the skew-normal distribution.
However, the above statistical inference methods are only limited to location models. Little research has been carried out to establish median regression model to investigate the influence of some covariates on distributional characteristics. The main objective of this paper is to construct the median regression model for the analysis of skew-normal data to capture the ''medium'' level. The rest of the paper is organized as follows. In Section 2, we introduce the skew-normal median regression model. In Section 3, EM algorithm facilitated by Newton-Raphson algorithm to estimate the regression coefficients and parameters.In Section 4, we conduct some Monte Carlo simulations to evaluate the finite sample performance of the proposed method. In Section 5, the body mass index data of Australia are analyzed to illustrate the proposed methods. A discussion is given in Section 6.

Skew-Normal Median Regression Model
A random variable Y is said to have a skew-normal distribution with location parameter μ, scale parameter σ 0 and skewness parameter λ, denoted by Y ∼ SN μ, σ , λ , if its probability density function(pdf) is [1] y|μ, σ , λ μ Φ λ μ , x ∈ ℝ 1 where • and Φ • denote the pdf and cumulative distribution function(cdf) of the standard normal distribution, respectively. When λ 0 , the skew-normal distribution reduces to N μ, σ . The distribution is right skewed if λ 0 and is left skewed if λ 0.
The expectation and variance of the random variable are given by π 1 λ 2 From [12], we know that the mode of (1) can be expressed as μ λ σ λ We know that the mean, mode and median have the following quantitative relationship: |Mean Mode| 3|Mean Median| Thus, we obtain The aim of this paper is to build regression model for Median Y to capture the medium level of the skew-normal distribution. we consider the following median regression model: x , ⋯ , x is the vector of covariates, β β , ⋯ , β is an unknown vector of the median regression coefficients. From (4), we have μ x β σ 3 m λ 2μ λ . 6

MLEs of Parameters Via an EM Algorithm
Let y be the realization of Y ∼ SN μ , σ , λ , and Y y denote the observed data and Let θ β , σ , λ denote the vector of parameters in the median regression model. The observed-data log-likelihood function of θ for the model is given by where c is a constant, τ λ y μ /σ. The EM algorithm [13] can be used to calculate the MLEs of θ, since the standard skew-normal distribution is a mixture of the standard normal distribution and the standard truncated normal distribution.

A mixture representation of the skew-normal distribution
The random variable Y ∼ SN μ, σ , λ has the following stochastic representation (SR): which will be used in the EM algorithm.

The EM algorithm
Based on the mixture representation (8), for each Y y , we introduce a latent variable R r to form the complete data Y Y ∪ Y y , r , where Y r denote the missing data and R , ⋯ , R ~ TN 0,1; 0, ∞ . From (10), the the surrogate function of θ can be constructed as Q θ θ E L θ|Y |Y , θ nlog π n 2 log σ n 2 log 1 λ 1 λ 2σ y μ 2σλ y μ r √1 λ σ r 13 Where θ denote the t-th approximation of the MLEs θ r E R |Y , θ τ (1)For a given sample size, the results for different values of λ are similar in terms of MSE, showing that the performance of the parameter estimation method is not sensitive to the values of λ.
(2) For a given λ, the MSEs of the MLEs β , σ and λ become smaller and smaller as n increases, indicating that the parameter estimation method performs well.

Application to Body Mass Index Data of Australia
In this section, we illustrate the proposed skew-normal median regression model by using the body mass index(BMI) data for 102 male and 100 female athletes collected at Australian Institute of Sport [14]. The BMI data consist of the response variable Y (i.e., the BMI) in weight/height 2 and four covariates: x (the red cell count), x (the plasma ferritin concentration), x (the body fat percentage) and x (the lean body mass). We are interested in establishing the relationship between the BMI Y and some important covariates.
The Figure 1 is histogram of Y, indicating that the distribution Y approximately follows a skewnormal distribution. Therefore, based on the BMI data, we can consider the skew-normal median regression model. By applying the proposed EM algorithm in Section 3.2 to the skew-normal median regression model, we obtained the MLEs of parameters reported in the second column of Tables 2.    5.1835 respectively. The kurtosis of the normal distribution is 3, however, the kurtosis of athletes is greater than 3 and the skewness is not 0. So there is a deviation in using normal distribution to fit the skew data. These results can provide valuable reference for the practical researchers.

Conclusion
The main aim of this paper is to establish the median regression model to capture the "medium" level of the skew-normal distribution. Our simulation studies and a real data analysis showed that the proposed model and method are useful in practice.
In this paper, we established the median regression model for the skew-normal distribution. A natural extension of the skew-normal distribution to other skewed distributions, e.g., skew-t-normal and skew-Laplace-normal distributions, may be one of our future works. Another future extension of this work is to consider the case of mixture regression model with multivariate outputs. Finally, one interesting direction is to extend the proposed models to partially linear models.