Review on circular-linear regression models

Classical linear statistics method is no longer appropriate when handling circular data since the data is influenced by direction or angle. Considering the possibility of circular data appeared as dependent variable, it has resulted in the remodeling of classic linear regression model into circular-linear regression model over the past few decades. It is important to acknowledge these circular data characteristics as it can affect the descriptive and inference of statistical analysis. With the growing body of literature regarding this issue, this paper will review on circular-linear regression model by highlighting and exploring their benefits and limitations.


Introduction
Circular data is a type of data that representing points on the circumference of the unit circle measured in degrees or radian [1]. Over the past century, a great emphasis of circular data is being utilized in various kind of area such as physics, psychology, meteorology, geology, paleontology, biology, astronomy and many more. Circular data can be defined in different kind of distributions such as point, uniform, cardioid, triangular, von Mises (VM), wrapped normal (WN) and wrapped Poisson (WP) distributions. The VM distribution is the most common distribution for circular data as it takes the role held by the normal distribution in standard linear statistics. In fact, it is shaped like the normal distribution, except that its tails are truncated. Unlike linear data, the origin of circular data is undefined because they exist on circumference and spherical surfaces. The difference of data characteristics can influence the descriptive and inference of statistical analysis [2]. Therefore, the classical linear statistics method is no longer appropriate when handling for circular data.
Hence, without disregarding the possibility of circular data to be appeared as dependent variable, for over the past decades, the development of circular-linear regression model has growth rapidly. However, the application of regression models proposed may be differ depending on the data generation process, data distribution, sample size, parameters involved in the link function and presence of outlier in the data set [3]. These aspects must be taken into consideration when choosing the right regression model to be applied. Therefore, the aim of this paper is to review on circular-linear regression model by highlighting and exploring their benefits and limitations. Section 2 in this paper will give an overview on the association of circular and linear variables. Next, in Section 3, this paper will give the overview of the model development for circular-linear regressions based on VM distribution. Later, this paper will discuss the benefits and limitations for other types of circular-linear regression model Section 4.

Linear-Circular Association
It is important to understand the difference between regression and correlation analysis. Regression is an expression of the relationship in the form of an equation. Meanwhile, correlation analysis is a method used to determine the existence of association between variables and to measure the strength of the association between them. The first step to develop a regression model is to investigate the association between linear and circular variables. Pearson Product-Moment correlation is a correlation analysis introduced by [4] in 1900. Pearson Product-Moment correlation is used to indicate the linear degree of a scatter diagram particularly for the strength and direction of a linear relationship between variables. Be mindful, it is not the measure of slope or steepness of the regression line. However, correlation analysis by [4] is specifically used for the association between linear variables only.
Consequently, the need for development of correlation analysis between linear and circular variables has caught the attention of researchers. In this section, there are several types of correlation analysis will be discussed. The first correlation coefficient for linear and circular variables is proposed by [5] in 1976 with the assumption that the association is approximately sinusoidal. The scatter plot of linear variable and circular variable should be distributed around a sine wave [5,6]. The correlation coefficient between linear and circular variables, 2 x R T is defined as The 2 x R T value is within the range of > @ 0,1 and the higher value of 2 x R T implies a stronger association between linear variable, X and circular variable, 4 . Zero value of 2 x R T indicates that X is independent of 4 . The value of 2 x R T is invariant under a change of scale and origin of X and change of zero or sense of direction for 4 . If X and 4 are independent and X is normally distributed, hence the population has an F distribution with 2 and 3 n degrees of freedom. Next, for independence test based on null hypothesis, 2 x R T is used as test statistic for randomization test. Smaller p-value as compared with the nth significance value, D will emphatically reject the independency, thus there is an association between X and 4 .
In addition, correlation coefficient between linear and circular variables, 2 x R T based on a rank-based analogue is also another correlation analysis developed by [5]. This rank correlation coefficient follows the Spearman's rank correlation coefficient logic by replacing the linear value, i x and associated circular, i T with their ranks and uniform scores respectively. This results a less restrictive assumption for linear-circular association which there is only one maximum and one minimum value. Firstly, the data vectors,   [7] suggested that there must be a restriction for circular variable as independent variable and later presented in [8]. After all, it will not be discussed in this paper as it is not in the context of this paper. Henceforth, it is crucial to analyze the association between linear and circular variables before developing the regression model and the correlation analysis should take the nature of regression model being contemplated into account either the circular variable is independent or dependent variable.

Circular-Linear Regression Models based on Von Mises (VM) Distribution
The development of circular-linear regression model can be varies depending on the process of data generation which is the data distribution for circular variable and the association between linear and circular variables. There are a few approaches that can be applied for generating circular distribution such as perturbation, wrapping, transformation and projected or offsets distributions. Table 1 give the overview of some basic circular-linear regression models that using circular variable with von Mises (VM) distribution along with their benefits and limitations.
The embarking of circular-linear regression model study development are started in 1969, as [9] invented a regression model to predict circular response variable based on linear predictors. The circular data in [9] is based on von Mises distribution. Hence the model by [9] is invented particularlly for biological field, specifically for the analysis of vectorcardiogram projections in which the circular variable is the time of the the interval elapsed from the start of the QRS loop to that point and the duration of the QRS loop were supplied is observed, and the linear variable is the duration of the heartbeat measured in beats per minute. The circular-linear regression model with link function, where P is the circular response variable, 0 P is the mean level, i E is the regression coefficient and i x is the linear independent variable. [9] provided a coordinated method for calculating the maximum likelihood estimations (MLEs) of the model parameters which is also equivalent to least squares and offered some approximate methods of inference for large value of concentration parameter, N [10].
Nonetheless, [11] indicated that the maximum likelihood function in [9] has infinitely many equally high peaks which can lead to ambiguously defined maximum likelihood estimations (MLEs). The infinite equally high peaks will be occurred because of the circular mean response, P that conditional on linear variable, X x is actually a curve winding in an infinite number of spirals up the surface of an infinite cylinder. This is one of the various forms of the so-called 'barber's pole' model that used in [9] regression model. As the weakness of the finding is also addresses by [9], the author believes that the employment of additional information such as the magnitude of the vector, hemodynamic data and the position of any particular vector on the surface of sphere as a function of the age of the individual for further research will give more elaborated and realistic models with more precise and meaningful data description. Subsequently, [11] extended [9] model for a circular-linear regression in a case of single linear predictor. The regression model is applied in meteorological field in which the circular variable is the wind direction than influenced by the wind speed as the linear variable. This proposed model is developed by using maximum entropy conditional distribution, which is a completely specified cumulative distribution function, as the link function. The regression model is outlined as Despite of the fact that the circular-linear regression model with conditional centering is not a multiplicative model in the parameters, the response, T completes just a single spiral as x becomes larger through its range as praised by [12]. The single spiral of the response variable is known as another different class of the socalled "barber's pole" as compared to [9] model is a curve winding in an infinite number of spirals. However, [11] model is only covering for the case of a single linear predictor [13]. Accordingly, [12] generalized [9] model by allowing the response variable to have a concentration parameter for the case of multiple linear predictors using link function mapping the real line to the unit circle [13,14]. [12] model is used to determine the distances moved by a type of small marine snaile, small blueperiwinkles, Nodilittorina unifasciata, after they had been transplanted down shore from the height at which they normally live. The circular-linear regression model is given as  [9] model. [12] is taking the account of the dependence of both mean direction and dispersion on distanced moved into their model. Anyhow, the multi-modality with very narrow and sharp modes resulted by the proposed model causes a difficulty in optimizing the likelihood function [15]. In the meantime, over the years [9], [11] and [12] have become the main references for new authors to generate new method for statistical analysis such as regression models, outlier detection methods and correlation analysis. As discussed earlier, the extension of [11] and [12] models are based on the model introduced by [9] and model suggested by [11] has been extend by [12]. Next, [13] altered the model based on [9] to be applied for semiparametric Bayesian model for circular-linear regression. Furthermore, extension of [9] and [11] has produced new model based on nonnegative trigonometric sums by [16]. Meanwhile, the new models for asymmetric circular-linear regression model for environmental data by [14], inverse circular-linear and linear-circular regression by [18], application of Bayesian hidden Markov modelling by [19] are the extension based on [9], [11] and [12]. Besides, [15] presented the projected multivariate for circular-linear regression model based on [12]. Next, [9] model is used by [20] and [21] to propose new outlier detection methods such as DMCE and COVRATIO. In addition, COVRATIO statistic method also is used by [22] based on [9] and [12] models. [23] presented mean circular error (MCE) as outlier detection method based on [9] model. Later, the models by [9],  [11] and [12] are used by [24] to apply COVRATIO and [25] used DMCE. [26] developed COVRATIOs statistic for multiple circular regression model based on [9] model. Other than that, DMCE circular statistic is applied by [27] using [9] model. Lastly, [11] model is used for correlation analysis by [28].
Ultimately, [9] is the main contributor for the development of circular-linear regression model with the assumption of circular data follows the VM distribution. Eventually, with the same distribution assumption for circular variable, the development is continued by [11] and [12] with their own uniqueness modification of the link function to encounter the limitations for specific problem. To sum up, the establishment of the link function is influenced by the distribution of circular data and consequently produced multiple different approaches in developing circular-linear regression models. Moreover, the link function also can be differed from each other depending to a case of circular response with one linear predictor or with a set of linear covariates. Consequently, the model developed by [9], [10] and [12] are extended into further works for statistical analysis such as generates new models, outlier detection methods and correlation analysis. This model is only for the case of a single linear predictor.
[28] developed new method for correlation analysis.

Circular variable:
Direction of small blue periwinkles moved.

Linear variable:
Distance of small blue periwinkle travelled. Extended [9] model to the case of multiple linear predictors using link function mapping the real line to the unit circle.
The likelihood function of the proposed model is very difficult to optimize because multimodality is very narrow and has sharp modes and unidentifiability of parameters.

Other Types of Circular-Linear Regression Model
Formerly, VM distribution has been known as the foundation and has been widely used for the circularlinear regression model development as originally introduced by [9]. Notwithstanding of the concept, the controversy of VM distribution is inappropriate to analyze circular data with multimodality or skewness has been raised by [8]. On top of that, the cumulative distribution function in VM distribution does not have a closed form expression. As a matter of fact, [14] brought out a concerning reality that in practice the circular distribution encountered are usually asymmetric. In that event, the development of circular-linear regression model has been extended into several new approaches such as nonparametric smoothing, multimodal distributions, asymmetric and projected or offsets distributions by other authors which is discussed in Table 2.
[13] implemented the semiparametric Bayesian approach to give flexibility for the model distributional assumptions and increases the robustness to prior misspecifications. In the light of this advantage, it is very beneficial especially when the likelihood surface is ill behaved. On contrary, the priori knowledge in the selection of starting values for the optimization algorithm is needed to locate the global maximum, however such information is not available in this case. Likewise, [16] extended [12] and [11] models for multimodal distribution by using nonnegative trigonometric sums which is known by its specialty to model data that represent multimodality or skewness in the joint density. Thus, it is applicable to a broad range of dependence patterns including from independence to a very complex structures of dependence. Other than that, [18] proposed an inverse regression which is inversely predicting the corresponding value of the independent variable after observing a value of the dependent variable. The inverse circular-linear regression model for asymmetric distribution is based on likelihood method by using the asymmetric generalized von Mises (AGVM) errors in the inverse circular linear regression problem is recommended when one targets a small mean squared errors (MSE) in the result. Lastly, [19] applied the Bayesian framework that able to overcome identifiability issues and computational problems in the classical setting of combining projected normal (PN) and Gaussian distributions. However, the models proposed by [14], [18] and [19] are only for one linear independent variable.
The model introduced by [13] has been extended into developing new model using Bayesian approach by [29] for semiparametric linear-circular regression model, [30] for basic circular-linear regression model and [31] for asymmetric multi modal regression. New outlier detection method is proposed by [32] based on [13] and [16] models. Later, [16] model are extended by [33] for nonparametric copula using wind direction data for circular-linear and linear-circular regression models, [34] for semi-wrapped gaussian and [35] for environment contour using direct sampling method for [34] circular-linear regression model. Next, [36] introduced meta-analysis for linear-circular regression model and [37] used iteration order manipulation to develop circular-linear regression model based on [18] model. Moreover, hidden Markov in [19] model is applied in [38] model and [39] model using wind direction data.
Obviously circular variable can be found in several form of distributions other than VM distribution, hence the development of circular-linear regression model is continuously has been studied and extended to statistical analysis possibilities such as developing new models and outlier detections methods. Despite of the fact that certain approaches may outstanding the other, the behavior and pattern of the circular data can affect the appropriateness of the regression model.

Conclusion
In a conclusion, the regression model for circular-linear regression model has been frequently transformed over time. The statistical model development is different from one another based on the process of generation of data for circular variable, the sample size, the link function, the parameters involved in the link function development and the presence of outlier in the dataset. The most important aspect before developing the regression model is to determine the association between linear and circular variables to produce an appropriate model. The VM distribution is the most commonly distribution used for predicting circular response as it is firstly introduced by [9] into his model. As discussed, the occurrence of infinite equally high peaks is because of the curve winding in an infinite number of spirals up the surface of an infinite cylinder and the employment of additional information such as the magnitude of the vector, hemodynamic data and the position of any particular vector on the surface of sphere as a function of the age of the individual for further research will give more elaborated and realistic models with more precise and meaningful data description. [11] applied another form of the so-called "barber's pole" which the response completes only single spiral as independent variable increases. However, this model is only suitable for one predictor only [12] introduced a regression model with multiple number of predictors and the link function can avoid the non-identifiability of MLEs in [9] model. Nevertheless, the multi-modality with very narrow and sharp modes resulted by the proposed model causes a difficulty in optimizing the likelihood function. Gradually, with the exposure of multiple possibilities that circular data in practice can be asymmetric, circular-linear regression model development has expanded to the adaptation of other type of distributions such as nonparametric, asymmetric, multimodal and normal projected (NP) or offset distributions. SAS, Maple 6 and CODA are the most popular statistical analysis tools that have been used by researchers for analysis that includes circular data. R, Python and Matlab are the other example of statistical tools that can be explored and frequently used in industrial field. For further works, there are numerous circular-linear regression models developed by authors can be extended and manipulated to generate new model based on different distribution types and approaches, apply or introduce outlier detection methods in established circular-linear regression model such as DMCE, COVRATIO and MCE and propose new correlation analysis that take the nature of circular data involved into consideration either the data is dependent or independent variables. To summarize, [11] model is the most suitable and simplest model that can be applied when the focus of the research is about other statistical inferences such as confidence intervals and outlier detection other than developing new regression model.