A new modified ridge-type estimator for the beta regression model: simulation and application

: The beta regression model has become a popular tool for assessing the relationships among chemical characteristics. In the BRM, when the explanatory variables are highly correlated, then the maximum likelihood estimator (MLE) does not provide reliable results. So, in this study, we propose a new modified beta ridge-type (MBRT) estimator for the BRM to reduce the effect of multicollinearity and improve the estimation. Initially, we show analytically that the new estimator outperforms the MLE as well as the other two well-known biased estimators i.e., beta ridge regression estimator (BRRE) and beta Liu estimator (BLE) using the matrix mean squared error (MMSE) and mean squared error (MSE) criteria. The performance of the MBRT estimator is assessed using a simulation study and an empirical application. Findings demonstrate that our proposed MBRT estimator outperforms the MLE, BRRE and BLE in fitting the BRM with correlated explanatory variables.


Introduction
Regression analysis is one of the most important tools which has several applications in chemometrics and other fields [39][40][52][53][54]. There are various types of regression models available in the literature i.e., linear model, non-linear model, generalized linear models and nonparametric regression model [36,39]. The choice of the suitable regression model is the most important task for obtaining accurate and reliable results [37,39]. The choice of suitable regression depends on the nature and distribution of the response variable [37,38].
In many research areas, there are situations in which the response variable is restricted to the interval [0, 1], such as rates and proportions. The classical solution is to transform the response variable so that it is mapped from the interval [0, 1] on the real line(ℝ). In this situation (which is common when analyzing chemical, environmental, or biological data), the logistic regression model is considered, where the log-odds are used as the response variable in a linear regression model. Another example of a common transformation is the inverse of a suitable cumulative density function, which also leads to the response variable ∈ [0, 1] being mapped onto the real line. A well-known example of the latter is to use the probit model. But such kinds of solutions have several demerits; for example, the model parameters cannot be easily interpreted in terms of the original response variable. Another shortcoming is that proportion measures typically show asymmetry, and hence any inference based on the normality assumption can be deceptive, especially for a small sample.
As a remedy to the aforementioned problems, Ferrari and Cribari-Neto [49] proposed a beta regression model (BRM) for a continuous response variable ( ) with support in the open interval [0,1] . This model is based on the assumption that the response variable follows the beta distribution. Further, the model can also accommodate asymmetries and heteroscedasticity. Later developments have expanded the model so that it is also possible to include covariates for modeling dispersion [45]. Generally, the maximum likelihood estimator (MLE) is used to estimate the unknown regression coefficients of the BRM.
It is a common assumption in the multiple BRM that the regressors are not linearly correlated with one another. Though, in routine, explanatory variables may be inter-correlated, which causes the problem of multicollinearity [48]. In the presence of multicollinearity, the variance of the MLE becomes overstated, and the inference based on this estimator may not be reliable. Another consequence of multicollinearity is the wider confidence interval and probability of type-II error increases in hypothesis testing of unknown parameters [18]. However, many biased estimators have been introduced to combat multicollinearity in linear regression modeling (LRM), such as Stein estimator [11], ridge regression estimator [2] (RRE), improved ridge estimator [7], contraction estimator [23], modified RRE [6]. Liu estimator [21] (LE), Liu-type estimator [22], mixed ridge estimator [51], and modified Liu-type estimator [25]. Among them, RRE is the most common and attractive method initially proposed by Hoerl and Kennard [2]. This method was further developed and applied to chemical data by Vigneau et al. [12] There are several methods to estimate the shrinkage parameter (see, e.g., [9,28,39,41]). Unlike the LRM, the effect of multicollinearity on the generalized linear model (GLM) has been prolonged by Segrestedt [8]. For instance, Månsson and Shukur [16] introduced some ridge parameters for the logistic regression model, Månsson and Shukur [17] established a Poisson RRE. While Amin et al. [29] proposed a James-Stein estimator for the Poisson regression model. Amin et al. [39] examined the performance of the inverse Gaussian ridge regression estimator. Later Amin et al. [30] recommended some methods for estimating the shrinkage parameter based on the Gamma RRE. Recently Qasim et al. [31] introduced the RRE for the BRM.
Liu [21] introduced a new estimator, subsequently known as Liu estimator (LE). The primary objective of the LE is that the biasing Liu parameter d is the linear function of the estimates instead of a non-linear function as in ridge parameter k. This leads to a more stable shrinkage vector of estimated coefficients. Therefore, due to the linear function of d, researchers have used the more robust LE instead of the conventional RRE. Regarding the vast literature on LE for the LRM, we refer our readers primarily to Liu [21]. This method has also been further developed and applied to chemical data by Alheety and Kibria [25,32], Kibria [10], Li and Yang [51], and Akram et al. [33]. However, the literature on the LE for the GLM is very limited for example, Månsson et al. [18] suggested some shrinkage parameters for the Poisson Liu regression estimator. Månsson et al. [19] introduced a LE for the logit regression model. Månsson [20] recommended some Liu parameters for the negative binomial regression model. Qasim et al. [34] developed and adopted some new shrinkage parameters for LE in the gamma regression model. Recently, Karlsson et al. [47] introduced the LE for the BRM. Another estimator efficiently tackles the problem of multicollinearity is called the modified ridge-type estimator (MRTE). The very limited literature on the MRTE is available. We refer to the following studies: Lukman et al. [3] introduced the MRTE for the LRM and Lukman et al. [24] introduced some new ridge parameters for the LRM. Further, Lukman et al. [4] proposed a modified ridge-type logistic estimator. The available literature shows that no such study is available for the BRM. The extent to which these types of methods have been studied is due to the importance of the situation when the regressors are non-orthogonal since this is the most common situation when analyzing chemical, environmental, or biological data. So, this article aims to adopt the MRTE for the BRM named MBRTE and derived its theoretical properties to determine its effectiveness. In addition, we also provide the theoretical comparison of the proposed MBRT with other estimation methods i.e., MLE, RRE, and LE in a sense of matrix mean squared error (MMSE) and mean squared error (MSE) criteria.
The paper unfolds as follows: we define the model of interest and estimation procedures of BRM in Section 2. The theoretical comparison of the proposed estimator with other estimation methods is also outlined in this section. The selection of the appropriate biasing parameters is presented in Section 3, whereas the layout of the Monte Carlo simulation and its findings is discussed in Section 4. An empirical application is outlined in Section 5. Finally, this paper ends with some concluding remarks.

The beta regression model
Suppose that 1 , 2 , … , are the observations of the random variable, Y follows a beta distribution with parameters , > 0 , symbolized as ( , ) and the beta probability density function is stated as where Γ(. ) is the gamma function and , > 0. The mean and variance of Y are ( ) = + and ( ) = ( + ) 2 ( + +1) , respectively. Ferrari and Cribari-Neto [49] defined the parameterization for developing a regression model of beta distributed responses based on Eq (1). By supposing that = + and = + , and re-parameterize the parameters of Eq (1) by defining = and = − , the density function of Y can be expressed through new parameterization as where ~( , ) and limited in the interval (0,1) , (0 < < 1) is denoted the mean of the response variable and ( > 0) is the precision parameter. The mean and variance of Y are defined on the new parameterization, i.e., ( ) = and ( ) = (1 − ) (1 + ) ⁄ . However, the reciprocal of is called the dispersion parameter ( = −1 ). The variance of the response decreases as is increased for fixed . The BRM is obtained by assuming that ~( , ), = 1, … , and the link function is defined as where is the ith row of X which is an × data matrix with non-stochastic explanatory variables, = ( 1 , 2 , … , ) is a × 1 vector of unknown regression coefficients, is the linear predictor and (. ) is the link function of the BRM, and it is strictly monotonic and twice differentiable such that (. ) ∶ (0,1) → ℝ. Since different link functions may be used for fitting the BRM, for instance; logit, probit, log-log, complementary log-log, and Cauchy link functions. Among these, the commonly used link function is the logit link, i.e., ( ) = ( 1− ), which was suggested by Ferrari and Cribari-Neto [49]. The mean function of the response variable with logit link function is defined as where is the mean function of the response variable. Since depends on and is a function of , the means 1 , … , are the functions of . The log-likelihood function of Eq (2) is given by Let ̂ be the estimated value of the MLE of . We are only interested to estimate the value of the parameter vector . Generally, the MLE is used to estimate the unknown regression coefficients , = 1, … . The score function ( ) can be found as Then the score function becomes where ( ) = (1 − ) and is the ith values for jth covariates , = 1, … , . Since Eq (5) is non-linear in , so, the solution of Eq (5) which can be obtained using Fisher's scoring iterative procedure as where ( ) is the first derivative of Eq (4) and = 0, 1, 2, … are the iterations which are carried out until convergence and the second derivative of Eq (4) with respect to becomes a Hessian matrix as Both ( ) and 2 ( ) ( ) are computed at . On some simplifications, the final form of estimation algorithm is then linked to as iterative reweighted least squares as . However, we conclude that converge to ̂ as → ∞, so final form of the MLE, we obtain where and are estimated at the final iteration. Equations (8) and (9) is a feasible estimator for the estimation of unknown regression parameters, for more details please see Roozbeh and Arashi [42]. Both and z are evaluated by using Fisher scoring iterative procedure. One can find the MMSE and MSE by considering =̂ and Λ = ( 1 , 2 , … , ) which is further equivalent to ( ) , where represents the orthogonal matrix whose columns are the eigenvectors of ; i.e., = ( 1 , … , ), where is the jth eigenvectors of and 1 ≥ 2 ≥, … , ≥ 0 are the eigenvalues of the matrix whereas ∀ = 1, … , is the jth element of ̂. Then, the covariance and MMSE of the ̂ are respectively defined by Therefore, the scalar MSE of the ̂ is given as where is the jth eigenvalue of . The matrix is ill-conditioned when the regressors are correlated that leads to some eigenvalues being small, and the estimated MSE of MLE is inflated. The multicollinearity problem is a severe issue in applied research that leads to high variance, wider confidence interval and unstable parameter estimates. Ferrari and Cribari-Neto [49] proposed a BRM for modelling rates and proportions. The MLE estimates of the BRM, and it is complicated to draw an inference based on the MLE for the BRM in the existence of multicollinearity. To circumvent this issue, we introduce a shrinkage estimator in the next section.

A beta ridge regression estimator
Qasim et al. [31] proposed a beta ridge regression estimator (BRRE) which is the generalization of Hoerl and Kennard [2] and is defined as where = ( + ) −1 ( ), ( > 0)is the shrinkage ridge parameter whereas is an identity matrix of order × . If → 0, then ̂=̂. The bias vector and covariance matrix of Eq (12) can be defined as where Λ = ( 1 + , 2 + , … , + ) andΛ = ( 1 , 2 , … , ) = ( ) , where is the orthogonal matrix whose columns are the eigenvectors of . Hence, the scalar MSE of the BRRE is generally obtained by applying the (. ) operator on Eq (15), which can be defined as (16) where =̂ and k (k > 0) is the ridge parameter suggested by Hoerl and Kennard [14] and for the BRRE it is computed by taking the derivative of Eq (16) with respect to k and equate to zero, we have =∑̂2 =1 . (17)

A beta Liu regression estimator
Karlsson et al. [47] introduced another estimation method called beta Liu estimator (BLE) for the BRM as ̂= ̂, (18) where = ( + ) −1 ( + ) and d [0,1] is the Liu parameter. If → 1, then ̂= . The bias vector and covariance matrix of Eq (18) can be respectively defined as Thus, the MMSE and scalar MSE of the BLE are respectively given as where Λ = ( 1 + , 2 + , … , + ) , Λ = ( 1 + , 2 + , … , + ) . Finally, the scalar MSE of the BLE can be defined as The optimum value of the above expression can be obtained by taking the derivative of Eq (22) with respect to d and setting the entire expression equal to zero. We obtain . (23)

Construction of the proposed estimator
For the LRM, Lukman et al. [3] proposed the following MRTE by augmenting the RRE which is given as In this study, we propose a MRTE for the BRM called MBRT estimator (MBRTE) defined as where ̃= ( and are the two biasing parameters of the MBRTE whereas is the identity matrix of order × . The bias vector and covariance matrix of MBRTE are respectively given as [̂M BRTE ( , )] = (̃− ) . (26) [̂M BRTE ( , )] =̂̃Λ̃. (27) Thus, the MMSE and scalar MSE of the MBRTE are respectively given as The  Proof. The proof of Theorem 2.3 can be seen in Appendix C.

Selection of the biasing parameters
Following the suggestion of the RRE and many other estimators proposed by different authors such as Hoerl and Kennard [2], and Qasim et al. [34]. There is a need to find an appropriate parameter for a practical purpose. The optimal values of k and d are determined for the proposed estimator. In determining the optimal value of k, d is fixed. The optimal value of k can be selected by differentiating Eq (30) and equate to zero, we have For practical purpose, and 2 are replaced with their unbiased estimates i.e., ̂ and ̂2 , respectively. Consequently, Eq (32) becomes ̂=( 1 + * )̂2 . (33) Furthermore, the optimal value for d can be derived by differentiating Eq (29) with respect to for fixed , we get For practical purpose, and 2 are replaced with their unbiased estimates i.e., ̂ and ̂2 , respectively. Consequently, Eq (35) becomes ̂= * ̂2 − 1. (36) The selection of the estimators of the parameters d and k is obtained iteratively as follows: Step 1: Obtain an initial estimate of d using ̂= Step 2: Obtain ̂ using Eq (33) by based on ̂ as computed in step 1.
Step 3: Estimate ̂ using Eq (36) by utilizing the value of ̂ obtained from step 2.
Step 4: If the value obtained from Step 1 is not between 0 and 1, use ̂=̂.

Results Monte Carlo simulation
In this section, a brief discussion is presented about the generation of data associated with different factors that play a pivotal role in the design of the simulation experiment. In addition, the assessment criterion is also presented to investigate the performance of the proposed estimator and compared with other competitive estimators.

Simulation layout
The numerical results were obtained using the following BRM which is defined as where is the covariates and represents the regression coefficients. The restriction on slope parameter values is ∑ 2 =1 = 1 , (see, e.g., Kibria [9] for more details). The intercept value is considered to be zero. The BRM is defined by using the logit link function in Eq (36). The n observations for the response variable are generated as ~( , ), = 1, … , in each Monte Carlo replication. The correlated regressors are generated by following McDonald and Galarneau [15] as follows = (1 − 2 ) 1/2 + ( +1) , = 1, … , ; = 1, … , . (38) where is the independent standard normal pseudo-random numbers and is the degree of correlation among two regressors. The performance of the proposed estimators is assessed under different conditions such as the degree of correlation which is taken to be = 0. , (39) where (̂− ) is the difference between the estimated and true parameter vectors and R is the number replications. All the computations are performed using the R Programming Language with the support of betareg() R package.

Results and discussion
The estimated MSEs of the BRM with different estimation methods are reported in Tables 1-3. For the evaluation purposes, we consider different factors to notice the performance of the proposed MBRTE. The general comments on the findings of the simulation study are discussed below.
From the results reported in Tables 1-3, one can be noticed that the performance of the proposed MBRTE is quite satisfactory in a sense of minimum MSE as compared to other estimation methods under study.
Results also demonstrate that multicollinearity has a severe impact on the estimated MSE's of the estimators under study. While this impact is somehow lower in our proposed estimator which signifies that MBRT shows a robust behaviour in the presence of high but imperfect multicollinearity. It should be noted that the MLE is the most negatively affected estimator when the regressors are correlated with one another. In addition, our proposed estimator outperforms the traditional MLE, BRRE, BLE in the BRM to reduce the effect of collinearity among regressors.
Increasing the value of sample size ( ) makes a decrease in the MSE values of the BRM for all the estimators under study. Meanwhile, the sample size has an indirect impact on the estimated MSE's.
Increasing the number of explanatory variables makes an increase in the simulated MSE values of the BRM estimators. Again, the MLE is the severely negatively affected estimator in this situation. If we examine the performance of the estimators concerning the regressors, then we conclude from the results that the proposed MBRTE still shows a reliable estimation method as compared to other estimation methods.
Hence, the findings of the simulation is compatible with the theoretical results. In summary, we suggest using MBRTE in the presence of moderate to strong multicollinearity due to its significantly decreasing the simulated MSE's.

Application: Heat treating test data
To evaluate the performance of the proposed method, we consider the heat treating test data. This dataset is taken from Montgomery and Runger [56]. This dataset consists of one response and five explanatory variables, where the response variable (y) represents the PITCH that denotes the product introduction to customer heart whose meaning is the quality of a sound governed by the rate of vibrations producing it or the level of something. Whereas the description of regressors include: 1 = furnace temperature (Temp), 2 =carbon concentration 3    The estimated coefficients and their respective MSE's of the BRM's estimators are reported in Table 4. Moreover, the estimated coefficients of the MLE, BRRE, BLE, and MBRTE are computed respectively using Eqs (9), (12), (18) and (25). While the MSE's of the considered estimators are computed respectively using Eqs (11), (16), (22) and (30). On the contrary, the value of ridge parameter used in scalar MSE of the BRRE is computed using Eq (17) which is to found be ̂= 0.0044, Liu parameter of the BLE is calculated using Eq (23) which is found to be ̂= 0.8948 , and the values of the biasing parameters for the MBRTE are computed respectively using Eqs (33) and (36) and are found to be ̂= 0.0033 and ̂= 0.5135. The results of Table 4 signifies that the MSE of the proposed MBRTE is substantially smaller than the MLE as well as other biased estimators. Results also validate that the MLE is the most negatively affected estimator when the regressors are highly correlated with one another due to its inflated behaviour of the MSE. However, BRRE and BLE show a smaller MSE when compared with MLE but these two biased estimators are not reliable when compared with our proposed MBRTE. Further, we also consider the standard error (SE) of the estimators considered in the study. Table 4 clearly demonstrates that the proposed MBRTE attains the smaller SEs while MLE attains the larger SEs of the regression estimates. Further, it is also clear that due to large SE and MSE, the MLE is not a better estimation method.
We use another criteria i.e., cross validation (CV) applied to the real life data set for the assessment of the proposed method. The findings of average validation error with reference to CV method are also given in Table 4. In CV criteria, we divide the train set into K = 5 equal size subsets or folds. Now for the individual value of k i.e., = 1, … ,5, we fit our prediction function on all points but those I the kth fold, and evaluate the validation error on the points in the kth fold. For more details on the CV, we refer the readers, please see [26,[43][44][45]. Further, the CV is considered to examine the predictive performance of the estimators comprehensively. Table 4 represents the CV values of all the estimators considered in the study. Results signifies that the performance of the proposed MBRTE is better as compared to the MLE, BRRE, and BLE. So both criteria i.e., MSE and CV shows that the proposed estimator performs consistently better as compared to the competitors.
Hence, both the simulation results and empirical application findings consistently support our proposed estimator and we recommend the practitioners to use our MBRTE instead of BRRE and BLE for estimating the unknown parameters of the BRM whenever the regressors are highly multicollinear.

Conclusions
In this paper, we proposed a new modified beta ridge-type (MBRT) estimator. We show analytically that the new estimator is superior to the standard MLE as well as other well-known biased estimators, i.e., BRRE, and BLE. A simulation study has been conducted to compare the performance of our proposed estimator with the available estimators. In the simulation experiment, we consider several factors to monitor the behaviour of the proposed estimator in every perspective. Based on the findings of the simulation, we found that our proposed estimator performed better than some existing estimators in a sense of smaller MSE and can be recommended. Finally, the benefit of the proposed estimator is shown by an empirical application where the proposed MBRTE for the BRM performed considerably better in terms of smaller MSE and CV as compared to the MLE as well as the other biased estimators. So, we suggest the researchers to use MBRTE whenever they want to fit the BRM and face the issue of multicollinearity. where = ( + ) −1 . Eq (C1) can be further expanded as