USING BETA REGRESSION MODELING IN MEDICAL SCIENCES: A COMPARATIVE STUDY

: Beta regression (BR) models provide an adequate approach for modeling continuous outcomes of limited intervals (0, 1). The BR model assumes that the dependent variable follows a beta distribution and that its mean is affiliated to a set of exploratory variables through a linear predictor known as coefficients and link function. The BR model also includes a dispersion parameter. This paper describes the BR model along with its properties. Furthermore, the comparison between different link functions of the BR model is conducted through a medical real-life application


INTRODUCTION
The BR is a type of generalized linear model (GLM) as it is a member of the exponential family of distributions. The GLM constructed assuming a dependent variable is marginally distributed following a beta distribution which is referred to as the BR model. Beta regression was formally introduced in political science and has many applications in medical sciences. It is a suitable candidate to traditional linear regression when the dependent variable follows a beta distribution rather than a normal distribution. The beta distribution can be parameterized by its mean and variance like the normal distribution. However, unlike the normal distribution, the variance of a beta distribution is a function of its mean and a 'precision' parameter, which is a scale measure of how tightly the observed data is clustered.
The usual practice used to transform the data so that the transformed response, say ̃, assumes values in the real line and then applies a standard linear regression analysis. A commonly used transformation is the logit function: ̃= ( 1− ), where log is the natural logarithm. However, this approach has some disadvantages, such as: 1-The regression parameters are interpretable in terms of the mean of ̃, and not in terms of the mean of y (given Jensen's inequality).
2-Regressions involving data from the unit interval such as rates and proportions which are typically heteroskedastic: display more variation around the mean and less variation as we approach the lower and upper limits of the standard unit interval.
3-The distributions of rates and proportions are typically asymmetric, and thus Gaussian-based approximations for interval estimation and hypothesis testing can be quite inaccurate in small Ferrari and Cribari [1] proposed a regression model for continuous variates which assumes values in the standard unit interval, rates, and proportions or concentration indices. Since the model assumes that the response is beta distributed, they called their model the BR model. In their model, the regression parameters are interpretable in terms of the mean of y (the variable of interest) and the model is natural [2].

Let
a random variable following a beta distribution: ~ ( , ) where p, q are shape parameters and , > 0. The probability density function (PDF) of y is given as follows: where Γ(•) is the gamma function. The mean and variance of y are ( ) = + , ( ) = ( + ) 2 ( + +1) .  Figure 1 shows that the beta distribution is highly flexible and able to accommodate the varying severity of skewness, heteroskedastic, and asymmetries (which make normalizing transformations impossible) [3]. However, the previous application of beta distribution did not involve the situations that the response variable can be modeled as a function of exogenous variables until the BR model proposed by [1]. 4 ABONAZEL, SAID, TAG-ELDIN, ABDEL-RAHMAN, KHATTAB

LITERATURE REVIEW
Various beta regression models have been proposed as a function of location and precision parameters [4][5][6]. Paolino [4] demonstrated that maximum likelihood estimation of proportions data using the beta distribution provides more accurate and precise results than OLS approach.
Kieschnick and Mccullough [5] identified different specifications for variates obsreved on [0. 1] intervals and recommened that regression models based on the beta distbution should used to model for these data. Smithson and Verkuilen [6] presented maximum likelihood regression models assuming that the response variable is conditionally beta distributed. Location and variances were modeled using continuous and categorical variables considering the heteroscedasticity problem. Cribari and Vasconcellos [7] analyzed the finite-sample behavior of three bias corrections to the maximum likelihood estimators of Beta distribution parameters. While Vasconcellos and Cribari [8] proposed a new class of regression models for beta-distributed response variables considering that the beta distribution's parameters are related to the regression parameters and covariates. They also highlighted the bias of the maximum likelihood estimator in small samples.
Simas et al. [9] extended the beta regression model proposed by [1] allowing the regression structure to be nonlinear. They defined bias-corrected estimators by derived formulas for maximum-likes estimators with second-order biases. Cepeda [10] proposed joint mean and variance beta regression models and applied the Bayesian estimation method. Bayer et al. [11] proposed the beta regression control chart (BRCC). Ferrari and Pinheiro [12] developed two modified likelihood ratio tests to test restrictions on the beta regression. Bayer and Cribari [13] proposed three Bartlett corrected likelihood ratio tests for fixed dispersion of the beta regression.
Also, Bayesian estimation of the BR model was proposed by [14]- [16] Many studies have used the BR model in medical applications. Swearingen et al. [17] modeled ischemic stroke lesions using beta regression because of their highly skewed distribution. Cepeda et al. [10] modelled the meteorological data assuming a beta distribution and both the mean and precision parameters are being modeled. Yellareddygari et al. [18] proposed a BR model for 5 BETA REGRESSION MODELING IN MEDICAL SCIENCES predicting the development of pink rot in potato tubers during storage and compared the BR model with the linear model on real data. Aktaş and Unlu [19] applied the BR model and describes its properties on well-being index data in Turkey . Gayawan et al. [20] adopted the beta regression model to examine covariate effects on the child mortality index in Nigeria.
Another group of studies proposed and developed the inflated BR model [21]- [25]. For the selection criteria model, Bayer and Cribari [11] proposed a model selection criterion for beta regression with varying dispersion, focusing on the selection of covariates for both mean and dispersion sub-models. Espinheira et al. [26] proposed also model selection criteria considering the residuals, leverage, and influential points both to systematic linear and nonlinear components.

Beta Regression Model
The BR model introduced by Ferrari and Cribari [1] considered the precision parameter to be constant across all observations. Nevertheless, assuming a constant could lead to substantial loss in efficiency of the estimators [27]. In BR with varying dispersion, the precision parameter is assumed to be variable throughout the observations and modeled by covariates, unknown parameters, and one link function. The BR model assumes that the response variable follows beta distribution, which is a family of continuous probability distributions strictly defined on the interval (0, 1) with two shape parameters (namely α and β). Those two positive shape parameters control the shape of the distribution in one-unit interval. Ferrari and Cribari [1] defined a regression structure for beta distributed responses that differs from (1). Let = + , = + , = , and = (1 − ) . Ferrari and Cribari [1] proposed a different new parameterization, the beta density in (1) can be written as: where ~ ( , ) , 0 < μ < 1, and > 0. Then ( ) = and ( ) = , where is known as a precision parameter and −1 is the dispersion parameter. Let {y 1 , . . . , y n } be the 6 ABONAZEL, SAID, TAG-ELDIN, ABDEL-RAHMAN, KHATTAB response variables that are independent of each other and each y i ( = 1, . . . , ) follows the beta density with and , the model is obtained by assuming that the mean of y can be written as We can define a dispersion sub model to account for possibly varying dispersion: where = ( 1 , 2 , . . . , ) is a vector of unknown regression parameters, and = ( 1i , . . . , ) are observations on covariates (q < n − k). Note that under our parametrization, one can use for the dispersion sub-model the same link functions that are typically used for the mean sub-model namely: logit, probit, clog log, and log-log. For details on these links (see Table   1).

Link Functions and Estimation
Different link functions can be considered for the BR model as in other generalized linear models [28]- [30]. When the parameters of interest are in the continuous interval (0,1), possibilities include the symmetric and asymmetric link functions [31], Box-Cox transformation link function [32], Gosset link function [33], Pregibon link function, and generalized logit function [34]. Canterle and Bayer [35] proposed ( ) = = 0 + 1 1 + ⋯ + The inverse link −1 (•) is also called the mean function commonly employed link function and their inverse are shown in Table1. The identity link simply returns its argument unaltered, = ( ), and = −1 ( ).
Some processes require that we model the scale parameter as a function of covariates as shown in The log-likelihood function for the BR model is: It is possible to show that this BR model is a regular model since all regularity conditions are described. Furthermore, with an invertible reparameterization, one can guarantee that the ML estimations are unique [36]- [37]. A particularly useful link function is the logit link, in which case we case write = 1+ .

Model Selection Criteria
Model comparison can be made via information criteria such as Akaike information criterion (AIC) 8 ABONAZEL, SAID, TAG-ELDIN, ABDEL-RAHMAN, KHATTAB introduced by [36] or the Bayesian information criterion (BIC). AIC is a well-known flaw of being dependent on sample size it tends to capitalize on chance and thus favors more complex models when the sample size is large. For this reason, many prefer the BIC [6].
where ( ) is the log-likelihood of whatever model was fitted, k is the number of parameters estimated, and n is the number of observations.
To assess the goodness of the model indices to assess the predictive capacity of the generalized regression model. These pseudo 2 indices have been developed that are intended as analogs of 2 as used in ordinary least squares (OLS) estimator. One such index introduced a penalized version of the BR pseudo-R-squared used in [1]. A simple candidate is a proportional reduction of error (PRE) statistic based on log-likelihoods: where ( ) is the log-likelihood of the null model as defined earlier, and ( ) is the log-likelihood of whatever model was fitted.

REAL-LIFE APPLICATION
The dataset used in this application obtained from Abdelmohsen et al. [38]. Data is collected from a master's thesis in pediatric surgery conducted at Al-Azhar University in Egypt, which recorded the pre-and post-operative data for children aged from one year to twelve years since they were suffering from a congenital obstruction in the connection between the kidney and the ureter, a condition called ureteropelvic junction obstruction. The number of observations is twenty patients divided into two groups (ten patients in each group) to compare two minimally invasive techniques using surgical laparoscopy to widen this obstruction.

Descriptive Statistics
STATA software version 14.2 is used to analyze the data. Tables 3 and 4 provides labels and descriptive statistics for the study variables.

Dependent variable
Post-Split Function (PO.SF) Proportion percent after Surgery.

Independent variables
Groups ( Table 4 indicates that all variables have a small variation as CV values for all variables is less than one. This is an indication that the data does not contain outliers. Jarque-Bera test is used to test normality. Since the p-values of all variables are more than 0.05, indicating that all variables are normally distributed. So, t-test is used to compare between two groups. There are no significant differences in PO.SF, PRE.SF, GTI, and Pain variables between the two group. While there is a significant difference between the two groups for HOS variable.

Diagnostic Tests
To test the multicollinearity problem in the model, we used the correlation matrix and variance inflation factor (VIF). As shown in Table 5, the model has no multicollinearity problem, because all correlation coefficients less than 0.8, and VIF values for all variables less than 5 [6,35,[39][40].
Some authors have introduced new estimators to reduce the impact of the multicollinearity problem in the BR model [30,[39][40][41]. To test the heteroskedasticity problem, we used the Breusch-pagan-Cook-Weisberg (BPCW) test.
There is no heteroskedasticity problem (P-value = 0.8635 > 0.05). indicating that the model is non-significant, and all variables are non-significant. The BR model is the best model for this medical application.  To prove that the link function is the best of the model, we used goodness criteria to select the best model as shown in table 7. Table 7 compares the BR model with four link functions (Logit, Probit, Clog-log, Log-log). We found that clog-log has the smallest values of BIC and AIC, and the largest value of log-likelihood and Pseudo 2 . Thus, we concluded that the BR model with clog-log is the best model to analyze our data.

CONCLUSION
Beta regression assumes that the response variable is beta-distributed. The BR model has mean and dispersion parameters assuming that the mean is related to a set of independent variables through a linear predictor with link function and unknow coefficients. In this paper, we studied the BR model with different link functions as Logit, Probit, Clog-log, and Log-log and applied in the medical field. We utilized the selection criteria to compare between the different BR models and choose the best model for analyzing the data. We used a real data in medical research to compare two minimally invasive techniques using surgical laparoscopy and laparoscopy-assisted techniques for widening the ureter used. The results indicated that the linear regression model is not suitable for this application, unlike the BR model with clog-log which provided the best results.