Vasicek Quantile and Mean Regression Models for Bounded Data: New Formulation, Mathematical Derivations, and Numerical Applications

: The Vasicek distribution is a two-parameter probability model with bounded support on the open unit interval. This distribution allows for different and ﬂexible shapes and plays an important role in many statistical applications, especially for modeling default rates in the ﬁeld of ﬁnance. Although its probability density function resembles some well-known distributions, such as the beta and Kumaraswamy models, the Vasicek distribution has not been considered to analyze data on the unit interval, especially when we have, in addition to a response variable, one or more covariates. In this paper, we propose to estimate quantiles or means, conditional on covariates, assuming that the response variable is Vasicek distributed. Through appropriate link functions, two Vasicek regression models for data on the unit interval are formulated: one considers a quantile parameterization and another one its original parameterization. Monte Carlo simulations are provided to assess the statistical properties of the maximum likelihood estimators, as well as the coverage probability. An R package developed by the authors, named vasicekreg , makes available the results of the present investigation. Applications with two real data sets are conducted for illustrative purposes: in one of them, the unit Vasicek quantile regression outperforms the models based on the Johnson-SB, Kumaraswamy, unit-logistic, and unit-Weibull distributions, whereas in the second one, the unit Vasicek mean regression outperforms the ﬁts obtained by the beta and simplex distributions. Our investigation suggests that unit Vasicek quantile and mean regressions can be of practical usage as alternatives to some well-known models for analyzing data on the unit interval.


Introduction
Modeling of bounded data has been extensively discussed in the literature to infer some statistical indicators such as mortality rates, recovery rates, as well as economic and risk measures. In diverse areas of knowledge, researchers are faced with continuous variables expressed as indices, proportions, rates, and ratios, among others. When there is an interest in explaining the relationship between these variables and some covariates, it is necessary to use a regression model that, in general, offers a summary of the mean of the response variable conditioned to the covariates. However, since the mean is clearly affected by extreme values in the data, inference from mean regression models is also affected by this situation.
When the response variable is skew-distributed, using a robust modeling is more suitable than a mean regression. Some works related to skew distributions are presented in [1][2][3]. For this reason, the quantile regression [4] can be proposed as an alternative modeling.
If the interest is in evaluating the effect of covariates on different quantiles of the distribution of the response variable, including the median, the quantile regression models are also appropriate. For a parametric analysis, the model is constructed considering a distribution supported in the same range as the response variable.
Although many other quantile regression models to adjust the limited data are already available, it is clear that none of them are a panacea for the diverse problems that a practitioner confronts. In this sense, it is important to propose and evaluate new alternatives. For example, the Vasicek (VASI) distribution, proposed in [24], is an alternative probability distribution bounded on the (0, 1) interval. This distribution has two parameters, which are defined on such as interval, where one of them is equal to the model mean. Hence, unlike other models whose parameters do not allow an interpretation direct, the VASI distribution has this advantage since the mean is one of its parameters. This also permits us to consider the distribution mean in the presence of covariates. The VASI distribution is typically employed to model economic data on the (0, 1) interval. The probability density function (PDF) of such a distribution can take different forms, such as U-shaped, J-shaped, or uniform [25]. Note that the literature on the VASI distribution is rather scarce.
Based on the above, the objective of our investigation is to introduce an alternative to the parametric quantile regression models for skewed distributed response variables restricted to the (0, 1) interval by considering the VASI distribution. Then, we extend the range of applications of this distribution. This paper is organized as follows. Section 2 introduces the VASI distribution parameterized either in terms of its quantile or its mean. Directions regarding the maximum likelihood (ML) estimation method in both quantile and mean parameterizations are presented in Section 3. In this section, diagnostics based on residuals are also considered. In Section 4, we provide two Monte Carlo simulation analyses to assess some statistical properties of the ML estimators, such as bias and mean squared error, in a regression-based functional form through the logit link function. The potential and competitiveness of the proposed models are shown in Section 5 through two applications involving real-world data. Section 6 presents some final considerations about the proposed model. Finally, in Appendix A, we characterize other distributions that appear in this paper.

The Vasicek Distribution
In this section, we introduce the VASI distribution parameterized either in terms of its quantile or its mean. A random variable Y with bounded support on the (0, 1) interval is VASI distributed [24] if its PDF, cumulative distribution function (CDF), and quantile function (QF) are written, respectively, as where 0 < (y, α, θ, τ) < 1, whereas Φ and Φ −1 denote, respectively, the standard normal CDF and QF. Note that θ is the shape parameter of the VASI distribution, and its mean and variance are, respectively, E(Y) = α and Var(Y) where Φ 2 is the bivariate standard normal PDF with correlation θ [26] (p. 900) calculated from the double integral stated as As presented in [27], the PDF defined in (1) is unimodal with mode computed from , when θ < 0.5; monotone, when θ = 0.5; and U-shaped, when θ > 0.5. Figure 1 shows the PDF defined in (1) for different values of θ and α. Observe that the distribution possesses a symmetry property, that is, F(y; α, θ) = 1 − F(1 − y; α, θ).  Note that we can assess the effect of covariates on the mean of the distribution of Y through some appropriate link function. Moreover, from the QF formulated in (2), we can re-parameterize α in terms of the τ-th quantile, for τ ∈ (0, 1), considering a one-to-one transformation (α, θ) → (µ, θ), as where µ is, for a fixed and known value τ, the τ-th quantile of the distribution of Y. Figure 2 shows the PDF stated in (1) re-parameterized and assuming different values for µ, θ, and τ. We consider the first decile (τ = 0.10), the first quartile (τ = 0.25), the median (τ = 0.50), the third quartile (τ = 0.75), and the ninth decile (τ = 0.90) for τ. The dotted vertical lines in Figure 2 indicate where the parameter α is located according to the values of τ. Observe that F(y; µ, θ, τ) = 1 − F(1 − y; µ, θ, τ).  In summary, assuming that Y is VASI distributed, we can fit its regression models for estimating both the mean and quantile of the response variable, for a fixed τ-level. Its desirable properties may distinguish it from some other distributions proposed in the literature for data on the (0, 1) interval.

Maximum Likelihood Estimation and Residuals Analysis
In this section, we describe the ML estimation considering the parameterizations (α, θ), (µ, θ), (α(x), θ(z)), and (µ(x), θ(z)), where x and z are values of the covariate vectors. Some directions on residuals analysis are also provided here.
Let Y = (Y 1 , . . . , Y n ) be a random sample of size n from a VASI distribution with unknown parameters α and θ and consider that y = (y 1 , . . . , y n ) are its observed values. Based on the PDF stated in (1), the corresponding log-likelihood function is proportional to The ML estimates of α and θ are obtained maximizing the log-likelihood function stated in (4) by solving the score equations U α = ∂ (α, θ; y)/∂α = 0 and U θ = ∂ (α, θ; y)/∂θ = 0 written, respectively, as where φ denotes the standard normal PDF. By solving the equation U θ = 0, obtained from (6), in α, we have while θ is calculated numerically considering α = ξ(θ) in the formula stated (6). The estimated asymptotic standard errors (SEs) of α and θ are computed by the corresponding inverse Hessian matrix. Such a matrix is obtained by the second-order derivative of the expression given in (4) with respect to α and θ. Note that Now, by considering the re-parameterization given in (3), the log-likelihood function is proportional to such that are the score equations for µ and θ, respectively. Note that, concomitant with y i , for i ∈ {1, . . . , n}, we can also observe covariate vectors x i and z i . Then, we are interested in evaluating the effects of these covariates on (α, θ) or (µ, θ). For ML estimation, we have the observations y 1 , . . . , y n from n VASI distributed independent random variables Y 1 , . . . , Y n . Now, assume the equations expressed as which link both η i and ζ i with a linear combination of the values of the covariates x i = (1, x i1 , . . . , x ip ) and z i = (1, z i1 , . . . , z iq ) , respectively. Note that ω i can be the mean or the τ-th quantile of Y. We assume that g 1 and g 2 are strictly monotonic, twice differentiable functions that map the mean α i or the τ-th quantile µ i and θ i to the real line. Suitable choices of g 1 and g 2 are the inverse CDF of the logistic (logit link), standard normal (probit link), minimum extreme value (complementary log-log link), maximum extreme value (log-log link), and Cauchy (Cauchit link) distributions. It is important to note that x and z can be identical or they could be subsets of each other.
To obtain the parameter estimates, we need the first-order and second-order partial derivatives of the logarithm of the likelihood function. For the response i, y i namely, with i ∈ {1, . . . , n} and, as mentioned, ω being the mean or the τ-th quantile of Y, the loglikelihood function is given by , such that the score equations are stated as where Θ = (β , δ ) , β = (β 0 , . . . , β p ) is a (p + 1) × 1 parameter vector associated with an n × (p + 1) covariate matrix X and δ = (δ 0 , . . . , δ q ) is a (q + 1) × 1 parameter vector associated with an n × (q + 1) covariate matrix Z. Considering the full log-likelihood function, we have the score vectors for Θ, written, respectively, as with "diag" denoting an n × n diagonal matrix and For the Hessian matrix, we have the expressions stated as The ML estimates ( β 0 , . . . , β p ) and ( δ 0 , . . . , δ q ) can be obtained, for instance, through general-purpose optimization algorithms such as Nelder-Mead, quasi-Newton, and conjugategradient, available in the optim function of R [28]. As an alternative, to model (α, θ) or (µ, θ), conditional on covariates, we create two generalized additive model for location, scale, and shape (GAMLSS) frameworks that can be used directly in the gamlss function of the gamlss package [29][30][31] of R. An advantage of having a GAMLSS structure is that all the parameters of the distribution can be modeled as linear, nonlinear, or smooth functions of covariates. In addition, we have available all the resources for the statistical modeling process within the gamlss package of R as model selection and diagnostics.
Considering some regularity conditions ( [32] pp. 118-119), note that Θ∼ N p+1 (Θ, (I(Θ)) −1 ), with I(Θ) being the expected information matrix that can be obtained as Observe that approximate confidence intervals may be reached employing the expression given in (7), while, for obtaining the information matrix defined in (8), we can utilize the observed information matrix defined by whose elements established in (9) may be calculated from the results above presented, evaluated at Θ = Θ.
To test the hypotheses H 0 : Θ = Θ 0 versus H 1 : Θ = Θ 0 , with Θ = (β , δ ) , as mentioned, we may employ the Wald and likelihood ratio tests. The Wald [33] and likelihood ratio statistics considering the observed information matrix [34] are, respectively, stated as As n → ∞, these statistics converge to a random variable with χ 2 distribution and r degrees of freedom (DF), χ 2 r namely, with r being the number of parameters under H 0 , which is rejected, at a significance level , if the statistic computed according to (10) or (11) is greater than χ 2 r,1− , which represents the 100(1 − )th χ 2 r quantile. Diagnostics and goodness-of-fit are important because they reveal the strengths and weaknesses of a proposed model. As pointed out in [35]: "Goodness-of-fit is concerned with assessing the validity of models involving statistical distributions, an essential and sometimes forgotten aspect of the modeling exercise. One can only speculate on how many wrong decisions are made due to the use of an incorrect model". Within the gamlss package, we may utilize the fitted normalized quantile residual [36] defined as If the model is correctly specified, the residual r i stated in (12) has an approximate standard normal distribution. Alternatively, defining r i = − log(1 − F(y i ; ω i , θ i )), we have the estimated Cox-Snell residual [37]. An important property of the Cox-Snell residual is that, if the model selected fits the data, r i follows the standard exponential distribution.
Note that, from a gamlss object, we can use these residuals to construct theoretical quantile versus empirical quantile (QQ) plots with simulated envelopes [38]. The main advantage of this simulation technique is its ease of interpretation without imposing any assumption on the residual distribution [39].

Simulation Studies
In this section, we report the results of two Monte Carlo simulation analyses employed to state the bias and root mean-squared error (RMSE) of the ML estimators of the VASI quantile regression coefficients. In addition, we provide the coverage probability (CP) of the 95% confidence interval (CP 95% ) using the asymptotic normality of such ML estimators.
Here, we assume τ ∈ {0.10, 0.25, 0.50, 0.75, 0.90}; sample sizes n ∈ {20, 50, 80, 110, 140, 170, 200}; and θ ∈ {0.25, 0.50, 0.75}, inserted on two regression frameworks formulated as: (i) logit(µ i ) = δ 0 + δ 1 z i1 for δ 0 = 1.0, δ 1 = 2.0 and z i1 ∼ N(µ = 0, σ = 1); and (ii) logit(µ i ) = δ 0 + δ 1 z i1 + δ 2 z i2 for δ 0 = 1.0, δ 1 = 1.0, δ 2 = 0.5, z i1 ∼ Bernoulli(p = 0.5) and z i2 ∼ N(µ = 0, σ = 1). For each (τ, n, θ) and both regressions, with the covariates being kept constant in the simulations, M = 10,000 replicates were generated using the SAS Data-Step, while the estimates were calculated with the quasi-Newton method of PROC SAS/NLMIXED [40]. Response values, for (τ, n, θ) and the covariates, are obtained from the QF given by where u i ∼ U(0, 1) and The bias, RMSE, and CP were calculated, respectively, as Tables 1-3 and Tables 4-6 present the results for the first and second regression models, respectively. Such tables report a small bias when estimating θ and δ for all settings considered in this study. In addition, the estimated RMSE is small and quickly approaches zero as n, the sample size, increases. Larger values of the bias and RMSE are detected as the quantiles are distant from τ = 0.5, either from the left or right, that is, the values τ ∈ {0.1, 0.25, 0.75, 0.9}. Moreover, for all scenarios, the CPs tend towards the nominal confidence coefficient, that is, 95%, as n increases from n = 20 to n = 200, with n ∈ {20, 50, 80, 110, 140, 170, 200}. The Monte Carlo simulation results for the mean and the same regression structures give us evidence that the coefficients are also well estimated according to the bias, RMSE, and CP, that is, once again, the bias and RMSE decrease as the sample size increases, whereas the CPs approach the confidence coefficient considered, that is, 95%, as expected. These results are not shown here and can be obtained from the authors upon request.    Observe that the empirical results based on the Monte Carlo simulations for large samples are coherent with the asymptotic theoretical results presented at the end of Section 3 and just before the diagnostic analysis.

Applications
In the next subsections, we present two real applications in order to show the potentiality of the proposed quantile and mean regression models, taking the VASI distribution as a baseline. For comparison purposes, in addition to the VASI quantile regression model, we also consider the Johnson SB (JOSB), Kumaraswamy (KUMA), unit-logistic (ULOG), and unit-Weibull (UWEI) quantile regression models, whose PDF, CDF, and QF are defined in Appendix A. The VASI mean regression model is compared with the beta and simplex regression models. Note that this section does not seek to provide all analyses for link function and variable selection, regression diagnostics, or parameter interpretation. Indeed, we are suggesting here the use of the VASI regression model as an alternative to other models available in the literature and not as a full regression analysis.

The VASI Quantile Regression Model
This application considers a real data set first reported in [41], which was also analyzed in [7]. This data set contains 298 observations about the body fat proportion of patients in a public hospital located in Curitiba, Paraná, Brazil. The fat proportion at android, arms, gynecoid, legs, and body are related to five response variables, and the data set is formed by two continuous and two categorical covariates. Note that the continuous covariates refer to the age (in years) and body mass index (bmi) (in kg/m 2 ) of the individuals, while the categorical covariates are related to gender (female or male) and ipaq (sedentary (S), insufficiently active (I), or active (A)). Observe that the ipaq is a questionnaire that permits the estimation of weekly time spent on physical activities of moderate and strong intensity, in different aspects of daily life, such as transportation, leisure, housework, and work, as well as the time spent in passive activities [42].
The sample contains information from 150 women and 148 men. Patients have a mean age of 46 years old and a standard deviation of 19.88 years old, while the average BMI is 24.72 kg/m 2 with a standard deviation of 3.15 kg/m 2 . The ipaq questionnaire classified individuals as follows: 76 individuals as insufficiently active, 60 sedentary, and 162 active. According to [41], the data set has one outlier, which consists of the individual #158 with the following characteristics: female, 49 years old, with BMI = 29.3 kg/m 2 and fat proportion in the arms equal to 0.196; in other words, this patient has a high BMI but low fat proportion in the arms.
For the response variable "proportion of fat in the arms", we assume that µ i is modeled as where • x i1 is the (age i − 46.00) with 46.00 being the average age; • x i2 is the (BMI i − 24.72) with 24.72 being the average BMI; • x i3 is an indicator covariate, in which x i3 = 0 for female or x i3 = 1 for male; • x i4 is an indicator covariate, in which x i4 = 0 for ipaq = S or x i4 = 1 for ipaq = I; • x i5 is an indicator covariate, in which x i5 = 0 for ipaq = S or x i5 = 1 for ipaq = A. Table 7 presents the values of the ML estimates and their corresponding SEs for the indicated models. The δ 0 is the estimate for a female with age of 46 years old, BMI = 24.72 kg/m 2 , and ipaq = S. The parameter estimates for δ 1 and δ 2 indicate that the arm fat proportion is larger for older individuals with larger BMI. In contrast, the estimates for δ 3 , δ 4 , and δ 5 have a negative influence on the arm fat proportion, indicating that this proportion is smaller for insufficiently active and active men, respectively. Consequently, the fat proportion is larger for women and sedentary individuals. As expected, θ does not vary with the quantiles, since it does not depend on covariates. From the variation in the parameters for different values of τ, we can conclude that the estimated arm fat proportion depends on the quantiles. This conditional quantile variation rate, expressed by the estimated regression coefficients, is illustrated in Figure 3. We can also see that, from a statistical point of view, all covariates are significant at 5%, since they do not contemplate the value of zero in their respective confidence intervals. Table 7. ML estimates and SEs for the body fat proportion data.  Table 7. Cont. To support this inference, the model's appropriateness should be stated. With this aim, we provide the QQ plots with envelopes considering the Cox-Snell and quantile residuals in Figure 4, from which we can conclude that the model presented a good fit, since all the observations are inside the simulated envelopes. Furthermore, Table 8 reports the values of the Akaike information criterion (AIC) employed to compare diverse models. We can conclude by this criterion that the quantile regression model considering the VASI distribution was the one that presented the best fit, as it had the smallest value. Note that, when the quantile level increases, the value of the AIC decreases. The largest estimated log-likelihood values have been obtained at τ = 0.90, that is, the left skew distribution of the response variable has been captured successfully by this quantile level.   ML estimates for the model parameters may be calculated using the vasicekreg package by means of the following R codes: library(gamlss) library(vasicekreg) data(bodyfat, package = "vasicekreg") fit <-lapply(c(0.10, 0.25, 0.50, 0.75, 0.90), function(Tau) { tau <<-Tau; gamlss(arms~age + sex2 + ipaq1 + ipaq2, data = bodyfat,trace = FALSE, family = VASIQ(mu.link = "logit", sigma.link = "logit")) }) sapply(fit, coef)

ML Estimate SE
Probit, complement log-log, and cauchit link functions could also be used through the argument mu.link and sigma may be employed depending on covariates through the argument sigma.formula. The vasicekreg package is available online at https:// cloud.r-project.org/web/packages/vasicekreg/index.html (accessed on 24 March 2022) and through it we can also consider, beyond quantile regression, mean regression and all the functionality of the gamlss package. For the other fitted models, the corresponding ML estimates were calculated with the unitquantreg package, which is under development and available online at https://github.com/AndrMenezes/unitquantreg (accessed on 24 March 2022).

The VASI Mean Regression Model
From Section 2, we observe that E(Y) = α, with Y VASI distributed. Hence, we can easily assess the effect of covariates on the mean of the response variable. Thus, in addition to the quantile regression model, we also propose a regression model considering the mean of the VASI distribution. As an alternative to the VASI regression model for the mean, the beta [43] and simplex [44] regression models are also considered.
In what follows, we illustrate the potential of the VASI mean regression model considering a political data set extracted from the baquantreg package of R [45]. The response corresponds to the proportion of votes in the 2014 presidential election in Brazil, obtained by the elected president, Dilma Rousseff, in the Minas Gerais and Piaui states. First, for each state, we considered the Human Development Index (HDI) in 2010, centered at the mean, as covariate and a regression structure given by where n = 843(222) for Minas Gerais (Piaui) and HDI + i = (HDI i − 0.6479). Table 9 reports the ML estimates, estimated SEs, and the values of AIC for the three fitted models. The estimate of δ 0 indicates that the logit of the vote proportion for the state of Minas Gerais is 0.6012 and for the state of Piaui is 0.7885 when HDI is 0.6479. Furthermore, since δ 1 is obtained as a negative value, there is an opposite linear relation between the vote proportion response and HDI covariate for both Brazilian states. Hence, we can conclude that, when HDI increases, a smaller proportion of votes has been received in both Brazilian states. This fact is more accentuated in the state of Minas Gerais since this value is smaller. The AIC values indicate that the VASI mean regression model is the best when compared to the beta and simplex models. We also consider a second mean regression structure, where "state" is a covariate, and we insert covariates into the parameter θ, as Table 10 reports the ML estimates, estimated SEs, and the AIC values. The estimate of δ 0 indicates that the logit of the proportion of votes is 0.5984 for the reference state (Minas Gerais) and HDI is centered at its mean value. The estimate of δ 1 indicates the variation that occurs in the proportion of votes from the reference state when there is a change of one unit in the HDI. From δ 2 , we can conclude that the proportion of votes is smaller for larger HDI values. The parameter estimate β 0 indicates the logit of the shape parameter of the distribution for the reference state. The parameter estimate β 1 indicates the variation that occurs in the logit of the shape parameter of the distribution for the reference state. Based on the AIC values, we can conclude that the VASI regression model considering the mean is the best fit to the data, when compared to the beta and simplex models. Once a VASI model is fitted, it is important to assess its adequacy by examining the residuals. Through the plot function, applied to objects vasi, beta, or simp, we have available four plots for checking the normalized quantile residuals. The four plots are residuals against the fitted values of the parameter α, residuals against an index, a kernel density estimate of the residuals, and the QQ plot of the residuals.

Concluding Remarks
There are various alternative classes of regression models for proportions and rates. Despite having great flexibility and taking different forms, the Vasicek distribution has been neglected in the analysis of data in the unit interval, both in the presence and in the absence of covariates. In this paper, two regression models considering the Vasicek distribution for response variables restricted to the unit interval were presented. The first one was formulated considering a re-parameterization in the parameter α of the Vasicek distribution. The second one was proposed stating the original parameterization in terms of the distribution mean.
In both models proposed, the covariates were related to the quantile or mean of the response variable by the logit link function. Through a Monte Carlo simulation study, we found that, for both models, the parameters are well estimated in terms of bias and root mean squared error. It was also possible to observe that the coverage probabilities tend towards the nominal confidence coefficient as the sample size increases.
The Vasicek quantile regression model was applied to a real-world data set, in which the response variable was the fat proportion measured in the arms. When compared with another four models existing in the literature, that is, those based on the Johnson SB, Kumaraswamy, and unit-Weibull distributions, the model that best fitted the data was the Vasicek quantile regression, according to the values of the Akaike information criterion. The Vasicek regression model considering the mean of the response variable was applied to a politics data set and compared with the beta and simplex regression models. The model was built considering covariates in the parameter representing the mean and also in the shape parameter of the distribution. By the criteria considered, the Vasicek regression model offered the best fit to the data. In the end, the results revealed that the Vasicek distribution has great potential for analyzing data restricted to the unit interval in the presence of covariates. The analysis using both regression models (quantile and mean) was facilitated by means of a package made available in the R software, which was implemented by the authors of this paper.

Acknowledgments:
The authors would also like to thank the Editors and Reviewers for their constructive comments which led to improve the presentation of the manuscript.

Conflicts of Interest:
The authors declare that they have no conflicts of interest.

Appendix A. Other Distributions
In this appendix, we characterize the other distributions considered in the numerical examples.