Regression models for binary response applied to data on neonatal deaths in newborns

The present study presents binary data modeling regarding 1.6% of neonatal deaths in 3,448 newborns from an epidemiological and observational study with a cross-sectional design, involving the retrospective analysis of 4,293 medical records of high-risk pregnant women followed in a gestational outpatient clinic from September 2012 to September 2017. Different symmetric and asymmetric link functions were considered by means of Bayesian inference. The support of more accurate inferences regarding the parameters of the model will provide biological interpretations that are more reliable and consistent with the reality. The model that presented, significantly, the lowest value for the deviance information criterion (DIC = 398.8), was the binomial with power logit (PL) link function, whose median posterior value estimated and significant for the parameter asymmetry was  = 0.25 (0.14;1.17). This significance is observed in all other models of the power family, however with very different values and significantly higher DIC values, indicating less parsimonious models. The Bayesian methodology proved to be flexible. Additionally, the results show that such model shows an accuracy = 97.4% and area under the ROC curve AUC = 89.4% in the prediction of neonatal deaths based on the weight of children at birth. Specifically, for 2.500g, a value predicted in the medical literature for low weight, the model predicts a probability of 1.43%.


Introduction
Mathematical models that seek to relate variables from biological events are commonly used in different areas of knowledge for the adjustment of observed data. In particular, regression models with linear or nonlinear predictors, whose parameters provide biological explanations, are of greater interest.
The regression models are associated to probabilistic models, in general, focused on the response variable (dependent). Among the situations of data analysis in which the answers, being dichotomic (success '1' and failure '0'), besides the presence of explanatory (co)variable(s), a binomial regression model is the proper choice. However, the uncritical choice of an appropriate link function may cause differences in the model settings, as well as in decision-making related to the research objective. The method of estimation of the parameters is also an important point to be considered, since each one has its assumptions that allow or not a greater flexibility in the modeling in the data analysis.
In cases where there are differences in the frequencies of 1's and 0's in the response variable, Chen, Dey and Shao (1999) mention that an asymmetric link function is more adequate.
Bazán, Romeo and Rodrigues (2014) present a simulation study considering different sample sizes for dichotomous responses in a regression model whose binding functions were probit and probit asymmetric. The results obtained under the Bayesian approach, evidence the second as the most parsimonious model. Additionally, they present new generalized link functions of these classes considering as probability density function probit (normal/gaussian) power and its reverse probit power. They report in the discussions that they found difficulties in obtaining the parameter estimates by means of frequentist procedures, which justified the use of the Bayesian model. Achcar, Coelho-Barros and Cordeiro (2013) have already presented generalizations of these distributions in other powers, that is, as special cases of a distribution called beta-normal (Eugene & Famoye, 2002). Bazán et al. (2014) mention that it is possible to unify these models through the distribution of Kumaraswamy (Cordeiro & Castro, 2011), but do not address the details.
It is worth mentioning that neonatal deaths are those that occurred from zero to 28 days after birth, the majority considered preventable (Brasil, 2014).
In 2015, in a global context, of the 5.9 million deaths in children under five, 2.7 million (46%) occurred in the neonatal period, causing a great impact on infant mortality (Liu et al, 2016). In the present study, the neonatal mortality rate was 16.2 deaths/1,000 live births (LB), totaling 1.6% of neonatal deaths, which is lower than the overall stillbirth rates in the year considered (19 deaths/1,000 LB) (Pan American Health Organization [PAHO/OMS], 2017).
The objective of the present study was to model data on neonatal deaths in neonates from high-risk gestation, considering different attachment link functions under the Bayesian approach due to their flexibility and, therefore, to promote more accurate inferences regarding the parameters of the model, providing biological interpretations that are more solid and consistent with reality.

Material and methods
In situations of data analysis in which the response Y = (Y 1 , Y 2 ,..., Y n )', a vector × 1 of independent random variables binary (dichotomous), in addition to the presence of explanatory (co)variable(s) represented by a vector of covariates, x i = (x i1 , x i2 ,..., x ik )' a × 1, a model of binomial regression can be used, such that Y i , is the observed value of the event of interest to the individual i, where i = 1, 2,..., n. Therefore, X will denote a × ( + 1) design matrix with rows x' i , and = ( 0 , 1 , . . . , )′ a k+1 vector of regression coefficients.
Response Y will assume '1' if the event occurred and '0' otherwise, thus the distribution of Y is Bernoulli, in a way that ( = 1) = (probability of occurrence), ∼ ( ), = ( | ) = ( ) = ( ′ ), indicating a model with link function = ′ (corresponding linear predictor) which represents the probability of occurrence of the event to the individual i, such that ′ represents the effects component of the regression model ( ) = ( ′ ) + , : vector (k+1) of effects of random errors associated with each observation.
All basic functions have their power and reverse power forms and are also interesting alternatives of this family (Bazán et al., 2016).

Reverse Power
Reverse power cauchy RPC For the application of the presented models, a database from an epidemiological, observational study with a crosssectional design was used, involving the retrospective analysis of 4,293 pregnant women followed by the high-risk prenatal outpatient clinic linked to a philanthropic hospital and accredited by the Unified Health System (SUS) of the South of Brazil, from September 2012 to September 2017 in the maternity of the referred hospital. A number of 3,448 pregnant women were eligible and composed the population of this study with their respective newborns.
Data collection was performed by a group of researchers from the State University of Maringá (UEM) between November 2016 and October 2017, through analysis of pregnant women's records, maternity birth records and SISREG (Consultation Regulation System of the Ministry of Health) to complement the data of the medical record.
In the modeling, neonatal death information (Y = 1) was considered as a response to a single explanatory variable, the weight (g) of newborns from high-risk pregnancies.
The research complied with the Directives and Norms Regulating Research Involving Humans of the National Health Council (Resolution CNS 466/2012) and was approved with the exemption of the signed Free Informed Consent Term (TCLE), justified by the fact of the use of data under opinion Nº. 2.287.476.
The subsequent posterior marginal distributions for the parameters were obtained through the BRugs package (Thomas, 2005) program R (R Development Core Team, 2020). For each parameter, initial frequentist values (models with basic link functions) were generated and 1,100,000 simulations were generated in an MCMC process (Monte Carlo Markov Chain), with a sample discard of 100,000 initial values. The final sample, taken with jumps of 100, contains 10,000 values generated. The convergence of the chains was verified through the coda package of program R, by the criterion of Heidelberger and Welch (1983).
Bayesian estimates of the parameters in the considered models, such as mean and median (used when asymmetry was observed in the distributions), standard deviation and 95% high probability density (HPD) intervals were obtained.
All computational procedures were performed on a machine with a 3.4 GHz Intel Core i7-3777 processor and 8GB of RAM and the processing time (PT), in minutes, was recorded.
As criteria for the selection of models, the Deviance Information Criterion (DIC -Spiegelhalter, Best, Carlin & Linde (2002)) was used. Lower DIC values indicate more parsimonious models and, for two models considered, A and B, so that = | − |, if D <5, it concludes by the equivalence of parsimonies. Although the DIC is a questioned measure in situations where the posterior distribution estimated is asymmetric in the parameters, since it uses the posterior average in its calculations, it was used for the selection of models even when the distribution of some parameters was asymmetric (Table 2), due to the easiness to obtain the DIC directly through the package used in R.
In addition, the test of adherence to the binomial distribution of Hosmer and Lemeshow (1989) was carried out through the ResourceSelection package of the R program and, for comparison of adjusted models, an evaluation was carried out under the data observed through their respective measures of predictive evaluation (Powers, 2011). For this purpose, the sensitivity, specificity, accuracy, positive predictive value (VPP) and negative predictive value (VPN) were calculated considering the observed prevalence (Huayanay et al., 2019;Lemonte & Bazán, 2018) and, finally, the area under the curve (AUC) ROC (Receiver Operating Characteristic) by means of the Epi package (Carstensen, Plummer, Laara, & Hills, 2020) of program R.

Results and discussion
The results of the Bayesian inferences under the parameters of the models with the linkage functions considered in Table 1, when applied to the neonatal death data as a function of the weight (g) of newborns from gestation of high risk, are presented in Table 2.
After analysis of convergence in the chains and tests of adherence to the binomial distribution of Hosmer-Lemeshow (H-L), we observed in all the models analyzed negative and significant values (0 ⊄ HPD95%) for 1 , parameter that influences the rate of increase/decrease in probability of occurrence of the event of interest, death as a function of the weight (g) of newborns.
The PL and PCLL models presented significantly the lowest DIC's (398.8 and 400.7, respectively), however with processing times 48.5 and 75.9 minutes, respectively, leading to the choice of the model with a power logit link function (p-value H-L = 0.998 and therefore a good fit), whose estimated value (median posterior) for the asymmetry parameter was  = 0.25 (0.14;1.17) and therefore significant, since the zero value is not contained in its respective HPD95% range. This characteristic is observed in all other models of the power family, however with very distinct values and higher DIC's values, significantly, in addition to higher processing time (PT). Bazán et al. (2014) present an application in data from the literature on the occurrence of menarche as a function of age in 3,918 girls. They conclude based on Bayesian models and in the previously mentioned criteria, that the RPP model is the best to represent such association. In addition, the link functions commonly used in the literature are analyzed in addition to the Power logit and its reverse versions, as well as the generalized extreme value (Wang & Dey, 2010). Bazán et al. (2016) using a sample of 4,000 car insurers, conclude by means of different selection criteria, including the DIC, that a model based on the Cauchy distribution with reverse power Cauchy link function (RPC) was the most parsimonious to predict if a customer hires a complete insurance plan for his or her automobile, depending on the gender, driving area, vehicle use, marital status, age and seniority in the company. According to the authors, such modeling aims to select potential customers of this plan.
Acta Scientiarum. Technology, v. 43, e45642, 2021  Anyosa (2017), in intensive simulation studies developed to study the accuracy and efficiency in the estimated parameters, reports that the binary regression models with power and reverse power link functions are better than those traditionally used, and illustrates them of an application to the educational data on the adequate performance of the spanish language and mathematics students regarding the evaluations of the sixth year of primary education (equivalent to the sixth grade of elementary education in Brazil) of the Peru educational system in 2014. The probabilistic sample contained 13,259 students between 11 and 13 years old. It concludes that the model chosen was the binary regression model RPL and that factors such as school management, school zone, gender and spanish language performance are all important in the analysis of the performance level in mathematics.
All adjusted models show values of accuracy equal to or greater than 97% and, with the exception of the Cauchy model (C), have AUC higher than 83% (Table 3)  The adjustments (Figure 1) present, respectively, the adjusted curves of the models with basic link power, power and reverse power functions considered. It is easy to observe that from the cut-off point recommended in the medical literature, that is, weight of 2,500g, the probability of neonatal death is practically zero, when Pr(Y = 1|Weight = 2,500g) = 0.0143 (Table 4) which represents 1.43% of neonatal deaths for babies weighing 2,500g. This prevalence in the observed data is 1.60%. It should be noted that the newborn (NB) is considered to have low birth weight (LBW) when it weighs less than 2,500g and can also be classified as very low birth weight (VLBW), which includes NB's with less than 1,500g, infants with extremely low birth weight (ELBW), which are NB's less than 1,000g (Glass et al., 2015). Therefore, the findings of this study show that the lower the weight of the newborn, the greater the probability of neonatal death. Corroborating, a review study between 1986 and 2004, with several large population-based cohort studies, including approximately 14,700 ELBW infants from North America, Western Europe, the United Kingdom, Australia, and Japan, found that ELBW infants present high risk of mortality (30-50%) (Glass et al., 2015).
It is known that LBW is the most important predictor of neonatal mortality, and is considered the main predictor of neonatal mortality with evidence of greater severity the lesser the newborn weight (Lansky et al., 2014;Demitto et al., 2017). It emphasizes the importance of more effective statistical models in research and possible interventions of health professionals and public policies to minimize neonatal mortality.

Conclusion
The Bayesian methodology proved to be flexible when applying to binary data considering different link functions in the linear predictor. The best (more parsimonious) adjustment to neonatal death data as a function of the weight of children at birth was the binomial model with the power logit link function.
The use of more accurate statistical models can provide professionals in the area with the basis for better decision-making to minimize the occurrence of neonatal deaths.