Binary regression model with misclassification and berkson-type measurement error with Student-t distribution

In this article, we introduce a regression model tailored for fitting binary data affected by misclassification in the response variable and Berkson-type measurement error in the covariate. The conventional assumption of a normal distribution for measurement error may inadequately represent atypical observations present in the dataset. To address this limitation, our model incorporates misclassification in the response variable and Berksontype measurement error, employing the Student-t distribution for more robust modeling of these atypical observations. We utilize the cumulative distribution function from the Student-t distribution as the link function, enhancing our ability to capture the dataset’s unique characteristics. Model parameters are estimated via the maximum likelihood method. We conduct a comprehensive Monte Carlo simulation study to thoroughly assess the impact of measurement errors and misclassification. Additionally, we apply the proposed model to a real-world dataset of survivors from the atomic bombing in Japan, showcasing its adaptability and suitability in practical scenarios. Our findings highlight the robustness and flexibility of this model in effectively handling complex binary regression scenarios involving measurement errors and misclassification.


INTRODUCTION
In regression models applied to binary data, it is typical to encounter datasets where certain covariates remain unobserved, leading to biased estimates.Conventional binary regression models operate under the assumption that the observed binary responses are devoid of misclassification, and the independent variables are free from measurement error.However, practical scenarios often involve measurement errors and misclassification, contributing to potential biases and imprecisions in the estimated regression coefficients.
To address these challenges, researchers have proposed various methods to account for measurement error in binary regression models and misclassification in the outcome variable in binary regression models.One approach to tackle measurement error is to assume a known distribution for the measurement errors and estimate the regression parameters using likelihood based methods.This approach, known as the classical measurement error model, has been extensively studied and applied in various fields, including epide-IECOS IECOS Modelo de regresión binaria con mala clasificación y error de medición tipo berkson con distribución t-Student miology, social sciences, applied social sciences, and environmental sciences.By explicitly modeling the measurement error, it becomes possible to obtain more reliable estimates of the true underlying relationships between variables (Carroll et al., 2006).
Another approach is to model misclassification in the response variable, where the observed binary response may not accurately represent the true underlying response due to misclassification errors.This can be addressed by estimating the probabilities of misclassification and adjusting the model accordingly Ekholm and Palmgren (1982).Carroll et al. (1984) analyzed data from a prospective study on the development of cardiovascular diseases presented in Kannel and Gordon (1968) and demonstrated the impact of measurement errors in binary regression.Burr (1988) investigated measurement errors in Berkson-type covariates in the field of bioassays, employing the probit link function.
In the context of binary response variables, measurement error models become even more challenging due to the presence of misclassification.Several researchers have proposed some models, such as Roy et al. (2005) who developed a measurement error model for misclassified binary responses, where the independent variable is subject to the Berkson-type measurement error which follows the normal distribution.To address departures from normality in measurement errors, Bolfarine and Lachos (2006) considered structural measurement errors following a skew-normal distribution and adopted the probit link function.They employed both classical and Bayesian approaches for parameter estimation, utilizing Markov chain Monte Carlo techniques.Liu and Zhang (2017) conducted a Monte Carlo simulation study with the logistic regression model, employing the logit link function within the classical framework, to demonstrate the presence of non-ignorable biases in parameter estimates when misclassification is disregarded.Bazán et al. (2014) used skew-probit link functions because it deviates from the probit link function in terms of a flexible asymmetry parameter, with Bayesian approach.
In this article, we introduce an innovative regression model designed to tackle the complexities of both measurement error and misclassification in binary data.Berkson-type measurement error occurs when an independent variable isn't directly observed but is derived from a surrogate variable along with measurement error (Roy et al., 2005;Burr, 1988).The conventional assumption of a normal distribution for measurement error often fails to adequately represent unusual observations within the dataset.To address this limitation, we present a flexible modeling framework integrating the Student-t distribution (Lange et al., 1989) to handle the measurement error component.
Moreover, our model incorporates the cumulative distribution function (cdf) from the Student-t distribution as the link function.This link function plays a pivotal role in connecting the linear predictor to the probabilities of the binary response.By employing the cdf from the Student-t distribution as a link function, we enhance our ability to effectively capture the data's unique characteristics and thereby improve the model's overall performance.
For estimating model parameters, we utilize the maximum likelihood method, leveraging the optimx (Nash & Varadhan, 2011) library within the R software (R Core Team, 2021).This method ensures efficient and consistent estimators.Additionally, we conduct a comprehensive Monte Carlo simulation study to evaluate how measurement errors and misclassification impact parameter estimation and prediction accuracy.This study provides insights into the model's robustness under various scenarios involving complex data structures.
To showcase the practical application of our proposed model, we applied it to a genuine dataset featuring survivors of the atomic bombings in Japan.Our analysis of this dataset serves to underscore the model's aptness and effectiveness in capturing the intricacies within the data, offering valuable insights and demonstrating its robustness in handling real-world complexities.
The remainder of the article is organized as follows.In Section 2, we present the proposed regression model that incorporates misclassification and measurement error.We estimate the model parameters numerically using the maximum likelihood method.Section 3 provides a simulation study to evaluate the performance of the maximum likelihood estimators.In Section 4, we apply the proposed model on a real data set of survivors of the atomic bomb attacks in Japan.Finally, Section 5 presents concluding remarks.

THE MODEL
The probit model, a binary linear regression using the probit link function, assumes the response variable follows a binary distribution.It models the relationship between predictors and the probability of the outcome.The probit link function is the cdf of the standard normal distribution, transforming the linear predictor into a probability, ensuring a smooth and symmetric relationship between predictors and the likelihood of success.
Using the cdf offers interpretability advantages.Coefficients estimate the change in the odds of success for a one-unit change in the predictor, making the probit model more interpretable than other link functions.It assumes errors follow a standard normal distribution, typically reasonable for large sample sizes due to the central limit theorem, resulting in reliable estimates and accurate inference.
However, the probit model is just one option among others (e.g., logit, cloglog) for modeling binary responses.The choice of link function relies on factors like the research context, data characteristics, and specific research questions.Researchers often compare different link functions, selecting the one best fitting the data and yielding the most meaningful results.
We present a regression model tailored for binary data handling Berkson-type measurement errors in covariates, where the error follows a Student-t distribution, including it as the link function.Furthermore, we account for misclassification in the response variable.
In linear models with binary responses, average estimates represent proportions.Various link functions are employed to transform the linear predictor, mapping values from the real line to the interval [0, 1].Consequently, a binary regression model can be defined as Where F G (.) is the link function, is the binary response variable with Bernoulli distribution and parameter is the model parameter vector, and w i is the predictor variable, i=1,…,n.We consider the link function to be the cdf of a distribution G i belonging to the Student-t distribution (Lange et al., 1989) with location parameter 0, scale parameter 1 and v degrees of freedom.Thus, which implies that

• Naive model (M1)
For the naive model, which assumes an absence of misclassification and measurement error, we consider the parameter vector to be estimated as θ1=(β 0 ,β 1 ,v) T .The probability of observing denoted as , for i=1,…,n is given by

• Model considering measurement error effects (M3)
In the regression model with Berkson-type measurements error, the predictor variable X i is not directly observed.Instead, X i is obtained as the sum of its surrogate ω i and a measurement error δ i ,i=1,…,n, with δ i ~ T(0,σ 2 ,v) and X i ~ T(ω i ,σ 2 ,v).Specifically, we have Assuming that the random variables X i and G i are univariate and independent random variables that constitute the random vector with bivariate Student-t distribution where X i ~ T(ω i ,σ 2 ,v) and G i ~ T(0,1,v),i=1,…,n, then we can define the ran- (Branco & Dey, 2001;Lin, 1972).For the regression model with binary response with measurement error Berkson-type ( 8)-( 9), considering where we have where G i is the link function Q i =G i -β 1 X i follows a Student-t distribution with location parameter -β 1 ω i , scale parameter and v degrees of freedom and F Q is the cdf of Q i .
A model for binary data with Berkson-type measurements error, with the parameter vector θ 3 =(β 0 ,β 1 ,σ 2 ,v) T , has the log-likelihood function given by

• Model incorporating both measurement error and misclassification (M4)
We delineate a model that integrates both measurement error and misclassification, building upon the foundations laid by the M2 (6) and M3 (11) models.We consider the probabilities ϵ 0 and ϵ 1 of misclassification.
Given the parameter vector θ 4 =(β 0 ,β 1 ,ϵ 0 ,ϵ 1 ,σ 2 ,v) T and denoting the unobserved binary response as the true value the observed binary response as and the observed variable as ω i ,i=1,…,n, the log-likelihood function is expressed as

SIMULATION STUDY
We performed a series of Monte Carlo simulations to examine the effects of misclassification and measurement errors on the coefficient estimates of regression models.In each scenario, we generate 500 Monte Carlo with measurement error following a Student-t distribution and/or with misclassification.Below, we provide a description of the simulation study, mirroring the approach taken by Roy et al. (2005).
Table 1-3 display the outcomes of simulations, featuring mean values and standard errors (SE) statistics for the adjusted model parameters.The results are derived from 500 Monte Carlo samples, each comprising 10, 000 observations, considering the presence of measurement and/or classification errors.Additionally, for the sake of comparison, adjustments were made using the probit link function (Roy et al., 2005) for models M1, M2, M3, and M4.This involved substituting the cdf of the Student-t distribution (T) with that of the normal distribution (N).Throughout all scenarios, σ^2 is assumed to be known.
In Table 1, showcasing simulation results where data exclusively incorporated misclassification, we note smaller biases in adjustments employing the Student-t distribution in contrast to adjustments with the normal distribution.This pattern holds true when comparing models M2 and M1.The superiority of the M2 model with Student-t becomes more pronounced with escalating probabilities of misclassification, as expected.Notably, the Student-t degrees of freedom estimated with the M1 model are significantly smaller than those estimated with M2, emphasizing the need for a distribution with heavy tails, particularly since M1 does not factor in misclassification.Additionally, it's worth mentioning that the SEs of M2 are larger due to the incorporation of additional parameters into the model.

Table 1. The mean and SE of model parameters for M1 and M2 are derived from 500 Monte
Carlo samples, each comprising 10, 000 observations.The data were generated with misclassification and without measurement error in three distinct scenarios.
In Table 2, we present simulation results based on data generated to incorporate measurement error.The models utilizing the Student-t distribution stand out, demonstrating superior performance with the smallest biases.Notably, as we increase σ 2 , all models exhibit a noticeable rise in bias in estimating β 1 .This trend is similarly observed concerning ν when employing the Student-t distribution.

Table 2. The mean and SE of model parameters for M1 and M3 are calculated from 500 Monte
Carlo samples, each comprising 10, 000 observations.The datasets were generated to include measurement error and exclude misclassification in three distinct scenarios.

Table 3.
The mean and SE of parameters for models M1, M2, M3, and M4 are computed from 500 Monte Carlo samples, each with a size of 10, 000, generated to incorporate both measurement error and misclassification.
Table 3 highlights that models utilizing the Student-t distribution generally yield superior results, exhibiting smaller biases across most scenarios.Notably, estimates of the parameter β_1 in models M1 and M3, under the normal distribution, displayed the most significant biases.Additionally, under the Student-t distribution, the estimated values of ν are consistently below 2, emphasizing the requirement for a distribution with heavy tails and rendering adjustments with the normal distribution inappropriate.The M4 model under the Student-t distribution, on the whole, delivered accurate estimates for all parameters.

APPLICATION
The dataset under examination in this analysis pertains to survivors of the atomic bombings conducted by the United States on the cities of Hiroshima and Nagasaki in Japan.Those who survived or resided in nearby areas experienced the effects of radiation exposure, leading to health issues, including cancer.The data utilized in this study, as sourced from Sposto et al. (1992), originates from a research initiative commenced 5 years after the atomic bombings.The primary objective of this study was to assess the impact of radiation exposure on cancer-related deaths.The cohort consisted of 86,520 survivors of the attacks, categorized into exposed and non-exposed groups based on their proximity to the bomb blast (< 2km, 2 to 10 km).These survivors were monitored from 1950 to 1985.
Table 4 presents information on radiation exposure dose, mean radiation exposure dose, number of cancer deaths, number of deaths from other causes, and the proportion of cancer deaths among the 31, 037 individuals studied.Measurement errors in radiation doses depend on location and biological reasons, as individuals can absorb different amounts of radiation despite having the same exposure conditions.Radiation exposure dose is measured using dosimetry, which quantifies the radiation doses to which an individual (or living being) may be exposed.Data were collected on various types of cancer, including lung, mouth, intestine, breast, prostate, among others.However, the radiation doses absorbed by the intestine at the time of exposure were selected as the reference dose.We applied the four studied models to fit the dataset, considering the cdf of both the Student-t (T) and normal (N) distributions as link functions in each model.For models involving measurement error, the substitute variable wi represents the average dose observed for each category, while the variable xi represents the true dose.Thus, we make the assumption that X i ~ T(ω i ,σ i 2 ,v), where σ i 2 =cω i 2 and c=0.5 (Roy et al., 2005), i=1,…,31,037.The focal point of this application is to assess the adequacy of the M4 model with Student-t in describing the data.Table 5 provides the estimated values of the parameters for models M1, M2, M3, and, M4 along with the corresponding SEs and the p-values obtained through Wald statistics.
Table 5. Model comparison: parameter estimates,SEs,AIC and BIC criteria for M1,M2,M3,and M4 fitted to data from atomic bomb survivor in Hiroshima and Nagasaki, Japan.
In the results presented in Table 5, we observed that the estimated values of parameters for models M1 and M3 are notably similar.The Wald statistics indicate that the estimated degrees of freedom under the Student-t distribution are not significant; they are, in fact, zero, suggesting a distribution with heavy tails for a more appropriate fit.In the case of the M2 model, the estimated value of ϵ 1 is not significant when considering the normal distribution at a 0.05 significance level, but it becomes significant with the Student-t distribution.Similarly, for the M4 model with the Student-t distribution, ϵ 1 is significant, but the degree of freedom, as per the Wald test, is not, indicating the necessity for an adjustment with a distribution featuring heavy tails.
According to the AIC criterion, the most suitable models are M3 and M4 under the normal distribution.Notably, the M3 model with the normal distribution also presents the lowest BIC.Given the observed existence of classification and measurement errors alongside a distribution with heavy tails, we propose the use of the M4 model for this application.
In Figure 1, we present graphs featuring randomized quantile residuals, as proposed by Dunn and Smyth (1996).These residuals tend to converge to the standard normal distribution when the model parameters are estimated consistently (Pereira & Russo, 2019).Additionally, we include their simulated confidence bands, constructed at a 0.95 confidence level using the hnp (de Andrade et al., 2017) library, for both the fit M1 (N) and the more appropriate fit M4 (T).Notably, the simulated envelope graph under the M1 model and the normal distribution appears unsuitable for the dataset.In such cases, the preferred option is adjusting with the M4 model and the Student-t distribution, characterized by heavy tails and accounting for both measurement error and misclassification.IECOS

CONCLUDING REMARKS
In this study, we adopt the assumption that the measurement error in the covariate adheres to a Student-t distribution, and the binary response is subject to misclassification.Utilizing the cdf of the Student-t distribution as a link function, the M4 model presented in this article proves valuable for modeling the mean of a binary response with both classification and measurement errors in the covariate.This model is particularly suited for adjusting data where the measurement error does not follow a normal distribution.
Parameter estimation was performed using the maximum likelihood method with the R software (R Core Team, 2021) and the optimx (Nash & Varadhan, 2011) library.
Simulations demonstrate the superiority of models considering some form of error and employing the Student-t distribution compared to the M1 model and models with the normal distribution, especially when ϵ 0 ,ϵ 2 , and σ 2 are involved.Generally, models M2 and M4 exhibit higher SEs due to their greater number of parameters.
In the final application, we employed data concerning the health effects on survivors of the atomic bombings in Hiroshima and Nagasaki in 1945, revealing a Berkson-type measurement error with a distribution featuring heavy tails.Among the models studied, it was observed that M1 and M3 provide similar estimates of coefficients β 0 and β 1 .However, the M4 model with the Student-t distribution yields notably different parameter estimates from the other models.Importantly, considering measurement error and misclassification observed in the data, the M4 model with the Student-t distribution emerges as the most suitable choice, supported by Wald statistics and simulated envelope graphs.
and is based on the responses a nd the predictors , with So, we will have the log-likelihood function is then obtained as which represents the log-likelihood function for the M1 model, considering (2) and (3), we have • Model incorporating misclassification (M2) Let represent the unobserved or true binary response, and , i=1,…,n, denote the observed binary response.We assume that the probabilities ϵ 0 and ϵ 1 of misclassification (Roy et al., 2005) are Considering with a Bernoulli distribution parameterized by , we have A regression model for binary data with misclassification, where the parameter vector is denoted as given the predictors is represented by the log-likelihood function where are the observed responses.

Table 4 .
Number of cancer and non-cancer deaths among the atomic bomb survivors in Hiroshima and Nagasaki corresponding to 10 dose categories.