Estimating Hazard Function and Survival Analysis of Tuberculosis Patients in Erbil city

: The study aimed to estimate the effects of prognostic factors on tuberculosis (TB) survival. Two models have been studied (Logistic model and Cox regression models) in survival analysis. Kaplan Meier has been applied to estimate the hazard function. The Kaplan Meier curve has been used to show the risks of dyeing of the factors in this study of tuberculosis data set. The data was obtained from Kurdistan Regional Government, Iraq /Ministry of health/General Directorate of Health, Hawler/Chest and Respiratory disease Center, in period 11 th January 2015 through 23 th November 2019 of all tuberculosis patients followed up by the hospital until 14 th April 2020. Kaplan Meier estimator results indicates that in the factor X-ray result TB has the highest value of estimated mean time until death, the Kaplan Meier curves are clearly indicated that the risk of dying increased with the time especially after 15 months. The logistic regression model identifies that (Gender, Chest symptoms, Type of patient, Site of TB, Transpupillary thermotherapy (TTT-outcome)) are the prognostic factors that influence in tuberculosis survival. Moreover, the Cox regression model identifies that (Age group, Gender, Site of TB, TTT-outcome) are the most common factors that have an impact on tuberculosis. Logistic regression model was selected to be the best model for our study data of tuberculosis by using the criterions; Akaike Information Criterion (AIC) and Bayesian information criterion (BIC) to comparing two models. It's worth mentioning; the results obtained by utilizing the statistical packages in Mat-lab and SPSS V.25, which was used to analyze our study data.

Tuberculosis (TB) is a contagious disease caused by infection with Mycobacterium tuberculosis (MTB) bacteria.The bacteria that cause tuberculosis are spread from one person to another through tiny droplets released into the air via coughs, sneezes, speaks or sings, and people nearby breathe in these bacteria and become infected.Scientific analysis, which depends on efficient statistical methods with quantitative and scientific measurements and parameters, must be used to study those specifications and characteristics.Using logistic regression and cox regression models, this study aimed to identify the important variables that influence tuberculosis disease.This model is called the logistic regression model.

Maximum Likelihood Estimation Method for Parameters:
The regression coefficients are usually estimated using maximum likelihood estimation.Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; for example, Newton's method.Estimation of parameters in logistic regression (the coefficients, ) can be estimated by maximum likelihood method: (The MLE's are determined numerically, by maximizing the log likelihood.) by taking the first derivative of the log maximum likelihood equation and then equivalents by zero we get the equations are nonlinear in the parameters the solution can be estimated numerically and therefore resort to the use of Newton Rafson iterative method and after a few cycles of succession produced appropriate estimates of the parameters.(Mawlood, 2019: 708) 2.1.2.Evaluating the Performance of the models: In linear regression analysis ordinary least square used to fit a model, t.F tests, and residuals are used to the coefficient's and the model.The situation is different with in logistic regression, the approximate chi-square and z tests and likelihood ratio test are used.

476
To test the parameters rather to determine are they equal to zero or not: chi-square test is used which is based on difference between the estimated log likelihoods corresponding to the two models, the test statistics is given by: …(4) Where: represents the logarithm of likelihood function of the Reduced Model, which contains only the Intercept parameter.(Archer & Lemeshow, 2006: 98) And for testing that if the explanatory variables are included in the model or not: Explanatory variables are included in the model.Explanatory variables are not included in the model.The Hosmer-Lemeshow goodness-of-fit measure, on the other hand, is useful for unreplicated datasets or datasets with just a few repeated observations.The observations are grouped in this test based on their approximate probabilities.The test statistic that results is approximately chi-square distributed with n -2 degrees of freedom, where n is the number of groups (generally chosen to be between 5 and 10, depending on the sample size).(Hosmer & Lemeshow, 2000 477 Each Wald statistic is compared to a 2-degree-of-freedom distribution.Wald statistics are simple to compute, but their accuracy is debatable, particularly for small samples.The standard error is often exaggerated for data that yield large estimates of the coefficient, resulting in a lower Wald statistic, and therefore the explanatory variable may be wrongly assumed to be unimportant in the model.The use of likelihood ratio tests is commonly thought to be preferable.

2-2. Survival analysis:
The study of time-to-event data is known as survival analysis.Such information describes the length of time between a time origin and a desired endpoint.Individuals could be tracked from birth until the onset of a disease, or the recovery period after a disease diagnosis could be studied.Data collected prospectively in time, such as data from a prospective cohort study or data collected for a clinical trial, is typically analyzed using survival analysis techniques.(Xin, 2011: 68) Survival analysis can be used to analyze health-care use in the field of public health.Since the health-care system represents a society's political and economic structure and is concerned with fundamental philosophical issues such as life, death, and quality of life, such an analysis is particularly important for both planners and scholars.(Liu, 2012: 12) Survival analysis is usually deal with the analysis of data in times of events in the history of individual life.The survival analysis and modeling the time it takes events occur; this typical event is death, which is derived from the name ' survival ' analysis.Let T be a random variable represents survival time of an event, with probability density function f (t) and cumulative function F(t) = Pr(T  t) , the survival function S(t) is defined as: (Fox, 2014: 66) … ( 7) The hazard function is the function that symbolized as h(t) and gives the failure rate for the survival time, which is defined as the probability of a failure during a small period of time (conditional failure rate) assuming that the individual might have remained alive until the beginning of the period, as well as the individuals fail in the so small time per unit time given that the individual have remained alive until time (t): (Wienke, 2011: 88)

2-2-1. Censoring:
Censoring is a type of missing data problem in which the time to event is not recorded for a variety of reasons, such as the research being terminated before all recruited participants have demonstrated the event of interest, or the subject leaving the study before witnessing an event.In survival research, censorship is popular.When knowledge about a subject's survival time is lacking, observations are censored.Interval censoring is applied to the data in the sense that certain transition times are not observed but are assumed to fall within a given time interval.The onset of dementia, for example, is latent, but when longitudinal evidence is available, the onset can be determined to be during the time period specified by 2 sequential observations.(Hout, 2017: 67) 2-2-2.Kaplan-Meier estimator to estimate hazard function: The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).The survival probability at time , , is calculated as follow: Where, Is duration of study at point i Is number of deaths up to point i Is number of individuals at risk just prior to S is the likelihood of an individual surviving at the end of a time period assuming that the individual was alive at the beginning of the time interval.The hazard rate at time t conditional on surviving up to or beyond time t is described as the instantaneous hazard function h(t) [also known as the hazard rate, conditional failure rate, or force of mortality].
Since h (t) is a rate rather than a chance, its units are 1/t.The cumulative hazard function H hat (t) is the integral of the hazard rates from time 0 to t, which reflects the sum of the hazard over time-mathematically, this quantifies the number of times the failure occurrence would be expected to occur in a given time span if the event was repeatable.As a result, thinking of hazards in terms of rates rather than probabilities is more reliable.The cumulative hazard is calculated using Peterson's (1977) method as follows: (Korosteleva,   484 male under probability of 95%.And about the site of TB the estimated mean time until death for the PTB is (17.248) while for the EP is (16.283).the estimated mean time until death for the X-ray result factor is (16.297) for NA, (12.625) for normal and (19.130) for TB with confidence intervals (15.493-17.100)for NA, (8.029-17.221)for normal and (17.360-20.901)for TB under 95% probability.

3-3-1. Kaplan Meier Curve:
The plot of K M curves is an important part of survival analysis for each group of interest.In our study, in our study the most important curve is the curve of hazard function.
Interpreting our results is usually with the plot of the cumulative hazard functions for the different groups of treatment (i.e., the two types of questions regarding condition), ''Alive", and "Death".4) we can see that the result of the Hosmer and Lemeshow test is statistically significant because p-value is less than 0.05, moreover, this means that our data fits the model.3-4-2.Variables in the equation for the logistic regression: However, the most important of all outputs of logistic regression analysis is the Variables in the Equation table.
Table (5): Variables in the equation Table (5) of variables in the equation, provides the parameter estimates (also known as the coefficients of the model) (B), their standard error (S.E.), the Wald statistic (to test the statistical significance) related pvalues that are less than α at level (0.05) are statistically significant, otherwise are not, degree of freedoms, and the important Odds Ratio (Exp (B)) for each variable category.B: These are the values for the logistic regression equation for predicting the dependent variable from the independent variable.They are in log-odds units.Similar to OLS regression, the prediction equation is where p is the probability of being in honors composition.Expressed in terms of the variables used in this example, the logistic regression equation is

The interpretation of B values:
-Age group-For every one-unit increase in Age group score (so, for every additional point on the Age group), we expect a 0.020 increase in the logodds of statue1, holding all other independent variables constant.-Gender-is one of the affecting factors to the risk or death in Tuberculosis diseases.For an increase by 1.159 which is an increase in the risk of the death for patient with (male or female), holding all other independent variables constant.-Chest Symptoms -For every one-unit increase in Chest Symptoms score (so, for every additional point on the Chest Symptoms), we expect a 2.218 increase in the log-odds of statue1, holding all other independent variables constant.-Type of patient -For every one-unit increase in Type of patient score (so, for every additional point on the Type of patient), we expect a 0.682 increase in the log-odds of statue1, holding all other independent variables constant.-Site of TB -For every one-unit increase in Site of TB score (so, for every additional point on the Site of TB), we expect a 3.616 increase in the logodds of statue1, holding all other independent variables constant.-X-ray result -For every one-unit increase in X-ray result score (so, for every additional point on the X-ray result), we expect a -0.300 decrease in the log-odds of statue1, holding all other independent variables constant.-TTT-outcome-For every one-unit increase in TTT-outcome score (so, for every additional point on the TTT-outcome), we expect a -2.717 decrease in the log-odds of statue1, holding all other independent variables constant.The model building process occurs in seven treatments (Age group, Gender, chest Symptoms, Type of patient, Site of TB, X-ray result, TTToutcome).In our study, we have 159 Event cases, which its number of deaths and 629 censored; cases, which are patients that still alive.Moreover, if the event has not occurred then the case is said to be censored.The Table (6): omnibus tests of model coefficients The table (6) of omnibus tests of model coefficients there is a statistically significant of the results of chi-square and, this means that the explanatory variables are included in the model.Furthermore, and about the -2 Log Likelihood there is a decreasing with the result of the -2 Log Likelihood before adding the explanatory variables by 501.701 and our -2 Log Likelihood after adding the explanatory variables is 1438.661.7) which shows the coefficients (B), standard errors (SE), value of Wald test, degree of freedom.The quantities E (B) are called hazard ratios (HR).A value of B greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the i th covariate increases, the event hazard increases and thus the length of survival decreases.

The interpretation of E (B):
-Age is one of the affecting factors to the risk in tuberculosis decease increase by Exp (0.103) = 1.109 which is increase in the risk of the death for patient with (Age group).The significant value 0.013 is less and equals to (α = 0.05) so there is significant effect on tuberculosis.And this factor increases in the hazard.-Although, Gender is one of the affecting factors to the risk or death in tuberculosis diseases.For a decease by Exp (-0.337) equal 0.714 which is a decrease in the risk of the death for patient with (male or female).The pvalue is 0.043 so there is evidence of a greater risk of death following tuberculosis in either sex.-The value of Exp (B) for Chest Symptoms means that the tuberculosis hazard for all patients who had chest Symptoms are 0.661 which is a decrease in the risk of death for patient to have or haven't chest Symptoms.The p-value is 0.150 which there is not significant effect on tuberculosis.-The estimated hazard in the Type of patient is, Exp (-0.013) equals to 0.987, which is a 98.7% decrease in the risk after adjustment for the other explanatory variables in the model of the death for patient.furthermore, the p-value is 0.848 so, it is not statistically significant.-The estimation of hazard decease by Exp (-0.452) equals to 0.636 for Site of TB, with p-value is 0.043 which is statistically significant.-The estimated hazard in the X-ray result is Exp (-0.108) equals to 0.898, which is an 89.9% decrease in the risk.However, the p-value equals to 0.525 is not statistically significant.-The estimation of hazard increases by Exp (1.828) equals to 6.220 for TTToutcome, with p-value (0.000) which is statistically significant and the 95% confidence interval for the hazard ratio included.When x is the vector of all the fixed covariates (Age group, Gender, Chest Symptoms, Type of patient, Site of TB, X-ray result, TTT-outcome) There are variables which not accepted by the above model because the score statistics with significance values greater than 0.05, which is three factors (Chest Symptoms, Type of patient and X-ray result), has not significant.The Cox-PH model with significant factor as follows: 3-6.Comparing models: There are many measures to comparing between two or among models in survival functions, it could also be worthwhile to consider the model comparison using the Akaike information criterion and the Bayesian information criterion.In survival analysis, Akaike information criterion and the Bayesian information criterion are the most widely used for comparing models.

1 .
Logistics Regression Model: Analysis of logistic regression is one of the important statistical methods that can be used in many areas of life, when there is a binary response variable or classified nature the relationship takes the formula of the logistic distribution function model and the explanatory variables can be quantitative, qualitative, mixed, ordinal and binary.What distinguishes the logistic regression model from the linear regression model is that the outcome in logistic regression is binary or dichotomous.(Hosmer & Lemeshow, 2000: 1) Let the response variable Y be the binary variable, assuming that P(Y =1) is dependent on a vector of predictor variables .The goal is to model: … (1) Because Y is binary variable, modeling means modeling , if model modeled as a linear function of explanatory variables, ‫وا‬ ‫اإلداريت‬ ‫للعلوم‬ ‫حكريج‬ ‫واالقخصاد/هجلت‬ ‫اإلدارة‬ ‫حكريج/may be result in estimated probability value which are over a restricted value [0,1].Then to avoid this the logistic function can be used, it assumes that: 1] or satisfy the probability condition, and by making transformation for we obtain another function: …(3) : 327) Similar to McFadden's R squared, Cox-Snell's R squared uses the likelihood of the selected model and an intercept-only model fit to the same data (McFadden's R squared uses the log likelihood).In this case, Cox-Snell's R Squared = 1 -[(Likelihood (Intercept-only Model)/ (Likelihood(Specified Model)] 2/n …(5) Where: n is the number of observations.Wald statistics is used to test the significance of individual coefficients in the model and are calculated as follows: …(6) To test the coefficients are they equal to zero or not: ‫وا‬ ‫اإلداريت‬ ‫للعلوم‬ ‫حكريج‬ ‫واالقخصاد/هجلت‬ ‫اإلدارة‬ ‫حكريج/كليت‬ ‫جاهعت‬ ‫القخصاديت

4 . 2 - 4 - 2 .
2008: 125)‫وا‬ ‫اإلداريت‬ ‫للعلوم‬ ‫حكريج‬ ‫واالقخصاد/هجلت‬ ‫اإلدارة‬ ‫حكريج/كليت‬ ‫جاهعت‬ ‫القخصاديتThe estimated survivor function and hazard function from the life-table method and the estimated survivor function from the Kaplan-Meier method can be further plotted to produce graphics.In these graphics, each estimated statistic is used as the vertical axis, and the study time is used as the horizontal axis.Based on equations about the cumulative hazard function, the analyst can further produce a cumulative hazard plot.(Guo, 2010: 88) 2-3.The cox proportional hazards model: The Cox proportional hazards regression model is the most convenient way to build regression models for survival data, time to-event outcome, based upon the values of given covariates.The Cox (1972) proportional hazards (PH) model has been an extremely popular regression model in the analysis of survival data during the last decades.The corresponding survival functions are related as follows: (Balakrishman & Rao, 2004: 186) … (11) where is an unspecified baseline hazard function free of the covariates x.The covariates act multiplicatively on the hazard.Clearly, the exponential and Weibull are special cases.(Ekman, 2017: 86) One subject's hazard is a multiplicative replica of another's; comparing subject j to subject m, the model is stated as: (Mawlood, 2019: 711) A parametric regression model based on the exponential distribution: (Fox, 2014: 69) … (12) Let denote the hazard rate at time for an individual have covariate value …(13) Here is the total number of the covariates.Is the constant Proportional effect of treatment.is called the baseline hazard; it is the hazard for the respective individual when all independent variable values are equal to zero.(Schmidt & Witte, 1998: 210) ‫وا‬ ‫اإلداريت‬ ‫للعلوم‬ ‫حكريج‬ ‫واالقخصاد/هجلت‬ ‫اإلدارة‬ ‫حكريج/كليت‬ ‫جاهعت‬ ‫القخصاديتproportional hazards model, the estimation of the bias line (h 0 (t)) and is needed to attempt to maximize the likelihood function for the observed data simultaneously with respect to h 0 (t).Similarly, a more popular approach is proposed by Cox in which the partial likelihood function that does not depend on h 0 (t) is obtained.(Lokeshmaran, 2013: 23) ( Measures of the Model Selection: In this study two measures for selecting the best model have been used by comparing the accuracy and performance of methods for comparing models simply involves calculating the measures for each model; the model with the lowest value is chosen as the best model.(Lee & Wang, 2003: 230) 2-4-1.Akaike's Information Criterion: The Akaike Information Criterion (AIC) compares the quality of a set of statistical models to each other.For example, Akaike's Information Criterion is calculated as follows: … (14) Where: K: is the number of model parameters (the number of variables in the model plus the intercept).Log-likelihood is a measure of model fit, this is usually obtained from statistical output.(Moore, 2016: 81) The Bayesian Information Criterion: The Bayesian information criterion (BIC) is one of the most widely known and pervasively used equipment in statistical model selection.BIC is computed for each of the models corresponding to the minimum value of BIC is selected.… (15) Where L is the value of the likelihood, N is the number of recorded measurements, and k is the number of model parameters.Comparing models with the Bayesian information criterion simply involves calculating the BIC for each model; the model with the lowest BIC is chosen as the best model.(Lee & Wang, 2003: 231) 3. Application part: 3-1.Introduction: In this section we are dealing with the collected data and the analysis of our data.In the medicine and health situations the most important analysis to use is the analyzing the time to event, which is the time from entry of any decease into a study until a decease has a particular outcome.Survival analysis techniques have been presented to analyze

Figure ( 1 )Figure ( 2
Figure (1): Kaplan Meier curve of age groupIn Figure(1)  the vertical axis represents the cumulative of hazard and the horizontal axis represent the time to event, of Hazard curve is clear from the plot that the risk of dying increases with time and sometimes stabled then increased again especially after 15 months which increases dramatically.We can see clearly there is not decreasing for the risk of dying.For the ages between 5-14 and 45-54 the risk of dying is greater than another age groups, especially the ages between 45-54 have the highest risk of dying.
can write the logistic regression equation with just significant variables: -2 Log Likelihood equals to 1940.362 for the omnibus tests of model coefficients before adding the explanatory variables to the model 3-5-1.Omnibus Tests of Cox Model Coefficients: Omnibus Tests of Cox Model Coefficients are used to verify that the new model (with explanatory variables included) is an improvement over the baseline model.
the corresponding vector of the regression coefficient for the fixed covariates.

names Classifications N No. of Alive No. of Death
The Kaplan-Meier estimator have been used to estimate the hazard function.And for two models (Logistic model and Cox PH model) were used for the survival analysis data.Moreover, all the corresponding results have been given and a comparison between the main methods: Cox model and Logistic model has been done.To evaluate the best survival model to our tuberculosis data two statistical measures (AIC and BIC) were used.3-2.Data collection:The data for this study of tuberculosis have been collected from Kurdistan Regional Government, Iraq/Ministry of health/General Directorate of Health, Hawler/Chest and Respiratory disease Center, Hawler.The data consisted of 788 cases which are collected during 5 years period; beginning from 11 th January 2015 through 23 th November 2019 of all tuberculosis patients followed up by the hospital until 14 th April 2020.Out of those patients there are 159 patients died during the study and 629 patients are survived or still alive.The survival time are measured in months.Table(1): Classification tableVariable

. Application of
Kaplan-Meier:The Kaplan-Meier method is a nonparametric technique for estimating time-to events (the survivorship function).Ordinarily it is used to analyze death as an outcome.It may be used effectively to analyze time to an endpoint.Also, used to estimate the hazard function and for comparing two different study populations.Table(2): Means for survival

Table ( 2
) explains the results of KM for all factors applied to data set of 788 cases.The results of KM for the Age group factor the Ages between 25-34 have the highest estimated mean time until death and the ages between 5-14 have the lowest estimated mean time until death.Moreover, for the chest symptoms the estimated mean time until death for those patients have the chest symptoms is greater than those don't have chest symptoms which are (17.703)for they have chest symptoms with confidence interval (15.259-20.146)and (16.160) for those don't have chest symptoms with confidence interval (15.298-17.022)under probability of 95%.The largest estimated mean time until death for the factor type of patient is for the N(S-)N(D) which is (18.267) with confidence interval (16.364-20.170)and the lowest estimated mean time until death is for the D(S+) and equals to (8.250) with confidence interval (6.977-9.523)under probability of 95%.And about the Gender the estimated mean time until death for Female is greater than Male which the estimated mean time until death for Female is (16.997) and for the Male is (16.732) with the confidence interval (15.645-18.348)for Female and (15.765-17.699)for

. The application of Cox Regression:
The Cox PH model provides an estimate of the effect of treatment on survival after modification for other explanatory variables.

Variables in the Equation
Table (7): Variables in the equation

Table ( 8
): comparing models with AIC and BIC

Table ( 8
) indicates the results for the AIC and BIC values which are used to comparing between two models (Cox regression model and Logistic regression model) for selecting the most suitable model to our data of tuberculosis.For each of the models based on two measures; the AIC and BIC were computed; the minimum value of AIC and BIC are selected.The results shows that Logistic regression model is the best model for our study data of tuberculosis because, it's AIC equals to 1106.4 and BIC equals to 1139.1 are the lowest values in comparison with AIC equal to 1954.4 and Bic equal to 1987 for the Cox regression model.Conclusions: During analyzing the tuberculosis data and as indicated by the outcomes from the practical part, the following conclusions have been drawn: