A Bayesian Analysis of Sub-Samples in Survival Analysis

A Bayesian analysis of sub-samples was performed in order to estimate predictors of survival that were based on exponential and Weibull models. Analysis was done by using data followed up over a five-year period on a random sample of 4001 children under five years of age and their mothers from the Maseru District of Lesotho. In the process of constructing both models, the presence of censored observations and truncation at the age of five years were taken into account. Bayesian inference with Gibbs sampling was used in order to obtain marginal posterior densities for each predictor variable and the intercept. The Weibull model gave a better fit in comparison with the exponential model.


Introduction
The study was based on a 5-year-long follow-up study of 4, 001 children who were born in the Maseru District of Lesotho [1]. All children and their mothers in the study were followed up over the five-year period of study. During the 5-year follow-up period, 393 of the 4001 children had died. Data on children in the study were collected from clinical records and their mothers by health professionals from the Health Ministry of Lesotho.
The objectives of the study were A. to investigate the relationship between the lifetime of children and 9 dichotomous predictor variables that affect the lifetime of children using specially constructed exponential and Weibull models to account for truncation and censored observations [2], and B.
To demonstrate how the drawing of subsamples considerably simplifies the daunting task of estimating parameters from a joint posterior density involving several predictor variables using Matlab programing [3] Bayesian inference [4] and Gibbs sampling [5,6].
An attempt was made to analyze the relationship between the lifetime of children and 9 predictor variables that affect their survival using conventional methods such as maximum likelihood estimators from a number of distributions and the normal and log-normal models [7][8][9]. However, the estimated models failed to fit the data well. This was confirmed by the use of widely used diagnostic procedures in the case of maximum likelihood estimators and hazard function plots in the case of the lognormal model. Conventional methods failed because I.
The error terms were not distributed normally with mean 0 and constant variance regardless of the large sample size, II. The data included censored observations and the lifetime variable had to be truncated at t=5 years. As a remedy, special Weibull and exponential models were constructed to account for censored observations and truncation at the age of 5 years.
Parameters were estimated using Bayesian inference and Gibbs sampling by drawing several subsamples. The use of subsamples instead of the complete sample considerably reduced the time taken for estimation of parameters using Matlab programming and led to more or less similar point estimates of parameters at a cost of inflating the variance of estimation slightly.
In keeping with principles of Bayesian inference, the posterior marginal density was obtained as the product of the likelihood and the prior. For the exponential model, uniform priors were used to estimate the parameters 0 1 observations. For the Weibull model, it was assumed that σ is unknown, and the non-informative prior 1 ( ) p σ σ ∝ was used to estimate it. All prior and likelihood densities used to obtain posterior densities were constructed taking the presence of censored observations and truncation at t=5 years into account. To facilitate computation, analysis was done by drawing 50 subsamples of size 400 each drawn from the complete sample. From the joint posterior densities, marginal densities were obtained. Point estimates and 95% credibility intervals were obtained for each parameter in the study. The Weibull model gave more realistic estimated probabilities than the exponential model.
The following items will be discussed in the order shown below: The list of variables of study and their levels.
II. The drawing of sub-samples from the complete sample.
III. The exponential model and results obtained from the exponential model.
IV. The Weibull model and results obtained from the Weibull model.

V.
Comparison of results from the exponential and Weibull models.

The drawing of subsamples
Data analysis was done on a PC using Bayesian inference, Gibbs sampling and MATLAB programming. Simulations resulting from the use of Gibbs sampling with a sample of size 4001 cases and 10 variables took very long, and it was necessary to shorten the time needed for analyses by drawing several subsamples of size 400 (10% of the complete sample size) and derive the posterior distributions based on the subsamples using the Gibbs sampler. Besides, one major objective of the study was to demonstrate that the Gibbs sampler does simplify the estimation of parameters using Bayesian inference considerably. It is true that the posterior distribution of a parameter based on a large sample has a smaller variance or spread than the posterior distribution of a parameter based on a small sample. Realizing this however, a procedure of drawing subsamples of size 400 out of 4001 cases was carried out. This procedure was repeated several times as in the bootstrap sampling-resampling process [5]. Analytical results are therefore based on posterior densities derived from the subsamples.
In what follows next, the efficiency of the procedure of drawing subsamples from the complete sample will be demonstrated using the Weibull model to estimate the parameter µ . Assume that ~( , ) T WEI µ σ where T denotes the lifetime of the child. The probability density function of T is given as follows: For the parameters µ and σ , it can be shown that p µ ∝ where ( ) p µ is a non-informative reference prior [10]. The likelihood function of µ is given as follows: In Figure 1, the posterior distribution for a given sample 400 is drawn from the sample, the posterior distribution µof looks like the figure denoted by (2) in Figure 1. The variance of (2) is of course larger than the variance of (1). Subsamples of size 400 are drawn repeatedly from the sample 50 times, and then the average of the 50 posterior distributions is obtained.

Biostatistics and Biometrics Open Access Journal
The resulting average posterior distribution of µ looks like the figure denoted by (3) in Figure 1. Posteriors (1) and (3) in Figure 1 give almost the same means, which means that Bayes estimates of µ are about the same as estimates obtained from the complete sample. (The true value of µ is -1.). Even if posterior densities based on subsamples have larger spreads than posterior densities based on the complete sample, the process of drawing subsamples does not affect estimates of the expected values of parameters significantly. The standard deviation of the posterior distribution based on the complete sample is approximately 4 times smaller than the standard deviation of the posterior distribution based on the mean of 50 posterior densities corresponding to the 50 subsamples drawn for the experiment. As it turned out, it was easier to do analyses based on the 50 subsamples drawn from the compete sample. Furthermore, results from the subsamples were more or less as good as results from the complete sample.

The exponential model and results
Let T denote the lifetime of the child. Let 1 9 ,...., x x denote the 9 dichotomous predictor variables that affect the lifetime T. If the lifetime T follows the exponential distribution, the pdf and cdf of T will be given as follows: 3)

The effect of censored observations
At the time of data collection, 3608 of the 4001children in the study were still alive. We thus assume a random sample of size n observations on the lifetimes 1 2 , , ..., n t t t where most of the t's are censored observations of Type I. Let the symbol be used to indicate whether the child is dead or alive.

The likelihood function
Let the pdf and cdf of T be denoted by

The effect of truncation at t = 5 years
Let 1 p denote the probability that the child dies before the age of five. Then, the distribution of T will have a spike of length 2 1 1 p p = − at t = 5 since the data reveals no information about children with ages older than 5 years. In this study, 1 p is fixed. It is assumed that 1 0.135 p = or the under-five mortality rate of Lesotho. Thus, After truncation at t=5 years, the pdf of T becomes: After truncation at t=5 years, the likelihood function for 5 t ≤ becomes: Using Bayesian inference, the posterior density is a function of the product of the prior density and the likelihood function. For the exponential model, the uniform prior and the likelihood function given in (3.7) are multiplied with each other to obtain the joint posterior density. From the joint posterior density, the marginal densities corresponding to each parameter are obtained. Point estimates and 95% credibility intervals are given for each parameter of study.
For any age t such that 0 5 t < ≤ years, the expected lifetime of the child at birth, E(T), is approximated by the expression shown in (3.8): The estimated probability that the child survives more than t years at birth, is denoted by ( ) T S t where 0 5 t < ≤ , and is approximated by the following expression shown in (3.9): The probability of surviving more than 1 year at birth is denoted by ( 1) T S t= , and is obtained by using t=1 in the expression for ( ) T S t in (3.9). Doing so, we obtain the following simplified expression for ( 1) respectively for the exponential model using estimated results shown in Table 1 above. The 9 predictor variables mentioned in the expression for In Case 1, the child is exposed to minimum risk as each of the 9 predictor variables is fixed at low level. Case 2 carries more risk than Case 1 as 5 of the 9 predictor variables are fixed at low level. Case 3 carries more risk than Case 2 as 4 of the 9 predictor variables are fixed at low level. Case 4 carries maximum risk as none of the 9 predictor variables is fixed at low level. For the The result in (3.12) shows that when all 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.6604 years. When 5 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.6211 years. When 4 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.4322 years. When none of the 9 predictor variables is kept at its low level, the expected lifetime of the child at birth reduces to 4.3841 years. Using the expression in  Table 1.
For the exponential model, the probability of surviving more than 1 year, ( 1) For the children in the study, the probability of surviving more than 0 years is given by the expression in (3.14): The fact demonstrated in (3.15) shows that the survival probability function introduced earlier in (3.10) is well defined.

The weibull model and its results
Let T denote the lifetime of the child. ~( , ) T WEI λ β . The pdf of T is given by In (4.7),

The effect of censored observations
We are interested in the estimation of the parameters

The likelihood function
Due to the presence of censored observations, the following likelihood is used: (4.14) Using the posterior distribution shown in (4.14), each marginal posterior density corresponding to the explanatory variables used in the study can be obtained. The posterior density in (4.14) takes into account the presence of censored observations and truncation at t=5 years. The expected lifetime at birth is obtained as follows: By using the Gibbs sampler, the marginal posterior densities of the parameters were obtained by simulation. From the simulated posterior densities, mean values were obtained and used as Bayes estimates. Computer programs were written in Matlab. The Weibull model gave the estimates of regression coefficients and 95% credibility intervals shown below in Table  2.  In Case 1, the child is exposed to minimum risk as each of the 9 predictor variables is fixed at low level. Case 2 carries more risk than Case 1 as 5 of the 9 predictor variables are fixed at low level. Case 3 carries more risk than Case 2 as 4 of the 9 predictor variables are fixed at low level. Case 4 carries maximum risk as none of the 9 predictor variables is fixed at low level. For the 4 cases discussed above, values of can be summarized as shown below in (4.23): The result in (4.24) shows that when all 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.97 years. When 5 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 2.73 years. When 4 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 1.98 years. When none of the 9 predictor variables is kept at its low level, the expected lifetime of the child at birth reduces to 0.89 years.
Using the expression in (4.20), the expected lifetime E(T) at birth of any child can be predicted for any possible values of 1 2 9 , ,....., x x x and values of t such that 0 5 t < ≤ using values of the estimated regression coefficients ,...., β β β shown in Table 2. It should be noted that the ( ) 5 E T ≤ years as the pdf of T is truncated at t=5 years. The estimated probability of surviving more than t years from birth for 0 5 t < ≤ is given by:    (when all 9 explanatory variables are at their high levels) In Case 1, the child is exposed to minimum risk as each of the 9 predictor variables is fixed at low level, and hence the probability of survival should be a maximum. Case 2 carries more risk than Case 1 as 5 of the 9 predictor variables are fixed at low level. As a result, the survival probability of Case 2 should be smaller than that of Case 1. Case 3 carries more risk than Case 2 as 4 of the 9 predictor variables are fixed at low level (Table 3). Thus, the survival probability of Case 3 should be smaller than that of Case 2. Case 4 carries maximum risk as none of the 9 predictor variables is fixed at low level, and hence the survival probability of Case 4 should be the smallest of the 4 probabilities. For the 4 cases discussed above, values of are obtained as follows:

Comparison of Results
From Table 4, it can be seen that estimates from the Weibull model have larger magnitudes (in absolute value) than estimates from the exponential model. Besides, all estimates (except for the intercept) from the Weibull model are positive, while 3 estimates from the exponential model are negative even if level 1 (the high level) of each of the 9 predictor variables in the study contributes to the probability of death. This fact shows that estimates from the Weibull model are larger in magnitude and more realistic than estimates from the exponential model. Table 5 shows that, for the Weibull model, as the magnitude of risk increases, the expected lifetime at birth decreases over an evenly spread and more realistic range of lifetime. Although the same trend is observed in the exponential model, the variation is limited from 4.66 years to 4.38 years only. This shows that the Weibull model gives a more realistic result. Table 6 shows that, for the Weibull model, as the magnitude of risk increases, the probability of surviving more than a year decreases. For the exponential model however, the magnitude of risk and the probability of surviving more than a year are directly related. This shows that the Weibull model gives a better fit and more realistic estimates in comparison with the exponential model in survival analyses procedures.