Estimation of Sojourn Time and Transition Probability of Lung Cancer for Smokers using the PLCO Data

Objectives: The goal of this study is to investigate time durations in the disease-free state and the preclinical state of lung cancer for male and female smokers, using lung cancer data from the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial. Methods: We applied a modified likelihood function to the lung cancer data, to obtain maximum likelihood estimate and make Bayesian inference of the transition probability from the disease-free to the preclinical state, and the sojourn time distribution. The data was stratified by age and gender for smokers in the periodic screening program. A scaled Beta distribution was used for the transition probability density function, and a Weibull distribution was used to model the sojourn time in the preclinical state. Results: The epidemiological estimate of screening sensitivity is 0.649 for males and 0.68 for females. The transition probabilities are not the same for males and females: it is increasing monotonically to 80 years old for males; while it has a single maximum at age 72.5 for females. For male, the maximum likelihood estimate of mean sojourn time is 1.82 years, the Bayesian posterior mean and median sojourn time is 1.50 and 1.48 years, respectively. For female, the corresponding maximum likelihood estimate, posterior mean and median sojourn time are 1.84, 1.74 and 1.79 years respectively. The Bayesian mean lifetime risks for male and female smokers developing lung cancer are 12.0%, and 6.8%, respectively. Conclusion: Our estimation showed that male smokers are more susceptible to lung cancer, because they have a higher lifetime risk and higher transition probability density than the same aged female smokers. Once they enter into the preclinical state, the male smokers have a shorter mean sojourn time than the female, meaning that they are quicker to develop clinical symptom of lung cancer. Citation: Wang D, Levitt B, RileyT, Wu D (2017) Estimation of Sojourn Time and Transition Probability of Lung Cancer for Smokers using the PLCO Data. J Biom Biostat 8: 360. doi: 10.4172/2155-6180.1000360


Introduction
Lung cancer is the leading cause of cancer death in the world. Based on the GLOBOCAN 2012 [1] estimates, there were about 1.825 million lung cancer incidence in 2012 in the world; and about 1.59 million deaths from lung cancer, of which 1.099 million for men, and 0.491 million for women. In the United States, based on the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program data, lung cancer is the second most common form of cancer, and the first leading cause of cancer death [2]. It was estimated that there were 224,390 new cases in 2016, which is around 13.3% of all new cancer cases; and there would be 158,080 lung cancer death in 2016, which is about 26.5% of the total number of cancer death [2]. Approximately 6.5% of men and women will be diagnosed with lung and bronchus cancer at some point during their lifetime, based on the SEER 2011-2013 data [2]. Lung cancer is more common in men than in women [2]. And smoking is widely recognized as the leading cause of lung cancer. About 80% of lung cancer deaths are directly resulted from smoking [3]. Despite the very serious prognosis of lung cancer, some people with earlier stage lung cancers are cured.
The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial is a multicenter randomized controlled trial (RCT) evaluating screening programs for the four kinds of cancer. The purpose is to determine whether each specific screening modality can reduce mortality from a specific cancer, e.g., PLCO-Lung is to check whether screening with chest X-ray can reduce mortality from lung cancer [4,5]. Secondary objectives of the PLCO are to assess screening sensitivity, specificity, incidence, etc. It started in 1993 and ended enrollment in 2001; about 77,500 men and 77,500 women aged 55 to 74 who has no previous history of any PLCO cancer were enrolled in ten screening centers across the US. The PLCO data collection was completed in 2009; so the PLCO data are existing data. These data are available to the authors without participants' identifiers for the development of new statistical methods, and it was exempted from the IRB review by the rule of the NIH, since no human subjects were directly involved. Participants in the PLCO-Lung cancer screening were randomized to either study or control arm: people in the study arm were offered four annual chest X-rays, with a follow-up time up to 10 years; people in the control arm had usual care (no screening), and were followed for 13 years. There were 70,618 subjects that received at least one chest X-ray, with 70,560 subjects between age 55 and 74 at the first screen. Based on their gender and smoking status, participants in the study group can be separated into four cohorts: male smokers, male never-smokers, female smokers, and female never-smokers. This study will focus on the 4-annual chest X-ray (CXR) screening for lung cancer for male and female smokers, stratified by age. The number of male smokers who participated the initial screening exam is 21,335, with the average age 62.7; and the number is 14,257 for female smokers, with the average age 62.1, correspondingly.
Based on the natural history of tumor growth, each cancer patients are assumed to experience three states: the disease-free state S 0 , the preclinical state S p in which an asymptomatic individual unknowingly has the disease that a screening exam can detect, and the clinical state S c when the disease manifests itself in clinical symptoms. The progressive disease model was first used by Zelen and Feinleib [6], denoted by S 0 → S p → S c ( Figure 1).
Transition probability is the probability density function of the time duration in the disease-free state S 0 , and it provides important information on at what age people will move from the disease-free to the preclinical state. However, it is difficult to estimate without proper modeling. Sojourn time is the time duration in the preclinical state S p . If a person enters the preclinical state (S p ) at age t p , and his (or her) clinical symptoms present later at age tc, then Tp=(tc-tp) is the sojourn time in the preclinical state. The nature of data collection in a screening program makes it impossible to observe the onset of either S p or S c . Therefore, estimation of the sojourn time is difficult without proper modeling. Usually a person with a longer sojourn time means that it is easier to catch the disease by screening exams. If he (or she) is offered a screening exam at time t within the time interval (tp, tc ) and cancer is diagnosed, then the length of the time L=(tc-t) is the lead time ( Figure  1). The screening sensitivity is the probability that the screening exam is positive, given that an individual is in the preclinical state S p .
The screening sensitivity, the sojourn time distribution and the transition probability are the three key parameters in screening modeling, since all other estimations (such as the lead time distribution and probability of over-diagnosis) can be expressed as functions of the three key parameters. Therefore, accurate estimation of the three key parameters is important in cancer screening. Our goal is to provide accurate statistical inference for the distribution of sojourn time and the transition probability from the disease-free to the preclinical state for smokers using the PLCO-Lung cancer screening data, and we will use a new conditional likelihood function to achieve this.

Methods
We let β(t) be the screening sensitivity at age t, q(x) be the probability density function (PDF) of the sojourn time, and w(t) be the PDF of the time duration in the disease-free state. Inspired by Wu et al. [7], a new conditional likelihood method for estimating sojourn time and transition probability density was developed and applied to the PLCO-Lung data for the two cohorts: male and female smokers. Data from each cohort includes the total number of participants at each screening exam two consecutive exams. These data were stratified by participants' age t 0 at the study entry, which was from 55 to 74 (inclusive) in this study.
This study is to accurately estimate the time durations in the disease-free state and the preclinical state, which will provide critical information for oncologists and clinicians. To achieve this, we first estimate the screening sensitivity β(t). Based on the previous lung cancer screening data analysis [8,9] and input from lung cancer radiologists, sensitivity does not depend on age in lung cancer screening. Hence the sensitivity was estimated by the epidemiologic approach: using the total number of screen-detected cases divided by the sum of screendetected cases and interval cases [10].
This provides 0 =0.649 for male smokers, and β 0 =0.680 for female smokers, which would be used in the likelihood function for β(t).
For each gender of the PLCO screening data, based on their initial age t 0 , we developed a new conditional likelihood function L(| t 0 ): This likelihood function is different from the previous likelihood in Wu et al. [7], since it is conditional on the probability of no clinical cancer at or before the initial exam, which matches the enrollment criteria of the mass screening study. Here 0 , k t D is the probability that an individual will be diagnosed at the k-th scheduled exam, given that he is in the preclinical state S p ; and 0 , k t I is the probability of being an incident case within the k-th screening interval (t k-1 ,t k ), with K=4, since there were four annual screening exams in the PLCO lung cancer study group. And n k+1 ,t 0 =0 and k=1,2,3,4.
Where ( ) ( ) is the survivor function of the sojourn time in the preclinical state S p .
Appropriate parametric functions for w(t) and q(x) were carefully chosen. Instead of the log-normal distribution for w(t), a scaled Beta distribution was used: Where t is the age at screening, a, b are the parameters in the Beta distribution, w 0 is the lifetime risk of developing lung cancer at some point during one's lifetime for male or female smokers, a variable to be estimated. Based on the result from SEER, the age to make a transition from the disease free to the preclinical state is from 20 to 80 years old. Hence we let t L =20, t U =80, meaning that the transition from the disease-free state to the preclinical would happen in the age interval of (20, 80) if one develops clinical cancer. In this model, w 0 ,a and b are the parameters to be estimated.
We used the Weibull distribution to model the sojourn time in the preclinical state: where x is the sojourn time, α and λ are positive parameters to be estimated.
In summary, as we mentioned earlier, w 0 ,a,b,α and λ are the parameters to be estimated using the new likelihood function.

Results
Both maximum likelihood estimates (MLE) and Bayesian posterior samples were used to make inferences for the five unknown parameters in the model, i.e., θ=( w 0 ,a,b,α,λ). Theoretically, the first parameter has a domain of (0, 1) and the last four have a domain of (0, ∞). The practical meaning of these parameters will limit them to a finite range. The ranges were identified as: 0.01<w 0 <0.99,1.01<a<20,0.5<b<10,0.1< α<5,0.1<λ<2.
Markov Chain Monte Carlo (MCMC) was used to generate posterior random samples using non-informative priors and the joint posterior distribution of the parameters for a Bayesian inference. The posterior simulation was partitioned into 3 sub-chains, then Gibbs sampling was used to sample the posteriors for w 0 ,(a,b),(α,λ) separately. Similar procedure in the Appendix from Wu et al. [7] was followed for this paper in the implementation of MCMC. The MLE and Bayesian posterior estimates of 8 for the PLCO data are shown in Table 1, for both male and female smokers. The posterior mean and median are close to the MLEs, especially for the female group. For the male group, the largest difference is in the estimation of the parameter α for the sojourn time distribution: the MLE is less than 1 (0.970), while the posterior mean and median are 1.852 and 1.389 correspondingly. This causes the different shape of the sojourn time distribution near zero, and a large difference in the mean sojourn time (MST) estimate, compared with the result from their female counterpart.
Another issue for the male cohort is that the MLE of the transition density parameter b is less than 1 (0.903), while the Bayesian posterior mean and median are greater than 1 (1.056 and 1.014, respectively). Even though the values are close, this causes different trend for the transition density curve when it is approaching 80 years old (see first graph in Figure 2). Since our study was focus on the age interval between 55 and 74, the results from these two methods are pretty matched.
The estimated probability density curve w(t) based on the MLE and the posterior median (with 95% confidence band) are plotted in Figure 2. The posterior median transition probability varies from 1.24 × 10−3 to 6.04 × 10−3 for males aged 55-74. This means, in every 1000 people, there will be 1.24-6.04 people making a transition from the disease-free state to the preclinical state lung cancer per year, depending on their age, whereas these numbers are 0.97-3.22 per 1000 for females. The transition probability is not a monotone function of age for female, with a single maximum at age 72.5; whereas for male, it tends to increase all the way to 80 years old. Female smokers have a much lower transition probability compared with the male smokers to enter into the preclinical state. This is also reflected on the much lower estimated w 0 for females (Bayesian median 0.066) than for males (Bayesian median 0.117), because w 0 indicates the lifetime risk over all ages for lung cancer.
The sojourn time probability distribution q(x) can be seen from Figure 3. It is clear that the probability densities are concentrated within 2 years for both genders. The posterior mean sojourn time (MST) is 1.50 years for male, with a posterior median of 1.48 years, and the 95% highest posterior density (HPD) interval (1.06,2.05). The posterior MST for female is 1.74 years, with a posterior median of 1.79 years, and the 95% highest posterior density (HPD) interval (1.10,2.25). The MST from MLE are 1.82 and 1.84 years, for male and female respectively. So the MST for female seems longer than the MST for male, by either MLE or Bayesian estimate, meaning that females may have a longer sojourn time in the preclinical state.

Discussion and Conclusion
We applied a new likelihood function to the PLCO data and obtained the maximum likelihood estimate and Bayesian estimate of the key parameters in lung cancer for smokers. We used epidemiological method to estimate the sensitivity for the study, and the sensitivity is 0.649 for males, and 0.68 for females.
The NCI's Cooperative Early Lung Cancer Group conducted an important study regarding the sensitivity, specificity, and predictive values of chest X-ray (CXR) in the early detection of lung carcinoma in 1984. The NCI trials demonstrated that the sensitivity of CXR is from 0.54-0.84, with an average at 0.69 [11]. Our simple epidemiological estimate of the sensitivity is compatible with their result. Jang et al. [12] studied Johns Hopkins Lung Project (JHLP) data with CXR and got the estimated sensitivity as 0.568. Kim et al. [13] studied the efficacy of dual lung cancer screening by CXR and sputum cytology using JHLP data, the study showed that the screening procedure with X-ray only has improved from 79.93% to 85.34% when the screening exams were combined with cytology. Ten Haaf et al. [14] used individual-level data from the National Lung Screening Trial (NLST) and PLCO trial to estimate the screening sensitivity for different stage of lung cancer. According to their results, except for the IV stage, the sensitivities of CXR at the earlier stage (IA-IIB) are below 50% for the non-small cell carcinoma, but the sensitivity could reach 97.31% for CXR to detect small cell carcinoma at stage IV.
For smokers in the PLCO-Lung study, the MLE of the mean sojourn time (MST) is about 1.82 years for males, and 1.50 years using Bayesian posterior mean, with a 95% Highest Posterior Density (HPD) credible interval of (1.06, 2.05) years. For females, the MLE of the MST is about 1.84 years, and 1.74 years by Bayesian posterior mean, with a 95% HPD credible interval of (1.10, 2.25) years. For the Mayo Lung Project study [15], of which the study design is similar to this study, the MST was 2.2 years for male smokers. Liu [8] studied NLST for lung cancer with CT scan, they estimated the mean sojourn time was 1.44 years for males and 1.62 years for females. By using The Lung Cancer Screening Program at the Memorial Sloan-Kettering Cancer Center (MSKC-LCSP) data, Chen et al. [9] had a MST about 3.35 years for male smokers. Chien et al. [16] summarized several MST estimates from different low dose spiral CT, ranging from 1.38-3.86 years. Our MST estimates (1.48~1.84) are within this range. Whereas ten Haaf et al. [14] estimated a higher MST for both genders: between 3.09-5.32 years for males, and 3.35-6.01 years for females, depending on the type of carcinoma.
The transition probability from the disease-free to the preclinical state increases all the way to age 80 for male smokers, while it has a peak around age 72.5 for females. We compared this result with the SEER database. The "SEER Cancer Stat Fact Sheets" [2] shows that the  probability of developing lung cancer has a single maximum between age 65 and 74 for both genders. Our female results agree with that fact, but the male results do not. The transition density from NLST [8] is a sub-density with a unimodal around age 70 for both genders.
Lung cancer is more common in men than in women. Overall, the chance that a man will develop lung cancer in his lifetime is about 7. 19%  (1 in 14); for a woman, the risk is about 6.04% (1 in 17) [17]. These numbers include both smokers and non-smokers. The risk is higher for smokers, and lower for non-smokers. Our estimated posterior mean of w 0 was 11.95% for male smokers, and 6.82% for female smokers, which are reasonable, because they are both higher than the corresponding values for the general population. This is the first time that the lifetime risk was treated as a variable in the model. The risk for male smokers has increased 66.2% comparing withthe general male population (from 7.19% to 11.95%); and the risk for female smokers has increased 12.9% comparing with the general female population (from 6.04% to 6.82%). These indicate that the risk of developing lung cancer is much higher for male smokers than for female smokers. Villeneuve and Mao [18] studied lifetime probability of developing lung cancer, by smoking status for Canadian people. They found that 172/1,000 of male current smokers will eventually develop lung cancer; this probability among female current smokers was 116/1,000. Our estimated w 0 for both genders are lower than their result.
Our estimation showed that male smokers are more susceptible to lung cancer, because male smokers have a higher lifetime risk and higher transition probability than their female counterpart. Once they enter into the preclinical state, the male smokers seem to have a shorter mean sojourn time than the females, meaning that their tumors seem quickly to develop into the clinical disease state. The key parameters obtained from this study are also important, because other interesting terms, such as the lead time distribution, the percentage of overdiagnosis, etc., are functions of these key parameters, and our future work on estimating long term outcomes will use the estimated values of the parameters from this paper.