Risk of lung cancer due to external environmental factor and

Lung cancer is a cancer with the fastest growth in the incidence and mortality all over the world, which is an extremely serious threat to human’s life and health. Evidences reveal that external environmental factors are the key drivers of lung cancer, such as smoking, radiation exposure and so on. Therefore, it is urgent to explain the mechanism of lung cancer risk due to external environmental factors experimentally and theoretically. However, it is still an open issue regarding how external environment factors affect lung cancer risk. In this paper, we summarize the main mathematical models involved the gene mutations for cancers, and review the application of the models to analyze the mechanism of lung cancer and the risk of lung cancer due to external environmental exposure. In addition, we apply the model described and the epidemiological data to analyze the influence of external environmental factors on lung cancer risk. The result indicates that radiation can cause significantly an increase in the mutation rate of cells, in particular the mutation in stability gene that leads to genomic instability. These studies not only can offer insights into the relationship between external environmental factors and human lung cancer risk, but also can provide theoretical guidance for the prevention and control of lung cancer.


Introduction
As we all know cancer is caused by the cumulative mutations in multiple genes, accompanied by selective advantages [1]. Some hallmarks are acquired through the mutation and selection of cells in the development of tumor. They include but are not limited to the following signs: tissue invasion and metastasis, limitless replicative potential, resistance of programmed cell apoptosis, tumor promotion inflammation, sustained angiogenesis, insensitivity to growth-inhibitory signals, self-sufficiency in growth signals, reprogramming of energy metabolism, mutation immune destruction and, evading genetic instability [2][3][4]. With advances in DNA technology, high-throughput DNA sequencing has revolutionized the understanding of cancer, and previously undetected mutational signatures are discovered in the evolution of cancer. The evidence suggests that tumor contains abundant gene mutations, such as p53, KRAS, TP53, APC, chromosomal aberrations and so on [5][6][7]. However, most of them are passenger mutations that are not selectively advantageous to the cell clone [8,9]. The passenger mutation is not the reason why the tumor exists, and the onset of neoplasia is driven by the driver gene mutations that confer selective advantage to the cell clone [10]. Historically, many cancer researches were focused on the driver gene mutation with selective advantage that drives the initiation and progression of tumor. Comprehensive sequencing efforts revealed that two to eight mutations in the normal cells were involved for a solid tumor [11]. These significant discoveries motivate researchers to investigate how many driver genes are mutated for human certain cancer.
Although the outcomes from biological research are eye-opening, the reconciliation of these biological knowledge and epidemiologic or clinical observations still poses a significant challenge. To address this issue, biologically-based mathematical model has been presented to study the fundamental processes from healthy cells to a tumor, which is an efficient method for studying the mechanism of some cancers. As a pioneering study, Armitage and Doll proposed the multistage model to describe the development of tumor, who discovered that there was a linear relationship between logarithm of age and the risk of many cancers [12]. However, the model ignored the fact that the cells with gene mutations involved the clonal expansion of cells. In response to this issue, Armitage and Doll presented the model including two gene mutations with selective advantage to cells [13]. In their model, the normal cells mutated into the premalignant cells with clonal expansion, and then the mutated cells gave rise to an exponentially increasing proliferation and further mutated into a malignant tumor cell. Nevertheless, they did not give the exact biological meaning of the model. Later, Knudson examined 48 cases of retinoblastoma and proposed the hypothesis that retinoblastoma was the result of two gene mutations in the normal stem cells [14]. They discovered the gene RB, which is the earliest tumor suppressor gene. In addition, Moolgavkar et al. developed the model involving two mutation events to simulate the incidence rate or mortality data of most cancers at different ages [15][16][17][18][19]. In recent years, this model was widely applied to analyze the risk of various cancers due to external environmental factors.
Many models incorporated more biological knowledge were readily developed based on the model with two gene mutations, such as the model with multiple pathways with gene regulation and the model with more than two events [20][21][22][23]. Genetic instability caused by mutation in stability gene is an important mark in the development of cancer. Then, the model with genetic instability has been proposed to account for the mechanism of genetic instability in tumorigenesis [24][25][26][27][28][29][30]. Hazelton et al. presented a longitudinal biologically-based model to study lung cancer, which considered individual smoking histories, the probability of tumor in lung tissue, the mortality of lung cancer, CT screen detection and other factors [31]. Zhang and Simon described the model with the number of hits between two and six and considered clonal expansion of all mutated cells [32,33]. Their result showed that the fitting effect of the model with three hits to the female breast cancer incidence data was superior to that of the two-stage model. Moreover, the number of mutations in gene between two and fourteen was required for female breast cancer [34]. With the development of DNA sequencing technology, a large number of gene alternations were identified at base-pair resolution. These information on carcinogenesis pathway should be considered in the cancer model. Tomasetti et al. presented an approach to estimate the number of mutation in driver genes in lung and colon adenocarcinomas, which combined epidemiological data with genome sequencing information [35]. The result suggested the model involving three mutations in driver gene was the optimal model for lung cancer. Recently, the mechanistic model with molecular driver pathway was developed to predict the risk of lung adenocarcinoma, which indicated that radiation mainly affects the pathway involving transmembrane receptor-mutant whereas smoking primary influences the pathway with transducer-mutant in the development of lung adenocarcinoma [36].
In this paper, we mainly discuss the application of the cancer model with gene mutations on the analysis of cancer risk. For example, how to measure the risk of lung cancer due to external interventions (such as smoking and radiation exposure) through the mathematical framework and epidemiological data. Some important extensions of the cancer model and relevant references are reviewed firstly. In Section 2, we describe the clonal expansion model with two mutations in driver gene, more than two driver gene mutations and genomic instability and the application of these models on cancer studies. In Section 3, the effect of external environmental exposure on lung cancer risk are analyzed by the models mentioned in Section 2. In Section 4, we summary some relevant conclusions, and point out the shortcomings of the study as well as some perspectives to cancer research.

The model with two gene mutations
Evidence suggested that it required at least two driver gene mutations to develop malignant cells for most human cancer [1]. Thus, the model with two mutations in driver genes is widely used to study the pathogenesis of various cancers, which takes tumorigenesis as the final result of two driver gene mutations with rate-limiting in the healthy cells. The detailed model can be seen in Figure 1. The Figure 1. The model with two mutations in driver genes. ν 0 (t), ν 1 (t) represent the gene mutation rate per healthy cell and per premalignant cell at time t, respectively. b(t) and d(t) represent the birth rate and death rate in a premalignant cell at time t, respectively.

Normal cells Premalignant cells (P)
meanings of parameters in the model are as follows, ν 0 (t)-the mutation rate from a healthy cell to the cells with gene mutation at time t; b(t)-the birth rate at which a cell with gene mutation becomes into two same daughter cells at time t; d(t)-the death rate at which a cell with gene mutation dies at time t; ν 1 (t)-the mutation rate from a premalignant cell with gene mutation to a malignant cell that multiply indefinitely at time t.
The model assumes that once malignant cell is generated in the tissue, malignant tumor is detected with probability one after a suitable incubation period. The incubation time from the malignant cell to the tumor detected clinically, T tag , is often considered as a constant [21,33,37]. What we are pay attention to are the probability of malignant cells by time t, p(t), and h(t) that denotes the hazard function by time t. The relationship between h(t) and p(t) is as follows [38] In this model, let X 0 (t), X 1 (t) and X 2 (t) signify the number of healthy cells, premalignant cells, and malignant cells at time t, respectively. Healthy cells, X 0 (t) usually has three main growth patterns: gompertz curve, logistic curve and constant. Therefore, the number of healthy cells, X 0 (t), is usually considered as deterministic growth curve in the assessment of cancer risk [39].
For e < t, the following probability generating functions are defined to seek the solution of hazard function, According to Theorem 5A in [40] and the Kolmogorov forward equation, the derivative of the probability generating function Φ with respect to t is written as and thus The conditional expectation in Eq ( 2.6) can be written as E[X 1 (t)] for certain cancers that are a rare disease, that is, P(t) ≈ 0. Then, the hazard function can be given by From Eq (2.2), we can obtain that By differentiating the Eq (2.3), E[X 1 (t)] can be given by Thus, However, the approximation value of h(t) mentioned above is poor when the probability of tumor is high. Hence, the exact solution of h(t) should be explored when a cancer is not a rare disease. By differentiating Eq (2.3) with respect to x 1 and Eq (2.5), the conditional expectation, The solution of conditional expectation in Eq (2.10) is not easy to implement. To address this issue, some papers discussed the solution of the conditional expectation, E(·|·). Clewell et al. given the mathematical foundation for the approximation of Var[X 1 (t)|X 2 (t) = 0], which was easy to implement for the model parameters with time-varying [41,42]. Besides, they found that the exact solution of hazard function, h(t), could be obtained when the parameters were time-constant after a slight modification. Crump et al. given the numerical solution of hazard function in the model with two mutations in genes by the Kolmogorov backward equations [43].
When the number of healthy cells, X 0 (t) = N, and the model parameters are time-constant, hazard function of the model can be written as . For time-varying parameters, the time is usually divided into several subintervals with steady parameters to solve the function hazard.

The model involving the number of gene mutations more than two
The maximum number of driver gene mutations in certain tumor is very significant for the diagnosis and treatment of some cancers [5,11]. For this purpose, the model with more than two mutations in driver genes was built to estimate the number of driver gene mutations in the process from healthy cells to the malignant tumor. The model is shown in Figure 2. Similar to the model with two mutations, let X 0 (t), X i (t) i = 1, 2, · · · , k − 1 and X k (t) signify the numbers of healthy cells, premalignant cells with i gene mutations, and malignant cells at time t, respectively. For e ≤ t, the following probability generating functions are considered, and Then, h(t) yields For the number of healthy cells with X 0 (t) = N and the model parameters with time-constant, h(t) is also given by, The detailed derivation can be seen in references [30,38]. Evidence suggested that less than 15 driver gene mutations might be responsible to drive and maintain the initiation and progression of the tumor [5,34]. Comprehensive sequencing efforts revealed that two to eight driver mutations in healthy cells are required in a tumor [11]. Thus, the model with more than two mutations was presented to estimate the maximum gene mutation number of lung cancer. From the US registry of Surveillance, Epidemiology, and End Results (SEER), the lung cancer incidence rate is adopted into a testing system, the fitting results shown that two to eight mutations in driver genes were required to drive the incidence of lung cancer [38]. Moreover, the result suggested that three mutations in driver genes occur to cause lung cancer with the highest probability [35,44]. Hence, the model with three gene mutations was usually utilized to analyze the risk of lung cancer due to some external factors [21,28].

The model with genomic instability
Genomic instability plays a vital role in the development of almost all human cancers, which drives the occurrence and development of tumor [3]. Little et al. described the multistage model with genomic instability for carcinogenesis based on the model that Nowak et al. presented [24][25][26][27]. The model with genomic instability is displayed in Figure 3, which contains the cancer-stage (horizontal direction) and genomic instability stage (vertical direction). In the model, cancer-stage involves the mutations of oncogenes or tumor suppressor genes, and genomic instability caused by the mutation in stability genes.  . The model with genomic instability for carcinogenesis. µ i (t), µ i,GI (t) denote the mutation rates of oncogenes or tumor suppressor genes in cell with genetic stability and that with genetic instability at time t, respectively. υ i (t), υ i,GI (t) denote the mutation rates of stability genes in cell with genomic stability and that with genetic instability at time t, respectively. b i (t) (b i,GI (t)) and d i (t) (d i,GI (t)) denote the birth rate and death rate in cells without genetic instability (cells with genetic instability) at time t, respectively.
This model with genomic instability was often used to analyze the mechanism of chromosome instability or microsatellite instability theoretically [24,[45][46][47]. Little et al. gave the detailed derivation of the model with genomic instability, and applied this model to match the incidence rate of colon cancer in whites from the SEER registry during the year 1973-2002 [25]. They turned out that the model with five mutations of tumor suppressor genes or oncogenes and two instability mutations was better than other models for colon cancer. For lung cancer, the main type of genomic instability is chromosomal instability, which is caused by an abnormality in the number of chromosomes [48][49][50]. Thus, we are focus on the model with one genomic instability and two mutations in oncogenes or tumor suppressor genes. Zöllner et al. used the model with one genomic instability and two mutations in oncogenes or tumor suppressor genes to study lung cancer carcinogenesis in the Mayak Workers [28]. However, they did not analyze the mechanism of genomic instability and only made a comparison between the model with genomic instability and the two-three stages model without genomic instability.
3. The impact of external environmental exposure on lung cancer risk

The effect of single factor based on epidemiological data
Lung cancer has a high mortality in malignant tumors throughout the world [51][52][53], which hits a top incidence rate. Studies indicate that more than 90% of lung cancer is closely related to external factors such as cigarette smoking, radiation, air pollution, dietary habit and so on [44]. Among them, cigarette smoking and radiation are two major inducements to increase the risk of lung cancer. Moolgavkar et al. [19] studied the influence of smoking on the mortality of lung cancer by analyzing the intensity and duration of the smoking in the US over the year 1975-2000. Their result suggested a great influence of changing smoking behaviors on lung cancer mortality.
Lung cancer incidence or mortality from Nagasaki and Hiroshima in Japan is strongly dependent on radiation exposure, which is often used to evaluate the lung cancer risk due to the environment with high concentrations of radiation [30,38,54]. Here, we mainly discuss the impact of single environmental factor on lung cancer risk by the fitting of the model described above to epidemiological data. We adopt the incidence rate data of lung cancer from Nagasaki and Hiroshima and the Osaka Cancer Registry in Japan into a testing system. The data from Nagasaki and Hiroshima and the Osaka cancer registry can be downloaded in the URL http: //www.rerf.jp and http: //www.iph.pref.osaka.jp/omc/ocr/, and the patients from the Osaka cancer registry do not involve the radiation. The models described in the above are used to simulate the data from Nagasaki and Hiroshima during the year 1958-1987 and the Osaka cancer registry during the year 1974-1997. Maximum likelihood is employed to estimate the optimal parameter values of the model [30]. The lag time from a malignant cell into clinically detectable tumor is assumed to be 5 years [21,28]. Wilcoxon rank sum test for the real data and the simulated data shows that the smallest value of P (two-tailed) is greater than 0.2. Thus, the models fit the incidence rate data of lung cancer well.
For the model with two mutations in genes, the following formulas are obtained by the expressions of p and q, The optimal parameter values of the model is given in Table 1 for male patients and female patients. It reveals that radiation may induce the division of cells (b) and cause or increase cell death. The optimal values of parameters in the model with two gene mutations for the data from Nagasaki and Hiroshima (the data with radiation) and the Osaka cancer registry (the data without radiation).

Parameters
The data with radiation The data without radiation The model with three mutations in driver genes can better fit the data than that with two mutations in driver genes. However, not all parameters can be identified form the data alone for this model. To address this issue, the most commonly used method is to fix the values of some parameters by biological results. The result of the model with three mutations in driver genes can be seen in reference [30] for the effect of radiation on lung cancer risk, which suggests that radiation increases the risk of lung cancer mainly by inducing the mutation of genes.
In the model with genomic instability, we set the mutation rate of cells including genomic instability and that of cells without genomic instability to be the same, respectively. In addition, γ 1,GI = γ 1 since genomic instability do not affect the net growth rate of cell [30]. The optimal parameter values of the model in Table 2 suggest that radiation mainly increases the mutation rates of genes, especially the mutation rate of stability genes for lung cancer. For the rates of cell reproduction, however, radiation doesn't cause the increase of them. It could be because radiation results in high death rates among cells. Table 2. The optimal values of parameters in the model with two oncogenes or tumor suppressor genes and one genomic instability for the data from Nagasaki and Hiroshima (the data with radiation) and the Osaka cancer registry (the data without radiation).

Parameters
The data with radiation The data without radiation The functional relationship between radiation intensity and the parameters in the model is an open issue in the field of cancer. Zaballa and his co-workers indicated that the clonal expansion of cells has a nonlinear response mechanism with radon exposure rate by analyzing the mortality of lung cancer from the Wismut cohort [55]. Zöllner and his co-workers applied the model with the number of mutations between two and three to analyze the impact of Plutonium exposure on lung cancer risk, which indicated that the radiation effect shown a delayed response at an early stage and drop significantly with age [28]. In addition, studies suggested that the relationship between the radiation dose and the net proliferation rate of cells was nonlinear [28,[55][56][57][58].

The joint effect of various carcinogens exposure
Cancer risk is affected by various carcinogens. However, the analysis to the joint effect of various carcinogens exposure on the risk of cancer is hard implement because of lacking full information of various factors. Therefore, the study for the cancer risk due to various carcinogens is still relatively lacking. For lung cancer, the radiation exposure and cigarette smoking are two major reasons to cause the increase in the incidence or mortality of lung cancer. Researches illustrated that more than 33% of the lung cancer cases were closely associated with smoking while almost 7% were relevant to radiation exposure [59]. The joint influence of smoking and exposure to radiation on lung cancer risk is noteworthy for the smokers who smoke less than a pack of cigarettes per day, while for heavy smokers who smoke greater than or equal to a pack per day, that appears to be additive or even sub-additive.
The two-mutation model is often used to study the risk of cancer due to environmental factors, since the parameters of the model can be identified by simulating epidemiological data. For example, the net proliferation rate and the mutation rate per cell division that we are interested in can be studied to analyze the effect of radiation and smoking intensity on lung cancer risk. The relationship between these model parameters and the radiation and smoking intensity can be described by the function depends on the dose of radiation, smoking index, and the time at exposure. Many studies suggested that the relationship between the dose rate of radiation and the net proliferation rate of cells is nonlinear, which revealed that the net growth rate of cells has a marked increase when the dose rate is larger than the critical value [28,[55][56][57][58]. Recently, Castelletti et al. [36] proposed the mechanistic model with molecular pathways based on the two-mutation model to study the risk of lung adenocarcinoma due to smoking and exposure to radiation. Using molecular data from Caucasian and Asian patients [60,61], they found that radiation mainly plays a role in the pathway involving transmembrane receptor-mutant while smoking affects the pathway with transducer-mutant. In addition, the mechanisms of smoking and radiation is no interaction for lung adenocarcinoma, and the relationship between smoking intensity and the net growth rate of cells is a exponential function.

Conclusions and perspectives
We mainly describe the applications of some mechanistic models based on biological knowledge on the study of lung cancer in this paper. The model with two hits is the most primitive model, which views complex carcinogenesis as two rate-limiting genomic events. There are plenty of work to study this model such as the solutions, properties and applications of the model. As the development of nextgeneration DNA sequencing techniques, more and more biological information are known for cancer. More complex and specific models are required for studying cancer. Therefore, some extended models based on the model with two hits are developed. These models set up the bridge between mathematics and biology.
The mechanistic models based on biological discoveries can well explain the impact of external environmental exposure on lung cancer risk. Although there are many models that involve some additional pathways such as genomic instability, these mathematical models may not be broad enough to declare the complicated carcinogenesis. In addition, not all biological parameters of the model can be obtained by fitting the data alone, which is a challenge to estimate model parameters. The commonly approaches to solve this issue are to assume suitable values for some parameters by known information or estimate the set of parameters instead of single parameterm [62]. The fitting results so far are obtained by the simplifying hypotheses such as parameter with time-constant and neglecting the growth of malignant cells. Nevertheless, the information in malignant cells may be very valuable for the study of lung cancer risk due to external environmental exposure. Therefore, the more detailed biological processes should be included in the cancer model. Besides, some papers provided other models to analyze the dynamics of cell population and key processes of tumorigenesis, such as multi-scale model, the model considering age-structured, stochastic reaction-diffusion model and so on [63][64][65][66][67][68][69].
There are a lot of work in mathematical modeling of cancer, however, the work on the variable parameters model of cancer is still lacking. In addition, the types of lung cancer and other external factors other than smoking and radiation should be considered for studying the risk of lung cancer. With the development of technology in DNA sequencing, information from TCGA analyses is commonly used to study the sequence of gene mutations in the certain tumor [70][71][72][73][74][75]. Thus, the specific model with these information should be developed to deepen the understanding of the mechanisms of carcinogenesis. The future studies for lung cancer should incorporated the information of gene regulatory pathways and several external factors.