Cancer types with high numbers of driver events are largely preventable

There is a long-standing debate on whether cancer is predominantly driven by extrinsic risk factors such as smoking, or by intrinsic processes such as errors in DNA replication. We have previously shown that the number of rate-limiting driver events per tumor can be estimated from the age distribution of cancer incidence using the gamma/Erlang probability distribution. Here, we show that this number strongly correlates with the proportion of cancer cases attributable to modifiable risk factors for all cancer types except the ones inducible by infection or ultraviolet radiation. The correlation was confirmed for three countries, three corresponding incidence databases and risk estimation studies, as well as for both sexes: USA, males (r = 0.80, P = 0.002), females (r = 0.81, P = 0.0003); England, males (r = 0.90, P < 0.0001), females (r = 0.67, P = 0.002); Australia, males (r = 0.90, P = 0.0004), females (r = 0.68, P = 0.01). Hence, this study suggests that the more driver events a cancer type requires, the more of its cases are due to preventable anthropogenic risk factors.


INTRODUCTION
It is not known whether intrinsic or extrinsic factors play the key causative role in cancer. Historically, extrinsic factors such as smoking, carcinogenic chemicals and fumes, ionizing and ultraviolet radiation, were the most demonstrative risk factors for cancer (Doll, 1980;Fraumeni, 1982;Lyman, 1992). However, the role of intrinsic factors has been recently brought to attention by the work of Tomasetti and Vogelstein. They proposed that the majority of cancers develop due to replicative mutations occurring during the stem cell division (Tomasetti & Vogelstein, 2015;Tomasetti, Li & Vogelstein, 2017). As this challenges the widely accepted dominant role of extrinsic risk factors, further quantitative studies of the extrinsic versus intrinsic factors contribution to carcinogenesis are required (Wu et al., 2018).
We have previously shown that the number of rate-limiting driver events per tumor can be estimated from the age distribution of cancer incidence using the gamma/Erlang probability distribution, both for adult (Belikov, 2017) and childhood (Belikov, Vyatkin & Leonov, 2021) cancers. Here, we study the correlation of this number with the percentage of cancer cases due to modifiable risk factors. This is an often-used parameter in epidemiological studies, and is also called the population attributable fraction (PAF) (Mansournia & Altman, 2018). It shows, for example, what percentage of lung cancer cases are caused by smoking tobacco. Combined PAF shows the overall contribution of all potentially modifiable risk factors, which usually include air pollution, occupational hazards, ionizing radiation, smoking, alcohol, poor diet, insufficient exercise, obesity, infection and ultraviolet radiation. By definition, PAF is proportional to the prevalence of the exposure to the risk factor and the relative risk of cancer associated with such exposure (Mansournia & Altman, 2018). The relative risk magnitude, in turn, characterizes carcinogenic strength of the risk factor. Hence, we hypothesized that prevalent and strong risk factors should induce much more carcinogenic (driver) events in the general population than less prevalent and weak risk factors, as well as internal processes alone.
Indeed, we show that the numbers of driver events per tumor predicted by the gamma/Erlang distribution strongly correlate with combined PAFs for most cancers, with the exception of cancers with the large contribution from infection or ultraviolet radiation.
This suggests that cancer types with higher numbers of driver events are more dependent on anthropogenic risk factors.

Population attributable fractions data
Population attributable fractions (PAFs) combining all risk factors were obtained directly from published open-access articles separately for each cancer type and sex.
PAFs for USA were obtained from Table 2 in Islami et al. (2018). Briefly, Islami et al. applied a simulation method in which numbers from repeated draws were generated for all relative risks, exposure levels, and numbers of cancer cases and deaths, allowing for uncertainty in the data. The simulation process was replicated 1000 times for each sex and age-group stratum. The numbers from repeated draws were used to calculate the proportion and number of attributable cancer cases and deaths and their 95% confidence intervals. By using exposure prevalence (Pi) at the exposure category i and the corresponding relative risks (RRi), PAFs for categorical exposure variables for each stratum of sex and age group were calculated using the following formula: Islami et al. used the above approximate formula for all associations, with a few exceptions. All cervical cancers were attributed to human papillomavirus infection and all Kaposi sarcomas to Kaposi sarcoma herpesvirus/human herpesvirus 8 infection. Because of the lack of data on anal human papillomavirus infection, 88% of anal cancers were attributed to human papillomavirus 10 before applying the simulation method. PAFs for excess ultraviolet radiation-associated melanomas were estimated using the difference between observed melanoma incidence rates by sex and age group in the general population and the rates in blacks. To calculate the overall attributable proportion and number of cancer cases or deaths for a given cancer type when there were several risk factors, it was assumed that the risk factors had no interactions.
PAFs for England were obtained from Table 2 in Brown et al. (2018) (Whiteman et al., 2015b). This method does not permit estimation of the fractions of cancers arising through synergistic effects of causal factors (Whiteman et al., 2015b). For Epstein-Barr virus and human papillomavirus, where mechanistic knowledge strongly suggests that the presence of infection in a cancer is sufficient to infer that infection caused the cancer, the PAF was assumed to be equivalent to the prevalence of viral DNA in tumour cells (Antonsson et al., 2015). Kaposi sarcoma herpesvirus is recognised as a necessary cause of Kaposi sarcoma, and thus the PAF was assumed to be 100% (Antonsson et al., 2015). The number of lung cancer cases expected in Australian adults in the absence of smoking was calculated by applying the estimated incidence rates of lung cancer in never smokers in the CPS II study to the population of Australia (Pandeya et al., 2015). The number and percentage of lung cancer cases attributable to smoking was then calculated by subtracting the expected number of cases from those actually observed (Pandeya et al., 2015). For the primary melanoma analysis, the difference was estimated between the observed numbers of melanoma cases in Australian residents (i.e., 'exposed' to high ambient ultraviolet radiation in Australia) and the expected number of cases assuming the population was exposed to levels of ambient ultraviolet radiation experienced by an 'ancestral' population for many Australians -the UK population (Olsen et al., 2015).
No modification or processing of PAF data was performed.

Australia incidence data
Australia cancer incidence data were downloaded from the Cancer Incidence in Five Continents (CI5) Volume XI Age-specific curves Online Analysis tool (http://ci5.iarc.fr/CI5-XI/Pages/age-specific-curves_sel.aspx). CI5 is published approximately every five years by the International Agency for Research on Cancer (IARC) and the International Association of Cancer Registries (IACR) and provides comparable high quality statistics on the incidence of cancer from cancer registries around the world. Volume XI contains information from 343 cancer registries in 65 countries for cancers diagnosed from 2008 to 2012. Incidence is calculated as the number of new cancer cases reported each calendar year per 100 000 population in each 5-year age group. The data were downloaded separately for males and females for each cancer type listed in Table 2

Estimation of the number of driver events per tumor
For analysis, the incidence data were imported into GraphPad Prism 9 (http://www. graphpad.com/). The following age groups were selected: ''5-9 years'', ''10- Prior age groups were excluded due to possible contamination by childhood subtype incidence, and ''85+ years'' was excluded due to an undefined age interval. If in the first several age groups (''5-9 years'', ''10-14 years'', ''15-19 years'') incidence initially decreased with age, reflecting contamination by childhood subtype incidence, these values were removed until a steady increase in incidence was detected. The middle age of each age group was used for the x values, e.g., 17.5 for the ''15-19 years'' age group. Incidence (new cancer cases per calendar year per 100,000 population) for each age group and each cancer type was used for the y values. Data for different countries, as well as for males and females, were analyzed separately.
Data were analyzed with Nonlinear regression using the following User-defined equation for the gamma distribution: The amplitude parameter A was constrained to ''Must be between zero and 100000.0'' and scale and shape parameters b and k to ''Must be greater than 0.0''. ''Initial values, to be fit'' for all parameters were set to 50. All other settings were kept at default values, e.g., Least squares fit and No weighting.
The numerical value of the shape parameter k rounded to the nearest integer was interpreted as the number of driver events per tumor (Belikov, 2017).

Correlation of the predicted numbers of driver events per tumor with PAFs
Obtained k values were correlated to population attributable fractions (PAFs) in GraphPad Prism 6 using the inbuilt Correlation tool at default settings, e.g., Pearson correlation with two-tailed P value. Cancer types were sorted into two classes, and correlation was performed separately for each class. Cancer types in which infection (Helicobacter pylori, hepatitis B virus, hepatitis C virus, Kaposi sarcoma herpesvirus/ human herpesvirus 8, human immunodeficiency virus and human papillomavirus) or ultraviolet radiation contributed to more than 30% of cases, for a given country according to the published PAF data (Whiteman et al., 2015a;Brown et al., 2018;Islami et al., 2018), were assigned to Class 2 (non-anthropogenic). The rest were assigned to Class 1 (anthropogenic), which included cancers with substantial contribution from air pollution, occupational exposure, exposure to ionizing radiation, smoking and exposure to secondhand smoke, alcohol intake, poor diet (red and processed meat, insufficient fiber, vegetables, fruit and calcium), excess body weight, insufficient physical activity, insufficient breastfeeding, postmenopausal hormone therapy and oral contraceptives, according to the published PAF data (Whiteman et al., 2015a;Brown et al., 2018;Islami et al., 2018).

RESULTS
To estimate the numbers of driver events per tumor, the gamma distribution was fitted to the actual age distributions of incidence separately for males and females in three countries: USA, England and Australia ( Fig. 1 and Table 1). The fits were generally excellent (R 2 = 0.99 for 22 cancer types), except for brain cancer (R 2 = 0.98), thyroid cancer (R 2 = 0.97), and several virus-induced cancers: pharyngeal (R 2 = 0.98), nasopharyngeal (R 2 = 0.93), vulvar (R 2 = 0.98), cervical (R 2 = 0.77), Kaposi sarcoma (R 2 = 0.67) and Hodgkin lymphoma (R 2 = 0.34). Due to the unsatisfactory fits, the last three cancer types were excluded from the further analysis. Successful fitting of the remaining 27 cancer types allowed the estimation of the numbers of driver events per tumor using the shape parameter of the gamma distribution.
Plotting the correlation of the number of driver events per tumor predicted from the gamma distribution with the estimated percentage of cases due to modifiable risk factors obtained from the published studies revealed that cancers appear to cluster into two classes.

Notes.
ND, no incidence data in the database or no corresponding PAF data in the source publication. Asterisk (*) denotes cancers in which a viral infection contributes to more than 30% of cases, according to the published PAF data (Whiteman et al., 2015a;Brown et al., 2018;Islami et al., 2018).
Class 1, which included the majority of cancers (18), demonstrated the linear correlation, whereas Class 2, representing the minority (9), clustered in the upper left corner of the plot in a cloud-like fashion. Investigation of the Class 2 revealed that it consists entirely of cancers with substantial (>30%) contribution of infection to their pathogenesis, plus the melanoma cancer. Class 2 was therefore named ''non-anthropogenic'', as infections and ultraviolet radiation existed long before the advent of human civilization. Interestingly, all cancers in Class 1 were induced by factors that arose with human civilization, such as air pollution, occupational hazards, ionizing radiation, smoking, alcohol, poor diet, insufficient exercise, obesity, insufficient breastfeeding, postmenopausal hormone therapy and oral contraceptives. Therefore, Class 1 was termed ''anthropogenic''. The correlation of the predicted number of driver events per tumor with the estimated percentage of cases due to modifiable risk factors for cancers in males is shown in Fig. 2 and Table 2, and in females in Fig. 3 and Table 3. It can be seen that anthropogenic cancers indeed exhibit the strong significant positive correlation for all studied countries and for both sexes, whereas for non-anthropogenic cancers correlations are not significant. Amongst anthropogenic cancers, the correlation is stronger for males than for females. Interestingly, the correlation is stronger for USA females (r = 0.81, P = 0.0003) than for English (r = 0.67, P = 0.002) and Australian (r = 0.68, P = 0.01) females, but weaker for USA males (r = 0.80, P = 0.002) than for English (r = 0.90, P < 0.0001) and Australian (r = 0.90, P = 0.0004) males. This observation holds true even when identical sets of cancer types are compared (Fig. 4). These differences are likely explained by differing exposures to risk factors between countries and between sexes, as well as by variations in the screening, diagnostics and reporting protocols of different countries, and in the methodologies of those studies. The role of population genetics also cannot be ruled out.

DISCUSSION
One of the most interesting findings of this study is the clustering of all cancers into two classes, termed here anthropogenic and non-anthropogenic. The possible explanation for this dichotomy is that the human body managed to evolve some protective countermeasures against cancer risk factors that were present for millions of years, whereas it appears unprepared for the novel risk factors brought by our civilization. For example, ultraviolet radiation has been present on Earth since the beginning, and although melanocytes cannot completely protect their DNA, and a lot of DNA damage occurs, it is likely that they developed a very slow division rate (Halaban et al., 1986) to avoid conversion of this damage into mutations for as long as possible. This may explain why only few rate-limiting driver events are predicted for melanoma despite lots of DNA damage that melanocytes receive: rate-limiting in this case is cell division and not the DNA damage. Similarly, the human body had plenty of time to adapt to viruses and develop protective mechanisms, such as RNAi-mediated destruction of double-stranded RNA (Maillard et al., 2019), interferon-driven innate immune system (Schlee & Hartmann, 2016), and T-and B-cell responses of the adaptive immunity (Mueller & Rouse, 2008). This may explain why the incidence rates of virus-induced cancers are low, and less driver events are predicted than would be expected from the linear correlation. It is also clear that viruses are inducing cancer via different mechanisms than chemical carcinogens (Butel, 2000;Mesri, Feitelson & Munger, 2014), and thus the development of such cancers may not be described by the Poisson process underlying the Erlang distribution (Belikov, 2017;Belikov, Vyatkin & Leonov, 2021). Indeed, many of the virus-induced cancers have rather poor fits of the Erlang distribution to their age distributions of incidence (Table 1).

Notes.
ND, no data in the source publication; NA, assigned to the non-anthropogenic group due to the strong contribution of a viral infection.
The strong positive correlation of the predicted number of driver events per tumor with the contribution from anthropogenic risk factors suggests that the higher is the number of driver events that are required for a given cancer type to appear, the less likely is for them to occur by chance (e.g., due to replication errors), and the more dependent are they on anthropogenic carcinogens to be induced. This is in accord with the mainstream view that the environment and lifestyle are the major contributors to carcinogenesis, but conflicts with the recently proposed view that the majority of cancers develop due to replicative mutations occurring during stem cell division (Tomasetti & Vogelstein, 2015;Tomasetti, Li & Vogelstein, 2017). The latter view is based on predominantly mouse data handpicked from varied publications and processed through calculations with unobvious assumptions, and thus has been widely criticized (Ashford et al., 2015;Gotay, Dummer & Spinelli, 2015;O'Callaghan, 2015;Rozhok, Wahl & De Gregori, 2015;Giovannucci, 2016;Wu et al., 2016;Wensink, Vaupel & Christensen, 2017).
It is also interesting to speculate why the observed correlations are stronger for males than for females. One likely explanation is that males generally are more exposed to chemical mutagens, e.g., during smoking and at dangerous industries (Whiteman et al., 2015a;Brown et al., 2018;Islami et al., 2018), directly inducing mutations in the DNA,