Journal Pre-proof Tobacco smoking related mutational signatures in classifying smoking associated and non-smoking associated non-small cell lung cancer.

Introduction: Patient-reported smoking history is frequently used as stratification factor in NSCLC-directed clinical research. However, this classification does not fully reflect the mutational processes in a tumour. Next-generation sequencing can identify mutational signatures associated with tobacco smoking, such as single-base signature 4 (SBS4) and indel-based signature 3 (ID3). This provides an opportunity to redefine the classification of smoking and non-smoking associated NSCLC based on individual genomic tumour characteristics and could contribute to reducing the lung cancer stigma. Methods: Whole genome sequencing data and clinical records were obtained from three prospective cohorts of metastatic NSCLC ( N =316). Relative contributions and absolute counts of SBS4 and ID3 were combined with relative contributions of age-related signatures to divide the cohort into smoking (‘smoking high’) and non - smoking associated (‘smoking low’) clusters. Results: The smoking high ( N =169) and smoking low ( N =147) clusters differed significantly in tumour mutational burden, signature contribution and mutational landscape. This signature-based classification overlapped considerably with smoking history. Yet, 26% of patients with an active smoking history were included in the smoking low cluster, of which 52% harboured an EGFR/ALK/RET/ROS1 alteration, and 4% of patients without smoking history were included in the smoking high cluster. These discordant samples had similar genomic contexts to the rest of their respective cluster. Conclusions : A substantial subset of metastatic NSCLC is differently classified into smoking and non-smoking associated tumours based on smoking-related mutational signatures than based on smoking history. This signature-based classification more accurately classifies patients based on genome-wide context and should therefore be considered as stratification factor in clinical research.


Introduction
Lung cancer is the leading cause of global cancer-related mortality. 1 Approximately 85% of lung cancer is non-small cell lung cancer (NSCLC), which is a notoriously heterogeneous disease. 2 It has become clear that within this heterogeneity, specific subgroups of NSCLC may be defined, which potentially derive greater benefit from certain treatments. Some of these subgroups, for example, tumours with squamous cell carcinoma histology or KRAS transversion mutations such as KRAS G12C or G12V mutations, are more prevalent in patients who smoke or have previously smoked; while tumours harbouring an EGFR mutation or ALK translocation are more prevalent in patients who have never smoked. [3][4][5] Therefore, NSCLC is often divided into smoking associated and non-smoking associated tumours based on patient-reported smoking history. However, this division falls short since tumours with non-smoking associated carcinogenesis may also occur in patients who smoke.
Additionally, clinical smoking history can be subject to recall bias and does not account for possible passive smoke exposure.
Fortunately, more precise tools than clinical smoking history are available to select individual patients for specific treatments, such as targeted next-generation sequencing (NGS) and programmed death-ligand 1 (PD-L1) tumour proportion score. Clinical smoking history might still help guide molecular testing as some targets that are much more common in patients who have never smoked, such as gene fusions, might require additional testing to confirm. However, the current guidelines recommend testing all patients with adenocarcinoma for molecular drivers, regardless of clinical smoking history. 6 Therefore, in the era of personalized treatment and precision medicine, clinical smoking history has limited diagnostic or therapeutic consequences in daily clinical practice.
In contrast, in clinical research the classification of patients in 'smokers' and 'never smokers' based on clinical smoking history is still frequently used as a stratification factor and as basis for subgroup analyses. This highlights a gap between clinical practice and clinical research that could possibly come at the expense of the external validity of clinical trials. There is a need to bridge this gap by implementing a more precise classification method than patient-reported clinical smoking history.
Several techniques enabling this classification method are already in practice. NGS, including targeted panels, whole-exome sequencing (WES) and whole genome sequencing (WGS), allow for an in-depth analysis of the lung cancer genome. Several genome-based studies highlighted major differences in oncogenic events between lung cancer in patients who smoke and patients who have never smoked, including different types of single base substitutions (SBS), doublet base substitutions (DBS) and small insertions and deletions (indels), which can group together to derive distinct biologically-relevant mutational signatures. [7][8][9] For instance, the SBS mutational signature 4 (SBS4) is J o u r n a l P r e -p r o o f Page 4 of 21 characterised by transcriptional strand bias for C>A mutations. This signature was found to be strongly associated with tobacco smoking and to correlate with the extent of tobacco smoke exposure. Similar to SBS4, the indel-based mutational signature 3 (ID3) is also associated with tobacco smoking. 10 Therefore, these signatures seem to provide an accurate way of classifying smoking and non-smoking associated tumours. However, the tobacco smoking mutational signatures have not yet found their way to randomized controlled trials.
In this study, we aim to provide a genomic classification of smoking-and non-smoking associated NSCLC based on the observed frequencies of the smoking-related signatures SBS4 and ID3. This could allow for a more accurate subgrouping of NSCLC for future clinical research. To this end, we leveraged high-quality WGS data obtained from three uniform prospective cohorts of metastatic NSCLC.

Patient cohort and study procedures
We selected patients with metastatic NSCLC who were included under the protocol of the Center for Personalized Cancer Treatment consortium (CPCT-02 Biopsy Protocol, ClinicalTrial.gov no. NCT01855477), the Whole genome sequencing Implementation in standard Diagnostics for Every cancer patient study (Samsom et al. 11 ) and the Drug Rediscovery Protocol study (ClinicalTrial.gov no. NCT02925234). All three trials were approved by the local institutional review board and were conducted in accordance with good clinical practice guidelines and the Declaration of Helsinki's ethical principles for medical research. All patients provided written informed consent before any study procedure. Core needle biopsies were taken following local institutional guidelines, aiming to take 2-4 biopsies with 18 G needles. A minimum tumour percentage of 20% and a minimum DNA yield of 50ng was needed. Matched whole blood samples were collected to discriminate somatic mutations from germline DNA background variations. The handling, processing and sequencing of the samples has previously been described in detail for these cohorts. [11][12][13] The WGS data was made available by the Hartwig Medical Foundation.
Here we present the in-depth analysis of patients with metastatic NSCLC who were included between July 2012 and October 2020 in 5 different hospitals in the Netherlands, and of whom clinical records were available. We collected demographic and clinical information including age, sex, disease stage at diagnosis, date of diagnosis of metastatic disease, smoking history, treatment(s) before and after study biopsy and pathological information from the local pathology reports including histology and PD-L1 expression. Smoking history was abstracted from the electronic patient charts. Patients who had previously smoked or were currently smoking were defined as having an active smoking history.
Patients with less than 1 pack year were considered to have never smoked.

Supervised clustering of samples based on smoking-related mutational signatures
The processing and analysis of the WGS data is described in detail in Supplementary Data 1. 7,12,[14][15][16][17][18][19][20][21][22] Mutational signature contribution was determined by the number of somatic mutations falling into the 96 single nucleotide substitution (SBS), 78 double base substitutions (DBS) and 83 indel (ID) contexts (as described in the COSMIC catalog (https://cancer.sanger.ac.uk/signatures/). These contexts are defined by the substitution class and the sequence context immediately 3' and 5' to the mutated base. Each mutational signature is therefore characterised by the predominant substitutional class(es), and also the predominant sequences in those classes. 23 The relative mutational signature contribution was determined relative to the total tumour mutational burden (TMB).
To classify the samples as smoking or non-smoking associated, we calculated the proportion of the relative contribution of single base mutational signature SBS4 (tobacco smoking) compared to the summed relative contributions of SBS1 (age), SBS4 and SBS5 (age). In addition, we calculated the proportion of the indel mutational signature ID3 (tobacco smoking) compared to the summed relative contributions of ID1 (mismatch repair deficiency (MMRd)/age), ID2 (MMRd/age) and ID3. As inspired by Lee et al. 7 , we utilized these relative proportions of SBS4 and ID3 together with the absolute counts of SBS4 and ID3 to form two distinct clusters (k-means; k=2), which we termed smoking high and smoking low; respective to the presence of these proportions and absolute counts.
Prior to clustering, these values were centred and scaled appropriately.

Statistics
Statistical analysis of the clinical characteristics was performed using IBM SPSS Statistics software, version 25. Continuous data was compared with Student's T test or Mann-Whitney U test as appropriate. Means are presented with standard deviations (SD), medians with interquartile ranges (IQR). Categorical data was compared with a Chi Square test. Correlations were analysed with the Spearman's rho. Genomic differences were tested in the statistical platform R (v4.1.1) using the Wilcoxon Rank-Sum Test with multiple testing correction (Benjamini-Hochberg). Mutational enrichment (or depletion) of genes were tested using a Fisher's Exact test with multiple testing correction (Benjamini-Hochberg). For visualization, P-values (or q-values) are visualized as * (P < 0.05), ** (P < 0.01) and *** (P < 0.001).

Patient cohort
The WGS data of 316 biopsies of metastatic NSCLC were analysed (Supplementary Figure S1). The cohort consisted of a majority of females (57%), mean age at diagnosis was 62 (±10) years, and adenocarcinoma was the most prevalent histology (75%). With respect to recorded clinical smoking history, 11% were currently smoking, 58% had previously smoked, 28% had never smoked and 3% had an unknown smoking history. Patients with an active smoking history had a median of 25 pack years (IQR 13-39).
Based on the supervised signature-based clustering, we categorized 169 samples as smoking high, and 147 samples as smoking low (Figure 1a, Supplementary Figure S2). The signature-based clusters differed significantly with regard to sex, smoking history, prior treatment and histology ( Table 1). The distribution of recorded smoking history within these signature-based clusters is depicted in Figure   1b. 4% of patients who had never smoked were included in the smoking high cluster, and 26% of the patients with an active smoking history were included in the smoking low cluster. Patients with an active smoking history in the smoking high cluster had significantly more pack years than those with an active smoking history in the smoking low cluster (28 versus 14, P<0.001). 30% of squamous cell carcinomas (N=10) were included in the smoking low cluster, of which 5 patients had an active smoking history, 4 patients had never smoked and one patient had an unknown smoking history. PD-L1 expression did not differ between clusters (P=0.70), or between those with an active smoking history and those who had never smoked (P=0.16). Overview and differences of the genomic landscape between signature-based clusters An overview of genomic characteristics has been summarized in  Beyond the expected differences in SBS4 and ID3 abundance due to their usage as clustering features (Supplementary Table S1), we also observed differences in additional mutational signatures between the two clusters (Figure 3c). Within the smoking high cluster, we observed significantly higher contributions of SBS signatures SBS18, SBS29, and SBS5 compared to the smoking low cluster. SBS4, SBS18 and SBS29 are all characterised by transcriptional strand bias for C>A substitutions, which would explain why these signatures cluster together. SBS18 is suggested to be related to damage by reactive oxygen species (ROS), and SBS29 is associated with tobacco chewing. SBS5 also has an unknown aetiology, although its mutational spectrum seems to be increased in cancers related to tobacco exposure and seems to be related to age. 10 is also associated with exposure to tobacco smoke, had the highest contribution in the smoking high cluster. 10 In the smoking low cluster SBS40, SBS1, SBS13, and SBS2 had the highest SBS contributions. SBS1 has a clock-like mechanism related to age, and it is characterised by C>T mutations caused by spontaneous deamination of 5-methylcytosine. SBS2, mainly consisting of C>T mutations, and SBS13, mainly consisting of C>G mutations, are both linked to activity of the AID/APOBEC enzymes. 10 It has been suggested that the activation of APOBEC in cancer could be caused by previous viral infection or tissue inflammation, which suggest a role of inflammation in the development of non-smoking associated lung cancer. 24 The relative contribution of the two APOBEC signatures was significantly higher in the smoking low cluster than in the smoking high cluster (16% versus 6%, P<0.001). SBS40 is also correlated with age and has a large similarity to SBS5. 10 DBS6 had the highest DBS signature contribution and ID1 had the highest ID signature contribution in the smoking low cluster. DBS6 still has an unknown aetiology, but seems to be related to age. 10 ID1 is associated with age and with slippage of the template DNA strand during DNA replication. It is often found in cancers with DNA MMRd and MSI. 10 The differences in TMB and relative mutational signature contribution between the two clusters ( figure 3a,c), were also investigated in the different histological subtypes (Supplementary Table S2).
Most observed differences were consistent over the histological subtypes, however, both APOBEC signatures (SBS2 and SBS13) did not differ significantly between the smoking high and smoking low cluster in the squamous cell subgroup (3% versus 5%, P=0.062 and 5% versus 13%, P=0.237, respectively), but did in the adenocarcinoma subgroup (P<0.001). The differences in smoking related signatures were consistent over all histological subgroups.
Altered landscape of putative driver genes Next, we investigated differences and/or enrichment within the somatic inventory of perturbed genes between the two signature-based clusters (Figure 3d, Supplementary Figure S4-5). Of the known driver oncogenes, EGFR mutations and ALK fusions were significantly more prevalent in the smoking low cluster than in the smoking high cluster (50% vs 9%, P<0.001; 13% vs 0%, P<0.001; respectively). KRAS mutations were significantly more prevalent in the smoking high cluster (28% vs 5%, P<0.001). Table 2 illustrates the differences in frequency of oncogene driver alterations in the non-squamous cell carcinoma samples between the signature-based clusters and the groups based on clinical smoking history.

Discordances
We further analysed the samples of patients who had never smoked but were included in the smoking high cluster (N=4) and found that all harboured high (>10) or medium-high (>5) TMB. In addition, these samples also reflected the mutational signatures found within the rest of the smoking high cluster and showed an absence of, or minor, APOBEC signature contribution ( Table 3). With regard to oncogene driver alterations, we identified a KRAS G12C mutation, an EGFR L858R mutation with concomitant EGFR amplification, and a BRAF G469A-activating mutation. The fourth sample did not harbour a known driver oncogene.
The samples of patients with an active smoking history captured within the smoking low cluster (N=56), contained only minor contributions of the dominant signatures of the smoking high cluster (   Figure S6a). The distribution of smoking history within the clusters again revealed a considerable degree of discordance between smoking history and mutational signatures ( Figure S6c,d, Figure 1b for the WGS clusters). A few more patients with an active smoking history were classified as smoking low with the TSO500 than with WGS. This can be explained by the lower discriminatory power of a targeted panel compared to WGS, as several (non-coding) genomic regions which are affected by (smoking-related) mutational processes are not included in the target-regions of the TSO500. This is further illustrated by the fact that the number of informative samples improves when extending the TSO500 regions to allow for the capture of additional mutations ( Figure S6a (upper track) and S6b).
Extensions of the TSO500 regions also generally improved the concordance (F1) of both approaches between WGS and genomic subsets ( Figure S6a).

Discussion
In this study, we aimed to investigate a more accurate classification than clinical smoking history in NSCLC. We showed that clustering metastatic NSCLC tumours into smoking-associated and nonsmoking associated mutagenesis based on the SBS4 and ID3 mutational signatures derived from WGS data is a feasible classification method.
Our classification shows that there is large overlap in clinical smoking history and classification based on SBS4 and ID3 contributions. This also revealed a degree of discordance between these two reportedly had never smoked. 26 This confirms that the mutational processes that have occurred in the tumour are not fully reflected by patient-reported smoking history.
Our signature-based clustering resulted in two distinct clusters with different TMB, mutational signature contributions and distinct mutational landscapes. We found that tumours with a high TMB, high SBS4, SBS18 and SBS29 contribution and with KRAS mutations were predominantly classified as smoking high. These signatures and most KRAS mutations in NSCLC are characterised by transversion mutations, which would explain why they group together. Tumours with a low TMB, high APOBEC signature contribution and EGFR mutations or ALK fusions, were predominantly classified as smoking low. Genome-based studies have found similar findings when using clinical smoking history to classify tumours. 8,9,26 The tumours from patients who had never smoked but were clustered as smoking high, had a similar genotype to the other tumours in the smoking high cluster, which suggests smoke exposure despite a negative smoking history. Tumours from patients with an active smoking history in the smoking low cluster, had similar genotypes to the rest of this cluster, including a high percentage of oncogenic driver alterations such as EGFR mutations and ALK fusions. This suggests that our classification based on SBS4 and ID3 is more accurate in grouping NSCLC's based on similar genomic context rather than reported smoking history. As TMB was strongly correlated to SBS4 contribution, this might suggest that classification based on TMB would yield similar results.
However, other causes of high TMB, such as MSI, might lead to more misclassifications. Interestingly, PD-L1 expression did not differ between the two clusters. Several studies have suggested the upregulation of PD-L1 expression in patients with an active smoking history 27,28 , which leads to the expectation of higher PD-L1 in the smoking high cluster. The fact that we found no difference between the smoking high and smoking low cluster supports studies that have found no association between PD-L1 expression and smoking status. 29,30 A signature-based classification based on genomic SBS4 and ID3 contribution instead of classifying based on clinical history could have several clinical implications. First, as we have shown that the frequency of targetable driver oncogenes is higher in those with a low smoking signature contribution than in those who have never smoked, a low smoking signature contribution suggests an increased likelihood of oncogene driven NSCLC regardless of smoking status. Therefore, if a driver alteration has not been detected during routine diagnostics, a low smoking signature contribution could warrant further investigation to identify rarer oncogenic driver alterations. Further investigation should include comprehensive RNA analysis for the detection of gene fusions, including those with unknown fusion partners, and kinase domain duplications (KDD). Although rare, these oncogenic drivers can provide an important target for treatment, e.g. several reports have demonstrated sensitivity of tumours with an EGFR-KDD to EGFR TKI's. 31,32 Second, replacing the terms 'smoker' and 'never smoker' that are coined by clinical smoking history, could contribute to reducing the stigma and self-blame around lung cancer. It has been suggested that the stigma of lung cancer is still a significant barrier in reducing the lung cancer burden in global society. 33 Therefore, the potential impact of the label 'smoker' on patients' wellbeing should not be underestimated. Next, in randomized trials investigating immunotherapy smoking history can be of special interest as a predictive biomarker due to the current assumption that smoking leads to an accumulation of mutations that in turn could generate a higher number of neoantigens. These neoantigens could potentially predict response to immunotherapy. However, as a considerable percentage of patients with an active smoking history did not actually harbour high (or any) smoking signature contribution, the reliability of smoking history as stratification factor and predictive biomarker in these trials can be questioned. As SBS4 has been found to have a potential predictive value for response to immunotherapy, it could therefore potentially provide a more accurate stratification factor than smoking history. 34,35 Also, it is possible that the subgroup of patients with low SBS4 contribution would derive less benefit from immunotherapy than would be expected based on smoking history or PD-L1 expression, since PD-L1 expression did not differ between the smoking high and smoking low cluster. Additionally, a small group of EGFR-mutated samples were classified as smoking high, which might suggest that these patients are part of the limited subpopulation of EGFR-mutated NSCLC that do derive benefit from immunotherapy. 36 However, the predictive value of SBS4, and ID3, in oncogene-driven NSCLC is yet to be determined.
Implementing such a classification method in genome-based research should constitute little additional effort as this data is already available. Also, the WIDE study investigators have recently shown that WGS for patients with metastatic cancer is feasible in routine clinical practice. 13 We do, however, appreciate that the implementation in clinical trials would still be a hurdle to overcome.
Currently, many clinical trials already require archival or fresh tumour tissue to be sent in for genome testing during the screening period. Our TSO500 in silico analysis showed that the TSO500 panel allows for SBS mutational signature calling in the vast majority of cases, whereas the ID signatures are more challenging to retrieve. However, as the ID3 signature is associated with the SBS4 signature, the lack of the ID signatures should not vastly differ conclusions. This provides an opportunity for clinical trials to incorporate mutational signature analysis in the genome testing procedure during screening without the need to perform WGS. Similarly, in daily practices where WGS is currently not common practice, mutational signature analysis with targeted panels could still help identify those with a higher likelihood of harbouring a (rare) oncogenic driver. However, the optimal cut off between a high or low smoking signature contribution does still warrant further investigation.
This study has certain limitations that should be considered. Most importantly, a golden standard in distinguishing smoking associated from non-smoking associated carcinogenesis is lacking. In the absence of a golden standard, accuracy analyses are unreliable and were therefore not performed.
Additionally, mutational signatures infer the dominant processes of mutagenesis within a tumour genome, however, they do not necessarily reflect the driving cause of carcinogenesis. For instance, cells of normal lung epithelium can also have SBS4 contribution without this leading to carcinogenesis. 37 Our samples of EGFR-driven NSCLC with high SBS4 contribution further illustrates this. Next, many of the patients in our cohort were included because the treating physician deemed WGS to have added clinical value in the patient's treatment course, which could have led to a selection bias. Most patients in our cohort had also received previous systemic therapy, however these therapies do not induce the same mutations as tobacco exposure and thus have no influence on the smoking related signatures. Lastly, we did not collect outcome data. Consequently, the prognostic or predictive value of our classification method has not been determined. Despite these limitations our study also has several strengths. To the best of our knowledge, it is the first to focus on the discordances between clinical smoking history and smoking associated mutational processes in the NSCLC genome. Additionally, previously published genomic cohorts are often small, only focus on patients without smoking history, or primarily included early-stage lung cancer. 8,9,38 Our comprehensive cohort allows for an in-depth analysis of metastatic tumours from patients with and without smoking history.
To conclude, our mutational signature-based classification of smoking associated and non-smoking associated NSCLC is more accurate in grouping tumours with similar genomic contexts together