Mutperiod: Analysis of periodic mutation rates in nucleosomes

Graphical abstract


Introduction
Nucleosomes significantly impact DNA damage and repair, leading to two distinct periodic effects on mutation rates [1,2]. First, DNA repair activity is inhibited within nucleosomes relative to linker DNA [1], often resulting in higher mutation rates at positions closer to the central dyad axis of nucleosomes. This mechanism creates a visible wave-like periodicity corresponding to the nucleosome repeat length (190 base pairs (bp) in human cells), resulting in what is termed translational periodicity [3]. Second, both DNA damage and repair can be modulated by the rotational setting of nucleosomal DNA. At positions where the minor groove of DNA faces the histone octamer, repair is generally inhibited, increasing mutation rates [2,4]. In contrast, at positions where the minor groove faces away from the histone octamer, damage formation associated with ultraviolet (UV) light is increased, leading to the opposite mutation pattern in skin cancers such as melanoma [1,2]. In either case, both mechanisms cause a periodic effect on mutation rates, called rotational periodicity, with a period corresponding to the 10.2 bp DNA helical repeat observed within nucleosomes [5].
Previous studies have characterized somatic mutation rates in nucleosomes for a variety of cancers and observed a wide range of both translational and rotational periodic effects [1,2]. In some cases, these studies have used these periodicities to elucidate the molecular mechanism responsible for the observed mutational patterns. For example, analyses of these periodicities have revealed that differences in UV damage formation are responsible for elevated mutation rates at outward rotational settings in skin cancers [1,2], while inhibition of base excision repair is likely responsible for elevated mutation rates at inward rotational settings in a variety of cancers [2,4]. Hence, analyzing nucleosome mutational periodicities can provide key insights into the etiology of somatic mutations in different cancers.
However, software tools for quantifying nucleosome periodicities are limited and have previously not been developed with accessibility to the broader scientific community in mind. To meet this need, we developed a freely available software package called mutperiod. Mutperiod allows users to quantify, compare, and visualize nucleosome periodicities across distinct mutation data sets and is implemented with a robust user interface for maximal accessibility. We expect that the availability and flexibility of the mutperiod toolset will allow more researchers to analyze nucleosome periodicities in mutation data sets such as sequenced tumor genomes and better understand the interplay between chromatin and mutagenesis. In order to showcase these capabilities, we use mutperiod to show that mismatch repair contributes to nucleosome periodicity of somatic mutations in esophageal adenocarcinoma.

Availability and Installation of mutperiod
Source code and instructions for installation and use are available through the git repository at https://github.com/bmorledge-hampton19/mutperiod. In brief, mutperiod can be installed on Linux systems through the apt install command and the personal packaging archive at https://launchpad.net/~ben-morledge-hampton/+archive/ubuntu/mutperiod.

Input data required to run mutperiod
Mutperiod requires a genome fasta file, a nucleosome map, and mutation data as inputs in order to run. Specific information on the required formats for these inputs is present at the mutperiod git repository, linked above. The results in this paper were generated using mutation data from the International Cancer Genome Consortium [6], the hg19 genome assembly, and a nucleosome map of intergenic nucleosome positions called by Pich et al. from MNase-seq data [7]. These data are derived from lymphoblastoid cells, but have been used previously to map mutation data in a variety of cell types [2]. The code used to generate the nucleosome map is available at https://bitbucket.org/bbglab/nucleosome-periodicity/src/master/nucleosomes/. For the analysis of mutational periodicities in different chromatin domains, we used the entire set of called nucleosome positions (not just intergenic) derived from the MNase-seq data. The nucleosome positions were stratified using chromatin domains defined for either normal human lung fibroblasts (NHLF) or the GM12878 B-lymphocyte cell line [8].

Defining rotational and translational periodicities
For rotational periodicities, mutations are counted in a radius of 73 bp around the dyad center, corresponding to the 147 base pairs of DNA within the nucleosome [5]. For translational periodicities, the radius is set to 1000 to encompass multiple nucleosomes.

Quantification of periodicity
Mutperiod utilizes the Lomb-Scargle periodogram R package, lomb, to quantify the nucleosome mutational periodicities similarly to previous work [1,2]. In order to produce meaningful periodograms, mutperiod requires that any input data have at least 5000 mutations within mapped nucleosome dyads. Periods between 5 bp and 25 bp are tested for rotational periodicity data and periods between 50 bp and 250 bp are tested for translational periodicity data. An oversampling factor of 100 is used. Mutperiod calculates a signal to noise ratio (SNR) representing the strength of the observed periodicity. The SNR is computed by dividing the period with maximum power (the signal) by the median power of all other periods not within 0.5 units of the maximum power period (the noise), as previously described [2].

Comparison of periodicities
Mutperiod facilitates the comparison of periodicities across different backgrounds or experimental conditions by first allowing the user to stratify mutation data cohorts at the beginning of the pipeline. Later in the pipeline, the separate cohorts are compared using the SNR values of the relevant periodicities. Either a permutation test (for aggregate data) or a Wilcoxon rank sum test (for individual tumor data) is used to determine the statistical significance of the comparison.

Mutperiod analysis pipeline
Mutperiod processes mutation data through a hybrid Python and R pipeline. Users are expected to provide a genome fasta file, a nucleosome map, and mutation data as inputs. Mutperiod has native support for data from the ICGC portal or a slightly modified bed format that contains information on the nature of the mutations. These formats allow users to easily leverage available mutation data sets or adapt their own data for the pipeline. Mutation cohorts can be stratified out of the input data by various background conditions for comparison later in the pipeline. Mutperiod directly supports stratification of mutation data by microsatellite stability and mutation signature using previously developed R packages [9,10]. Further stratifications can be manually designated by users in the mutation data input file.
Mutperiod uses the given inputs to count the mutations in and around nucleosomes, producing a table of mutation counts for each nucleosome position relative to the dyad center. Mutations can be counted within the radius of a single nucleosome for rotational periodicity or within several adjacent nucleosomes for translational periodicity. The resulting counts can be normalized in mutperiod using either the surrounding sequence context or a given mutation background from another input data set. After mutations are counted and optionally normalized, the relevant periodicities are quantified using a Lomb-Scargle periodogram, for which a signal-to-noise ratio is calculated [2].
In order to investigate the link between nucleosomes and DNA damage and repair, it is vital to be able to compare the relative strengths of mutational periodicities across different genetic backgrounds, tumor types, or experimental conditions. Mutperiod can compare the strength of the periodicities between different data sets or in stratified subsets of a single data set (e.g., microsatellite stable versus instable tumors; see below) and determine if they are significantly different by either computing the SNR of usergenerated random permutations of the aggregate mutation data or by using a Wilcoxon rank sum test for individual tumors. Furthermore, mutperiod can output plots of mutation counts with respect to the relevant periodicities to help users visualize them beyond the simple statistical output. These plots are similar to those presented in Fig. 1.
The entirety of the analysis pipeline supports an intuitive user interface through Python based Tkinter dialogs. In addition, a command line interface that installs with mutperiod can be used to manually pass inputs to the pipeline or simply invoke the user interface. Besides offering increased user accessibility, the user interface helps facilitate larger volumes of data processing by allowing the user to select multiple files or one or more directories to recursively search for all relevant files. All the selected files are then run through the pipeline in tandem.

Analysis of mutational periodicities in esophageal adenocarcinoma reveals a role for mismatch repair
We tested mutperiod's ability to quantify differences in nucleosome mutational periodicities by analyzing the potential role of DNA mismatch repair in promoting these periodicities in esophageal adenocarcinomas (ESAD) [11]. Somatic mutations in ESAD display clear translational and rotational periodicities in nucleosomes [2], but the molecular mechanism responsible for these periodicities is unclear. In esophageal tumors, the translational periodicity is characterized by greater mutation rates in nucleosomes relative to linker DNA, and the rotational periodicity is characterized by greater mutation rates at positions where the minor groove of nucleosomal DNA faces inward, towards the histone octamer [2]. These periodicities suggest that repair inhibition may be an important mechanism [1,2,4]. The principal mutation signature associated with esophageal cancers is signature 17, which is characterized by an abundance of T > G mutations, as well as a lower frequency of C > T, T > A, and T > C mutations in certain sequence contexts [12]. Because T > G mutations are often linked to misincorporation of oxidized guanine opposite adenine during replication [13], we hypothesized that differential mismatch repair (MMR) in nucleosomes may play a role in creating the observed mutational periodicities.
To test this hypothesis, we used mutperiod to compare the periodicities of microsatellite stable (MSS) and instable (MSI) esophageal tumors. Microsatellite stability is an indicator of MMR proficiency [14,15], as loss of MMR prevents the repair of replication errors in repetitive microsatellite regions, where they accumulate rapidly and lead to genomic instability. The data produced by mutperiod reveal markedly different periodicities across MSS and MSI tumors (Fig. 1). The SNR value for the aggregate MSS tumor data shows a nearly 15-fold increase compared to the MSI data for the translational periodicity (1573 vs. 107) and an almost 3fold increase for the rotational periodicity (658 vs. 236).
The MSS data set contains approximately three times as many mutations mapped to nucleosomes as the MSI data set, which could affect the calculated SNR, since SNR tends to scale with sample size. To account for differences in sample size, we used mutperiod to analyze a subset of the aggregated MSS mutation data that matched the count of mutations (N = 2,921,950) in the MSI data set. For both translational and rotational periodicities, the subset of the MSS mutation data still had much higher SNR than the MSI data ( Supplementary Fig. S1). We repeated this analysis for 100 random subsets of the MSS data. In all cases, the SNR of the MSS subset was much higher than the MSI data ( Supplementary  Fig. S2).
To confirm that the differences in SNR between the MSS and MSI mutation data were statistically significant, we performed a permutation test. We randomly permutated the MSS and MSI labels for the aggregated mutation data, while maintaining the same number of mutations in the permuted MSS and MSI classes, and calculated the difference in SNR using mutperiod. It is interesting to note that the median difference in SNR in the permuted classes was greater than zero (Fig. 2), likely reflecting the higher SNR in the permuted class with more mutations (e.g., the permuted MSS class). Across 100 random permutations of the aggregate data, all produced a lower difference in SNR than observed between MSS and MSI tumors for both rotational and translational periodicities (P < 0.01 for both cases; Fig. 2).
In addition to the analysis of the aggregate data, we analyzed the distributions of SNR values for individual MSS and MSI tumors. The data indicated that individual MSS tumors had a significantly higher translational (P = 0.010) and rotational (P = 0.030) periodicity than MSI tumors (Fig. 3), even though MSI tumors on average had a greater number of mutations. In summary, these results indicate that mismatch repair status is significantly associated with the strength of mutational periodicity in nucleosomes, both in the aggregate mutation data and in individual tumors.
When analyzing the period with the maximum power in each tumor, we found that while many tumors in the MSI data set have the expected median periods of~190 bp and 10.2 bp in translational and rotational data, respectively, they also have a much greater variability in the peak periodicity values compared to the MSS tumors (Fig. 4). This is particularly apparent for translational periodicities, where the MSI tumors had a much higher standard deviation (SD = 56) and interquartile range (IQR = 74) in the values of the peak periodicity than the MSS tumors (SD = 24.83; IQR = 4.56). There was a similarly elevated variability of peak rotational periodicities for the MSI tumors (SD = 2.49) relative to the MSS tumors (SD = 0.68). This is consistent with previous results, indicating that MSI tumors may display a range of MMRdeficient phenotypes [16], which could account for the variability in periodicity values. The outliers among the MSI samples may represent tumors where MMR was disabled earlier in tumorigenesis or more severely. Taken together, these findings suggest that the nucleosomal mutational periodicities observed in ESAD tumors are associated with proficient mismatch repair.

Nucleosomal mutation periodicities vary in different chromatin domains
The analysis described above used a nucleosome map encompassing all intergenic regions of the human genome. However, other analyses may benefit from a more targeted nucleosome map. To address this need, mutperiod is equipped with the ability to stratify nucleosome maps using a bed file of genomic loci. Pos-sible targets for stratification include genomic regions with distinct patterns of histone post-translational modifications or chromatin states. As an example, we stratified the complete nucleosome map (i.e., intergenic and genic nucleosomes) into nucleosome subsets associated with either heterochromatin or transcribed domains, as defined in a previous study [8].
The differences between mutational periodicities in heterochromatin vs. transcribed regions were striking, with the SNR of mutational periodicities in transcribed regions being significantly less than for nucleosomes in heterochromatin domains (Fig. 5). Since there were more nucleosomes associated with heterochromatin domains, we analyzed a subset of heterochromatin domain nucleosomes that matched the count for transcribed regions (Supplementary Fig. S3). The translational and rotational SNR's for the heterochromatin subset were still much higher than those of the transcribed domain nucleosomes (compare Fig. 5B,D with Supplementary Fig. S3). A similar stratification using a chromatin map from a different cell type (i.e., GM12878 B-lymphocyte cell line instead of normal human lung fibroblasts) showed similar results (Supplementary Fig. S4). In summary, mutperiod can be used to characterize mutation periodicities in distinct genomic or chromatin domains. In this case, our analysis suggests that in esophageal adenocarcinomas, nucleosome mutation periodicities are more prominent in heterochromatin, and less prominent in transcribed genes.

Discussion
While recent studies have highlighted the prevalence and importance of periodic mutation rates in nucleosomes, accessible software tools to analyze these mutational periodicities have been lacking. To meet this need, we have developed an accessible and freely available software package called mutperiod. We use mutperiod to show that DNA mismatch repair (MMR) contributes to periodic mutation rates in esophageal adenocarcinomas, highlighting the power of mutperiod in identifying new mechanisms that contribute to mutational periodicity in nucleosomes.
A key feature of mutperiod is its ease of use and accessibility. It is well known that easily accessible software tools are more likely to be used than inaccessible tools by members of the scientific community [17,18]. We have taken special care to design mutperiod with ease of access in mind, in order to encourage other researchers to leverage nucleosome periodicities in their work and to increase the efficiency of their research. Much of this accessibility comes from a simple installation protocol and an intuitive user interface. Moreover, mutperiod is lightweight, and the above analysis of esophageal tumor data was run in a single day without the use of a computing cluster or any particularly expensive hardware (omitting the permutation test, which was much more computationally expensive). For this analysis, much of the runtime comes from the external package, MSISeq [9], which is beyond the scope of this research to optimize. The next most computationally expensive portion of the pipeline comes from mapping individual mutations to the given nucleosome map. This process has been optimized by sorting both inputs and comparing across each in parallel, guaranteeing worst-case linear runtime. We are confident that mutperiod can be leveraged even by researchers with limited computational experience and resources.
Somatic mutations in esophageal adenocarcinomas have among the strongest rotational and translational periodicities in nucleosomes of any cancer type [2], but the molecular mechanism(s) responsible for these mutational periodicities have been unclear. It has been previously suggested that differential base excision repair (BER) may be responsible for these mutational periodicities [2], since BER is regulated by both the translational and rotational positioning of the DNA base lesion in the nucleosome [4,19], and because many oxidative DNA lesions, which are associated with mutagenesis in esophageal tumors [20,21], are repaired by BER [22]. However, the principal mutation signature associated with esophageal cancers (signature 17) has been linked to misincorporation of 8-oxoguanine (8-oxoG) nucleotides opposite adenine during replication [13]. BER enzymes such as hOGG1 are unable to efficiently repair 8-oxoG lesions paired with adenine [23]; instead, these misincorporated 8-oxoG lesions are likely repaired by MMR [24,25]. Our results support the hypothesis that inhibition of MMR in nucleosomes contributes to mutational periodicities in esophageal tumors. This hypothesis is supported by a recent report that colorectal cancers with DNA polymerase epsilon mutator alleles, which generates DNA replication errors that are repaired by MMR, also show a rotational mutation periodicity in nucleosomes [26]. Our results are consistent with biochemical studies demonstrating that nucleosomes inhibit mismatch repair [27] and showcase the power of mutperiod in elucidating this relationship in human tumors.
Our analysis using mutperiod also demonstrates that the observed nucleosome mutation patterns are much more prominent in heterochromatin than in transcribed genes. This suggests that these mutation periodicities may not significantly impact the occurrence of potential driver mutations in esophageal adenocarcinoma, since these driver mutations typically occur in transcribed exons. A key question is what molecular mechanism causes this difference mutational periodicity in heterochromatin versus transcribed euchromatin. One possibility is that nucleosome positioning is more dynamic in transcribed domains, either within individual cells or between different cell types, resulting in weaker nucleosome periodicities. Alternatively, it is possible that transcription-coupled repair of oxidative damage [28] may diminish the impact of nucleosomes on mutation rates in transcribed domains. Moreover, transcribed DNA is known to be enriched for histone H3 K36 methylation, which can recruit mismatch repair proteins and influence mutagenesis [29,30]. It is possible that high levels of H3 K36 methylation promote efficient mismatch repair in transcribed nucleosomes, thereby diminishing the impact of nucleosomes on mutagenesis. While future studies will be needed to test these hypotheses, our findings highlight the utility of mutperiod in characterizing mutation patterns in nucleosomal DNA.