Evolution of the Highly Repetitive PEVK Region of Titin Across Mammals

The protein titin plays a key role in vertebrate muscle where it acts like a giant molecular spring. Despite its importance and conservation over vertebrate evolution, a lack of high quality annotations in non-model species makes comparative evolutionary studies of titin challenging. The PEVK region of titin—named for its high proportion of Pro-Glu-Val-Lys amino acids—is particularly difficult to annotate due to its abundance of alternatively spliced isoforms and short, highly repetitive exons. To understand PEVK evolution across mammals, we developed a bioinformatics tool, PEVK_Finder, to annotate PEVK exons from genomic sequences of titin and applied it to a diverse set of mammals. PEVK_Finder consistently outperforms standard annotation tools across a broad range of conditions and improves annotations of the PEVK region in non-model mammalian species. We find that the PEVK region can be divided into two subregions (PEVK-N, PEVK-C) with distinct patterns of evolutionary constraint and divergence. The bipartite nature of the PEVK region has implications for titin diversification. In the PEVK-N region, certain exons are conserved and may be essential, but natural selection also acts on particular codons. In the PEVK-C, exons are more homogenous and length variation of the PEVK region may provide the raw material for evolutionary adaptation in titin function. The PEVK-C region can be further divided into a highly repetitive region (PEVK-CA) and one that is more variable (PEVK-CB). Taken together, we find that the very complexity that makes titin a challenge for annotation tools may also promote evolutionary adaptation.

titin PEVK region comparative genomics gene prediction molecular evolution One goal of modern biology is to connect the underlying molecular structure of a protein with physiological function. Studies that compare protein sequences in an evolutionary context can illuminate the regions of proteins that are most essential, under functional constraint, or responding to natural selection (e.g., Perutz 1983;Hughes and Nei 1989;Kreitman and Akashi 1995;Galindo et al. 2003;Zhao et al. 2009;Finseth et al. 2015). With the advent of next generation sequencing, the focus of such studies has often moved beyond single gene analyses to comparing entire genomes or transcriptomes (e.g., Finseth et al. 2014;Karlsson et al. 2014;Zhang et al. 2014;Pervouchine et al. 2015;Chikina et al. 2016;Partha et al. 2017). Yet, genes that are large, repetitive, and/or poorly annotated can be left behind by this approach. One such example is titin (TTN), a giant filamentous protein expressed in the muscles of all bilaterian metazoans (Steinmetz et al. 2012) that plays a key role in muscle elasticity (Linke 2018) and is linked to various human muscular diseases (Savarese et al. 2016). At nearly 4,000 kD and greater than 1 micrometer in length, titin (also known as connectin) is among the largest proteins found in vertebrates (Bang et al. 2001). Titin plays a fundamental role in vertebrate striated muscle, where it acts as a giant molecular spring responsible for passive and active muscle elasticity (reviewed in Linke 2018). Titin spans an entire half sarcomere from z-disk to m-line and its I-band region is composed of three domains: the proximal tandem Ig The general role titin plays in sarcomere structure is conserved among vertebrates; however, the elastic and physiological properties of vertebrate muscle vary dramatically across species and tissues. Titin, and the I-band in particular, are at least partially responsible for some of these functional differences (Freiburg et al. 2000, Prado et al. 2005, Trombitás et al. 2000 and likely contribute to variation in muscle physiology among tetrapods (Manteca et al. 2017). For example, cysteine residues in the I-band of titin may be responsible for mechanochemical evolution of vertebrate titin (Manteca et al. 2017). Likewise, variation in the length and structure of alternatively spliced titin isoforms can affect passive and active stiffness of muscles within a single species (Freiburg et al. 2000;Linke et al. 1998, Prado et al. 2005, Powers et al. 2016, Monroy et al. 2017. Moreover, mutations to the TTN gene have been associated with various cardiomyopathies and skeletal muscular dystrophies in humans (Savarese et al. 2016).
Despite its central role in muscle physiology, evolution, and disease, the TTN gene is often poorly or partially annotated in non-model species. One issue is the sheer size of TTN; it is composed of more than 100 kbp with a protein length of . 30,000 amino acids and the gene is more than 60 times the length of the average eukaryotic gene (Bang et al. 2001, Granzier et al. 2007. TTN also contains 364 exons (363 coding exons and a first non-coding exon) with lengths that range from just a few base pairs to greater than 17,000 base pairs, and theoretically can produce more than one million splice variants (Bang et al. 2001, Guo et al. 2010. The abundance of alternatively spliced isoforms means that numerous tissues, samples, and individuals are required for complete cDNA-based or RNA-based annotations. The PEVK region of TTN, an important determinant of titin and muscle elasticity, also presents a problem for most annotation tools. The PEVK region contains over 100 short, repetitive exons consisting of $70% proline (P), glutamate (E), valine (V), and lysine (K) residues. These features mean that PEVK exons are often missed by automated annotation tools and previous studies have relied upon manual annotation of this region (e.g., Freiburg et al. 2000;Granzier et al. 2007). Nevertheless, the PEVK region is potentially an important target of selection over evolutionary time; it is evolutionarily labile, varies in length, exon structure, and amino acid content across some vertebrates, and may contribute to evolutionary adaptations in myofibril and whole muscle stiffness (Witt et al. 1998;Greaser et al. 2002;Granzier et al. 2007). The PEVK region also appears to have a hierarchical structure, with greater sequence divergence in the N-terminal among vertebrates than the C-terminal (Witt et al. 1998). With the wealth of genomic data now available, more detailed analyses of the PEVK region across vertebrates are possible and key to understanding its role in muscle evolution.
Here, we characterize the genomic structure of the PEVK region of TTN to gain insight into the structure-function relationships of I-band titin and its evolution across mammals. We first developed a custom tool, PEVK_Finder, to annotate exons within the PEVK region across diverse mammalian species. We then compared exon structure and sequence content both within one individual's PEVK region and across species. Finally, we performed evolutionary analyses to examine the nature of selection acting on the PEVK region of TTN.

PEVK_Finder development and optimization
We developed a tool called PEVK_Finder to annotate the PEVK region of TTN across vertebrates (Table S1). PEVK_Finder annotates exons according to three criteria: 1) minimum exon length (12), 2) minimum PEVK ratio per exon (0.54), and 3) sliding window length (10; determination of optimal parameter values described below). For a given species, PEVK_Finder translates the complete TTN sequence into three forward reading frames and stores the resulting amino acid sequences in memory. PEVK_Finder then utilizes a sliding window approach to identify windows with a PEVK ratio above a minimum threshold. Note that the PEVK_Finder algorithm also incorporates the amino acid alanine (A) in its PEVK ratio calculations, based on previously observed PPAK motifs in titin (Greaser et al. 2002). However, all ratios reported in the text and figures of this paper will be referred to simply as PEVK ratio. All windows that meet the PEVK ratio requirement are stored in memory, and overlapping windows are combined into discrete sequences with unique start and end coordinates. To determine exonic boundaries, PEVK_Finder searches for paired donor and acceptor splice sites within each sequence using canonical mammalian nucleotide splicing patterns (Burset et al. 2000). Although these paired sites are traditionally used to define the 59 (donor) and 39 (acceptor) ends of intronic sequences, PEVK_Finder searches for acceptor and donor pairs of adjacent introns. For example, the acceptor site of an intron marks the 59 end of a likely PEVK exon, while the donor site of the following intron marks the 39 end of the exon. When no acceptor/ donor pairs are found in a given sequence, PEVK_Finder discards the current sequence and moves on to the next likely PEVK exon. When multiple acceptor and/or donor sites are found in a given sequence, the sub-sequence with the minimum distance between any pair of acceptor and donor sites is considered the most likely PEVK exon. We confirmed that each non-overlapping exon was represented by a single reading frame and combined all resultant exons into a single, large exon set. The algorithmic workflow of PEVK_Finder is summarized in Figure S13. The PEVK_Finder program was written in Python using the Biopython bioinformatics tools package (Cock et al., 2009).
We optimized PEVK_Finder on the well-annotated human and mouse TTN sequences, before applying the tool to a diverse set of mammalian species (Table S1). To determine an optimal set of baseline parameters for PEVK_Finder, we subjected the untranslated DNA reference sequence of human and mouse TTN to a range of parameter settings in PEVK_Finder and compared those results to cDNA-annotated PEVK regions. The complete human and mouse TTN reference DNA were downloaded from the NCBI RefSeq database (O'Leary et al. 2016). The coordinates and DNA sequences for cDNA-annotated human and mouse PEVK exons were gathered from the Ensembl genome browser (Yates et al. 2016 ; Table S1). Human TTN exons 112-224 were defined as the human PEVK region, as determined by Granzier et al. (2007). Although human TTN exon 225 is also part of the PEVK region, the second $half of the exon also encodes the first exon of the distal tandem Ig segment. Exon 224 is the last exclusively-PEVK exon in the PEVK region, and exon 225 was therefore excluded from all further analyses. Note that we refer to human exon numbers according to the NCBI consensus CDS database and provide NCBI exon numbers, location, sequences, and additional numbering schemes in Table S2. Mouse TTN exons 109-207 were defined as the mouse PEVK region, as determined by manual inspection of cDNA-annotated sequences.
All possible combinations of minimum exon length (10-30), minimum PEVK ratio (0.45-0.83), and sliding window length (10-30) were used to generate $17,000 PEVK exon sets per species. The parameter ranges used for optimization testing were determined by examining the minimum and mean exon lengths and PEVK ratios of all human and mouse PEVK exons. To determine the lower boundary for each parameter range, a number slightly below the minimum was chosen, and a number higher than the mean was chosen to determine the upper boundary. Each resulting exon set was compared with the corresponding cDNA annotated exon set for each species using the Basic Local Alignment Search Tool (BLAST; Altschul et al. 1990). Only exons with 100% identity as determined by BLAST were retained for the next steps of optimization. If multiple hits per exon met these criteria, only the hit with the highest bit score was retained.
A match score was calculated to determine how well parameter sets recover the annotated PEVK regions for the $17,000 parameter combinations per species. The match score is a weighted score that prioritizes exons that recapitulate annotated exons ("recovered exons"; 70%), rewards identical exons ("perfect exons"; 10%), and minimizes exons identified by PEVK_Finder that are not found in the annotations ("extraneous exons"; 20%). Recovered exons were calculated as the proportion of the cDNA annotated exons identified by PEVK_Finder and included exons that generated both partial and full BLAST hits; perfect exons were defined as the proportion of recovered exons that were 100% identical for the entire length of the annotated exons; extraneous exons were calculated by subtracting the number of PEVK_Finder exons with no matches in the annotated database from 100 and dividing that number by 100. When the number of matchless exons found by PEVK_Finder was .100, the extraneous exons score was 0. The final match score was the sum of the three separate weighted scores and ranged from 0 to 1. The parameter space that encompassed the highest match scores was used to determine an optimal set of parameter ranges that yielded the most accurate sets of PEVK exons for a range of mammalian species. All match score scripts were written in R (R Core Team 2016).

PEVK Finder validation
To evaluate the utility of PEVK_Finder, we compared the performance of PEVK_Finder on human and mouse TTN with a suite of popular gene annotation tools: Augustus (Stanke and Morgenstern 2005), FGENESH (Salamov and Solovyev 2000), geneid (Parra, Blanco and Guigó 2000) and GENSCAN (Burge and Karlin 1997). We used each tool with default parameters to predict the exon-intron coordinates of PEVK exons in both the human and mouse TTN sequences. Exon coordinates generated by each tool were manually compared with cDNA-annotated PEVK exon coordinates for the corresponding species. Exons generated by any of the four tools with a partial match (.= 50%) to a corresponding cDNA exon were considered a match, and a tool exon that spanned two or more cDNA exons was considered a single match. cDNA exons with no overlapping coordinates in the tool exon set were considered missing exons. Tool exons with no overlapping coordinates in the cDNA exon set were considered novel exons.

Phylogeny construction and PEVK region characterization across mammals
We downloaded 43 mammalian TTN sequences representing 16 major orders from the NCBI RefSeq database (Table S1). Our taxonomic sampling is a representative subset from Zhao et al. (2009), which generated a phylogeny of 59 mammals that evolved to fill diverse niches including subterranean, aquatic and nocturnal niches. The genomic TTN sequences for all 43 mammals in the species list were run through PEVK_Finder and used to create novel PEVK exon sets. Forty-one of these exon sets were used for further analysis. To restrict PEVK exons to the PEVK region, terminal exons were manually determined based on the exon spacing and sequence composition patterns observed in the terminal exons of the cDNA-annotated human and mouse PEVK regions. Specifically, the first PEVK exon (human exon 112) has an amino acid sequence identical or nearly identical to "EIPPVVAPPIPLLLPT-PEEKKPPPKRI" and is $3,000 or fewer nucleotides before human TTN exon 114, which has an ortholog in all but one species and can thus be easily recognized. The last PEVK exon has an amino acid sequence identical or nearly identical to "AKAPKEEAAKPKGPI" and is generally $3,000 or fewer nucleotides after the second to last exon. Exons identified by PEVK_Finder that are located beyond these landmarks may indeed be true PEVK exons, but all exon sets used for further analyses were restricted to this exon range for the sake of remaining consistent with previous studies of the PEVK region.
Using a full time-calibrated phylogeny of extant mammalian species, we created a phylogeny of 39 of the 41 mammalian species in our study to compare PEVK exons across vertebrates and set duck-billed platypus as the outgroup (Kumar et al. 2017). Two bat species, Myotis lucifugus and Myotis davidii, were not included in the phylogeny because their PEVK regions recapitulated those of other closely related bat species already represented in the figure. Phylogeny importation and editing was performed in R using the Phytools package (Revell 2012; R Version 3.3.2, R Core Team 2016). To facilitate visual comparisons of PEVK regions, we generated exon-intron diagrams for each species with a custom R script that plotted the coordinates of PEVK exons in each exon set, colorcoded by percent PEVK per exon.
PEVK_Finder exons were compared with cDNA-based annotations or in silico predicted exons for the 41 mammalian exon sets. The NCBI gene prediction program Gnomon (Souvorov et al. 2010) predicts the optimal coding sequence using a combination of partial alignments and ab initio modeling. PEVK_Finder annotations were compared with cDNA-based Gnomon predictions when cDNA was available and were compared with in silico Gnomon predictions when no cDNA was available. For simplicity, we refer to both sets of predictions as "Gnomon exons." We calculated exons identified by PEVK_Finder or Gnomon, exons that were identified by both ("consensus exons"), and the total number of unique exons identified by both tools. Mean differences between Gnomon and PEVK_Finder were determined using a paired t-test, pairing by species.

Evolutionary analyses
To facilitate comparisons among PEVK_Finder exons, we attempted to identify "orthologous exons". We use the term "orthologous" to distinguish exons that descend from a common DNA ancestral sequence from those that are related by duplication (i.e., paralogous). Orthologous PEVK exons among species were determined using the reciprocal best BLAST method, with the exception that orthologs were determined for each exon rather than the entire TTN gene (Tatusov et al. 1997;Bork and Koonin 1998;Koonin 2005). The nucleotide sequence of the human PEVK exon set was used as the query and compared with exon sets from the other species. Potential orthologs were called when a given human exon's top BLASTn species hit (minimum e-value, maximum bit score among hits) returned the original query exon when performed reciprocally. Only orthologs found in the PEVK exon sets of .10 species (including human) were included in further analyses. The remaining sets of orthologous pairs are referred to as "confident orthologs". Confident orthologs were aligned by codon with the Clus-talW (Goujon et al. 2010) algorithm as implemented in MEGA7 (Kumar et al. 2016), using all other default parameter settings. Codons missing .15% of data among species were eliminated from the alignment. We estimated average evolutionary divergence over all sequence pairs for each confident ortholog following Tamura et al. (2004) as implemented in MEGA7 (Kumar et al. 2016). Values for each ortholog were calculated in MEGA7 with default parameters and 100 bootstrap replications.
We also evaluated repetition and duplication within the PEVK region by comparing all PEVK exons for a single species with each other. For a given species, all possible exon pairs in the PEVK exon set were aligned, and pairwise nucleotide substitutions per exon were calculated using MEGA Proto and MEGA-CC (Kumar et al. 2012). The PEVK region was divided into the subregions PEVK-N and PEVK-C, defined in Results: PEVK region hierarchical structure. The pairwise exon alignments were divided into quadrants representing PEVK-C:PEVK-N (Quadrant I), PEVK-C:PEVK-C (Quadrant II), PEVK-N:PEVK-N (Quadrant III), and PEVK-N: PEVK-C (Quadrant IV) comparisons. For each species, we calculated mean substitutions per exon for all quadrants, and results were compared with a one-way ANOVA and a Tukey's HSD post-hoc test. To evaluate repetition at the nucleotide level, we concatenated exonic sequences from the human PEVK-C region and generated a dot plot in the Dotlet web browser (Junier and Pagni 2000). We additionally made self-dot plots using the R Dotplot package (Delaney 2017) for all species using the full PEVK-C genomic sequence for each species, including intronic sequences. Finally, we used a BLAST search to find human LINE-1 elements in other mammals. We used the intronic sequence between human TTN exons 180 and 181 as the query and compared with all 41 mammal TTN sequences.
The PEVK-C region was further divided into two subregions PEVK-CA and PEVK-CB for single-species analyses, also defined in We tested for positive selection acting on individual codons using site models as implemented in the Phylogenetic Analysis by Maximum Likelihood (PAML) software package (Yang 2007). For each ortholog alignment, a phylogenetic tree was estimated using the M0 model, then models M0 (one ratio), M1a (neutral), and M2a (positive selection), were run. For each model, v was set first at 0.5, then at 1 and then lastly at 3. We recorded Lnl values for each model and test statistics for two model comparisons: M0 vs. M1a and M1a vs. M2a. The PAML x2 calculator was used to determine the p-value of each comparison. All sites under selection, as determined by Bayes Empirical Bayes analysis (Yang et al. 2005), were recorded for estimates with posterior probabilities . 0.5.
Finally, we tested for charge shifts in codons under positive selection. For each codon, the corresponding amino acid for each species was assigned one of four values based on residue charge: hydrophobic/ uncharged, polar, positively charged, or negatively charged. The ancestral charge status of each node in the corresponding mammal tree was estimated using an equal rates (ER) model. Ancestral state reconstruction was performed in R using the Phytools package (Revell 2012; R Core Team 2016).

Data Availability
The PEVK_Finder source code is available for download at https:// github.com/kmuenzen/pevk_finder_public. Tables S1 and S2 contain species names, NCBI accession numbers and TTN exon numbers for sequences used in this study. Scripts used in this study have been made available at https://github.com/kmuenzen/pevk_finder. Supplemental materials are available at Figshare: https://doi.org/10.25387/ g3.7544900.

PEVK_Finder parameter optimization
PEVK_Finder was developed to annotate the PEVK region of TTN across mammals. We determined optimal parameters (exon length, window size, and PEVK ratio) for downstream applications by comparing PEVK_Finder results with the human and mouse annotations and calculating a match score. For each human and mouse model, we tested a total of $17,000 parameter combinations. Optimal match scores for human were achieved when minimum exon length was 10-15 nucleotides, PEVK ratio was 0.54-0.55, and window length was 10 nucleotides (Figure 1a). Optimal match scores for mouse were achieved when minimum exon length was 12-14 nucleotides, PEVK ratio was 0.53-0.54, and window length was 10 nucleotides ( Figure  S1a). Default consensus parameters were therefore set as follows: window length = 10 nucleotides, PEVK ratio = 0.54, and minimum exon length = 12 nucleotides.
Using the optimal parameter values, PEVK_Finder identified 109 exons in the PEVK region in human TTN (113 annotated) and 93 PEVK exons in mouse TTN (99 annotated). In humans, 106 PEVK_ Finder exons recovered annotated exons (94%) and 70 of these were perfect matches. Of the 36 non-perfect exon matches, 7 were overestimations of exon length and 29 were underestimations. When PEVK_ Finder exon length was .50% reduced relative to the annotated length, the annotated exons were often long, with low PEVK ratios. In these cases, PEVK_Finder tended to truncate the exons in regions due to low PEVK ratio (e.g., human TTN exon 114). When PEVK_Finder overestimated exon length, all PEVK_Finder exons were ,100 nucleotides long and length differences were marginal (3-4 amino acids difference). Patterns were similar with the mouse TTN, with the exception that overestimations were more frequent (12 of 37 non-perfect exon matches), possibly due to non-canonical splice sequences ( Figure S1c).

PEVK_Finder validation
To evaluate the utility of PEVK_Finder, we compared results generated from PEVK_Finder with four different gene annotation tools. Overall, PEVK_Finder outperformed all other tools when identifying PEVK exons in the TTN gene. PEVK_Finder identified 97% of human TTN exons, while no other tool identified more than 65% of human PEVK exons ( Figure 1b). PEVK_Finder also missed far fewer exons than GENSCAN in both human and mouse TTN (gray circles Figures 1c, S1c). PEVK_Finder was also the only tool that identified .80% of both human and mouse exons in the PEVK region (Figures 1b, S1b). FGE-NESH also identified a high proportion of mouse exons (81%), but was far less effective than PEVK_Finder at annotating human exons (62%). PEVK_Finder performed slightly better on human TTN (94%) than mouse TTN (90%). PEVK_Finder also annotated exons across a range of PEVK ratios and in regions of short, high-density exons (Figures 1c, S1c); these regions were a challenge for GENSCAN. In the few instances when PEVK_Finder did fail to annotate exons, it was usually those exons with low PEVK ratios. GENSCAN also tended to lump distinct exons together, likely due to missed splicing sites. PEVK_Finder, FGE-NESH, and GENSCAN could be applied to both mouse and human TTN, whereas geneid and Augustus are only suitable for human data. In addition to identifying known exons, PEVK_Finder discovered novel putative PEVK exons in both the human (1; Figure 1c) and mouse (2; Figure S1c) TTN sequences. However, the "novel" exon identified in human TTN is technically outside the bounds of the pre-defined PEVK region (exon 112-224).
We extended our evaluation of PEVK_Finder by evaluating its performance on other species with annotated TTN sequences. Of the 41 mammalian species tested, only five have cDNA data for TTN in NCBI (Table S3). Human TTN has the most complete annotation, with consensus cDNA from 11 isoforms. The American Pika (Ochotona princeps) has cDNA data from 5 isoforms, the house mouse (Mus musculus) and the Gairdner's Shrewmouse (Mus pahari) have cDNA data from 3 isoforms each, and the Chinese Rufous Horseshoe Bat (Rhinolophus sinicus) has cDNA data from 1 short isoform. For the remaining 36 species, we compared PEVK_Finder results with the annotations provided by NCBI generated by the Gnomon program (Souvorov et al. 2010). Across all species, PEVK_Finder identified significantly more exons (mean: 89.4, SD: 8.33) than the Gnomon-based annotations (mean: 73.9, SD: 10.6; t = 9.37, df = 40, P = 1.23 Ã 10 211 ; Table S3). The number of exons identified by PEVK_Finder represents a significantly higher percentage of the total exons (mean: 93.6%, SD: 2.35) than Gnomon-based annotations (mean: 76.7%, SD: 9.29; t = 11.10, P = 8.71 Ã 10 214 ; Table S3). PEVK_Finder also identified many more putative novel PEVK exons that were not described previously by either Gnomon or cDNA (mean: 22.7; SD: 10.17).
PEVK region hierarchical structure PEVK_Finder successfully generated exon sets for 41 of the 43 species tested. We were unable to obtain exon sets for two species (Sorex araneus and Leptonychotes weddellii) due to non-canonical splicing sequences that were not compatible with those used by PEVK_Finder to identify exon-intron boundaries. The PEVK region of the remaining 41 species exhibited structural similarities in both exon-intron spacing and PEVK content per exon (Figure 2). The total number of exons identified by PEVK_Finder ranged from 81 -116 exons, 74 -109 of which were within the bounds of the exons defining the start and end of the PEVK region (Table S3). Interestingly, the species with the smallest number of total PEVK exons are either aquatic diving mammals (Odobenus rosmaruswalrus (81), Tursiops truncatusdolphin (79), Neomonachus schaunslandi -Hawaiian monk seal (83)) or a burrowing mammal, Condylura cristata (82). However, two of these species also show gaps in their PEVK regions, which may be real or may be due to assembly artifacts (see below).
Exon structure of the upstream $half of the PEVK region is relatively conserved over mammalian evolution, while the downstream $half varies considerably. Based on this observation, we divided the PEVK region into two distinct regions: a structurally conserved region ("PEVK-N"; human exons 112-165) and a structurally variable region ("PEVK-C"; human exons 166 -224). Because the end of the PEVK-N region and the beginning of the PEVK-C region varies 1-5 exons between species, a visual landmark was used to manually determine the PEVK-N/C boundary for each species. The manually determined PEVK-N/C boundary is preceded by 3-4 very closely spaced, low PEVK-ratio exons, and is followed by 3-4 more low-PEVK ratio exons and a dense, high PEVK-ratio region (Figure 2). Across mammals, intron/exon boundaries and the number of exons are more similar across the PEVK-N (mean: 50.00, SD: 2.51) vs. the PEVK-C region (mean: 39.46, SD: 8.12, f = 0.10, P , 1.70 Ã 10 211 ) (Figure 2, Table  S3). Several species (e.g., chimpanzee, dolphin, walrus and alpaca) also show large deletions of PEVK exons in the PEVK-C region. The walrus TTN sequence contains a large string of .9,000 'N' bases, indicating that the gap in the walrus PEVK-C region is likely due to low assembly quality in this region. Alpaca TTN also contains several strings of N's Figure 1 PEVK Finder tool optimization and evaluation of human TTN. a) Match scores of human PEVK exon sets were generated using different combinations of minimum exon length, PEVK ratio and sliding window length parameter settings. b) PEVK_Finder recovered more PEVK exons than other existing gene prediction tools (GENSCAN, Augustus, FGENESH and geneid). c) PEVK_Finder outperformed GENSCAN at recovering the exon-intron distribution of human TTN PEVK exons identified by cDNA. Vertical lines indicate exon boundaries, and the thickness of the lines is determined by the exon coordinates. Gray circles indicate exons that were missed by either PEVK Finder or GENSCAN, and black asterisks indicate automatically annotated exons that were not annotated by cDNA. In this figure, the first exon indicated by an asterisk is technically outside the pre-defined bounds of the PEVK region. The PEVK ratio scale is given in the figure. that range from 500-5,000 bases long, which are likely responsible for the large gaps. However, while several small strings of 10-150 'N' bases are found in the PEVK-C region of chimpanzee and dolphin TTN, no large strings of Ns are present. The large deletions in the chimpanzee and dolphin may therefore represent true PEVK exon "deserts". Alternatively, the gaps could also be an artifact of non-canonical splice sites or small strings of N's interfering with splice site detection or PEVK ratios. Complete genomic sequences of TTN and/or RNA-based annotations could improve resolution of PEVK exons in these species.
The observed sequence variability within an individual's PEVK-C region (Figure 3a, Quadrant II; self-dot plots of the PEVK-C region for each species, Figures S14-23) suggested that the PEVK-C could be further subdivided into a highly similar PEVK-CA region (exons 166-215 in human TTN) and a more variable PEVK-CB region ($last 9 exons of the PEVK-C). The PEVK-CA/PEVK-CB boundary for each species was manually determined using a visual landmark, where the final exon in the densest part of the PEVK-C region marked the last exon of the PEVK-CA region, and the first exon of the PEVK-CB region directly followed the last PEVK-CA exon. The mean substitutions per exon varied significantly among the nine subregion comparisons (one-way ANOVA: F 8,360 = 211.6; P = 2 Ã 10 216 ; Figure S25). Post hoc comparisons indicated that the number of substitutions per exon for PEVK-CA:PEVK-CA comparisons (mean: 0.64; SD: 0.13; Box v) were significantly lower than all other PEVK-CA comparisons (P , 0.0001 in all cases; Box v vs. Boxes i-iv, vi-ix), as well self-comparisons including the PEVK-CB:PEVK-CB (P , 0.0001; Box iii vs. Box ix). Additionally, the number of substitutions per exon for PEVK-CB: PEVK-CB exons were significantly greater than the PEVK:N-PEVK-N self-comparisons (P , 0.0001; Box iii vs. Box vii). PEVK-CB:PEVK-CB exon comparisons also show similar levels of variability as between region comparisons, such as PEVK-CA and PEVK-CB exon comparisons (P = 0.93, Box iii vs. ii; P = 0.9, Box iii vs. vi). These additional analyses suggest that the PEVK-CB region is more variable in sequence than the PEVK-CA region in an individual, and that sequences with high similarity to each other are concentrated in the PEVK-CA region. Likewise, the self-dot plots (Figures S14-S23) show that each species has some highly repetitive regions in the PEVK-CA (appearance of boxes in lower left corner). The distinct patterns of sequence variability with the PEVK-C suggested that the PEVK-CA and PEVK-CB may be evolving differently. Intriguingly, cardiac N2B titin isoforms do not contain exons in the PEVK-CA, and may provide a functional basis for variable selection on the two regions (Greaser et al. 2002;Guo et al. 2010). There is significantly more glutamate (E), valine (V) and lysine (K) per exon in the PEVK-N region (gray bars) and more proline (P) per exon in the PEVK-C region (black bars) across all 41 species. Bars represent mean 6 SE. Asterisks denote significance at P , 0.05. e) Ratio of glutamate (E) per exon in human TTN PEVK. PEVK_ Finder confirms that there is relatively more glutamate (E) per exon in the PEVK-N region compared to the PEVK-C region.
In humans, the movement of long interspersed nuclear elements (LINEs) facilitated the restructuring of the PEVK-C region, resulting in a large triplication event in the PEVK-C (Bang et al. 2001). We used selfdot plots of the PEVK-C region to examine the incidence of large-scale duplications and triplications (off-axis parallel lines) across each species ( Figure S24). We find evidence of large duplications in chimpanzee, orangutan, Norway rat, house mouse and star-nosed mole in the PEVK-CA region. Triplications are only apparent in koala and human ( Figures S14b, S16b, S24). In koala, the internal segment of the triplication is flanked by identical sequences, similar to the human triplication event, though it is not known if the sequence is a LINE-1 element as in humans, as LINEs are species-specific. Previous work based on nucleotide divergence placed the insertion of the LINE-1 elements in human PEVK-C after the divergence of humans and other primates (Bang et al. 2001). However, here we find evidence that both orangutan and chimpanzee also have the LINE-1 element in the PEVK-C, though they only show 1 element each (2 in humans; Figures S16a-c, S24). While in humans, the LINE-1 restructuring of the PEVK-C produces a large triplication event, the region is only duplicated in orangutans and the chimpanzee shows two large repeats with a truncated third repeat (Figures S16a, S16c, S24).
Previous work found that exons in the region corresponding to the PEVK-N were proportionally richer in glutamate (Greaser 2001;Labeit et al. 2003;Forbes et al. 2005). PEVK-Finder replicates that finding here, suggesting again that PEVK_Finder is able to reproduce previously described PEVK exons in humans (Figure 3e).

Evolutionary analysis of the PEVK region across mammals
We performed a reciprocal best BLAST (Tatusov et al. 1997;Bork and Koonin 1998;Koonin 2005) to detect orthologs in the PEVK region. Overall, orthology across the PEVK region is low, with only 22 confident orthologs detected in the PEVK-N region (43% of human PEVK-N exons) and 13 in the PEVK-C region (22% of PEVK-C exons) (Figures 4, S2). More orthologs are detected in the PEVK-N region than the PEVK-C region (red/orange, Figure S2), consistent with a more conserved exon structure in the PEVK-N region. Exons in the PEVK-C region display greater sequence divergence across the mammalian tree (mean: 0.11; SE: 0.01) than the PEVK-N region (mean 0.08; SE: 0.01), though this difference is not significant (t = 1.67, P = 0.11) and primarily driven by three exons in the PEVK-C region ( Figure S3).
We also examined the nature of selection acting on codons in the PEVK region. Specifically, we tested for positive selection using randomsites models in PAML, which allowed v to vary among sites but not lineages. For six sets of orthologous exons (human exons 114, 135, 137, 138, 145 and 199), models that allowed certain codons to evolve under positive selection (v . 1; M2a) were significantly better than models that restricted codons to be conserved (0 , v , 1) or neutral (v = 1; M1a; Figure 4; Table 1). Model M1a vs. M0 was also significant for six exons investigated (results not shown). Five exons with sites under positive selection were in the PEVK-N region, and one was in the PEVK-C region. In total, 20 codons were under positive selection (Figures S5-S10; Table 1). Results were qualitatively similar for all initial v values.
The polarity of amino acids can influence protein-binding interactions and solubility, and therefore may contribute to adaptive evolution of titin. Therefore, we used ancestral state reconstruction to examine charge transitions in TTN exons with confident orthologs. Our analyses show that codons 42 of exon 114 and 6 of exon 135 are under positive selection and have substitutions that change the charge of the codon (Figures S11-S12). Codon 42 of exon 114 features shifts from hydrophobic to polar amino acids from a minimum of 3 to $6 times, with shifts occurring in the branches leading to bats, beaver, cetaceans, lemur and Northern treeshrew. This codon also shifts from a hydrophobic to a positively charged amino acid in Mus musculus. Codon 6 of exon 135 reveals four independent transitions from positively charged to polar amino acids across mammals. Intriguingly, in cetaceans, this codon shifts from positively charged to hydrophobic. Thus, in both instances of codons with evidence for charge transitions, cetaceans are one of the few groups to experience changes in polarity and sometimes do so in unique ways ( Figure S12). Together, these charge changes and low PEVK exon numbers suggest that titin in cetaceans and other aquatic diving mammals may be under different selective pressures than terrestrial mammals.

DISCUSSION
PEVK_Finder improves the annotation of the PEVK region across mammals The PEVK region of TTN possesses many attributes that confound standard gene prediction tools; it contains numerous short, repetitive exons that vary within and across species in sequence content, exon structure, and length. However, it is this very complexity that makes the PEVK region a promising source of raw material for evolutionary adaptation of muscle (Freiburg et al. 2000;Tompa 2003;Prado et al. 2005). Robust annotations of the PEVK region are therefore needed to link molecular evolution of TTN with physiological performance over evolutionary time.
While accurate gene prediction may be achieved with RNA-Seq or similar methods, relying on transcriptomic data for gene prediction is not always feasible due to time and cost constraints. For example, only 5 of the 43 species examined in this study had cDNA data supporting their TTN annotations at the time of the study. The plethora of alternatively spliced isoforms of TTN further complicates RNA-based gene predictions, as data from numerous tissues, developmental stages, and/or individuals are required to generate complete annotations (e.g., Savarese et al. 2018). To this end, we developed PEVK_Finder, which combines ab initio and signature-based approaches to gene prediction and targets the PEVK region of TTN (Wang et al. 2004;Lerat 2010;Sleator 2010). Our tool depends solely on nucleotide sequence and searches for exons with motifs and patterns common to well-annotated PEVK regions (i.e., human, mouse). Similar custom tools have been developed for classes of transposable elements, which are also highly repetitive and often missed by standard annotation tools (Lerat 2010).
Here, we find that PEVK_Finder consistently improves the annotation of the PEVK region across mammals. One major challenge of annotating repetitive regions with short exons is that exons are commonly missed. PEVK_Finder consistently detects significantly more PEVK exons than standard annotation tools, both in model and nonmodel species (Figures 1b, 1c; Table S3). Specifically, PEVK_Finder outperforms other tools in regions of short, high-density exons and across broad ranges of PEVK ratios. PEVK_Finder also identifies  Exons 114,116,122,125,130,135,136,137,138,141,142,143,144,146,152 153, and 161 are either constitutively expressed or expressed in .95% of TTN transcripts in skeletal muscle according to Savarese et al. (2018). Details about codons under selection are in Figures S5-S12. As in figure 1c, the first exon in this figure is technically outside the pre-defined bounds of the human PEVK region.
numerous putative PEVK exons that are not found in cDNA annotations or identified by other bioinformatics tools (Figures 1c, S1c; Table  S3). It is possible that these novel exons are not transcribed, but TTN has many alternatively spliced isoforms that may not be completely characterized by cDNA. For example, manual annotation of genomic TTN sequences has discovered novel TTN exons in the past (e.g., Freiburg et al. 2000;Granzier et al. 2007).
While PEVK_Finder improves available annotations in non-model species, we caution that it does not perfectly replicate the PEVK region in human and mouse TTN. PEVK_Finder occasionally misses, truncates, or joins distinct exons, albeit at a lower frequency than the other tools tested. Exons with low PEVK ratios, internal splice sites and noncanonical splice sites cause most of these errors. Indeed, we were unable to obtain exon sets for two of the 43 species we tested due to the presence of non-canonical splice sites. In the future, incorporation of hidden Markov models into the tool could improve issues with splicing and variable PEVK ratios. Given this, we argue that PEVK_Finder is most useful for determining major trends in PEVK structure and evolution in non-model species, as exemplified here. When complete, perfectly resolved annotations are required in a non-model species, PEVK_Finder should be used in conjunction with RNA-Seq or cDNA sequencing from numerous tissues, individuals, and developmental times. Likewise, pairing PEVK_Finder with another annotation tool, such as Gnomon, could also provide more comprehensive annotations (Table S3).

The evolution of the PEVK region
Our results reveal contrasting patterns of constraint and divergence across the PEVK region, suggesting two or three subregions with distinct evolutionary dynamics. The PEVK-N region shows relatively conserved length and exon structure over evolutionary time, but evidence of diversifying selection and more variable amino acid content (Figures 2,4). In contrast, the PEVK-C region varies dramatically in length and exon number across mammals and can be further divided into the highly similar PEVK-CA and more variable PEVK-CB (Figures 2, 3a, 3c, S14-S23). Overall, the documented patterns argue for selection maintaining particular, "essential" PEVK-N and PEVK-CB exons over evolutionary time, with diversifying selection targeting specific codons in the PEVK-N. For example, 17 out of 22 PEVK-N exons with confident orthologs are either constitutively expressed or expressed in .95% of TTN transcripts in skeletal muscle (Figure 4). Likewise, cardiac N2B titin contains exons in the PEVK-N and PEVK-CB, but not the PEVK-CA (Greaser et al. 2002;Guo et al. 2010). Conservation over both evolutionary time and across isoforms suggest these exons may play key roles in vertebrate muscle function. Additionally, five PEVK-N exons have codons that experienced adaptive evolution and may be implicated in diversification of muscle function over mammalian evolution. Specifically, codons in exons 114 and 135 experienced major shifts in charge over mammalian evolution and may influence titin-protein binding. Conversely, expansion and contraction of the total length of the PEVK-C region, rather than selection on any particular exon, dominates the evolution of the PEVK-C. Such length variation has the potential to contribute to functional differences, as altering the size of the titin "spring" through alternative splicing of more or fewer PEVK repeats can affect titin's compliance, though neutral processes may also be relevant (reviewed in Linke 2018). Future work can focus on disentangling the effects of natural selection acting on specific codons from PEVK length variation, and how both contribute to evolutionary adaptation of TTN.
What explains the discrepancy between the molecular patterns documented in the PEVK-N and PEVK-C regions? Intraspecific sequence comparisons of PEVK exons offer clues. Within a single species' titin sequence, the sequence content of the PEVK-CA is more repetitive than the PEVK-N or PEVK-CB (Figures 3, S5, S14-S23, S25). This suggests a mechanistic basis for expansion and contraction of the PEVK-CA over the course of mammalian evolution. Replication slippage and recombinatorial repeat expansions often occur in regions with tandem repeats, and may result in exon duplication and loss (Tompa 2003). Mobile LINE element insertions have also been implicated in TTN remodeling by causing PEVK exon duplication or differential splicing (Granzier et al. 2007). For example, we find that retrotransposition of LINE-1 elements facilitated tandem duplications of longer PEVK-CA regions in chimpanzee and orangutan, in addition to the previously reported human duplication (Bang et al. 2001;Figures S24, S16). Because we find that the PEVK-CA contains more repeats than the PEVK-N and PEVK-CB, we argue that any process generating exon duplication or loss occurs most frequently in the PEVK-CA.

Implications for titin function
The length and exon structure of the PEVK region vary remarkably over evolutionary time, with implications for titin functionality. The PEVK region of titin has long been known to contribute to passive stiffness of muscle (Gautel and Goulding 1996;Linke et al. 1998). Through alternative splicing, titin can be expressed as isoforms with varying lengths of the PEVK domain which correlate with the passive properties of different muscle types (Freiburg et al. 2000;Prado et al. 2005). To date, at least 4000 alternative splicing events and twelve distinct isoforms have been identified in human, mouse, and rabbit tissues (Freiburg et al. 2000;Prado et al. 2005;Savarese et al. 2018). Muscles with long PEVK segments have low passive stiffness whereas muscles with shorter segments have higher passive stiffness (Freiburg et al. 2000;Prado et al. 2005). The repetitive nature of PEVK-C exons may allow for variability in the length and stiffness of titin isoforms. Repeats within the PEVK region have been postulated to be functionally equivalent to fulfill entropic chain requirements (Tompa 2003). While this may be the case for repeats within the PEVK-C region, it is less clear for repeats within the PEVK-N region. Alternatively, variation in exon number across the PEVK region may be a product of neutral processes that result in exon skipping and loss, without affecting TTN function. While a mechanism has not yet been elucidated, several researchers have proposed that during muscle activation, titin stiffness increases in the presence of calcium possibly by titin-actin binding (Herzog et al. 2012;Nishikawa et al. 2012;Schappacher-Tilp et al. 2015). Evidence suggests that there is a small direct effect of calcium on PEVK stiffness but that this increase cannot fully account for active muscle stiffness (Tatsumi et al. 2001;Labeit et al. 2003). Within the PEVK region, exons can be composed of strings of negatively charged glutamate residues (E-rich motifs). Results from single molecule experiments have shown that E-rich PEVK segments bind to actin filaments, which produces a viscous load that could possibly resist the sliding of the thin filament along the thick filament during muscle contraction (Kellermayer and Granzier 1996;Labeit et al. 2003;Nagy et al. 2004;Bianco et al. 2007). If this interaction is necessary for increasing titin stiffness in activated muscle, splicing of exons that encode for amino acids responsible for changing charge or binding affinity to actin could have dramatic effects on titin function. For example, several studies have implicated titin in the enhancement of active muscle force with stretch (Leonard and Herzog 2010;Powers et al. 2016;Hessel et al. 2017). If titin stiffness failed to increase as a result of a change in amino acid charge and/or binding to actin or other proteins, muscles could show decreased force during stretch and alter an organism's ability to move. Our data show that PEVK-N exons are conserved and confirm previous work suggesting that PEVK-N exons have a higher percentage of E-rich motifs than PEVK-C exons (Greaser 2001;Labeit et al. 2003;Forbes et al. 2005; Figure 3e). In addition, PEVK-N exons 114 and 135 exhibit charge shifts that could affect PEVK interactions with other proteins (Figures S11, S12). However, no studies to date have shown calcium dependent PEVK-actin binding (Bianco et al. 2007;Nagy et al. 2004;Linke et al. 2002). Thus, it is unlikely this interaction underlies the increase in titin stiffness during muscle activation. It remains unclear how E-rich motifs or change in charge may affect titin function. Further studies on the expression of PEVK-N exons could serve to elucidate the role of the PEVK in active and passive muscle.
Several cardiac and skeletal muscle diseases have been associated with mutations in the TTN gene including the PEVK region (Chauveau et al. 2014;Savarese et al. 2016;Schafer et al. 2017). Understanding how titin has evolved across mammals may provide insight about the effects that TTN mutations have on muscle physiology and disease. Our analyses have identified evolutionarily conserved PEVK exons in the PEVK-N that have previously been shown to be constitutively expressed (Savarese et al. 2018). These exons may play fundamental roles in titin function and therefore represent sites that could lead to severe deficits in function if modified. Of the 127 TTN sequence variants that have been associated with human muscular disorders, 29 are located in the I-band, 5 of which are found in the PEVK region (Chauveau et al. 2014). Cardiomyopathies are often associated with truncating variants in the A-band, although a few variants have been identified in the I-band as well (Schafer et al. 2017). Interestingly, to date, most of the mutations in the PEVK region that have been linked to neuromuscular disease are located in the PEVK-C region (Savarese et al. 2016). It is possible that there are more disease-causing variants in the PEVK-C compared to the PEVK-N region, or it could be that variants in the PEVK-N region have yet to be identified. It is also possible that the highly repetitive nature of the PEVK-C may facilitate errors during replication that increase the frequency of deleterious mutations. Alternatively, it has been suggested that some I-band variants could be circumvented via differential splicing (Schafer et al. 2017). Clearly, further studies on the essentiality of particular exons for muscle function and mechanisms of disease are crucial for better understanding TTN-related disorders.

Conclusions
In summary, PEVK_Finder provides a useful method for determining major trends in PEVK structure and evolution in non-model species. By characterizing the PEVK region of titin in mammals, we identify two potential pathways through which selection could shape titin function: by changes in the amino acid sequences of specific codons, and by variations in the size of the PEVK region. In the PEVK-N, exons with confident orthologs that are constitutively expressed in isoforms are strong candidates for future work testing the essentiality of particular exons that underlie titin stiffness and muscle function. Together, the construction of custom annotation tools for challenging-to-annotate genes can facilitate novel insights into the diversification and nature of selection acting on important proteins like titin.