Nontriplet feature of genetic code in Euplotes ciliates is a result of neutral evolution

Significance In this work, we provide compelling evidence that Euplotes genetic code violates the triplet nature of the genetic decoding that was thought to be universal. Thus, Euplotes possess the most extreme example of genetic code variation described so far. The nontriplecy arises from abundant ribosomal frameshift sites with no regulatory function, where stop-codons distant from the 3′ transcript end specify +1 or +2 ribosomal frameshifting with high accuracy. We show that this violation of the triplet coding in Euplotes is brought about and further maintained by neutral evolution rather than selective processes but still is irreversible.


Determinationing of the transcript strand
Transcript strands were determined as follows. If a transcript contained the poly-A tail either as a poly-A tail at the 3'-end or as a poly-T tail at the 5'-end, the transcript strand was determined accordingly. If no poly-A tail was present, we considered a pairwise alignment of the transcript with one of its orthologs in four combinations of orientations of two transcripts. In each set of four pairwise alignments, the longest conserved ORF was inferred and for this ORF the d N /d S values were calculated. Orientation corresponding to the lowest d N /d S value was then selected. If the selected orientation was not consistent across orthologs, it was assigned by manual curation of the alignments.

General outline and the parameters
Stop codons in Euplotes spp, in addition to being signals of translation termination, can also function as ribosomal frameshift sites (1, 2) and may potentially be read through. To identify sequences of coding ORFs separated by possible frameshifting or stop-codon readthrough events, we rely on three features of ORFs known to be the hallmarks of coding sequences: (i) ORF length, (ii) protein identity between orthologs, (iii) signature of negative selection (low d N /d S ). For each feature, we infer a set of thresholds based on the transcriptomic data (see below) and then demonstrate their validity and robustness by comparing the results to RiboSeq data and checking their adherence to known properties of euplotid frameshift sites. If an ORF does not satisfy any of the threshold-defined criteria, it is considered potentially non-coding. We consider only euplotid ORFs with established orthologous sequences in other Euplotes spp. and hence protein-and nucleotide sequence alignments may be constructed and, accordingly, parameters such as sequence identity or d N /d S may be obtained. We should note here that our method is expected to perform rather poorly on very short proteins (<30 amino acids), as identity and d N /d S values would not be reliably calculated.
The general outline of the algorithm is as follows. We start with the ORF in a transcript with the longest alignment with an ortholog and require it to be longer than a certain threshold. Next, we expand it on both sides, adding adjacent ORFs assuming +1 or +2 frameshifts or stop-codon readthrough, hence gradually constructing the translated region (TR). Hereinafter, for brevity, we refer to orthologous ORFs in alignments simply as ORFs. For each such expansion, we check that the added ORF is longer than another threshold. Both thresholds are selected so that the number of predicted frameshifts is not excessive. In addition, we check that the protein identity with orthologs is sufficiently high, and d N /d S is sufficiently low, using as a reference Euplotes genes without frameshifts. We also check that both values are relatively constant along the predicted TR, using parameters inferred by the analysis of a large set of mammalian proteins (for the identity), and the Poisson statistics for d N /d S . The details are given below.
The procedure used to set all thresholds aimed at robustness of the resulting set of FSs. Robustness of the results was defined as the absence of substantial changes in the set of predicted FSs upon minor changes of the thresholds. The thresholds were optimized sequentially: first, we looked into stability of the number of predicted FSs with respect solely to the length thresholds (Suppl. Fig. S1c). Next, having selected length thresholds yielding stable results, we applied them with different combinations of thresholds on the uniformity of the protein identity and d N /d S . The final set of thresholds (listed in Table S1) was used to predict FSs on which the main results were obtained.

Primary ORF and its extension
In each transcript with the determined strand we selected an ORF such that (i) its length exceeded a given threshold (30 amino acids) and (ii) after formal translation it provided the longest amino acid alignment with a transcript from a different Euplotes species. The threshold for the length of primary ORFs was established by the same procedure as the one used to establish other threshold lengths (see below). These primary ORFs turned out to be long, demonstrating high sequence identity and low d N /d S (Suppl. Fig. S1f-i), hence we considered them to be bona fide protein-coding ORFs.
The TR is then constructed by extending the primary ORF with shorter ORFs adjacent at the 5'and 3'-ends. All three reading frames are considered, hence accounting for stop codon readthrough and +1 and +2 frameshifting. If adjacent ORFs in several reading frames satisfied the length, identity, and d N /d S conditions (see below), we used the longest ORF if its length was more than 10/9 of the second-longest ORF, otherwise we selected the added ORF based on higher protein identity to the closest ortholog.

Extending ORFs -length thresholds
The number of predicted FSs depends on the minimal-length thresholds for adjacent ORFs (Suppl. Fig. S1c). This dependency is particularly strong when the length threshold is low. This is caused by a combinatorial explosion in non-coding ORFs erroneously added to TRs. On the other hand, while high thresholds yield high-confidence FS predictions, many genuine FS are not detected. We selected thresholds, 25 and 15 codons for the downstream and upstream TR extension, respectively, which corresponded to the sharpest changes in the curve slopes, so as to yield the maximal number of FSs with the minimal number of false negatives.
Additionally, we set a soft threshold for the minimal length of adjacent ORFs based on the distributions of protein identity values of the adjacent ORFs (Suppl. Fig. S1de). That is, we allowed a longer adjacent ORF to be added to the TR if it passed the conditions on the identity level. The rationale for that is that, while at the N-and C-ends of the protein alignments the identity levels tended to deviate largely for small window sizes, for larger windows (approximately >60 amino acids), this effect became negligible (Compare Suppl. Fig. S1g and Suppl. Fig. S1de). Thus, the longer ORFs were subjected solely to the threshold-based identity checks, and the shorter, to the checks of the uniformity of identity (see below).

Protein identity thresholds
For long adjacent ORFs (with lengths above the soft threshold), we apply the threshold equal to the lowest observed protein identity for primary ORFs, that is 0.65. In this case it is impossible to apply the filter on uniformity of identity along the protein sequence, as only few non-overlapping windows of the size equal to the ORF length fit into the protein, and the distribution of identity may not be built on these few data points.
Conversely, for short adjacent ORFs (with lengths between the hard and soft thresholds), the threshold on identity cannot be set, as identities at N-and C-termini may be lower than those of the middle parts of a protein (Suppl. Fig. S1de). However, a relaxed identity-based criterion can be set here, based on the prediction of identity at the termini given identity in the middle. For that, we considered 20317 (3) well-annotated groups of mammalian orthologs from the OMA database and aligned them with ClustalΩ (4). For each group of orthologs, we constructed a distribution of identities calculated in non-overlapping windows of size n (ranging from 3 to 60 amino acids), situated between nucleotides n and L-n, L being the alignment length, that is, in the middle part of the alignment. For each such distribution we used the obtained mean μ and standard deviation σ to obtain Z-scores for identity calculated in the remaining N-and C-terminal windows. Next, for each window size n, we constructed distributions of the obtained Z-scores across groups of orthologs and calculated one-sided 95% confidence intervals for these distributions (Suppl. Fig.  S1ab). In the alignments of euplotid transcripts, if the observed Z-score of the difference in the identity in an adjacent ORF and the predicted translated region was in the constructed 95% confidence interval for the respective window size, this ORF was accepted as an extension of the TR, and the respective FS or codon readthrough event was predicted. Otherwise this ORF was filtered out as non-coding.

d N /d S uniformity threshold
Our d N /d S uniformity threshold was defined as follows. The d N /d S values were calculated separately for the longest ORF and for an added ORF. Next, the obtained rates were compared with a permutation statistic and the respective p-value was obtained (Suppl. Fig. S1i). This p-value was considered as the uniformity parameter, for which a threshold value 0.85 was obtained from the bimodal distribution produced by all adjacent ORFs (Suppl. Fig. S1i). At that, as the d N /d S rates of the primary ORF and the extending ORFs are expected to come from the same distribution, only the left peak of the p-value distribution is considered.

Pipeline
The resulting procedure is as follows: 1.
For each stranded transcript from each purified euplotid transcriptome (see Methods: Transcriptome quality assessment), the euplotid ortholog from our dataset yielding the longest nucleotide alignment is obtained. If no ortholog has been found, the transcript is discarded. All nucleotides with no alignment to an orthologous sequence are discarded. If the primary ORF of the sequence is not aligned with the primary ORF of an ortholog, the transcript is discarded.

2.
For the primary ORF, a codon-wise alignment is built with TranslatorX(5). The d N /d S rate of the longest ORF is calculated (see Materials and Methods: Calculation of identity and d N /d S values). At this stage, we consider the primary ORF as a translated region (TR) and initialize the extension procedure, an iteration of which is described below (steps 3-5).

3.
ORFs corresponding to +1 and +2 frameshifting and stop-codon readthrough at the ends of the TR are considered. For each such adjacent ORF, its length is calculated. If the length of an adjacent ORF is smaller than the hard threshold of 15 amino acids for the 5'-end ORFs and 25 amino acids for the 3'-end ORFs, the ORF is discarded.

4.
For each remaining adjacent ORF, a codon-wise alignment with the ortholog is built and the protein identity and d N /d S rates are calculated.

5.
If the length of an adjacent ORF exceeds the soft length threshold of 60 codons, the minimal protein identity threshold of 0.65 is applied to the product of its translation. If this criterion is not satisfied, the ORF is discarded.

6.
If the length of an adjacent ORF is below the soft length threshold, the Z-score is calculated for the adjacent ORF protein identity value based on the parameters of the identity distributions for the entire TR. The Z-score threshold is determined from the length of an adjacent ORF using our results obtained for mammalian orthologs. ORFs with the Z-scores higher than obtained thresholds are discarded.

7.
If the length of an adjacent ORF is below the soft length threshold, the d N /d S uniformity parameter is calculated. If the calculated value is below 0.85, the respective ORF is discarded.

8.
If by this step no adjacent ORFs have passed filters 3, 5, 6, and 7, the TR is returned for the regarded transcript. Otherwise, among remaining adjacent ORFs, the longest one is selected and assumed translated with the respective event (+1 or +2 frameshifting or stop-codon readthrough). The ORF is added to the TR and steps 3-8 are repeated.

9.
The resulting set of translated ORFs constituting the TR and respective FS or stop codon readthrough events are the output. Supplementary Table S1.

Validation based on E. crassus RiboSeq data
To validate the procedure for predicting coding sequences and the obtained set of parameters, we used a RiboSeq dataset of E. crassus (see Materials and Methods: Sample collection and library preparation). Predicted coding ORFs for E. crassus can be verified by mapping the RiboSeq reads on the transcript. For each putatively coding ORF, if it is covered with a sufficiently high number of RiboSeq reads, the prediction is considered as true positive. Conversely, if an ORF in a predicted TR is not covered with reads or is covered with a significantly lower density of reads than the primary ORF, the prediction is considered a false positive. If a high number of reads covers a predicted untranslated region, it is considered a false negative. Otherwise, the prediction is true negative. Unfortunately the triplet periodicity of the RiboSeq data (13) was insufficient for the inference of the translated reading frame. Therefore we estimated only the ability of the algorithm to predict coding sequences regardless of the frame.
We analyzed how each parameter (Suppl. Tab. S1) influences the true-positive rates (TPRs) and the false-positive rates (FPRs) of +1 FS events in E. crassus, while keeping other parameters constant (Suppl. Fig. S2). Our parameters, selected independently using transcriptomic data, yield the cumulative TPR of 0.543 for 5'-end ORFs and of 0.508 for 3'-end ORFs, and the net FPR of 0.0.

Validation based on known properties of euplotid FSs
In previous studies, euplotid FSs were predicted in 2-30% of transcripts and shown to involve a specific A-rich trinucleotide motif (1, 6, 7). Using our algorithm, we predicted frameshift sites in 3.9% of transcripts (192 sequences out of 4903). At that, stop codons identified as Euplotes FSs feature the same strong nucleotide context as the one reported previously (1) (Figure 1b), whereas stop-codons predicted to be signals of translation termination do not fall within this motif (Figure 1b), as expected.

Validation based on known properties of euplotid transcripts
The d N /d S values of the predicted TRs fell into the 95% confidence interval of [0.044, 0.324] with the mean of 0.163, being similar to the ones calculated for model organisms, e.g. for the human population (8).
In addition, while the AT-content differed in coding regions, 3'-UTRs, and 5'-UTRs, there were no differences in the AT-content between the regions upstream and downstream of FSs (Figure 1e). This result fell in line with previously reported patterns of nucleotide composition around start and stop codons in ciliates (1, 7). The predicted 3'-UTRs, i.e. the sequences between the terminal stop codon and the poly-A tail, were extremely short, being on average 33 nt, and featured an enhanced density of stop codons, as expected (9) (Figure 1gh).

Code availability
The pipeline implementation written in Python 3.7 along with the data analysis protocols are available at https://github.com/sofyagdk/euplotes.          Dataset S1 (separate file). Coding regions of 4903 sequences studied Table containing 6 columns. Each row corresponds to a sequence from the orthology group alignments. column "orthology_filename" contains the unique filename of the orthology group file (Dataset S4), column "orthology_filename" contains a unique identifier of the sequence (same as in Dataset S5-S6), column "coding_region" contains hyphen-separated boundaries of transcribed regions, the even positions corresponding the start of the region (0-based), the odd positions corresponding to the position of stop codons. The column "main_frame" contains the hyphen-separated boundaries of the primary coding frame. The other two columns are the same as "coding_region" and "main_frame", but the coordinates of the coding regions are given for the respective columns in the alignment files.
Dataset S2 (separate file). Coding regions of Sequences with frameshifts Table contains sequences with frameshift sites. The first five columns are the same as in Dataset S1, the last column shows the number of frameshift sites in the particular sequence.

Dataset S3 (separate file). Orthologous sequence identifiers
Table containing 11 columns, rows correspond to orthology groups. Column "filename" contains the filename containing the alignment of the orthologous group, column "len" contains a number from 2 to 9 which shows the number of sequences in the orthology group. The remaining nine columns contain a unique sequence identifier of the sequence of the particular species, or None if the orthology group does not contain a sequence from this species.

Dataset S4 (separate file). Alignments of orthologous sequences
Archive named ortholog_alignments.tar.gz contains a folder with files. Each file contains one ortholog gene group, each ortholog gene group consists of 2 to 9 orthologous sequences of different Euplotes species analyzed.
Dataset S5 (separate file). Contamination information. Table with contaminating species, the accession codes of tested sequences, and the number of euplotid transcriptome sequences similar to some sequence in the contamination dataset, given for each species separately.