On distinguishing between canonical tRNA genes and tRNA gene fragments in prokaryotes

ABSTRACT Automated genome annotation is essential for extracting biological information from sequence data. The identification and annotation of tRNA genes is frequently performed by the software package tRNAscan-SE, the output of which is listed for selected genomes in the Genomic tRNA database (GtRNAdb). Here, we highlight a pervasive error in prokaryotic tRNA gene sets on GtRNAdb: the mis-categorization of partial, non-canonical tRNA genes as standard, canonical tRNA genes. Firstly, we demonstrate the issue using the tRNA gene sets of 20 organisms from the archaeal taxon Thermococcaceae. According to GtRNAdb, these organisms collectively deviate from the expected set of tRNA genes in 15 instances, including the listing of eleven putative canonical tRNA genes. However, after detailed manual annotation, only one of these eleven remains; the others are either partial, non-canonical tRNA genes resulting from the integration of genetic elements or CRISPR-Cas activity (seven instances), or attributable to ambiguities in input sequences (three instances). Secondly, we show that similar examples of the mis-categorization of predicted tRNA sequences occur throughout the prokaryotic sections of GtRNAdb. While both canonical and non-canonical prokaryotic tRNA gene sequences identified by tRNAscan-SE are biologically interesting, the challenge of reliably distinguishing between them remains. We recommend employing a combination of (i) screening input sequences for the genetic elements typically associated with non-canonical tRNA genes, and ambiguities, (ii) activating the tRNAscan-SE automated pseudogene detection function, and (iii) scrutinizing predicted tRNA genes with low isotype scores. These measures greatly reduce manual annotation efforts, and lead to improved prokaryotic tRNA gene set predictions.


Introduction
Canonical transfer RNAs (tRNAs) implement the genetic code by coupling mRNA codons to their respective amino acids during protein synthesis. Of the 64 codons in the genetic code, three generally serve as stop codons. The remaining 61 ('sense') codons are decoded into 20 amino acids by tRNAs with various complementary anticodons. Given that some anticodons can recognize more than one codon -through wobble base pairing [1,2] -organisms carry fewer than 61 types of tRNA [3]. Together with knowledge of wobble base pairing rules, systematic studies of tRNA gene sets have yielded important generalities regarding tRNA requirements for efficient translation; standard tRNA gene sets have been identified for each of the three domains of life [4,5]. In Archaea -the focus of this work -a standard set of 46 tRNA types, each encoded by a single gene copy, has been proposed (examples are depicted in Fig. 1A). Standard tRNA gene sets are poised to play a pivotal role in understanding tRNA gene set function and evolution. However, a range of reported deviations from these sets (e.g. [3,6]) may be masking their role in tRNA biology. Specifically, two types of tRNA gene set deviations can occur: (i) evolutionary adaptations of the translationally active tRNA gene set in some organisms (see below), and (ii) artifactual deviations due to imperfections in genome sequencing/assembly, or downstream computational analyses of putative canonical tRNA genes. This manuscript aims to distinguish between these two cases, using the tRNA gene sets carried by the archaeal Thermococcaceae family as an example.
As a result of remarkable progress in experimental methods for determining the genome sequences, automated annotation has become an essential tool for extracting biological information from sequence data. In the field of tRNA biology, several tRNA gene prediction tools are available [7][8][9][10][11][12][13], the most widely used of which is tRNAscan-SE [7,14,15]. tRNAscan-SE effectively performs the challenging tasks of identifying putative tRNA genes from input DNA sequence, and annotating each predicted gene -including the definition of the anticodon and amino acid carried (i.e. the tRNA isotype). Additionally, tRNAscan-SE distinguishes between three isotypes carrying the anticodon 5'-CAU-3': (i) initiator tRNA-Met(CAU), which serves in the initiation stage of translation, (ii) elongator tRNA-Met (CAU), which functions in peptide chain elongation, and (iii) tRNA-Ile(CAU), in which the anticodon is posttranscriptionally modified to recognize isoleucine codon 5'-AUA-3' [16].
The final output of tRNAscan-SE is a biochemically detailed, automatically annotated list of putative tRNA genes from the input DNA sequence (e.g. genome assembly, plasmid). The tRNAscan-SE output for a selected set of complete (or near-complete) genomes is collected in the publicly accessible Genomic tRNA database (GtRNAdb), with all predicted tRNA genes falling into one of five categories ('tRNAs decoding standard 20 AA', 'Selenocysteine tRNAs (TCA)', 'Possible suppressor tRNAs (CTA,TTA,TCA)', 'tRNAs with undetermined or unknown isotypes', and 'Predicted pseudogenes') [3,17]. The accumulation and comparison of complete canonical tRNA gene set data (i.e. tRNA genes in the first category) paves the way for conclusions regarding the evolution of tRNA gene sets across the tree of life [5,[18][19][20][21][22][23][24][25][26]. In addition, such information may have important medical implications [27][28][29][30][31]. For example, both prokaryotic and eukaryotic tRNA gene sets have been shown to evolve rapidly in response to changing translational demands [32][33][34], and a correlation has been demonstrated between tRNA gene copy number and bacterial growth rate [35]. However, conclusions regarding tRNA gene sets are dependent on the accuracy and proper use of the tRNA detection software.
Here, we evaluate the performance of tRNAscan-SE, through a detailed manual inspection of the tRNA genes it identifies and annotates, within a set of 20 genomes from the archaeal family Thermococcaceae. This family is an archaeal clade within the phylum Euryarchaeota (according to the National Center for Biotechnology Information (NCBI) taxonomy [36]), or the phylum Methanobacteriota (according to the Genome Taxonomy Database, GTDB [37]). The Thermococcaceae family consists of at least six genera (i.e. Pyrococcus, Thermococcus, Thermococcus A, Thermococcus B, Thermococcus C, and Palaeococcus) that are typically found in hydrothermal marine environments [38]. Members of the Thermococcaceae family reportedly adhere very closely to the standard tRNA gene set of Archaea [38, page 364]. Given this information, the proportion of Thermococcaceae genomes listed in GtRNAdb as encoding variant tRNA gene sets is -as described below -unexpectedly high (seven deviant genomes out of the 20 listed genomes = 35%). However, manual annotation of the predicted tRNA gene sets resolves many of the apparent differences, leaving only a single probably truly divergent tRNA gene set.

Thermococcaceae tRNA gene sets
All Thermococcaceae family tRNA gene sets available in GtRNAdb (Data Release 19; June 2021) [3] were downloaded (see Supplementary Table S6). Details of the 20 organismsincluding members of the Pyrococcus, Thermococcus, Thermococcus A, Thermococcus B, and Palaeococcus genera - are provided in Supplementary Table S1. In addition, the tRNA gene sets of these organisms were predicted from the genome assemblies (downloaded from NCBI) using locally run tRNAscan-SE (version 2.0.6, the same version that was reportedly used for the current GtRNAdb listings for all 20 organisms), with standard settings for Archaea (option −A) and pseudogene detection activated (see Supplementary Table  S7). To display the output, options −H and −−detail were added.

Manual annotation of tRNA gene sequences
Putative tRNA genes of interest were manually annotated using the following six steps: (i) Identify the 3' CCA terminus (nucleotides 74-76), the discriminator nucleotide (nucleotide 73) and the acceptor stem (nucleotides 1-7, base pairing with nucleotides 72-66) [39]. (ii) Identify the T-arm prior to nucleotide 66 (5′>>>>>UUCRANN<<<<<3′, where R is base A or G, and each >< indicates a base pair). (vi) Identify nucleotides 8-9 (between the acceptor stem and D-arm) and 26 (between the D-arm and anticodon arm), which play a role in realization of the three dimensional structure of tRNA molecules.

GtRNAdb lists divergent tRNA gene sets for seven of 20 Thermococcaceae genomes
The tRNA gene sets of 20 genomes belonging to the Thermococcaceae family are listed in GtRNAdb (Data Release 19; June 2021) (Supplementary Table S1). Our phylogenetic tree of these 20 genomes confirmed the phylogenyinformed taxonomy of the Thermococcaceae family established by Rinke et al [44] and accurately resolved the relationships between the five genera to which these genomes belong, with Palaeococcus forming the earliest diverging branch (Supplementary Figure S1).
All 20 of the genomes of interest are complete genomes assembled into single circular contigs, with an average size of 1.9 Mb (ranging from 1.7 Mb to 2.2 Mb) and a mean genomic GC content of 47% (ranging from 40% to 56%). Genome descriptions have been published for 18 of the 20 genomes (see Supplementary Table S1). Inspection of the associated GtRNAdb tRNA gene set entries revealed that, of the 20 genomes, only 13 (65%) are predicted to carry the standard archaeal tRNA gene set of 46 canonical tRNA types encoded by single-copy genes (see Introduction), while the remaining seven (35%) are predicted to differ. There are 15 distinct anomalies across the seven deviant genomes, consisting of twelve additional and three absent putative tRNA genes ( Table 1; Fig. 1A).
When the same set of genomes were analysed by running tRNAscan-SE locally (version 2.0.6, see Materials and Methods), six of the genes causing tRNA gene set deviations were flagged as possible pseudogenes. Inspection of the documentation for the entries of these genomes on GtRNAdb indicated that automated pseudogene detection was not activated during the associated tRNAscan-SE run, resulting in the inclusion of the six potential pseudogenes within the canonical, standard tRNA gene sets for the organisms involved.

Manual annotation reveals only a single divergent tRNA gene set
Given the unexpectedly high degree of deviation from the standard archaeal tRNA gene set, the 929 tRNA genes predicted across the 20 Thermococcaceae genomes were subjected to careful manual inspection (see Materials and Methods). This revealed that the 15 observed anomalies fall into three categories, each of which is described below (see also Table 1).

Category 1: Erroneously predicted tRNA genes arising from tRNA gene fragments
Eight of the 15 observed anomalies result from the erroneous listing of partial tRNA gene sequences as either standard tRNA genes (seven cases) or genes encoding tRNAs of undetermined isotypes (one case) in GtRNAdb ( Table 1). As described further below, we found that the partial tRNA sequences are the result of two processes: (i) the acquisition of fragments of tRNA genes from viral genomes by CRISPR-Cas activity (accounting for one anomaly; category 1a in Table 1), and (ii) the partial duplication of tRNA genes during the integration of horizontally transferred genomic elements (accounting for seven anomalies; category 1b in Table 1).
Thermococcus nautili 30-1 (NCBI Reference Sequence NZ_CP007264.1) is predicted to carry two distinct genes encoding Leu-CAA (positions 10,012-10,099 and 706,451-706,668). The first of these has characteristics typical of canonical leucyl-tRNA genes (e.g. 88-bp long, isotype score 134.8), while the second does not (e.g. 218-bp long, very low isotype score of 2.3). Furthermore, the RNA product of the second is predicted to form an unusual secondary structure, in which the T-arm is the only readily identifiable tRNA feature (see Supplementary Text S1). Closer inspection of the T. nautili 30-1 genome reveals that the second, atypical Leu-CAA sequence occurs within a CRISPR array of over 40 identical short repeats interspersed with unique spacer sequences that are derived from foreign, invading DNA ( Fig. 2A) (for a recent review of CRISPR-Cas systems, see [45]). The tenth spacer of this array appears to be derived from a viral tRNA gene, encompassing the T-arm and one side of the acceptor stem ( Fig. 2B-C). A search for nucleotide sequence matches to the spacer-10 sequence and viral DNA sequences (using the Nucleotide Basic Local Alignment Search Tool, blastn) revealed a good match between the spacer and the Thr-TGT gene of several isolates from the Myoviridae family, a family of bacteriophage known to infect Archaea and Bacteria (Fig. 2D). Based on these results, we conclude that the tRNA gene fragment in T. nautili 30-1 is derived from a viral tRNA gene. While tRNA genes have been documented in a range of bacteriophage genomes [46,47], CRISPR spacers do not commonly contain viral tRNA gene fragments [48,49]; to the best of our knowledge, this is the first report of a tRNA gene fragment appearing in a CRISPR-Cas array. Together, these results are consistent with the CRISPR-Cas-mediated incorporation of a viral tRNA gene fragment into the T. nautili 30-1 genome, which is erroneously listed as a canonical, standard tRNA gene in GtRNAdb.
Inspection of a further seven of the deviating tRNA genes revealed that they involve fragments of tRNA genes resulting from insertion of mobile genetic elements into the genome. A particularly clear example of such an element has been  previously reported in Pyrococcus yayanosii CH1 (NCBI Reference Sequence NC_015680.1; [50]), one of the 20 Thermococcaceae genomes: the ~21.4 kb mobile genetic element PYG1 lies between partial and complete Gln-TTG copies ( Fig. 3A-C) [51]. Both the partial and complete Gln-TTG sequences are listed in GtRNAdb as distinct, standard tRNA genes (Fig. 3D-E). Our dataset includes a further six examples of partial tRNA genes resulting from similar integration events, with evidence for the integration of between ~9 kb and ~23 kb of viral DNA into various archaeal tRNA genes (see Supplementary Table S2). These findings are in line with published results showing that the incorporation of mobile genetic elements (e.g. temperate phages, integrative and conjugative elements (ICEs)) into prokaryotic genomes is a widely occurring source of tRNA gene fragments; tRNA genes are common targets for the integration of such elements, presumably as a result of tRNA gene stability and pervasiveness [52][53][54]. Indeed, manual inspection of arbitrary prokaryotic GtRNAdb entries rapidly revealed numerous instances of similar partial tRNA genes in other Archaea, and Bacteria, many of which are erroneously listed as standard, canonical tRNA genes on GtRNAdb (for examples from  various taxa, see Supplementary Table S3). The ICE integration process typically results in one complete and one partial copy of the target tRNA gene at opposing ends of the integrated element, providing a reliable method for their delineation [55]. Notably, the identification of these types of partial tRNA genes, and their full tRNA gene counterparts, can also provide evidence of genome rearrangement events. For example, our analysis indicates the integration of a ~20.7 kb prophage into the canonical Arg-TCT gene of the common ancestor of Thermococcus gammatolerans EJ3 and Thermococcus kodakarensis KOD1, followed by an inversion event in the KOD1 lineage that led to the occurrence of the partial and full tRNA gene copies on opposite strands of the chromosome, and a considerable distance apart (~179 kb; see Supplementary Table S2).

Category 2: Erroneously assigned tRNA isotypes resulting from ambiguous input sequence
The Thermococcus litoralis DSM 5473 genome (NCBI Reference Sequence NC_022084.1; [56]) contains seven of the 15 anomalies, including all three of the missing tRNA genes (Leu-CAG, Pro-GGG, Gln-CTG) and four additional tRNA genes (Leu-NAG, Pro-NGG, Und-NTG, Leu-CAA) ( Fig. 1A; Table 1). Careful analysis of the primary and secondary structures strongly indicates that three of the additional tRNA genes account for the three missing tRNA genes: additional gene Leu-NAG is the apparently missing gene Leu-CAG, Pro-NGG is Pro-GGG, and Und-NTG is Gln-CTG (see Supplementary Text S1 for tRNA structures). In each case, the presence of one or more N (i.e. identity unknown) nucleotides in the tRNA gene sequence -and the T. litoralis DSM 5473 genome sequence -hampers the ability of tRNAscan-SE to predict the anticodon and/or isoacceptor family of the putative tRNA gene. Together, these errors account for a further six of the anomalies ( Table 1). The seventh anomaly is described above in category 1b (a tRNA gene fragment resulting from the integration of incoming DNA). Hence, after manual annotation, the T. litoralis DSM 5473 genome is predicted to encode the standard archaeal tRNA gene set.

Category 3: Legitimate, additional tRNA genes
In the two categories above, manual annotation has accounted for 14 of the 15 observed tRNA gene set anomalies. The remaining anomaly is an additional copy of the Ala-TGC gene in Palaeococcus pacificus DY20341 (NCBI Reference Sequence NZ_CP006019.1; [57]). The two Ala-TGC gene copies reportedly occur within distinct, distally separated ribosomal RNA (rrn) operons, each consisting of a 16S gene, Ala-TGC, and a 23S gene. With the exception of an additional 32bp stretch of DNA in one of the 23S genes, the two reported rrn loci are identical in primary sequence (including Ala-TGC). Notably, the other 19 Thermococcaceae genomes each encodes only a single rrn operon (and hence only one Ala-TGC gene; see Supplementary Table S4) [58]. These results are consistent with intra-genomic duplication of the rrn operon in Pa. pacificus DY20341, leading to an actually divergent tRNA gene set in this taxonomically distinct organism (see Supplementary Figure S1). However, it should be noted that long stretches of nearly identical DNA sequence (such as rrn operons) pose considerable technical challenges for genome sequencing and assembly, and hence that rrn copy numbers (and any tRNA genes contained) may be inaccurate [59,60].
The results so far show that, after manual annotation, the tRNA gene sets encoded in the 20 Thermococcaceae genomes adhere closely to the standard archaeal tRNA gene set of 46 single-copy tRNA genes; only a single probable deviationduplication of Ala-TGC in Pa. pacificus DY20341 -remains (Fig. 1B).

The tRNAscan-SE isotype score as an indicator of non-canonical tRNA genes
Manual annotation requires expert knowledge of tRNA structure, and detailed manual annotation may be too time consuming for very large data sets. It is therefore desirable to devise automated methods for distinguishing between canonical and non-canonical tRNA genes. As reported above, the automated detection of possible pseudogenes offered by tRNAscan-SE is valuable in this regard, as it flags multiple tRNA gene fragments identified here as probable noncanonical tRNA genes. In addition, during our manual annotation process, we noted lower than usual isotype scores for non-canonical tRNA genes. Specifically, nine predicted tRNA genes contributing to the deviations of the standard archaeal tRNA gene set exhibited relatively low isotype model scores (see Table 1). Therefore, we performed a more systematic analysis of the value of both approaches (pseudogene detection feature and isotype score) for detecting non-canonical tRNA genes. To this end, we obtained the isotype scores of all predicted tRNA genes for the 20 Thermococcaceae genomes from GtRNAdb and locally run tRNAscan-SE (version 2.0.6; see Materials and Methods), and extracted the tRNA genes flagged as possible pseudogenes from the output of locally run tRNAscan-SE.
In each case, 929 tRNA genes were predicted, with isotype scores ranging between 2.3 and 138.2 (median = 115.8; Fig. 4A-B; Supplementary Table S6). While the isotype scores of most predicted tRNA genes fell into a Gaussian-like large peak with a maximum of ~120, eight putative tRNA geneseach corresponding to a partial tRNA gene sequence listed in Table 1 -received low isotype scores (<85; Fig. 4C-D). One of these low-scoring putative tRNA genes -Thermococcus sp. ES1 Und-NNN (127-bp long; isotype score 16.8) -is called as a non-canonical tRNA of undetermined or unknown isotype by tRNAscan-SE (and in GtRNAdb). A further six were identified as pseudogenes by locally run tRNAscan-SE, yet listed as standard tRNA genes in GtRNAdb. The lack of pseudogene definition in GtRNAdb appears to result from the active disablement of pseudogene checking by tRNAscan-SE (as can be seen under the 'Run Options/Stats' tab for individual organisms listed in GtRNAdb). The decision to deactivate the possible pseudogene detection feature when scanning archaeal (and bacterial) genomes for the current GtRNAdb listings was made in order to avoid misclassifying some low-scoring, but likely canonical, putative tRNA genes as pseudogenes (Todd Lowe, personal communication).
The final relatively low-scoring putative tRNA gene -Val-CAC of T. kodakarensis KOD1 (isotype score 83.4) -is called as a standard tRNA gene by both tRNAscan-SE and GtRNAdb. However, manual inspection of the primary sequence and the predicted secondary structure reveals defects in the acceptor stem and D-arm that render the putative tRNA highly unlikely to be canonically functional (see  Table 1. tRNA gene categories are generated by tRNAscan-SE. Note that in panels A and C (GtRNAdb) the pseudogene category ('pseudo') is absent, due to disabling of pseudogene detection function (see 'Run Options/Stats' tab in GtRNAdb listings).
Supplementary Text S1). The tRNA with the next lowest isotype model score was Pyrococcus furiosus DSM 3638 Arg-GCG (isotype score 88.1). Given that this single-copy tRNA gene encodes an essential tRNA isotype, it is highly likely to give rise to a functional, canonical tRNA. Indeed, its relatively low isotype score can be accounted for by the absence of a single nucleotide in the acceptor stem (G70), which could arise from a point mutation or an error in the DSM 3638 reference genome sequence (see Supplementary Text S1). Overall, these results show that for the tRNA gene dataset from the 20 Thermococcaceae genomes, a combination of (i) the automated detection of probable pseudogenes/fragments, (ii) the tRNAscan-SE isotype score, and (iii) manual inspection of tRNA genes with borderline isotype scores (~60-90) can be used as a valuable indicator of the nature of putative tRNA genes (canonical versus non-canonical).
To investigate whether the above observations extend to a wider archaeal dataset, the distribution of isotype scores for all putative tRNA genes in 210 complete (or nearly complete) archaeal genomes listed in GtRNAdb were examined. For these 210 organisms, approximately 10,000 putative tRNA genes were listed in GtRNAdb and also predicted by locally run tRNAscan-SE (the latter with pseudogene detection activated; see Materials and Methods). Isotype scores ranged between −14.5 and 155.9, with a median of 105.9 ( Fig. 5; Supplementary Table S7). As observed for the 20 Thermococcaceae genomes, the distribution of isotype scores for the tRNAs encoded in the 210 archaeal genomes showed a large Gaussian-like peak near 110 and a long tail of lowscoring putative tRNA genes. tRNAscan-SE predicted 839 putative tRNA genes scoring below 85, accounting for ~8% of all predicted tRNA genes, and ~94% of all predicted possible pseudogenes. Notably, tRNAscan-SE predicts 742 standard tRNA genes with isotype scores below 85. On closer inspection, many of these are unlikely to be canonically functional (e.g. are predicted to encode truncated tRNAs, or tRNAs with unusual secondary structures). Yet some, particularly those with isotype scores between ~60 and ~90, are predicted to form tRNAs that may indeed be functional (for an example, see Thermoplasmatales archaeon BRNA1 Ala-CGC in Supplementary Text S1). We also examined the value of other types of scores generated by tRNAscan-SE (e.g. HMM score, infernal score) in identifying tRNA gene fragment but, compared to the isotype score, these resulted in a less clear division of canonical and non-canonical tRNA genes (see Supplementary Figure S2). In summary, while both the automated pseudogene detection and isotype scores are clearly informative, they currently do not offer a watertight separation of canonical and non-canonical tRNA genes in Archaea.

Discussion
In this work, we have performed a detailed manual analysis of the computationally predicted tRNA gene sets listed in GtRNAdb for each of 20 high quality genome sequences from the archaeal Thermococcaceae family. Foremost, we highlight that the tRNAscan-SE software has done a truly impressive job of identifying putative tRNA genes in these organisms; with a degree of manual curation, sensible and translationally complete canonical tRNA gene sets are recognizable in all 20 organisms (see Fig. 1B). Additionally, and arguably equally usefully, tRNAscan-SE identifies a number of non-canonical tRNA gene fragments resulting from biologically relevant and widespread phenomena: CRISPR-Cas activity and the integration of incoming genetic elements (see Table 1). However, the process of distinguishing between these two categories (canonical and non-canonical), particularly on the widely-used, public database GtRNAdb, leaves some room for improvement.
Our analyses provide insights that allow progress towards three goals: (i) obtaining estimates of the accuracy with which tRNAscan-SE predicts, annotates, and categorizes putative tRNA genes, (ii) identifying specific sources of error, with the aim of highlighting specific avenues for possible further improvement of tRNAscan-SE and/or interpretation of the output (both by GtRNAdb and independent users), and (iii) gaining insight into the properties and evolutionary dynamics of archaeal tRNA gene sets. Each of these is discussed below.

(i) The accuracy of automated tRNA gene prediction, annotation, and categorization on GtRNAdb
In eleven independent cases (see Table 1, rows 1-11), our manual curation process yielded conclusively different annotations of predicted tRNA genes than those obtained by the automated process (listed in GtRNAdb). These included the complete removal of eight predicted tRNA genes from the canonical tRNA gene sets, and the re-annotation of a further three to other -previously absent -isotypes (Table 1, rows 9-11). Overall, fourteen apparent deviations were removed from a set of 929 predicted tRNA genes, yielding an error rate of ~1.5% for the 20 Thermococcaceae GtRNAdb listings.

(ii) Observed sources of error in automated tRNA gene annotations, and possible solutions
When predicting the above tRNA gene sets, one source of error was the annotation of tRNA-like sequences as standard tRNA genes (or, less commonly, tRNA genes of unknown or unidentified isotypes) in GtRNAdb. Eight such instances were detected within our dataset of 20 Thermococcaceae genomes (see Table 1). The existence of these sequences was attributed to two distinct mechanisms: the integration of horizontally transferred genetic elements into host tRNA genes (seven instances), and CRISPR-Cas activity (one instance).
The first mechanism -integration of genetic elements -is a general process in both Archaea and Bacteria [52,53] and, as such, similar tRNA gene-like remnants are expected to be a pervasive feature of prokaryotic genomes. Indeed, in addition to the seven examples detected in our Thermococcaceae dataset, we manually identified numerous tRNA gene fragments -many of which are mis-categorized on GtRNAdb as canonical tRNA genes -resulting from phage integration events in Archaea and Bacteria (see Supplementary Table S3). For example, GtRNAdb lists Pseudomonas fluorescens SBW25 as carrying two copies of the Cys-GCA gene, with these occurring on either side of a putatively identified ~121 kb genomic island [61]. One copy is low-scoring (tRNA-Cys-GCA-2-1, isotype score 29.0), and the corresponding putative tRNA was not detected in the mature tRNA pool of this bacterium [32]. Notably, a third possible source of genomic sequences with tRNA-like features in Bacteria is transfer-messenger RNA (tmRNA) genes [62]. Such tmRNA genes encode RNA products of a few hundred base pairs, which perform a role in the quality control of translation (as opposed to elongation). Since tmRNA molecules carry some tRNA features, they have the potential to be identified and annotated by tRNAscan-SE. An example is seen in the GtRNAdb listing for Thiobacillus denitrificans ATCC 25259: tRNA-Und-NNN-3-1 (91-bp; isotype score 23.7) is actually the 3' end of a tmRNA gene.
The accurate detection and annotation of pervasive tRNA-like genomic sequences poses a clear challenge to any tRNA identification software. To minimize the resulting overestimates of canonical tRNA genes in Archaea and Bacteria, we recommend that users predict canonical tRNA gene sets -or indeed, non-canonical tRNA gene sets -of interest using the following process: (i) prescreening the genomes of interest for integrative genetic elements, tmRNA genes, and CRISPR loci, and flagging any partial and full tRNA sequences associated with such regions (numerous computational databases and prediction tools are available for the identification of such elements [8,55,[63][64][65][66][67][68][69]), (ii) using locally run tRNAscan-SE, paying particular attention to the pseudogene detection settings, and (iii) manually inspecting tRNAscan-SE output, using the tRNAscan-SE isotype score to focus on tRNA genes with isotype scores below ~85 (many of which are likely to encode non-canonical tRNA genes), and any possible pseudogenes detected. Further, we suggest that the addition of two new tRNA gene categories to the output of tRNAscan-SE -attR (in line with established nomenclature denoting the ends of integrative elements; e.g. [54]), and tmRNA -would be both useful and biologically meaningful.
A second source of error in our dataset was the presence of N nucleotides in input genome sequence leading to the misannotation of putative tRNA genes. Every such misannotation generated two errors: a false positive tRNA gene (due to the erroneous presence of the mis-annotated tRNA gene), and the corresponding false negative (due to the erroneous absence of the correctly annotated tRNA gene). Three examples were detected in our dataset, resulting in six errors (Table 1). This source of error could be minimized by a combination of users excluding (or, at a minimum, noting) genomes containing significant numbers of Ns, and/or by the addition of a note to the tRNAscan-SE output when Ns are detected within a putative tRNA gene sequence.

(iii) The properties and evolutionary dynamics of archaeal tRNA gene sets
The results reported here provide insight into the evolutionary dynamics of tRNA gene sets in Archaea. Primarily, while the GC content of the 20 analysed genomes varies widely (between 40% and 56%; see Supplementary Table S1), the predicted tRNA gene sets exhibit remarkable stability; with the likely exception of a single duplicate tRNA gene in Pa. pacificus DY20341, all 20 organisms contain a tRNA gene set encoded by 46 single-copy tRNA genes (see Fig. 1B). These observations are consistent with previous suggestions of a 'standard' archaeal tRNA gene set [4,5], with occasional, small -and perhaps evolutionarily fleetinginstances of divergence.

Disclosure statement
No potential conflict of interest was reported by the author(s).