Evolution of group II introns

Present in the genomes of bacteria and eukaryotic organelles, group II introns are an ancient class of ribozymes and retroelements that are believed to have been the ancestors of nuclear pre-mRNA introns. Despite long-standing speculation, there is limited understanding about the actual pathway by which group II introns evolved into eukaryotic introns. In this review, we focus on the evolution of group II introns themselves. We describe the different forms of group II introns known to exist in nature and then address how these forms may have evolved to give rise to spliceosomal introns and other genetic elements. Finally, we summarize the structural and biochemical parallels between group II introns and the spliceosome, including recent data that strongly support their hypothesized evolutionary relationship.


Introduction
Investigating the evolution of mobile DNAs involves unique challenges compared to other evolutionary studies. The sequences of mobile DNAs are usually short and evolve rapidly, resulting in limited phylogenetic signals. The elements often transfer horizontally, which prevents the linkage of their evolution to that of their host organisms or other genes in the organism. Finally, many mobile elements themselves consist of multiple components that may have different evolutionary histories. All of these complicating factors apply to group II introns and must be considered when trying to understand their evolutionary history.
Group II intron retroelements consist of an RNA and a protein component. The RNA is a ribozyme (catalytic RNA) that is capable of self-splicing in vitro, while the intron-encoded protein (IEP)'s open reading frame (ORF) sequence is contained internally within the RNA sequence and encodes a reverse transcriptase (RT) protein [1][2][3][4][5][6]. The two components cooperate intricately to carry out a series of inter-related reactions that accomplish intron splicing and retromobility. In addition to the 2-to 3-kb retroelement form, group II introns have evolved into many variant forms and spread throughout all domains of life. They are present in bacteria, archaebacteria, mitochondria, and chloroplasts but are notably excluded from nuclear genomes, with the exception of presumably inert sequences transferred to the nucleus as segments of mitochondrial DNA [7,8].
Group II introns have attracted considerable attention, in part due to their hypothesized relationship to eukaryotic pre-mRNA introns. The purpose of this review is to carefully consider the evidence available regarding the evolutionary history of group II introns. We present a summary of the multiple types of group II introns known to exist in nature and discuss a model for how the variant forms arose and subsequently evolved into spliceosomal introns and other elements.

Structure and properties of group II introns
The biochemical and genetic properties of group II introns have been described in depth elsewhere [1,3,5,6,[9][10][11][12][13][14] and are summarized briefly here. Of the 2-to 3-kb intron sequence, the RNA component corresponds to approximately 500 to 900 bps, which are separated between the first approximately 600 bp and last approximately 100 bp of the intron sequence (red shading in Figure 1A). After transcription, the RNA folds into a complex structure that carries out splicing [12,[14][15][16][17][18]. There is little conservation of primary sequence among all group II intron RNAs, but the introns fold into a common secondary structure that consists of six domains ( Figure 1B). Domain I is very large and comprises about half of the ribozyme. Among other roles, it serves as a structural scaffold for the entire ribozyme and importantly recognizes and positions the exon substrates for catalysis [19][20][21]. Domain V is a small, highly conserved domain that contains the so-called catalytic triad AGC (or CGC for some introns), which binds two catalytically important metal ions [22,23]. Domain VI contains the bulged A motif that is the branch site during the splicing reaction. Splicing is accomplished by two transesterification reactions that produce ligated exons and excised intron lariat ( Figure 2A) [24,25]. For some group II introns, the RNA component alone can self-splice in vitro under appropriate reaction conditions, typically with elevated concentrations of magnesium and/or salt.
The IEP is encoded within the loop of the RNA domain IV ( Figure 1) and is translated from the unspliced precursor transcript. The IEP contains seven sequence blocks that are conserved across different types of RTs, as well as the X domain that is the thumb structure of the RT protein but is not highly conserved in sequence ( Figure 1A) [26][27][28][29]. Downstream of domain X are DNA binding (D) and endonuclease (En) domains, which are critical for retromobility [30][31][32][33].
Both the RNA and IEP are required for splicing and mobility reactions in vivo. The translated IEP binds to the unspliced intron structure via the RT and X domains, which results in RNA conformational adjustments leading to splicing (Figure 2A) [34][35][36][37][38]. The role of the IEP in splicing is known as maturase activity because it results in maturation of the mRNA. After splicing, the IEP remains bound to the lariat to form a ribonucleoprotein (RNP) that is the machinery that carries out a retromobility reaction [35,39].
For most group II introns, the mobility reaction is highly specific to a defined target sequence of approximately 20 to 35 bp known as the homing site. The mechanism of mobility is called target-primed reverse transcription Watson-Crick pairing interactions that are important for exon recognition are IBS1-EBS1, IBS2-EBS2, and δ-δ′ (for IIA introns), which are shown with teal, orange, and brown shadings, respectively, and connected with black lines. For IIB and IIC introns, the 3′ exon is recognized instead through an IBS3-EBS3 pairing (not shown). The ε-ε′, λ-λ′, and γ-γ′ interactions are also indicated, because they have potential parallels in the spliceosome ( Figure 5); other known tertiary interactions are omitted for simplicity. Both the RNA and DNA structures depicted correspond to the L. lactis ltrB intron. EBS, exon-binding site; IBS, intron-binding site; ORF, open reading frame.
(TPRT) [6,10,31,[40][41][42][43][44]. The RNP first recognizes and unwinds the two strands of the target, and the intron RNA reverse splices into the top strand of the DNA ( Figure 2B). The reaction is the reverse of splicing but utilizes DNA exons rather than RNA exons, and so part of the target site specificity comes from the intron-binding site 1 (IBS1)-exon-binding site 1 (EBS1), IBS2-EBS2, and δ-δ′ pairings between the intron RNA and DNA exons. The IEP facilitates reverse splicing analogously as it does in the forward splicing reaction, that is, it helps the ribozyme fold into its catalytic conformation. In addition, the IEP contributes to target site specificity through interactions of its D domain with the DNA exons. The bottom strand of the target DNA is cleaved by the En domain, either 9 or 10 bp downstream of the insertion site to create a 3′OH that is the primer for reverse transcription of the inserted intron [31,45]. Repair processes convert the inserted sequence to double-stranded DNA, although the repair activities involved differ across host organisms [46][47][48].
Relevant to this review is a key distinction in the character of group II introns in bacteria compared to introns in mitochondria and chloroplasts. In bacteria, the introns behave mainly as mobile DNAs that survive by constant movement to new genomic sites, whereas in organelles, they are less mobile [5,49,50]. This can be inferred from genome sequences because the majority of intron copies in bacteria are truncated or inactivated, and many are surrounded by other mobile DNAs [49,51]. Most bacterial introns are located outside of housekeeping genes so that their splicing does not greatly affect the host biology. On the other hand, in organelles group II, introns are almost always located in housekeeping genes, which necessitates that they splice efficiently [1,15]. Organellar introns are rarely truncated and frequently have lost mobility properties altogether to become splicing-only entities. As opposed to bacterial introns, organellar introns have taken up a more stable residence in genomes, potentially assuming roles in gene regulation because their splicing factors are under nuclear control (below). The splicing reaction. Splicing is intrinsically RNA-catalyzed and occurs for naked RNA in vitro; however, under physiological conditions, the IEP is required as well. The IEP binds to the RNA structure to enable it to adopt its catalytic conformation and accomplish splicing. In the first transesterification step of splicing, the 2′ OH of the branch site adenosine initiates nucleophilic attack on the 5′ splice junction, yielding cleaved 5′ exon and a lariat-3′ exon intermediate. In the second transesterification, the 3′ OH of the 5′ exon attacks the 3′ splice site to form ligated exons and intron lariat. The IEP remains tightly bound to the lariat to form a mobility-competent RNP particle. (B) The mobility reaction, known as target-primed reverse transcription (TPRT). The RNP product of splicing recognizes the DNA target site and reverse splices into the top strand. The En domain cleaves the bottom strand and the free 3′ OH is the primer for reverse-transcription. Host repair activities, which vary across organisms, complete the process. IEP, intron-encoded protein.

Major classes of group II introns
The varieties of group II introns can be classified either according to their RNA or IEP components. Group II introns were initially classified as IIA or IIB based on the RNA sequence and secondary structure characteristics of introns in mitochondrial and chloroplast genomes [15]. A third variation of RNA structure was subsequently identified in bacteria, IIC [52,53]. These three classes each exhibit considerable variation, especially IIB introns, and classes can be further subdivided (for example, IIB1 and IIB2) [15,54]. The most prominent difference among IIA, IIB, and IIC ribozymes is the mechanism of exon recognition, because each class uses a distinct combination of pairing interactions to recognize the 5′ and 3′ exons (that is, different combinations of IBS1-EBS1, IBS2-EBS2, IBS3-EBS3, and δ-δ′ pairings [15,17,19,21,55]).
Alternatively, group II introns can be classified according to phylogenetic analysis of their IEP amino acid sequences. Eight IEP classes have been defined: mitochondrial-like (ML), chloroplast-like (CL), A, B, C, D, E, and F [28,50,56]. The two classification systems are useful for different purposes. Classes IIA, IIB, and IIC apply to all introns regardless of whether they encode an IEP, whereas the IEP-based classes are more specific and correspond to phylogenetic clades. The correspondence between the ribozyme and IEP classifications is shown in Table 1. IIA and IIB introns are found in bacteria, mitochondria, and chloroplasts, while IIC introns are only present in bacteria [15,49,53,57]. Among IEP-classified introns, all forms are found in bacteria, whereas only ML and CL introns are found in mitochondria and chloroplasts ( Table 2). There is some relation between IEP classes and host organisms. For example, within bacteria, CL2 introns are almost exclusively found in Cyanobacteria, while class B introns are found exclusively in Firmicutes [50,51].

Intron variations that deviate from the 'standard' retroelement form
Reconstructing the evolution of group II introns requires an accounting of all known intron forms and their distribution. Here, we describe the range of variants that differ from the 'standard' retroelement form diagrammed in Figure 1.
Introns lacking En domains in the IEP Approximately a quarter of group II intron IEPs in organelles and over half in bacteria lack an En domain [44,50,51], including all introns of classes C, D, E, and F and a minority of CL introns ( Figure 3B). The En domain belongs to the prokaryotic family of H-N-H nucleases [30,58], suggesting that the En domain was appended to an ancestral IEP that had only RT and X domains. If true, then at least some of the lineages of En-minus introns (classes C, D, E, F) represent a form of group II introns that predated acquisition of the En domain.
With regard to mobility mechanisms, En-minus introns are unable to form the bottom strand primer and require an alternative pathway. It has been shown for these introns that the primer is provided by the leading or lagging strand of the replication fork during DNA replication [33,[59][60][61][62]. Some En-minus introns (namely, IIC/class C) use a different specificity in selecting DNA target sites. Rather than recognizing a homing site of 20 to 35 bp, IIC introns insert at the DNA motifs of intrinsic transcriptional terminators, while a smaller fraction inserts at the attC motifs of integrons (imperfect inverted repeat  sequences that are recognized by the integron's integrase) [49,52,[63][64][65][66][67][68][69].
Introns with 'degenerated' IEPs that have lost RT activity Among mitochondrial and chloroplast introns, many IEPs have lost critical RT domain residues (for example, the active site motif YADD) or lost alignability altogether to some of the conserved RT motifs (for example, trnKI1 in plant chloroplasts, nad1I4 in plant mitochondria, and psbCI4 in Euglena chloroplasts) ( Figure 3C) [27,28,70,71]. These divergent IEPs have undoubtedly lost RT activity and presumably have lost mobility function as well, although the splicing (maturase) function likely endures [27]. A well-studied example is the chloroplast IIA intron trnKI1, which is located in an essential tRNA Lys gene. The IEP encoded by this intron, MatK, aligns with other RTs only across motifs 5 to 7, with the upstream sequence being unalignable with motifs 0 to 4; however, domain X sequence is clearly conserved, suggesting the maintenance of the maturase function [27,44]. MatK has been shown biochemically to bind to multiple chloroplast IIA introns, supporting the hypothesis that it has evolved a more general maturase activity that facilitates splicing of multiple IIA introns in plant chloroplasts [70,72].
In bacteria, degenerations of the IEP sequences are rare because the great majority of non-truncated intron copies are active retroelement forms. The only known example is O.i.I2 of Oceanobacillus iheyensis, which encodes an IEP of the ML class that lacks the YADD and other motifs. The fact that the ORF has not accumulated stop codons suggests that it retains maturase activity, particularly because its exons encode the DNA repair protein RadC [50].
Introns with LAGLIDADG ORFs A small set of group II introns do not encode RT ORFs but instead encode proteins of the family of LAGLIDADG homing endonucleases (LHEs) and are presumably mobile through a distinct pathway that relies on the LHE ( Figure 3D). LHEs in group II introns were first identified in several fungi, although an example has since been identified in the giant sulfur bacterium Thiomargarita namibiensis [73][74][75][76]. LHEs are a well-studied class of mobility proteins associated with group I introns, and they promote mobility by introducing double-stranded DNA breaks at alleles that lack the introns [2]. Consistent with this role, the LAGLIDADG ORFs in group II introns of the fungi Ustilago and Leptographium were shown biochemically to cleave intronless target sequences [77,78]. However, the IEP of Leptographium did not promote splicing of the host intron, as sometimes occurs for some group I intronencoded LHEs [77,79]. To date, all identified LHEencoding group II introns in both mitochondria and bacteria belong to the IIB1 subclass and are located in rRNA genes [73,80].
Introns without IEPs Group II introns without IEPs have lost retromobility properties and exist as splicing-only elements ( Figure 3E). They are present in both bacteria and organelles but are especially prevalent in mitochondrial and chloroplast genomes [15]. For example, in plant angiosperms, there are approximately 20 ORF-less group II introns in each mitochondrial and chloroplast genome [70,71,81,82]. These plant organellar introns have been inherited vertically for over 100 million years of angiosperm evolution, consistent with their lack of a mobilitypromoting IEP. Because the introns are situated in housekeeping genes in each organelle, efficient splicing is enabled by many splicing factors supplied by the host cells (below). In organellar genomes of fungi, protists, and algae, ORF-less group II introns are also common but less prevalent than in plants. Many of these introns contain remnants of IEP sequences, pointing to a sporadic and ongoing process of loss of the IEP and retromobility [53,[83][84][85][86].
In bacteria, ORF-less group II introns are rare. Among the known examples, the ORF-less introns nearly always reside in genomes containing related introns whose IEPs may act in trans on the ORF-less introns [50]. Splicing function in trans has in fact been demonstrated experimentally for an IEP in a cyanobacterium [87]. The sole known exception to this pattern is the C.te.I1 intron in Clostridium tetani, for which no IEP-related gene is present in its sequenced genome. C.te.I1 self-splices robustly in vitro, and it was speculated that the intron might not require splicing factors in vivo [88,89]. This example lends plausibility to possibility that the ribozyme form of group II introns may exist and evolve in bacteria apart from the retroelement form; however, this would be rare because C.te.I1 is the only example of this type among over 1,500 known copies of group II introns in bacteria [90].
Introns with 'degenerated' ribozymes Many group II introns in mitochondria and chloroplasts have defects in conserved ribozyme motifs, such as mispaired DV or DVI helices or large insertions or deletions in catalytically important regions ( Figure 3F) [15,44,71,91,92]. For such introns, secondary structure prediction with confidence is difficult or impossible, and these introns have presumably lost the ability to self-splice. Consistent with this inference, no plant mitochondrial or chloroplast group II intron has been reported to self-splice in vitro.
These examples illustrate that group II introns have repeatedly lost their splicing capability in organelles. To compensate, cellular splicing factors have evolved independently in different organisms to enable efficient splicing of the introns that lie in housekeeping genes. Similar to the case of ORF-less group II introns, there has been a conversion from retromobility to splicing-only function, and splicing is under the control of the host nuclear genome.
Group III introns The most extreme examples of degenerated RNA structures are group III introns, found in Euglena gracilis chloroplasts ( Figure 3G) [106]. These introns are approximately 90 to 120 nt in length and sometimes contain only DI and DVI motifs. Euglena chloroplasts are replete with >150 group III and degenerated group II introns, many located in essential genes. Because group III introns lack a DV structure, it is thought that a generalized machinery consisting of trans-acting RNAs and/or proteins facilitate their excision from cellular mRNAs.
Trans-splicing introns Some group II intron sequences in plant mitochondria and chloroplasts have been split through genomic rearrangements into two or more pieces that are encoded in distant segments of the genome ( Figure 3H) [71,107,108]. The intron pieces are transcribed separately and then associate physically to form a tertiary structure that resembles a typical group II intron. The majority of trans-splicing introns are split into two pieces with the break point located in DIV. However, the Oenethera nad5I3 and Chlamydomonas psaAI1 are tripartite, containing breaks in both DI and DIV [108,109]. These and other trans-splicing introns require multiple splicing factors for efficient processing. In the case of psaAI1 in Chlamydomonas reinhardtii chloroplasts, as many as twelve proteins are required in the trans-splicing reaction [110,111]. For some introns, the evolutionary timing of the genomic rearrangement can be specified. The nad1I1 intron is cis-splicing in horsetail, but transsplicing in fern and angiosperms, indicating that the genomic rearrangement occurred after horsetail split from the fern/angiosperm lineage over 250 million years ago [112,113]. No trans-splicing introns have yet been reported in bacteria.
Altered 5′ and 3′ splice sites While the vast majority of group II introns splice at specific junction sequences at the boundaries of the introns (5′GUGYG…AY3′), a number of group II introns have attained plasticity that allows them to splice at other points ( Figure 3I). A set of fungal rRNA introns was identified that splice 1 to 33 nt upstream of the GUGYG motif. The alteration in splicing property was attributed to specific ribozyme structural changes, including an altered IBS1-EBS1 pairing, and loss of the EBS2 and branch site motifs [74]. These changes were inferred to have evolved independently multiple times. All of the introns are of the IIB1 subclass and the majority encodes a LAGLIDADG IEP [74]. Interestingly, a similar situation was found for the bacterial intron C.te.I1 of C. tetani, which exhibits analogous structural deviations and splices eight nucleotides upstream of the GUGYG motif [89]. Alterations of the 3′ splice site have also been reported. About a dozen class B introns are known that contain insertions at the 3′ end of the intron, called domain VII, which result in a shift of splicing to approximately 50 to 70 nt downstream of the canonical 3′AY boundary sequence at the end of domain VI ( Figure 3J) [114][115][116].
Alternative splicing The fact that group II introns can utilize 5′ and 3′ splice sites separated from the 5′ GUGYG and AY3′ sequences allows for the possibility of alternative splicing. The first report of this was in Euglena chloroplasts, where several group III introns spliced in vivo using noncognate 5′ or 3′ splice sites [117,118]. The frequencies of these splicing events, however, were low, being detected by RT-PCR, and the resultant proteins were truncated due to frame shifts and stop codons, which together raise the possibility that this is a natural error rate in splicing rather than regulated alternative splicing per se.
In bacteria, alternative splicing at the 3′ splice site was found for B.a.I2 of Bacillus anthracis. In that case, two in vivo-utilized sites are located 4 nt apart (each specified by a γ-γ′ and IBS3-EBS3 pairing), which result in two protein products, one consisting of the upstream exon ORF alone and the other a fusion of upstream and downstream ORFs [119]. In a more dramatic example, the C. tetani intron C.te.I1 utilizes four 3′ splice sites, each specified by a different DV/VI repeat. Each resulting spliced product is a distinct fusion protein between the 5′ exon-encoded ORF and one of four downstream exon-encoded ORFs [88]. The latter example resembles alternative splicing in eukaryotes because several protein isoforms are produced from a single genetic locus ( Figure 3K).
Twintrons A twintron is an intron arrangement in which one group II intron is nested inside another intron as a consequence of an intron insertion event ( Figure 3L). For a twintron to splice properly, often the inner intron must be spliced out before the outer intron RNA can fold properly and splice [118,120,121]. Twintrons are common in Euglena chloroplasts where they were first described, and where approximately 30 of its 160 introns are in twintron arrangements [106]. Several twintrons are known in bacteria; however, splicing of these twintrons does not appear to greatly impact cellular gene expression, because the twintrons are intergenic or outside of housekeeping genes [51,122]. Twintrons in the archaebacterium Methanosarcina acetivorans have a particularly complex arrangement [123]. There are up to five introns in a nested configuration but no coding ORFs in the flanking exons. Based on the boundary sequences of the introns, it can be concluded that the introns have undergone repeated cycles of site-specific homing into the sequences of other group II introns. These repeated insertions are balanced by deletions of intron copies through homologous recombination. For these introns, the twintron organizations do not affect host gene expression but provide a perpetual homing site in the genome for group II introns.

Molecular phylogenetic evidence for the evolution of group II introns
While there has been much speculation about intron evolution, it remains difficult to obtain direct evidence for specific models. For group II introns, clear phylogenetic conclusions can only be drawn when analyzing closely related introns. This is because only closely related sequences allow the extensive alignments needed for robust phylogenetic signals. Such analyses have indicated multiple cases of horizontal transfers among organisms. Some of the inferred examples are as follows: from an unknown cyanobacterial source to Euglena chloroplasts [124]; from unknown sources into a cryptophyte (red alga; Rhodomonas salina) [125] or a green alga (Chlamydomonas) [126]; between mitochondrial genomes of diatoms and the red alga Chattonella [127]; and from the mitochondrion of an unknown yeast to Kluyveromyces lactis [127,128]. In bacteria, it was concluded that group II introns from multiple classes have transferred horizontally into Wolbacchia endosymbionts, because the resident introns are of different classes [129]. More broadly, horizontal transfers among bacteria appear to be relatively common because many bacteria contain introns of multiple classes [51,130,131].
Beyond identification of horizontal transfers, unfortunately, global phylogenetic analyses result in poor phylogenetic signals because the number of characters available (that is, those that are unambiguously alignable for all introns) decrease to at most approximately 230 aa for the ORF and approximately 140 nt for the RNA [57]. With such reduced-character data sets, clades are clearly identified in bacteria corresponding to classes A, B, C, D, E, F, ML, and CL [28,50,56,132]; however, relationships among the clades are not well supported. Notably, when IEPs of organellar introns are included in trees along with bacterial introns, the organellar IEPs cluster with the ML and CL clades of bacteria, indicating that introns of mitochondrial and chloroplast genomes originated from the ML and CL lineages of bacteria [28]. A global analysis with all known organellar and bacterial intron IEPs is not possible because of extreme sequence divergence of many organellar introns.
The limited phylogenetic resolution for group II introns was attributed to several potential factors [57]. First, the amino acid data sets had substantial levels of saturation (that is, repeated changes per amino acid), which decreased the signal-to-noise ratio. Second, the sequences of some clades had extreme base composition biases that could distort the results (for example, GC-rich genomes have biased amino acid composition that can cause artifacts; this is especially true for class B introns). In addition, there were problematic taxon-sampling effects (differences in trees depending on which intron sequences were included). These complications underscore the difficulty of obtaining rigorous evidence for the evolution of group II introns and the need for exercising caution in drawing interpretations and conclusions. In the future, identifying the basis for these effects may allow for compensation and optimization that may produce more satisfying conclusions.

Coevolution of ribozyme and IEP and the retroelement ancestor hypothesis
Over a decade ago, it was noticed that there is a general pattern of coevolution among group II intron IEPs and their RNA structures [53,133]. Specifically, each phylogenetically supported IEP clade corresponds to a distinct RNA secondary structure. Coevolution of RNA and IEP should not be surprising given the intimate biochemical interactions between ribozyme and protein during the splicing and mobility reactions. However, coevolution clearly has not occurred for group I ribozymes and their IEPs. Group I introns have been colonized by four families of IEPs, and there is evidence for a constant cycle of ORF gain and loss from group I ribozymes [134][135][136][137].
The principle of coevolution is a central principle to deciphering the history of group II introns. Importantly, it simplifies the reconstruction from two independent histories to a single history. Based on the pattern of coevolution, a model was set forth to explain the history of group II introns, which was called the retroelement ancestor hypothesis [53,133]. The model holds that group II introns diversified into the major extant lineages as retroelements in bacteria, and not as independent ribozymes. Subsequently, the introns migrated to mitochondria and chloroplasts, where many introns became splicing-only elements.
Phylogenetic analyses have in general supported the initial observation of coevolution, because both RNA and IEP trees define the same clades of introns, thereby excluding extensive exchanges between ribozymes and the different classes of IEPs [57]. However, caveats remain. The most obvious one is the fact that some group II introns encode LHE proteins rather than RT proteins. The invasion of group II ribozymes by LHE's occurred at least once in bacteria and multiple times in fungal mitochondria [74,76]. So far, these exceptions are limited in number and do not significantly undermine the overall pattern of coevolution. A second caveat comes from topology tests between the IEP and RNA trees which indicated a conflict [57] (topology tests are mathematical techniques for evaluating and comparing different trees). As noted in that study, the conflict could be explained by either discordant evolution (reassortment of IEPs and ribozymes) or convergence of RNA or IEP sequences that masks their true evolutionary relationships. While the source of the conflict was not resolved, more recent data support the latter reason (L. Wu, S. Zimmerly, unpublished).

A model for the evolution of group II introns
Diversification within Eubacteria The retroelement ancestor model continues to be consistent with available data and is elaborated here to show how it can explain the emergence of the known forms and distribution of group II introns ( Figure 4). The ancestral group II intron is hypothesized to have been a retroelement in Eubacteria that consisted of a ribozyme and intron-encoded RT component and had both mobility and self-splicing properties. The earliest introns would have behaved as selfish DNAs [49], which then differentiated in Eubacteria into several retroelement lineages (A, B, C, D, E, F, ML, CL). The IEP initially would have consisted of a simple RT, similar to RTs of classes C, D, E, and F, while the En domain was acquired subsequently from H-N-H nucleases present in Eubacteria [30,58]. The En domain would have provided the benefit of enhanced mobility properties and/or allowed the introns to exploit new biological niches.
Of the three target specificities known for bacterial introns (insertion into homing sites, after terminator motifs, and into attC sites) [64,65], any of these specificities could have been used by the ancestor, although homing is by far the most prevalent specificity, occurring for all lineages but class C. Horizontal transfers would have driven the dissemination of group II introns across species. Some group II introns took up residence in housekeeping genes, particularly in cyanobacteria and for CL and ML lineages [51,138,139]. These introns would have had to splice efficiently to avoid inhibiting expression of the host genes. Limited numbers of introns deviated from the 'standard' retroelement form, including ORF-less introns, introns with degenerate IEPs, twintrons, and alternatively splicing introns. Most of these lost mobility properties but maintained splicing ability. Some introns adapted altered mechanisms of 5′ and 3′ exon recognition and altered 5′ or 3′ intron termini [71,72,74,89,116,117,119,123].
Migration to archaebacteria and organelles Introns belonging to the lineages CL, D, and E migrated from Eubacteria to archaebacteria [51,123]. The direction of migration can be inferred from the lower number and diversity of introns in archaebacteria compared to Eubacteria. Introns of the CL and ML lineages migrated from Eubacteria to mitochondria and chloroplasts. The introns could have been contained within the original bacterial endosymbionts that produced each organelle or been introduced by subsequent migrations. Horizontal transfers of introns among mitochondrial and chloroplast genomes created a diversity of IIA and IIB introns in both organellar genomes [124][125][126][127][128].
Diversification within organelles Within mitochondria and chloroplasts, the character of group II introns changed to become more genomically stable and less selfish. The introns took up residence in housekeeping genes, which necessitated efficient splicing, and which was enabled by host-encoded splicing factors [71,[93][94][95][96]. While many group II introns maintained retromobility, many more degenerated in their RNA and/or IEP structures or lost the IEPs entirely, leading to immobile introns. In plants, the introns proliferated greatly to copy numbers of approximately 20 per organelle, with nearly all IEPs being lost. At least two IEPs migrated from the plant mitochondrial genome to the nucleus to encode four splicing factors that are imported to the mitochondria and possibly chloroplasts for organellar intron splicing [71,85].
In fungi, a small fraction of ORF-less introns acquired an IEP of the LAGLIDADG family, which permitted mobility through the homing endonuclease mechanism. In mitochondria and chloroplasts, introns sporadically became trans-splicing due to genomic rearrangements that split intron sequences [71,[107][108][109]112,113]. In Euglena chloroplasts, the introns degenerated on a spectacular scale to become group III introns. The earliest euglenoids are inferred to be intron-poor while the later branching euglenoids harbor more introns, pointing to a process of intron proliferation within Euglena chloroplasts [140,141].
Caveats It should be kept in mind that this model is contingent upon the available sequence data. One cautionary note is that our picture of group II introns in bacteria may be skewed, because for the data available the introns were identified bioinformatically in genomes based on the RT ORF. This may result in some oversight of ORF-less group II introns; however, the numbers of those introns do not appear to be large. In a systematic search of bacterial genomes for domain V motifs, nearly all introns identified were retroelement forms [50]. There was one example uncovered of a group II intron with a degenerate IEP, and only a few ORF-less introns, all in genomes with closely related introns where an IEP may act in trans on the ORF-less intron. A single independent, ORF-less group II intron was found out of 225 genomes surveyed. Hence, it seems safe to predict that relatively few ORF-less introns have been overlooked in bacteria, unless they have domain V structures unlike those of known group II introns.

Origin of group II introns
If the ancestor of extant group II introns was a retroelement, where did that retroelement come from? The simplest scenario is that pre-existing ribozyme and RT components combined into a single element, creating a new mobile DNA. An interesting alternative possibility is that a self-splicing RNA might have arisen at the boundaries of a retroelement to prevent host damage by the mobile DNA [142].
There are many potential sources for the ancestral RT component, because a myriad of uncharacterized RTs exist in bacterial genomes, most of which could potentially correspond to forms that were co-opted by the primordial group II intron [143]. Because there is little evidence that bacterial RTs other than group II introns are proliferative elements, it is possible that the property of mobility emerged only after the RT became associated with the RNA component. Similarly, there are many structured RNAs in bacteria that could have given rise to the ancestral group II ribozyme, including noncoding RNAs, riboswitches, or even a fragment of the ribosome [144][145][146]. The primordial RNA component would not necessarily have been self-splicing like modern group II introns, but upon associating with the RT, it would have generated a simple retroelement, which then became specialized and/or optimized to become the efficient retroelement that was then the ancestor of the different lineages. Although the topic of the ultimate origin of group II introns is interesting to consider, any model will be speculative.
Which class of modern group II introns best represents the ancestral group II intron retroelement? It is often claimed in the literature that IIC introns are the most primitive form of group II introns [13,14,18,147]. While this idea is consistent with the small size of IIC introns, it is only weakly supported by phylogenetic data. The study cited provides a posterior probability of only 77% in Bayesian analysis in support of the conclusion (and <50% with neighbor-joining or maximum parsimony methods), whereas 95% is the usual standard for making conclusions with Bayesian analysis [148]. In more recent phylogenetic analyses, IIC introns are also seen often as the earliest branching of group II introns, albeit with weak or inconsistent support [57]. Interestingly, additional classes of group II introns have been uncovered more recently in sequence data, and some of these are as good or better candidates for most ancestral intron (L. Wu, S. Zimmerly, unpublished).

Structural parallels between group II introns, spliceosomal introns and the spliceosome
Major parallels The concept that group II introns were the ancestors of spliceosomal introns emerged shortly after the discovery of multiple intron types (spliceosomal, group I, group II introns) [149][150][151]. Since then, mechanistic and structural evidence has accumulated to the point that few if any skeptics remain. This is a shift from the early years when it was argued that mechanistic constraints could have resulted in convergent evolution of mechanisms and features [152].
The major similarities and parallels for the two intron types are summarized here. In terms of splicing mechanisms, the overall pathways for group II and spliceosomal introns are identical, with two transesterifications and a lariat intermediate (Figure 2A). The chemistry of the two splicing steps share characteristics with regard to their sensitivities to Rp and Sp thiosubstitutions. A Rp thiosubstitution (that is, sulfur atom substituted for the Rp nonbridging oxygen) at the reacting phosphate group inhibits both steps of the reaction for both group II and spliceosomal introns, whereas Sp substitutions do not, suggesting that different active sites are used for the two reactions [153][154][155][156]. This contrasts with data for group I introns, for which Rp substitutions inhibited only the first splicing step, and Sp substitutions inhibited only the second step, which is consistent with reversal of a reaction step at a common active site [157,158]. The shared sensitivities for the reactions of group II and spliceosomal introns suggest that similar active sites are used for the two types of introns, with the group II-like active site being maintained during evolution of spliceosomal introns.
Structurally, there are many parallels between group II intron RNAs and spliceosomal snRNAs, which run the gamut from being clearly analogous to being speculative. The most obvious parallel is the branch site motif that presents the 2′OH of a bulged A to the 5′ splice site for the first step of splicing. For group II introns, the bulged A is contained within a helix of domain VI; in the spliceosome the same bulged structure is formed by the pairing of the U2 snRNA to the intron's branch point sequence ( Figure 5) [159]. Intron boundary sequences are also quite similar and presumably function analogously, being 5′ GU-AY 3′ for group II introns and 5′ GU-AG 3′ for spliceosomal introns ( Figure 5). The first and last nucleotides of each intron have been reported to form physical interactions that are essential for an efficient second step of splicing [160][161][162].
For group II introns, the active site is in domain V, with two catalytically important metal ions being coordinated by the AGC catalytic triad and the AY bulge [147]. A similar structure is formed in the spliceosome by pairings between the U2 and U6 snRNAs, which bear an AGC motif and AU bulge ( Figure 5) [23]. The equivalence between the two active sites has been supported experimentally through the substitution of the DV sequence of a group II intron for the analogous positions in the snRNAs of the minor spliceosome (in that case the U12-U6atac snRNA pairing rather than U2-U6) [163]. The substitution demonstrates that the group II intron sequence can assume a functional structure at the putative active site of the spliceosome. More recently, the equivalence of the two active sites was taken to a new level using thiosubstitution and metal rescue experiments, in which a thiosubstitution inhibits a splicing step, but is rescued by metal ions that coordinate sulfur better than magnesium does. These experiments demonstrated that the AGC and bulged AU motifs of the U6-U2 active site coordinate catalytic metal ions as predicted from the crystal structure of the group IIC intron [164].
A further active site parallel comes from the discovery in the group II crystal structure of a triple helix between the AGC base pairs in domain V and two bases of the J2/3 strand ( Figure 5A) [147]. This structure is hypothesized to be recapitulated in the active site of the spliceosome, with an AG of the ACAGAGA motif forming the triple base pairs with the AGC of the U6-U2 helix ( Figure 5B). Experiments for the yeast spliceosome using covariation-rescue and cross-linking methods support the hypothesized triple base pairs in the spliceosome and lend further support for this active site parallel [165].
A final clear parallel between group II introns and spliceosomal introns was revealed by the crystal structure of a portion of the Prp8 protein, a 280-kDa protein (in yeast) located at the heart of the spliceosome. A region of Prp8 cross-links to the 5′ and 3′ exons and also to the intron's branch site, indicating its proximity to the spliceosome's active site. Surprisingly, the crystal structure of a major portion of yeast Prp8 revealed that the cross-linking portion is composed of a reverse transcriptase domain fold [166]. In fact, the existence of an RT domain in Prp8 had been previously predicted correctly based on sensitive sequence pattern profiles [167]. Thus, the active site region of the spliceosome appears to contain remnants of both an ancestral ribozyme (snRNA pairings) and an ancestral group II RT (Prp8), which together strongly support the idea that the eukaryotic spliceosome and nuclear pre-mRNA introns are highly elaborate derivatives of ancient, retromobile group II introns.
Less clear yet plausible parallels Additional parallels between group II intron and spliceosomal intron RNAs For group II introns, selected nucleotide positions critical for splicing are shown, while the sequences shown for snRNAs correspond to the 95% consensus for the U2, U5, and U6 snRNAs sequences present in Rfam [203]. The blue square inset shows an alternative secondary structure model for the ISL of U6, which is less compatible with DV of group II introns but is formed for naked snRNAs. The green square indicates an alternative four-way junction structure, also formed by naked snRNAs. Question marks indicate the interactions found in group II introns for which no equivalent interactions are reported in snRNAs. See text for a full description.
are credible but less clear. The loop 1 structure of U5 snRNA is predicted to be analogous the EBS1 loop of group II introns, a substructure that forms base pairs with the 5′ exon of group II introns, thereby delivering the 5′ exon to the active site ( Figure 1A). Supporting the parallel, the loop 1 structure of U5 forms cross-links with both the 5′ and 3′ exon boundary sequences [168]. An experiment supporting functional equivalence demonstrated that the EBS1 stem-loop of the bI1 intron of yeast mitochondria could be deleted and then rescued with a stem-loop supplied in trans that had either the native bI1 stem-loop sequence or the loop 1 sequence of the U5 snRNA [169]. However, because the function of the EBS1 loop sequence is to form base pairs with the exon's IBS1, and the U5 loop sequence is fortuitously capable of base pairing with the IBS1 of bI1 (but not other group II introns), the significance of the experiment is less clear. Interestingly, while the EBS1 loop sequence of IIB and IIC introns pairs with only the 5′ exon, the EBS1 loop of IIA introns pairs with both 5′ and 3′ exons (IBS1-EBS1 and δ-δ′ interactions; Figure 1), making the putative parallel more similar for IIA introns than for IIB or IIC introns [170].
The 2-bp ε-ε′ interaction of group II introns has been proposed to be equivalent to an experimentally detected pairing between the U6 snRNA and a sequence near the 5′ end of the intron (Figures 1 and 5) [12,[171][172][173]. While the analogy is reasonable, the U6 pairing was initially reported as 3 bp and later evidence suggested it to be up to 6 bp [174,175]; it remains unclear whether or to what extent the two pairings are analogous structurally and functionally.
Finally, the λ-λ′ interaction of group II introns is a threeway interaction that connects the ε-ε′ interaction (and hence the 5′ end of the intron) to the distal stem of domain V (Figures 1 and 5). The parallel in snRNAs is proposed to be a triple base pair between a subset of nucleotides in the ACAGAGA motif and the internal stem-loop (ISL) helix of U6. While this structural parallel remains a possibility, it appears difficult for the ACAGAGA motif to simultaneously form the ε-ε′-like and λ-λ′-like interactions.
Missing or questionable structural parallels It is important not to ignore features that are not shared between group II and spliceosomal introns, in the rush to pronounce the two types of introns equivalent. Each type of intron has features not found or reported in the other. For example, the γ-γ′ interaction of group II introns is a Watson-Crick base pair between a J2/3 nucleotide and the last position of the intron, but it has not been reported for spliceosomal introns (Figures 1  and 5). The putatively equivalent nucleotides in the snRNAs would be a residue of the ACAGAGA box and the last nucleotide (G) of the intron.
Two critical pairings that occur in the spliceosome but not in group II introns are temporal pairings formed during spliceosome assembly but not catalysis [176]. The U1 snRNA pairs to the 5′ end of the intron during splice site recognition and assembly, only to be replaced before catalysis by a pairing between U6 and the 5′ end of the intron. Similarly, the extensive pairings between the U6 and U4 snRNAs occur during spliceosome assembly but are disrupted and replaced by the U6-U2 pairing. Both of these transient RNA-RNA pairings can be predicted to have arisen during the evolutionary advent of the spliceosome, for the purposes of assembly and/or regulation.
On the other hand, Helices Ia and III of the U2-U6 structure ( Figure 5) occur during catalysis, but have no equivalent in group II introns, and perhaps even conflict with the structural organization of group II intron RNAs. Helix Ia introduces a spacer between the catalytic AGC motif, the branch site motif and triple helix motif, potentially introducing a structural incompatibility between spliceosomal and group II introns. In any case, group II introns do not have an equivalent helix Ia structure. More problematic is Helix III, which is not present in group II introns, and appears to conflict with proposed structural parallels for the ACAGAGA sequence. In [175], it was proposed that helix III is shortened to approximately 4 bp during catalysis, but might form more fully during assembly. Again, because this established helix has no group II intron equivalent, it may have originated during evolution of the spliceosome.
A modest discrepancy involves the secondary structure of the ISL of U6 and the DV structure of group II introns. The secondary structure of the ISL is usually drawn with an AU bulge opposite an unpaired C (blue square, Figure 5) [177]. However, chemical modification protection data with purified, activated spliceosomes instead suggested an alternative structure that is more similar to group II introns. The alternative structure does not form for naked snRNAs, but it may form in the context of the spliceosome [163,175]. Another perplexing difference between intron types is the break of the catalytic helix into helices 1b and the ISL.
Finally, it is notable that secondary structure models for snRNA pairings have changed over the years, and there are proposed differences in snRNA pairings for yeast versus mammalian snRNAs, despite the fact that the relevant sequences are identical [178][179][180][181][182]. NMR structural analysis of the naked U2-U6 sequences revealed a four-way junction structure ( Figure 5B) [180], which was subsequently supported by genetic data in yeast [183]. The four-way junction was proposed to form for the first step, with the three-way junction forming for the second step. However, there is no evidence for the four-way junction structure in the mammalian spliceosome, most recently based on RNA modification protection data of purified, activated U5-U6-U2 spliceosomes [175].
The pathway for the evolution of spliceosomal introns from group II introns Because virtually all eukaryotic genomes contain introns and spliceosomes, with the few exceptions attributed to losses [184][185][186], the spliceosome was necessarily present in the last eukaryotic common ancestor (LECA). Thus, evolution of ancestral group II introns to the spliceosome would have occurred prior to the LECA. Evidence from genome comparisons indicates that the LECA contained a multitude of introns [187]. Indeed, it is doubtful that such a complex machinery as the spliceosome would have arisen on account of a few introns.
Models for the conversion of group II introns to the spliceosome are not well refined, and multiple scenarios are possible [188][189][190][191]. At some point prior to the LECA, group II introns likely invaded the nuclear genome and proliferated as mobile DNAs. The invading group II intron(s) could have come from the genome of the alphaproteobacterium that became the mitochondrial endosymbiont or alternatively could have been transferred from a bacterium to the nuclear genome after establishment of the mitochondrion. Rampant intron propagation would leave many introns interrupting essential genes, which would require the maintenance of splicing to ensure cell viability. Consequently, the cell evolved splicing factors to facilitate and eventually control splicing of the introns. Debilitating mutations in ribozyme sequences would occur easily through point mutations, leading to many copies of splicing-deficient introns in the genome. On the other hand, discarding such defective introns by precise deletions of entire introns would be rare. The cell could have solved this problem by evolving a general splicing machinery that acts in trans, leaving the introns free to lose all their ribozyme structures except for certain boundary sequences. The end result was the transfer of splicing catalysis from individual ribozyme units scattered throughout the genome to a single trans-acting RNP machinery that could act on all intron copies.
Because the modern spliceosome is ostensibly a elaborate derivative of a mobile group II intron RNP, it follows that at a time point prior to the LECA, the ribozyme structure of group II introns fragmented into the U2, U5, and U6 snRNA components of the spliceosome. In addition, the RT protein expanded in length through domain accretion, with the fusion of an RNase H domain, MPN/JAB1 (nuclease) domain, and possibly other domains that form portions of the modern 280-kDa Prp8 protein [167,192]. Additional protein splicing factors such as Sm and SR proteins were incorporated into the spliceosomal machinery.
The U1 and U4 snRNAs and snRNPs were added as new regulatory or facilitating activities, since they do not have equivalents in group II introns.
One intriguing model for the emergence of the spliceosome predicts that proliferation of mobile group II introns was the driving force for invention of the nuclear membrane [188,193]. The model is based on the likelihood that splicing would have been slow compared to transcription and translation processes. In an uncompartmentalized cell, translation would therefore occur before mRNAs were fully spliced, yielding nonfunctional proteins. By separating transcription and translation, the nuclear membrane ensured that only fully spliced transcripts were translated.
Several studies have experimentally addressed evolutionary issues of group II introns. One series of studies sought to reproduce the fragmentation of a group II ribozyme into a trans-splicing intron-in-pieces. It was shown that a retromobile IIA intron could be split into multiple functional trans-splicing RNA transcripts, with the break points distributed throughout the sequence and not only in domain IV as occurs for nearly all natural trans-splicing introns [189,194,195]. In a separate series of studies, the question was addressed as to why group II introns do not function optimally in nuclear genomes, where they are apparently excluded in functional form in nature. It was found that the introns spliced in the cytoplasm rather than the nucleus and that transcripts were subject to nonsensemediated decay (NMD) and poor translation. Further dissection showed that transcripts were mislocalized to foci in the cytoplasm and that the excised intron lariat formed RNA-RNA pairings with spliced mRNAs that inhibited their translation. It was inferred that these phenomena demonstrate an incompatibility of group II introns with eukaryotic cellular organization and may have been responsible for the ejection of group II introns from nuclear genomes during evolution [190,196,197].

What other elements did group II introns evolve into?
In addition to spliceosomal introns, group II introns are believed to be the ancestors of non-LTR retroelements, a major class of mobile DNAs in eukaryotes [31]. The RTs of group II introns and non-LTR retroelements are related phylogenetically and share sequence motifs 0 and 2a, which are absent from other RTs except diversitygenerating retroelements (DGRs) (2a), retroplasmids (2a), and possibly retrons (2a) [143,191,198,199]. Moreover, the retromobility mechanisms of group II and non-LTR elements are similar, with both called target-primed reverse transcription because they involve cleavage of the DNA target to produce a primer for reverse transcription [31,200]. As mobile group II introns were present in the nucleus prior to the LECA, it is plausible that some invading group II introns produced the non-LTR family retroelements in the nucleus through the loss of their ribozyme and splicing functions but retention of mobility functions.
Moreover, it is clear that group II introns spawned other RT-containing units. A subset of CRISPR/Cas elements contain an RT gene, either as a free-standing ORF or fused to a cas1 gene (denoted G2L1 and G2L2 (group II-like 1 and 2) [143,201]). By sequence, these RTs might be mistaken for group II introns except that no ribozyme RNA structure is present [143]. The cas1 gene encodes a nuclease that helps integrate short sequences of phage or plasmid into CRISPR arrays, lending cellular immunity to DNAs containing those sequences [202]. The RT genes found within CRISPR/Cas systems are almost certainly derived from group II intron retroelements due to their close sequence similarity. It seems likely that they use a mechanism related to TPRT to integrate the new protospacer sequences into CRISPR arrays.
Three additional types of group II-related RTs exist in bacteria, denoted G2L3, G2L4, and G2L5 [143]. These are not associated with CRISPR/Cas systems and also lack ribozyme structures. It is unknown whether these RTs are part of mobile DNAs or participate in as yet unidentified functions.

Conclusions
Group II introns are compact and versatile retroelements that have successfully colonized genomes across all domains of life and have given rise to many variant forms. Current data are consistent with the model that the retroelement form (that is, the form diagrammed in Figure 1) was the ancestor of extant group II introns and was the driver for their spread and survival. The evolutionary success of group II introns may be linked to the multifunctionality of their splicing and mobility reactions, which allowed them to spread as selfish DNAs, and then derivatize into adaptable forms that shed either splicing or mobility properties. Interestingly, there is much overlap in variant forms of group II introns found in bacterial and organellar genomes (ORF-less introns, twintrons, altered 5′ splice sites, alternative splicing, degenerate IEP sequences, LAGLIDADG IEPs; Figure 4), which suggests that these derivative forms represent general ways that group II introns can differentiate. The low numbers of derivatives in bacteria suggest that the nonmobile derivatives do not persist long in bacterial genomes, whereas derivatized introns in organelles may persist indefinitely as splicing-only elements, and potentially provide benefits of gene regulation through nuclear control of their splicing.
With regard to the evolutionary pathway of group II introns into spliceosomal introns, important insights over the past 2 years have largely erased doubts about the longstanding hypothesis that the spliceosome descended from group II introns. Indeed, there are no credible competing hypotheses for the origin of the spliceosome. Still, the specifics of the pathway and the full scope of mechanistic parallels remain to be resolved. Additional insight may be forthcoming from structural elucidations of the spliceosome and comparisons to group II intron structures, as well as genomic comparisons of early branching eukaryotes, which may give information about introns in the LECA and potentially suggest evolutionary intermediates or pathways. Overall, the elucidation of group II intron biology, structure, and evolution remains an important facet in understanding the evolution and dynamics of eukaryotic genomes.