Euglena in time: Evolution, control of central metabolic processes and multi-domain proteins in carbohydrate and natural product biochemistry (cid:2)

Summary Euglena gracilis is a eukaryotic microalgae that has been the subject of scientific study for hundreds of years. It has a complex evolutionary history, with traces of at least four endosymbiotic genomes and extensive horizontal gene transfer. Given the importance of Euglena in terms of evolutionary cell biology and its unique taxonomic position, we initiated a de novo transcriptome sequencing project in order to understand this intriguing organism. By analysing the proteins encoded in this transcriptome, we can identify an extremely complex metabolic capacity, rivalling that of multicellular organisms. Many genes have been acquired from what are now very distantly related species. Herein we consider the biology of Euglena in different time frames, from evolution through control of cell biology to metabolic processes associated with carbohydrate and natural products biochemistry.


Introduction
The ease with which Euglena can be cultured has made them one of the most highly studied eukaryotes, playing a pivotal role in the development of cell biology and biochemistry. Euglena gracilis, in particular, has long been investigated for the production of vitamins A, C, E (Takeyama et al., 1997) and essential amino acids, and is also a good source of polyunsaturated fatty acids (Korn, 1964). When grown aerobically in light it produces an insoluble ␤-1,3-glucan storage polymer, paramylon (Rodríguez-Zavala et al., 2010), which can make up around 85% of the dry weight of the organism. In contrast, under anaerobic conditions, wax esters comprise over 50% of the dry weight of some strains of Euglena (Inui et al., 1982).
Genome sequencing of Euglena has been hampered to date due to the large and complex genome (approximately 2 Gbp in size, with 80% repetitive sequence -Mark Field, private communication), which has arisen from a series of endosymbiotic events during its evolution. Aside from typical eukaryotic epigenetic modifications, including DNA methylation and histone acetylation, the genome of Euglena also contains the modified nucleotide Base J (glucosylated hydroxythymidine), also found in other kinetoplastids (Borst and Sabatini, 2008), which complicates DNA sequencing by virtue of restricting polymerase processivity. Additionally, Euglena has the ability to extensively process mRNA during transcription (Tessier et al., 1992), altering the sequences before translation; hence the proteome of Euglena would be difficult to predict from its genome. Avoiding the complications of algal genome sequencing (Rismani-Yazdi et al., 2011) and to begin to explore the full metabolic capability of Euglena, we sequenced the transcriptome of Euglena gracilis var. saccharophila (O'Neill et al., 2015).
Transcript analysis identified 22,814 predicted proteinencoding genes in phototrophic Euglena cells, whilst 26,738 were evident in heterotrophic cells, accounting for 32,128 non-redundant predicted proteins overall, including 8890 splice variants. This indicates that there is a dramatic shift in metabolic capability which is dependent upon growth conditions. All of the genes necessary for cellular housekeeping activities are encoded, as well as for the biosynthesis of vitamins, amino acids and complex carbohydrates; a number of novel protein sequences are evident whose activity is difficult to predict at this time.
This transcriptome reveals a wealth of information about how the metabolic capacity in Euglena has evolved. This complex evolutionary history, combined with horizontal gene transfer (Henze et al., 1995), gives Euglena a huge biosynthetic capability, obtained from diverse sources, and has allowed the evolution of a complex genome which shows features of higher eukaryote control mechanisms. There is also evidence for evolution of novel enzymes, giving Euglena a unique metabolic competency. These observations are based on the 14,389 BLASTP hits, providing prospective functional annotation, leaving a further 17,739 non-redundant predicted proteins that show no significant homology to any known protein and for which we currently have no clue to their function.

Genome evolution
The kingdom-level distribution of the top BLASTP hits illustrates the huge diversity of sources of genetic material present in the Euglena genome, obtained from horizontal gene transfer, and highlights its complex genetic history (Fig. 1A).
The Euglenoids are related to the pathogenic protozoa Trypanosomes and Leishmania (Fig. 1B) and are extremely difficult to classify, even using modern molecular techniques (Linton et al., 2010). Since the split of Euglena from other members of the Euglenozoa over one billion years ago (Parfrey et al., 2011), there is evidence for a red algae endosymbiont, which transferred some genes to the eventual Euglena nuclear genome and has since been lost (Maruyama et al., 2011). Subsequently, there was endosymbiosis of a eukaryotic green alga (Martin et al., 1992), with transfer of many genes to the nucleus, including those for the maintenance of the chloroplast. Thus, the genetic material in Euglena is derived from: the ancestral protozoa; the mitochondrion, related to alpha-proteobacteria; the red algal endosymbiont; the primary photosynthetic host; and the primary photosynthetic endosymbiont, related to cyanobacteria, obtained from the green alga (Fig. 2).  (Letunic and Bork, 2011;Parfrey et al., 2011). Organisms in green are photosynthetic. Most of the genetic material was transferred to the nucleus. 5. The ancestral plant cells diversified to form green algae and plants, golden algae and red algae (Moreira et al., 2000). 6. A red algal cell was taken up by the ancestor of Euglenids and some of the DNA transferred to the nucleus (Maruyama et al., 2011). 7. The red algae was then lost and the ancestor of photosynthetic Euglena formed an endosymbiotic relationship with a green algae (Gockel and Hachtel, 2000), which has been subsequently lost in several independent Euglenid lineages. 8. Nuclear and chloroplast DNA were subsequently transferred from the green algae to the nucleus of the Euglenid ancestor. 9. The nucleus and mitochondia were lost from the plant cell to leave the final chloroplast. To form Base J a thymidine is hydroxylated by JBPs, to form hydroxymethyl-deoxy-uracil (HOMedU), and then glucosylated by JGT, a novel glucosyltransferase. The glucose is removed by an as yet unidentified glucosidase, leaving HOMedU. R 1 = 3 DNA strand, R 2 = 5 DNA strand. (B) JBP2 is thought to initiate de novo hydroxylation of thymidine. The product is then glucosylated by JGT to form novel Base J residues, which are recognised by JBP1, leading to hydroxylation of nearby thymidine residues. Further glucosylation by JGT subsequently leads to a local amplification of the Base J silencing signal (Borst and Sabatini, 2008).
Despite being colloquially referred to as a green alga, the core nuclear genome of Euglena is more closely related to that of Trypanosomes than to that of other eukaryotic algae.
In addition to the nuclear genome, some genetic material is retained in the chloroplast (Hallick et al., 1993) and the mitochondrion, which has an unusually fragmented gene organisation (Spencer and Gray, 2011). The Euglenoid chloroplast has had substantial genetic rearrangement during its evolution (Hrdá et al., 2012). This has included the transfer of several important genes from the chloroplast genome to the nucleus, from where they can be identified in the Euglena transcriptome, including the protease clpP (lm.95241 and lm.15675), the membrane protein cemA (lm.59206) and photosystem 1 component ycf3 (lm.46611).

Evolution of controls
To control expression and activity of proteins eukaryotes use sophisticated mechanisms, with more intricate systems amongst the higher organisms. Euglena has many of the classical mechanisms, as well as some more unusual and more complex than have been found elsewhere.

Base J
Base J is a modified DNA base, formed by hydroxylation and glucosylation of thymidine (Fig. 3A). It is uniquely found amongst the Euglenozoa, including the important human pathogens Trypanosomes and Leishmania (van Leeuwen et al., 1998). It prevents RNA polymerases passing along the DNA strand, silencing the modified region of the genome, and is of key importance to the control of surface coat protein expression in Trypanosomes (Borst and Sabatini, 2008). In Leishmania, some of the Base J is not located around the telomeres, as is the case in Trypanosomes (van Luenen et al., 2012), and it has been shown to prevent transcriptional read-through, regulating transcription termination (Reynolds et al., 2014). In contrast, Base J in Euglena is found throughout the genome where it makes up approximately 0.2% (1 in 500) of the bases (Dooijes et al., 2000).
The two proteins involved in the initial hydroxylation of thymidine are well studied in Trypanosomes and Leishmania (DiPaolo et al., 2005). There are matching homologues encoded in the Euglena transcriptome, though the JBP1 homologue (dm.72228), essential in Leishmania, (Genest et al., 2005) is not found in the transcriptome of the light grown cells. Unlike JBP1, JBP2 (dm.12798 in Euglena) does not actually bind Base J, but instead appears to initiate de novo Base J biosynthesis, whilst JBP1 amplifies this signal ( Fig. 3B) (Cliffe et al., 2009).
Recently, two separate groups identified the Base J glucosyltransferase as a single copy in the Trypanosome genomes (Bullard et al., 2014;Sekar et al., 2014). This enzyme was noted as being distantly related to GT-A type glycosyltransferases and to have variable loops between conserved regions in other kinetoplastid species. In the Euglena transcriptome there is one homologue (dm.53028) of the Trypanosome transferase, with an alternative splice variant containing a short N-terminal region of unclear function (dm.53027). The enzyme(s) for removal of the glucose unit have not, to date, been identified in any organism. After removal of the glucose the HOMedU could be either dehydroxylated or the base could be excised and replaced (Borst and Sabatini, 2008).

Gene silencing
RNA-mediated gene silencing is a ubiquitous control mechanism found in bacteria (Marraffini and Sontheimer, 2010), plants (Baulcombe, 2004) and animals (Berezikov and Plasterk, 2005). Three main components make up the machinery in eukaryotes: Dicer like (DCL), which cleaves Table 1 Components of the RNA silencing machinery. Transcripts for genes involved in gene silencing pathways were identified in the Euglena transcriptome (Brodersen and Voinnet, 2006 (Hutvagner and Simard, 2008); and RNAdependent RNA polymerase (RDRP), which amplifies the silencing (Baulcombe, 2007). Whilst some sequenced algae do not have any of the necessary components, most retain some capability for gene silencing (Cerutti et al., 2011).
Euglena is known to possess some capacity for RNA-silencing: successful RNAi knockdown experiments of a gene involved in vitamin C biosynthesis (Ishikawa et al., 2008) and a photoreceptor protein (Ntefidou et al., 2003) have been performed in this organism. Arabidopsis encodes four copies of DCL, which are differentially expressed under stress conditions (Liu et al., 2009). The Euglena transcriptome also encodes four DCL proteins, which appear to have diverged from a single copy rather than being acquired by repeated horizontal gene transfer or endosymbiosis. Argonautes are split between the piwi subfamily and the AGO subfamily, which have undergone duplication, expansion and loss in different lineages (Hutvagner and Simard, 2008). Humans have four genes for each subfamily, whilst Arabidopsis has ten members of the AGO subfamily and no piwi-like genes, and Leishmania has lost both of these subfamilies entirely.
There are no members of the piwi subfamily encoded in the Euglena transcriptome, but there are four AGO family proteins, with three being closely related and one more divergent, possibly suggesting two separate gene acquisitions.
There are also many of the other components of the gene silencing machinery in the Euglena transcriptome, including helicases, histone methylases, histone deacetylases, and cytosine methylases (Table 1) and these are variably expressed in the light-grown and dark-grown cells. There is no RNA-dependent RNA polymerase (RDRP) present, which suggests that Euglena is incapable of producing trans-acting RNAs or utilising the aberrant transcript pathway (Brodersen and Voinnet, 2006). It is possible that another enzyme is capable of acting to amplify the signal, as is suggested for mammals and insects, which also lack the RDRP (Baulcombe, 2007). In Arabidopsis this amplification is involved in development, where the original double-stranded RNA might be diluted during growth (Fahlgren et al., 2006), which is conceivably not required in a single celled organism such as Euglena.

O-GlcNAc
As a complement to protein phosphorylation, a key regulator of protein activity is the addition and removal of serine and threonine linked N-acetylglucosamine (O-GlcNAc) (Love and Hanover, 2005). This modification impacts on cellular signalling and nutrient response, including cross-talk with protein phosphorylation, which competes for the same sites on the protein (Fig. 4) (Zeidan and Hart, 2010). There are three putative GT41 O-GlcNAc transferases in the Euglena transcriptome, (lm.52466, dm.35031 and lm.92993), the latter of which is only present in the light grown cells. In humans there is only one transferase gene (Vocadlo, 2012), whilst plants have two distinct enzymes, SEC, similar to the animal enzyme, and SPY (Olszewski et al., 2010). There are, however, no homologues of the human O-GlcNAcase encoded in the Euglena transcriptome, or in plants, which would reverse the O-GlcNAc addition, suggesting that a noncanonical hydrolase may carry out this reaction. Hence it is evident that not only does Euglena carry out this 'highereukaryote' protein glycosylation, it employs a more complex system than 'higher' multi-cellular organisms.

Evolution of protein architecture
Euglena has enzymes for the biosynthesis of many diverse compounds, including amino acids, vitamins, complex  (Vocadlo, 2012). This modification is orthogonal to kinase-mediated phosphorylation.
carbohydrates and polyunsaturated fatty acids (O'Neill et al., 2015). These capabilities have been obtained from many diverse sources through evolution. Aside from mutation-based evolution, such as the massive expansion and diversification of carbohydrate-active enzyme families (Cantarel et al., 2009), Euglena appears to have made extensive use of gene fusions to produce novel domain arrangements, a few examples of which are discussed below.

Alternative splicing
Many of the transcripts obtained from sequencing represent alternative splicing variants, information that would not be available from genome sequencing. For example, transcripts lm.75841 and lm.75842 share an identical Nterminus, coding for a glycosyltransferase, but the former has a C-terminal extension not present in the latter, which encodes a peroxisomal protein (Fig. 5A). The ratio of the expression of the long to short isoforms was estimated to be approximately 11:1 in the light sample, but in the dark no short sequence variant is detectable. This suggests that in the light the enzyme activity is required in both subcellular locations, but in the dark it is not required in the cytosol. Hence Euglena appears to make use of alternative splicing to control subcellular targeting of a single gene product, as has been seen for many enzymes (Danpure, 1995), including glycolytic enzymes in fungi (Freitag et al., 2012) and amino acid metabolic enzymes in plants (Gebhardt et al., 1998).

Di-domain carbohydrate-active enzymes
Whilst fusion of carbohydrate binding modules (CBMs) to carbohydrate-active enzymes is relatively common in nature, and contiguity of multiple glycoside hydrolase domains in a single protein is well known, it is much rarer to find glycosyltransferases as part of a protein containing other domains. Examples include the sea-squirt Oikopleura dioica (Medie et al., 2012), which encodes a cellulose synthase (GT2) and a ␤-glucan hydrolase (GH6), and the previously discussed O-GlcNAc transferases, which are found in most eukaryotes and some bacteria, and encode tetratricopeptide repeats in addition to the GT41 GlcNActransferase domain (Lubas et al., 1997).
Two transcripts were identified in the Euglena transcriptome that encode proteins with two carbohydrate-active enzyme domains (Fig. 5B). The first protein (lm.71174) has a putative GT11 fucosyltransferase domain and a putative GT15 mannosyltransferase domain. The active site of the former does not contain the second arginine in the HxRRxD motif (Takahashi et al., 2000), whilst the latter contains the nucleophile and a zwitterionic ion-binding motif (Lobsanov et al., 2004). It is possible that this enzyme maybe act to transfer both fucose and mannose to the same N-glycan core. Alternatively, this enzyme may transfer mannose on to a fucosylated glycan, to which the GT11 domain, acting as a carbohydrate-binding module, directs it.
A second di-domain protein (dm.47703) is composed of a GT1 sugar transferase, most closely related to bacterial sterol ␤-glucuronic acid transferases, linked to a C-terminal GH78 ␣-rhamnosidase domain. Both domains appear to have an intact active site, suggesting that both activities are viable (Cui et al., 2007;Mulichak et al., 2004). This didomain protein might conceivably be involved in cleaving rhamnose from a small molecule and adding a glucuronic acid moiety, a sugar addition that is known to facilitate subcellular relocalisation and xenobiotic detoxification (Tukey and Strassburg, 2000).

Tryptophan biosynthesis
The biosynthesis of tryptophan from chorismate, via anthranilate, is typically carried out by five sequential enzymatic reactions (Fig. 6) (Crawford, 1975). The first reaction is catalysed by a two-component anthranilate synthase, though these proteins are often fused. In Euglena, there are two transcripts encoding both components, one of which contains an additional N-terminal aminotransferase, which has not previously been noted. The next four reactions in the biosynthetic pathway are normally located on separate proteins, though there is sometimes one gene fusion (Cohn et al., 1979). Uniquely in Euglena, a single transcript encodes all four enzymes, which has not previously been observed (Schwarz et al., 1997). This unusual fusion construct has three domains related to fungal enzymes and a bacterial isomerase; the domain sequence is not in biosynthetic order. The final synthase, required for indole formation and condensation to serine, is composed of a hetero-tetramer in bacteria (␣ 2 ,␤ 2 ) (Miles, 2006) and a homo-dimer in fungi ((␣␤) 2 ) (Matchett and DeMoss, 1975). The tetra-functional enzyme from the Euglena transcriptome only contains the ␤-subunit, whilst the ␣-chain is encoded on a separate transcript. This suggests that tryptophan biosynthesis is catalysed by a hetero-tetramer composed of two copies of the multi-function protein and two copies of the tryptophan synthase ␣-chain. This once again highlights the unique capabilities that Euglena displays.

Natural product synthases
No polyketides or non-ribosomal peptides have been confirmed in Euglena to date. However, there are transcripts apparent for the complex multi-domain secondary metabolite synthases needed to make such compounds, as is evident for an increasing array of algae now that genome/transcriptome sequence data is becoming available (Sasso et al., 2012).
Polyketides comprise a huge range of compounds, formed by repeated condensation of acetate units, followed by variable reduction and further elaboration. Broadly speaking, polyketide synthases (PKSs) can be large multi-domain proteins (type I) or composed of discrete proteins with individual functions (type II), although there are other architectures possible (Shen et al., 2007). Fourteen potential PKSs were identified in Euglena as having the key ketosynthase domain and attempts to predict the structures of the compounds synthesised by these, using SBSPKS (Anand et al., 2010) and the PKS/NRPS Analysis Web-site (Bachmann and Ravel, 2009), were not successful. Analysis of the domain sequences of these enzymes using DELTA-BLAST allows some putative predictions to be made. For example, the largest polyketide synthase encoded in the Euglena transcriptome (lm.8157) contains, in addition to a fully reducing PKS module, two enoyl hydratases and an HMGCoA synthase (Fig. 7A). These proteins have been characterised in bacterial gene clusters as single domain proteins, such as PksG, H and I in bacillaene biosynthesis, as adding a ␤-methyl branch to polyketides (Butcher et al., 2007). Whilst the association of other components of the complete synthase cannot be predicted, this domain architecture suggests the formation of methyl branched alkane, which could be part of a polyketide, or alternatively may be included in a fatty acid structure.
Non-ribosomal peptide synthetases (NRPSs) are multidomain proteins that join amino acids together to form small peptides with a diversity of function. Five proteins encoded in the Euglena transcriptome contain both A-domains, for the activation of the amino acids, and C-domains, for formation of the peptide bonds, with the PCP acting as a carrier for the growing peptide. For example, lm.32232 contains a complete module in the characteristic C-A-PCP domain order, followed by an additional module lacking the PCP and an extra N-terminal C-domain with no associated A

Figure 7
Multidomain natural product megasynthases in Euglena. (A) lm.8157 is composed of 5 modules of a fully reducing polyketide synthase (in red) followed by three modules (in blue) that add a ␤-methyl branch to polyketides and a thioesterase domain (in grey) for release of the product from the acyl carrier protein (green). (B) lm 3223 is a non-ribosomal peptide synthase with two adenylation domains for activation of specific amino acids and three condensation domains for formation of peptide bonds. domain or PCP, which is occasionally seen in microbial NRPSs. The A-domains specify the amino acid but the prediction programmes SBSPKS (Anand et al., 2010) and NRPS/PKS Analysis Web-site (Bachmann and Ravel, 2009), are incapable of dealing with these sequences, probably because of the evolutionary distance from the bacterial and fungal species with which these pieces of software were designed to deal.

Conclusions
The transcriptome of Euglena reveals a huge diversity of proteins, many of which have no known equivalents in other species. Although only a unicellular organism, the number of genes and the control mechanisms evident in Euglena are as sophisticated as those typically found in higher eukaryotes, if not more so. A diverse array of metabolic enzymes has been acquired through the complex evolutionary history of Euglena over the last 1.6 billion years, since their divergence from plants and other algae. A number of Euglena genes encode fusions of several domains encoded on one polypeptide, acquired either through horizontal gene transfer from microorganisms or generated as novel fusions within Euglena. This suggests that there is some selective pressure that favours placing several domains from single pathways on one protein, rather than relying on diffusion or assembling multi component complexes non-covalently. Perhaps the large cytoplasmic volume of Euglena cells, with a highly flexible cell shape and no vacuole, renders the diffusion of metabolites or proteins insufficient for efficient metabolism.
The novel and complex evolution of Euglena provides a wealth of unexpected information accessed from its transcriptome. There are many unusual features in genome and transcriptome dynamics, including mechanisms for metabolic control that are more complex than in 'higher' multi-cellular organisms. In addition, Euglena offers major opportunities for metabolic engineering and for the production of added-value biomolecules (vitamins, natural products, essential amino acids), and provides inspiration for the transfer of its diverse metabolic capabilities into other host systems.