Comprehensive transcriptome analysis provides new insights into nutritional strategies and phylogenetic relationships of chrysophytes

Background Chrysophytes are protist model species in ecology and ecophysiology and important grazers of bacteria-sized microorganisms and primary producers. However, they have not yet been investigated in detail at the molecular level, and no genomic and only little transcriptomic information is available. Chrysophytes exhibit different trophic modes: while phototrophic chrysophytes perform only photosynthesis, mixotrophs can gain carbon from bacterial food as well as from photosynthesis, and heterotrophs solely feed on bacteria-sized microorganisms. Recent phylogenies and megasystematics demonstrate an immense complexity of eukaryotic diversity with numerous transitions between phototrophic and heterotrophic organisms. The question we aim to answer is how the diverse nutritional strategies, accompanied or brought about by a reduction of the plasmid and size reduction in heterotrophic strains, affect physiology and molecular processes. Results We sequenced the mRNA of 18 chrysophyte strains on the Illumina HiSeq platform and analysed the transcriptomes to determine relations between the trophic mode (mixotrophic vs. heterotrophic) and gene expression. We observed an enrichment of genes for photosynthesis, porphyrin and chlorophyll metabolism for phototrophic and mixotrophic strains that can perform photosynthesis. Genes involved in nutrient absorption, environmental information processing and various transporters (e.g., monosaccharide, peptide, lipid transporters) were present or highly expressed only in heterotrophic strains that have to sense, digest and absorb bacterial food. We furthermore present a transcriptome-based alignment-free phylogeny construction approach using transcripts assembled from short reads to determine the evolutionary relationships between the strains and the possible influence of nutritional strategies on the reconstructed phylogeny. We discuss the resulting phylogenies in comparison to those from established approaches based on ribosomal RNA and orthologous genes. Finally, we make functionally annotated reference transcriptomes of each strain available to the community, significantly enhancing publicly available data on Chrysophyceae. Conclusions Our study is the first comprehensive transcriptomic characterisation of a diverse set of Chrysophyceaen strains. In addition, we showcase the possibility of inferring phylogenies from assembled transcriptomes using an alignment-free approach. The raw and functionally annotated data we provide will prove beneficial for further examination of the diversity within this taxon. Our molecular characterisation of different trophic modes presents a first such example.

Results. We sequenced the mRNA of 18 chrysophyte strains on the Illumina HiSeq platform and analysed the transcriptomes to determine relations between the trophic mode (mixotrophic vs. heterotrophic) and gene expression. We observed an enrichment of genes for photosynthesis, porphyrin and chlorophyll metabolism for phototrophic and mixotrophic strains that can perform photosynthesis. Genes involved in nutrient absorption, environmental information processing and various transporters (e.g. monosaccharide, peptide, lipid transporters) were present or highly expressed only in heterotrophic strains that have to sense, digest and absorb bacterial food. We furthermore present a transcriptome-based alignment-free phylogeny construction approach using transcripts assembled from short reads to determine the evolutionary relationships between the strains and the possible influence of nutritional strategies on the reconstructed phylogeny. We discuss the resulting phylogenies in comparison to those from established approaches based on ribosomal RNA and orthologous genes. Finally, we make functionally annotated reference transcriptomes of each strain available to the community, significantly enhancing publicly available data on Chrysophyceae. 1 INTRODUCTION 51 Recent phylogenies and megasystematics demonstrate an immense complexity of eukaryotic diversity 52 with numerous transitions between phototrophic and heterotrophic organisms (Adl et al., 2012;Keeling, 53 2004; Boenigk et al., 2015). While a primary endosymbiosis of a cyanobacterium into a eukaryotic host 54 cell (thus originating eukaryotic photosynthesis) is considered to be a singular event (Keeling, 2004), 55 except for the case of the cercozoan genus Paulinella, secondary and tertiary endosymbiosis, i.e., the 56 acquisition of a eukaryotic algae by a eukaryotic host cell, occurred several times (Keeling, 2004 (Andersen 97 able to work on even incompletely assembled sequences and are less affected by misassemblies.

115
To our knowledge, alignment-free methodologies have not yet been applied to transcript sequences.

116
Recent studies that used transcriptome data to infer phylogenies either use sequencing technologies which 117 produce long reads such as Roche 454 (Borner et al., 2014), or short read sequences in combination 118 with available reference genomes of the species (Wen et al., 2013). Usually, all transcripts that belong 119 to a set of orthologous genes are used for a combined multiple sequence alignment (Peters et al., 2014), 120 from which the trees are then built. However, certain transcripts might not be expressed (and hence not 121 observed) under study conditions, which may significantly reduce the set of available genes with complete 122 orthology information. Additionally, properly dealing with alternative transcripts of the same gene may 123 be non-trivial. We therefore describe an alignment-free k-mer approach for assembled transcriptomes, 124 apply it to the 18 chrysophyte RNA-seq datasets and discuss the resulting phylogenies in comparison to a 125 gene-based approach.

126
In summary, this article makes three contributions.   3. On the methodological side, we discuss phylogenetic inference from assembled transcriptomes 132 based on alignment-free k-mer methods.

133
The main objectives of our study are (1) to investigate relations between trophic mode and molecular 134 processes at the transcriptome level and (2) to use abundantly available transcriptome data as an additional 135 source for phylogeny reconstruction.

137
Strain cultivation and sample preparation 138 All strains were grown at 15 • C in a light chamber with 75-100µE illumination (1 E[instein] is defined as 139 the energy in 6.022 × 10 23 photons) and a light:dark cycle of 16:8 hours. Light intensities were adapted 140 to conditions allowing for near maximum oxygen evolution but still below light saturation in order to 141 avoid adverse effects (Rottberger et al., 2013). Due to different pH requirements of the investigated 142 strains different media were used: Most heterotrophic strains were grown in inorganic basal medium 143 (Hahn et al., 2003) with the addition of Listonella pelagia strain CB5 as food bacteria (Hahn, 1997), 144 exceptions from this are mentioned separately below. The inorganic basal medium for the axenic strains 145 was supplemented with 1 g/l of each of nutrient broth, soytone and yeast extract (NSY; Hahn et al. (2003)) 146 in order to allow for heterotrophic growth. Poteriospumella lacustris strains JBM10, JBNZ41 and JBC07 147 as well as Poterioochromonas malhamensis strain DS were grown axenically in the culture collection 148 of the working group. For details on origin, isolation procedure and axenicity of the axenic strains, see (Guillard and Lorenzen, 1972). We did not expect strong effects of the media on gene expression and 153 tested for this during data analysis (see Results and Discussion).

154
Cells for RNA isolation were harvested by centrifugation at 3000 g for 5 to 10 minutes at 20 • C. RNA 155 extraction was carried out under sterile conditions using TRIzol (Life Technologies, Paisley, Scotland -156 protocol modified). Pellets were ground in liquid nitrogen and incubated for 15 min in TRIzol. Chloroform 157 was added and the mixture was centrifuged to achieve separation of phases. The aqueous phase was 158 transferred to a new reaction tube and RNA was precipitated using isopropanol (incubation for 1h at 159 -20 • C and centrifugation). The RNA pellet was washed three times in 75% ethanol and re-suspended in  Quality control and preprocessing of sequencing data 168 The quality control tool FastQC (v0.10.1; Andrews (2012)) was used to analyse the basepair quality 169 distribution of the raw reads. Adapter sequences at the ends of the reads were removed using the 170 preprocessing software Cutadapt (v1.3; Martin (2011)). Cutadapt was also used to trim bad quality bases 171 with a quality score below 20 and discard reads with a length below 70 bp after trimming.     Manuscript to be reviewed overview maps, human diseases and drug development. Own implementations were used to perform a 205 hypergeometric test for each KEGG pathway and pathway visualisation. All mappings of genes to KEGG 206 pathways and pathways with a significant enrichment were reported.

235
The data is provided as a public resource at the European Nucleotide Archive (ENA) database, study 236 accession PRJEB13662 (Beisser et al., 2016). We provide both raw read sequences and assembled 237 transcript sequences (see below).

238
Assembly of transcripts 239 The cleaned reads were de-novo assembled to transcripts with the software Trinity and then quantified

243
The statistics for the assembled transcriptomes are shown in Figure 1, sorted by GC content of 244 the strains, which incidentally also separates the trophic modes (Fig. 1A). Trinity outputs assembled   merase, spliceosome, RNA processing, ribosome, proteasome, ubiquitin system and protein processing.

285
All species, except FU22KAK and JBM08, cover the essential modules. These observations mostly imply 286 good coverage and completeness of the assembled transcriptomes. The Dinobryon strain FU22KAK lacks 287 part of the central carbohydrate metabolism and nucleotide metabolism, which hints to a quality issue 288 which was already described in the last section. Apoikiospumella mondseeiensis (strain JBM08) contains 289 complete gene sets for the pathway modules, but misses some of the gene sets necessary for the structural 290 complexes, which is also evident from the highly expressed pathways (see next section) and likely caused 291 by the transcriptional state of the cell.

292
General molecular characterisation 293 In general, the most actively transcribed pathways comprise ribosome synthesis as well as protein 294 processing and biosynthesis of several amino acids (Figure 4). Since ribosome maintenance as well 295 as general transcription and translation are essential for protein production and functioning of the cell, 296 the genes present in these pathways were expected to be highly expressed. In heterotrophic strains, 297 genes affiliated with oxidative phosphorylation were also highly expressed. In contrast, photosynthetic   Manuscript to be reviewed  Manuscript to be reviewed separation is on the one hand due to differences in gene expression, but also depends on the group-specific 329 presence or absence of genes in either heterotrophic or mixotrophic taxa. We will focus on the two 330 aspects (1) group-specific genes and (2) differences in gene expression of common genes in the following 331 paragraph.

332
Influence of nutritional strategies 333 Heterotrophic chrysophytes presumably evolved from photosynthetic ancestors. This results in the 334 conclusion that heterotrophy is on the one hand a reduction at the cellular level, the reduction of the plastid, 335 and along with that a reduction of metabolic pathways associated with the plastid, i.e. photosynthesis,  Lhca4). These two strains also expressed few genes of the photosystem I and II, cytochrome b6/f complex, 369 electron transport and F-type ATPase. These findings indicate that the loss of photosynthesis in these  Another reduction can be seen in protein export pathways. The phototrophic strain expressed five 381 exclusive genes for the SRP (signal recognition particle), which could be an indication of protein transport 382 through the chloroplast membrane for pathways taking place in the plastid.

383
Taken together, the reduction of photosynthesis seems to start with the reduction of cost-intensive 384 enzymes and pathways such as carbon fixation by RuBisCo which seems to be abandoned first and is 385 missing in all investigated heterotrophs. The next step seems to be the reduction of photosystem I and II 386 whereas genes for the photosynthetic electron transport are still present in a number of heterotrophs and 387 seem to be reduced in a later step.   Table 1). For E. coli it was shown that it utilizes two ways to form glutamate. These differ in the fact that 494 one way (glutamate dehydrogenase) is energetically efficient (no direct requirement for ATP), while the 495 other one (GOGAT pathway: glutamine synthetase plus glutamate synthase) uses ATP (Helling, 2002).

496
The choice among these parallel pathways in biosynthesis has been hypothesized to control the speed synthesis (Helling, 2002). Differences in expression and gene content in the above pathways point to 500 known divergences in growth rates for mixo-and heterotrophic species (Boenigk et al., 2006).

501
Related to the amino acid metabolism we find higher expressed lysosomal enzymes for the break 502 down and usage of incorporated biomolecules for the heterotrophic group.

514
We aim at developing a method to use available transcriptome data to place strains into their evolu-515 tionary context and reconstructing their phylogenetic relationship. Especially when marker genes are not 516 yet sequenced, but transcriptome data is readily available, this information can be valuable.

536
For the alignment-based approach (right panel) transcripts were annotated with KEGG genes, pathways 537 and KEGG orthology information (Kanehisa and Goto, 2000). The KEGG orthology information was 538 used to find orthologous genes between all strains as a faster alternative to pair-wise bi-directional BLAST 539 searches. Overlapping regions between all 18 strains are detected, which are thereupon used to construct 540 multiple sequence alignments with MAFFT (Katoh and Standley, 2013). One or several transcripts, 541 constituting splice variants or alternative assemblies, were included if they feature an adequate length.

542
Based on the multiple sequence alignments at the nucleotide level, a model test was performed and a 543 bootstrapped maximum-likelihood phylogeny estimated. In general, the complete workflow can take 544 hours to days, depending on the number of species and number of orthologous genes.

545
In contrast, the alignment-free approach does not need orthology information or multiple sequence 546 alignments, omitting the steps marked in red in Figure 8. Based on the transcript sequences, k-mers were 547 counted (4-mers and 6-mers) and used to calculate transcriptome signatures (Eq. 1). Using the Euclidean 548 distance between the signatures, trees were thereupon constructed applying the Unweighted Paired Group

550
In the following, we describe how we computed transcriptomic signatures. We first computed 551 normalized oligonucleotide usage deviations (OUDs), the ratio of observed excess counts of k-mers in a 552 transcriptome to the expected count under a null model, following Reva and Tümmler (2004).

553
To determine the oligonucleotide usage deviations (OUD) among transcriptomes, the observed number N(z) of each k-mer z is compared to its expected value and normalized. We define Figure 8. Overview of alignment-free (left) and alignment-based gene-centric (right) approaches of phylogenetic inference. Both approaches include the assembly of reads to transcripts using Trinity (Grabherr et al., 2011). Based on the transcript sequences in the alignment-free approach, k-mers were counted (4-mers and 6-mers) and used to calculate transcriptome signatures. Using the euclidean distance between the signatures, trees were constructed using the Unweighted Paired Group Method with Arithmetic Mean (UPGMA). In contrast, for the alignment-based approach, the steps marked in red differ.
Here the transcripts were annotated with KEGG genes, pathways and KEGG orthology information (Kanehisa and Goto, 2000). The KEGG orthology information was used to find orthologous genes between all strains. Multiple sequence alignments were constructed with MAFFT (Katoh and Standley, 2013) using overlapping regions of the transcripts of all 18 strains. One or several transcripts, constituting splice variants or alternative assemblies, were included if they feature an adequate length. Based on the multiple sequence alignments, a model test was performed and a bootstrapped maximum-likelihood phylogeny estimated.

Manuscript to be reviewed
where E 0 (z) = (L − k)/4 k is the expected count of k-mer z assuming uniform distribution of all k-mers in 554 a transcriptome of length L, and E 1 (z) is the expected count of k-mer z using mononucleotide content, 555 corresponding to an i.i.d. model (independent identically distributed). 556 We computed tetranucleotide and hexanucleotide signatures containing 4 4 = 256 and 4 6 = 4096 elements using Jellyfish (v1.1.2; Marçais and Kingsford (2011)). The Euclidean distance between two signatures x and y of 4 k possible k-mers was used, defined as (2) The abundance of tetra-and hexanucleotides was calculated over the transcripts that were de-novo of multiple sequence alignments. We therefore had to settle for fewer genes and this also led to the 581 consideration of alignment-free approaches.

582
In contrast to these transcriptome phylogenies, the SSU phylogeny of considered strains is depicted Manuscript to be reviewed lacustris probably due to a high conservation in the gene. All k-mer based phylogenies indicate a higher 617 similarity between Poteriospumella lacustris strain JBM10 and Poteriospumella lacustris strain JBC07.

618
The same results were found by Stoeck et al. (2008) using the SSU and a concatenation of sequences for 619 three protein-coding genes (alpha-tubulin, betatubulin and actin) and rDNA fragments (SSU rDNA, ITS1, 620 5.8S rDNA, ITS2). In contrast, the multigene approach shows a higher similarity between Poteriospumella 621 lacustris strains JBM10 and JBNZ41.

622
Overall, for closely related species in the same genus or a taxonomy of higher orders the k-mer 623 approach on coding sequences worked best, but it has to be noted that it did not reproduce the SSU phy-624 logeny nor did the other three phylogenetic approaches. Evaluation on the SSU rRNA-based methodology 625 has shown that SSU phylogenies are sufficiently reliable and robust for evaluating relationships between 626 organisms related at the genus level and higher (Liu and Jansson, 2010). Therefore, the current SSU 627 phylogeny will probably also undergo changes with new sequences becoming available, but at the moment 628 depicts the most reliable phylogeny at the genome level and higher. Furthermore, the Chrysophyceae 629 exhibit a high phylogenetic diversity with up to 10% pairwise differences in nucleotides in the SSU 630 sequence and it was shown previously that the SSU rRNA gene offers here a higher resolution than

690
In addition to the comparison of chrysophyte physiology by trophic mode, we presented an alignment-691 free approach to use the transcriptomic sequences to infer phylogenetic relationships. We use a k-mer 692 based approach which provides several benefits for transcripts assembled from short read RNA-Seq 693 data. Our best result was obtained using a mononucleotide-normalized 6-mer phylogenetic approach on