Analysis of mRNA processing at whole transcriptome level, transcriptomic profile and genome sequence refinement of Trypanosoma cruzi

The genomic sequence of Trypanosoma cruzi, the protozoan causative of Chagas disease was published more than a decade ago. However, due to their complexity, its complete haploid predicted sequence and therefore its genetic repertoire remains unconfirmed. In this work, we have used RNAseq data to improve the previous genome assembly of Sylvio X10 strain and to define the complete transcriptome at trypomastigote stage (mammalian stage). A total of 22,977 transcripts were identified, of which more than half could be considered novel as they did not match previously annotated genes. Moreover, for the first time in T. cruzi, we are providing their relative abundance levels. We have identified that Sylvio X10 trypomastigotes exhibit a predominance of surface protein genes, specifically those encoding trans-sialidase and mucin-like proteins. On the other hand, detailed analysis of the pre-mRNA processing sites revealed some similarities but also some differences in the spliced leader and different polyadenylation addition sites compared to close related kinetoplastid parasites. Our results also confirm that transcription is bidirectional as occur in other kinetoplastids and the proportion of forward-sense and reverse-sense transcripts is almost equivalent, demonstrating that a strand-specificity does not exist.

While other kinetoplastids that cause human diseases such as Leishmania (Leishmaniasis) and T. brucei (sleeping sickness) have been and are still extensively studied regarding genomics and transcriptomics, allowing the description of most of its molecular regulatory mechanisms through its life cycle and host interaction, T. cruzi remains poorly understood 7 .
To date, some attempts to describe the global genome of T. cruzi have been performed but his high intra-species variability has made of this a complex challenge [17][18][19][20][21] . In recent years T. cruzi has gained some attention at transcriptomic and proteomic level 3,22-25 but contrary to other kinetoplastids, little is known about the parasite molecular switches. In this work, we performed RNAseq analysis of Sylvio X10 strain to describe the trypomastigote transcriptome, but importantly it helped to improve significantly the previous genome sequence of this strain and also allow describing for the first time some of its main regulatory molecular mechanisms of T. cruzi mRNA processing.

Results and Discussion
Genome correction, transcriptome assembly and quantification. RNA from trypomastigotes of T.
cruzi (strain Sylvio X10) was sequenced after polyA+ selection on Illumina MiSeq sequencing platform, generating 17,332,912 of paired-end reads (length: 75 nucleotides). Raw data was deposited to the SRA database under accession number PRJNA546488. Raw reads trimmed and filtered (maximum and minimum length: 100 and 50 respectively, minimum mean quality 25, phred score based) were mapped to SylvioX10 genome deposited on Tritryp (http://tritrypdb.org/, "TcruziSylvioX10-1") allowing up to three mismatches with bowtie2. Considering that about 81.75% of the total reads were successfully mapped to the reference genome, we use it as bona fide genome reference.
The Sylvio X10 genome is composed by 47 "chromosome-like sequences" (C-LS) that are long scaffolds, no complete chromosomes. Thus, it is important to notice that non-aligned reads may correspond to undefined genomic regions (gaps) or missing chromosomes. Consequently, RNAseq reads were used to improve genomic assembly, such as the correction of deletions, insertions, SNPs etc. For this, we performed an assembly correction using Pilon, a bioinformatic tool for correcting drafted haploid and diploid microbial assemblies using paired end reads. Also, a previous paired DNAseq reads under the NCBI bioproject number PRJNA395140 were included to this genomic correction.
Results of Pilon improvement are summarized in Fig. 1. About 68% of the total genome bases were confirmed, corresponding to about 28 Mb of information (including gaps). Sequence correction affected mainly coding regions, 3,415 SNPs and 39 ambiguous bases were corrected, 72,887 insertions removed and 1,810 previous deletions were also corrected. Eighteen of the forty-seven C-LS decrease in total length up to 23 Kb as C-LS 4; the longest C-LS (1) decrease 7.2 Kb and the second longest C-LS decrease 12.8 Kb, while the shortest (C-LS 47) decreased 7.2 Kb. In total, 133.35 Kb of nucleotide insertions were trimmed from the previous assembly, while 43.66 Kb of new information was added to the new genome. In total, 177.02 Kb were corrected. In addition, it is important to notice that C-LS 17 and 47 had the lowest DNA-RNA sequencing coverage and the biggest ratio gaps per assembled base. This may be indicative that these sequences correspond to structural non-coding chromosomes, ambiguous assemblies or spurious sequences. But it clearly needs to be confirmed by further deeper genomic studies (Annotation in Supplementary File 1).
After genome correction 82.35% of the trimmed RNAseq reads were successfully mapped and were used in further analysis.
In trypanosomatids a discriminatory mechanism for the initiation of transcription at individual loci is absent in most genes. Thus, long RNAs containing more than one gene are transcribed at the same time by the polymerase II into long polycistronic transcription units (PTUs), requiring further mRNA maturation process that had been described before and named as trans-splicing and polyadenylation [26][27][28] . In this sense, the total transcripts were obtained in two steps: firstly, primary assembly (potential PTUs) was performed using Stringtie identifying 9,108 transcripts, corresponding to previous transcriptomic analysis 24,29 , but this is less than 50% of the recently predicted ORF's (20,058) that constitutes the T. cruzi genome 21 , or even the total genes annotated in Sylvio X10 (20,619).
We cannot discard the presence of unprocessed and stable PTUs in T. cruzi in our primary assembly as it has been describe in some instances in tryponosomatids 30 . However, since translation regulation relays mainly on mRNA stability, we found potential mature messengers from the same PTUs showing clear different levels of abundance ( Fig. 2A), some of them match to more than one predicted gene. Surprisingly, most of PTUs did not match to any predicted genes.
In order to define the mature transcripts, we took advantage of the 39-nucleotide mini-exon sequence (SL) and polyA tail present in the 5′ and 3′-UTRs respectively of all mRNAs-31-33 , as described by Rastrojo et al. 34 . 1,086,343 reads containing the SL sequence were trimmed and aligned back to the reference genome identifying 42,700 total SL-insertion sites (Fig. 2B), whereas 85,452 reads containing polyA-reads and 14,512 poly-Adenylation addition sites were identified. SL-containing sequences aligning in contrary orientation in any single PTU, confirming that in T. cruzi trans-splicing occurs only in the same orientation of the PTU similar to what has been described in Leishmania major Friedlin transcriptome 34 .
Considering that the previous annotated genome sequence was modified by our RNA analysis, the new Sylvio X10 genome was re-annotated using Companion and our transcriptomic evidence.
Using the SL and polyA insertion sites to define the 5′ start and 3′ end, respectively, most of the primary assembled PTUs were finally divided into 18,666 mature mRNAs. Of these, 1,172 were annotated as polycistronic, corresponding to sequences that contain one or more annotated ORFs, but that were not divided into mature transcripts by the presence of SL or poly-Adenylation-containing reads. Eight hundred seventy-two were annotated as truncated transcripts, corresponding to transcripts that do not cover its entire predicted ORF length (5′ or 3′ sequence). Although polycistronic and truncated mRNAs needs to be confirmed by future genomic and transcriptomic analysis, most cases correspond to low coverage transcripts. This may simply reflect that they are low-abundant transcripts at trypomastigote stage.
The remaining reads (about 3 million paired-end) that did not map to the reference genome or to kinetoplastid known sequences, were assembled de novo into potential transcripts using rnaSPAdes. Total potential assembled transcripts were filtered by coverage (RDC >= 10), length (>1 kb), sequence redundancy (≥90%) and contained ORFs (just contigs containing one ORF were considered). Finally, we obtained a total of 4,311 extra-transcripts (About 9.62 Mb). The abundance of these transcripts were also calculated using Stringtie and the sequences were added to the final transcriptome obtaining a total of 22,977 potential mRNAs. Those transcripts were named using the same nomenclature in other trypanosomatids: Trypanosoma cruzi "TC", Strain; "SX10", Polycistron Number, Monocistron number and Sense; "F", "R" or "X" (no sense identified) e.g TCSX10.1.1.F. www.nature.com/scientificreports www.nature.com/scientificreports/ De novo assembled transcripts were named using the contig number instead of polycistron and monocistron numbers, by default sorted by length (increasing order) and adding "ND" (non-defined) e.g TCSX10_ND.1.F, the sense was defined by the encoded ORF.
Taken together our results indicate that about the 79.95% of the T. cruzi genome (strain Sylvio X10) corresponds to coding sequence and therefore about 20% correspond to non-conding structural sequences. This is in sharply contrast with previous genomic analysis were only 37.73% of the total genome were defined as corresponding to coding sequences. Thus, our results have more than duplicated this estimation, largely improving the previous genomic annotation (Fig. 3). Besides, taking into account the length of ND transcripts, the haploid genome for Sylvio X10 may be higher than previously reported (at least 51 Mb).
In related species such as Leishmania, transcription initiation preferentially occurs at divergent strand-switch regions (SSR), where PTUs raise in opposite directions on opposing DNA strands 15,16 . In the Sylvio trypomastigote stage, we have identified a total of 226 SSR transcription initiation sites across the entire genome and up to 15 in a single C-LS (TcX10_chr24). All C-LSs contains at least one SSR (except TcX10_chr17) and 4.8 on average per chromosome. There is a medium positive correlation (0.54) between the number of SSR and C-LS length. On the other hand, there is a strong positive correlation between the total transcripts per chromosome and the truncated/ polycistronic (0.86 and 0.84, respectively) transcripts found (as expected), whilst, there is no significant correlation between the number of forward, reverse and undefined transcripts (Fig. S1).
Seventeen thousand three hundred thirty-four transcripts have a FPKM (fragments per kilobase per million mapped reads) greater than or equal to 10 and quartile 3 includes values up to 66, whereas 59 transcripts have the maximum values (outliers, FPKM greater than or equal to 1,000) as shown in Fig. S2. These 59 top abundant transcripts (with FPKM greater or equal to 1,000) and its predicted function are listed in Table 1. Surprisingly, 83% (49) of the 59 more abundant transcripts correspond to de novo assembled transcripts and therefore were not annotated on the actual available genome for this strain.
The top abundant transcript corresponds to a ND transcript (TCSX10_ND.1840.F) containing an ORF coding for a highly conserved (in T. cruzi) elongation factor 1-alpha (EF-1-alpha). Different roles have been assigned to EF-1-alpha in various cellular processes, including metabolism, cytoskeletal organization, oncogenic transformation, apoptosis and anti-apoptosis and in T. cruzi has been described as potential regulator of gene expression 35 . Interestingly a second elongation factor (EF-2, TCSX10_ND.7486.F) was found in this top abundant list. The second and fifth top abundant transcripts correspond to mucin-like glycoproteins (TCSX10_ND.7941.F and and TCSX10_ND.5081.F) that are likely associated with parasite cell invasion and survival in Tc I strains 36 . Furthermore, 4 of these top abundant mRNAs in Sylvio X10 corresponds to heat shock proteins, as occurs in Leishmania (promastigotes) where they make up to 2.1% of the total protein in unstressed conditions 37 .
RNA binding proteins (RBPs), due their role in regulation of gene expression may represent one of the most important gene family for which parasite stage differential expression/transcript abundance information is of special interest. Unexpectedly, we found just one transcript (TCSX10_ND.7601.F, FPKM = 2301.46829) showing RNA recognition motif 3 (RRM3) of type I (polyA-binding proteins: PABPs) which is conserved in proteins that bind to the poly(A) tail of most eukaryotic mRNAs 38 . It may constitute a potential drug target for a translational inhibitor at trypomastigote stage that deserve deeper analysis across the entire parasite's life cycle. Additionally, we found being part of this list 17 mRNAs coding for ribosomal proteins (9 for 60S, 5 for 40S and one for S25, L24 and L37). It has been demonstrated that striking differences in ribosomal composition of trypanosomatids www.nature.com/scientificreports www.nature.com/scientificreports/ comparing to other eukaryotes (mainly the mammal host) may lead also to specific drug targets in Chagas disease [39][40][41] . The remaining abundant transcripts encode canonical histones H2B and H2A (TCSX10_ND.4352.F, TCSX10_ND.4559.F) and some others corresponding to proteins involved in cell metabolism.
The minimum transcript length was of 200 nucleotides and the maximum about 20.5 Kb (TCSX10.4085.2.F, a polycistronic transcript not divided by SL-containing or polyA-containing reads but containing 2 ORFs) and TCSX10_ND.1.F (a de novo assembled) transcript of about 20.2 Kb containing an ORF of 6,635 residues encoding a conserved transferase. The predominant length for Sylvio X10 transcripts is from 800 nucleotides to 2.2 Kb.
Finally, 12,146 of the C-LS transcripts (65%) and 4,197 (97%) of the ND transcripts were successfully identified by blastx hit to nr protein database, non-blasted transcripts were annotated as "Not Found".
Functional enrichment analysis. In order to define and visualize their functions, mature transcripts obtained in this study were ascribed gene ontology IDs (GO terms) with the Blast2GO tool. Level 3 of the gene ontology (GO terms) families mapped (at least 10 transcripts to be considered significant) were used to visualize the complete classification and functional enrichment as show in Fig. 4. In agreement to previous results, we found a set of transcripts ubiquitously abundant on epimastigotes, amastigotes and trypomastigotes 24 : 149 related to microtubule movement/processes (biological process/molecular function), 125 for chromosome organization/ DNA packing (cellular component) and 17 for stress response (biological process) among others.
In Sylvio X10 trypomastigotes, the most abundant transcripts corresponding to cellular components belongs to membrane products (up to 200), cytoplasm and ribosomes. A total of 1,198 transcripts were classified into this level. GO terms associated to biological process corresponds to specific objectives that the organism is genetically programmed to achieve, including mRNAs that are implicated along the complete process or in the final outcome. A total 2,812 transcripts correspond to this group. Since the sequenced RNA was extracted from trypomastigotes (non-replicative infective form), is not surprising that the most represented biological process corresponds to pathogenesis (316), proteolysis (185), protein phosphorylation (142), cell adhesion (90) necessary in infection steps. These findings are also in agreement to Berná et al. 24 , which describes the enrichment of processes related to movement, adhesion, invasion and signalling in trypomastigotes (undefined strain). GO terms of Molecular function include transcripts involved on specific activities, and we found that in the Sylvio X10 transcriptome this is the GO family that includes most of the transcripts mapped (4,624). ATP-binding (480), protein-binding also www.nature.com/scientificreports www.nature.com/scientificreports/  www.nature.com/scientificreports www.nature.com/scientificreports/ known as glycoprotein-binding (398), exo-alpha-sialidase activity (334), protein kinases (134) and nucleic acid binding constitutes the top 5 most enriched functions.
Additionally, to the transcriptomic classification by GO terms, a second functional-based classification allow us the identification of the most abundant gene families over-represented across the entire transcriptome (Fig. S3). A total of 1,632 functionally related gene families were identified (17,929 transcripts), comprising 239 families (containing 10 or more genes) and corresponding to about the 52% (11,960) of the total transcriptome. Interestingly, 928 transcripts matching to Interpro GO ids did not show similar functions to other transcripts (single family) and 5,048 were not identified. This grouping of transcripts into functional families confirms that the trypomastigote physiological needs are covered by the transcription of multi-copy genes. As expected, some of these families correspond to the most abundant genes annotated along the T. cruzi genome 21 such as; 1,329 transcripts coding for sialidase activity, 235 mucin-like proteins, 284 RHS (retrotransposon hot-spot) and 338 kinase proteins. Interestingly, about 428 transcripts corresponded to WD40-repeat-containing domain superfamily, proteins which are found in all eukaryotes and implicated in a variety of functions ranging from signal transduction and transcription regulation to cell cycle and apoptosis 42 . Specifically, WD40 motifs may act as a site for protein-protein interaction, and proteins containing WD40 repeats are known to serve as platforms for the assembly of protein complexes or mediators of transient interplay among other proteins. Other interesting abundant functionally related families are the P-loop containing nucleoside triphosphate hydrolase (695 transcripts), which is composed by proteins containing the most frequent domain of nucleotide-binding proteins 43 . This superfamily is characterized by the concanavalin A-like lectin/glucanase domain which includes proteins like glycosyl hydrolases, lectins, lyases, Beta-D-xylosidases, vp4 sialic acid binding protein among others 44 .

Trans-splicing and poly-Adenylation sites.
The addition of about 39-nt mini-exon (SL) to the 5′ start of all mRNAs is an essential part of the maturation process in trypanosomatids 45,46 . As a result, an AG dinucleotide has been described as the consensus sequence for SL trans-splicing, but unfortunately, no specific signal for polyadenylation had been defined in Trypanosomatids, just a global polypyrimidine tracts of variable length is supposed to compose these regions 34,[47][48][49][50] . Small differences between Leishmania and T. brucei in nucleotide composition surrounding the AG nucleotides and polyadenylation sites have been described before, suggesting that slightly different specific mechanisms may control the mRNA maturation process across trypanosomatid species 34,48 . In contrast, in-depth analysis of these sites in T. cruzi has not been performed to date. Thus, in this work, analysis of SL-containing and polyA-containing reads allowed us to differentiate between mature transcripts and long RNAs or PTUs, but also to analyse trans-splicing and polyadenylation sites to single nucleotide resolution.
We search for sequence enrichment/motifs associated to SL-addition sites by the calculation of the weblogo sequence on the ±30 nucleotide region from the trans-splicing and polyadenylation sites (Fig. 5). A total of 18,589 SL-addition sites were found. In agreement to previous results 34,48 , the AG dinucleotide was determined as the conserved insertion signature with 97.39% and 97.19% of probability for each base, respectively (Fig. 5A). Surprisingly, in contrast to Leishmania 34 (having a Cytosine) but concordantly with T. brucei 48 , an Adenine nucleotide was the most probable (31.84%) residue before the dinucleotide AG (position -3), followed by Cytosine (29.18%), Thymine (26.53%) and Guanine (12.42%), interestingly at position -4 Guanine is the most probable nucleotide (31.20%) as occurs in L. major but contrary to T. brucei where a polyT tract starts and continue as the most frequent composition up to 50 nucleotides upstream.
The most marked differences between T. cruzi with L. major and T. brucei in the DNA regions surrounding the SL addition sites are in the pyrimidine nucleotide enrichment. Whereas in Leishmania C-T dinucleotides clearly consist about the 70% of nucleotides present upstream and downstream the AG dinucleotide, in T. cruzi, this pattern is different. T. cruzi just conserved the C-T pattern in the upstream 5′ region while downstream region is composed of purine nucleotides (A-G) up to 60% (Fig. 5A). T. brucei, also only had the C-T pattern in the upstream 5′ region but downstream A-T dinucleotides are the most frequent bases, indicating that in general, T. cruzi transcripts show more proportional nucleotide composition and G + C profiles (as ocurr in L. major) than T. brucei.
Regarding the polyadenylation signals, trypanosomatids, contrary to other eukaryotes, do not share the conserved AAUAA polyadenylation motif 51,52 . In contrast, an A-T, A-T, A-T in T. brucei and A-G, A-T, A-T for L. major have been described as the most probable dinucleotides preceding the polyadenylation site. In addition, sequence heterogeneity surrounding ±40 bases of these motifs between these T. cruzi related species have been determined. L. major displays a more variable sequence composition upstream and downstream polyadenylation www.nature.com/scientificreports www.nature.com/scientificreports/ site (specially the 5′) 34,47 whilst, in T. brucei is possible to visualize a uniform pattern in both extremes composed by T-A nucleotides. In the T. cruzi, we found and analysed 9,789 polyadenylation sites that allow us to identify and visualize the surrounding genomic region (Fig. 5B).
Interestingly and contrary to other trypanosomatids where a AA dinucleotide has been found on the polyadenylation site 34,53 , we found that a single nucleotide is the most probable signal of polyadenylation start. Cytosine is the most frequent signalling nucleotide (45.32%), Adenine and Guanine are almost equally frequent bases (26.34% and 21.52%, respectively) and Thymine is the less frequent nucleotide with 6.79%. Likewise, our analysis allowed us to identify abundant Thymine composition (up to 95.45% at position −1) in upstream region, and a higher T + A composition (up to 89%) on the downstream sites, starting from +2 position. Together, our results indicate that mRNA maturation processes in T. cruzi may differ significantly from its closely related kinetoplastid counterparts.

conclusions
In this work and for first time in T. cruzi, we have described some of the most important transcriptomic features in the tripomastigote stage of Sylvio X10 strain, as well as to contribute to the improvement of its genome sequence. Firstly, we corrected about 177 Kb of ambiguous genomic regions using RNAseq data and secondly we identified, classified, quantified and annotated 22,977 mature mRNAs. Their identification and quantification has allowed us to identify the main probable regulatory elements, and to significantly improve the entire genomic annotation.
We have also identified the transcription initiation SSRs across the genome, and determined that are positively correlated to the "chromosomal" length and therefore there is no strand-specificity as it occurs in Leishmania. Slighter abundance of Forward transcripts, also suggest that PTUs in T. cruzi are equally transcribed and that according to the genomic annotation the packaged genes may be or not functionally related. Besides, our results have identified that Sylvio X10 trypomastigotes exhibit a predominance of surface protein mRNAS, mainly those encoding trans-sialidase and mucin-like proteins which also constitutes the most expanded gene families, confirming that the gene copy number act as secondary regulator mechanism of protein expression.
Finally, but equally important, we have identified and described the trans-splicing and polyadenylation sites in T. cruzi which are different to other closely related kinetoplastids such as Leishmania and T. brucei, mainly on the polyadenylation signal and surrounding sequences.

Material and Methods
Parasite cultures and RNA extraction. The Sylvio X10 strain was obtained from Dr. M. Miles (London School of Hygiene and Tropical Medicine, London, UK) through the European program ChagasEpiNet. Vero cells were grown in RPMI medium supplemented with 5% fetal bovine serum (FBS), 100 UI/mL of antibiotics mixture, 10 µg/mL streptomycin and 2 mM glutamine at 37 °C in an atmosphere of 5% CO2 until the cells reached 80% confluence. The cell monolayer was subsequently infected with metacyclic trypomastigotes of T. cruzi Sylvio X10 strain. After 4 days, the supernatant medium was collected, Vero cells and amastigotes were removed by centrifugation at 1000 g by 5 minutes. Trypomastigotes were collected by centrifugation at 1600 g for 10 minutes. Three biological replicates where mixed before RNA isolation.
RNA was isolated using the RNeasy Mini Kit (Qiagen) and treated with RNAse-free DNAse I. RNA samples were quantified by absorbance at 260 nm using the Nanodrop ND-1000 (Thermo Scientific), all samples showed an A260/A280 ratio higher than 2.0. In addition, RNA integrity was checked in a bioanalyzer (Agilent 2100) resulting on RIN value higher than 8.
RNAseq and transcriptome analysis. RNA-seq was performed at the Massive Sequencing Platform of Cantoblanco (CSIC-PCM, Madrid, Spain). Standard libraries for massive sequencing were generated using the TruSeq RNA Sample Prep Kit (Illumina). Briefly, poly-A + RNA was selected by oligo-dT chromatography, and RNA fragmentation was achieved using divalent cations under elevated temperature. Afterwards, these fragments were used to generate a cDNA library, and cDNA fragments corresponding in size to about 450-550 bp were isolated from an agarose gel. Paired-end reads of 100 nucleotides were obtained, and raw reads were subject to quality-filtered using the standard Illumina process and analysed using FASTQC tool 54 .
Reads were mapped to the reference genome using Bowtie2 55 , transcript assembly and abundance was calculated with Stringtie 56 and assembly corrections were performed using Pilon 57 . Transcript identification/classification was performed by Blas2GO suite 58 , Blastx using the protozoan non-redundant (nr) proteome downloaded from the NCBI and in-house python scripts. Trans-splicing and polyadenylation sites were identified as described before by Rastrojo et al. 34 and the local version of the WebLogo tool 59 . All figures and statistical analysis were performed using R and Rstudio 60 .
Additional bioinformatics tools were used to handle, parse, analyse or visualize sequencing data such as samtools 61 , CD-HIT 62 and IGV 63 .

Data availability
Raw data has been submitted to the NCBI repository (SRA) under the accession number PRJNA546488. The new genome version, transcript sequences and annotation file have been also submitted to the TritrypDB for its public access. Both data will be immediately available after the publication of our manuscript.