The prediction of a pathogenesis-related secretome of Puccinia helianthi through high-throughput transcriptome analysis

Many plant pathogen secretory proteins are known to be elicitors or pathogenic factors,which play an important role in the host-pathogen interaction process. Bioinformatics approaches make possible the large scale prediction and analysis of secretory proteins from the Puccinia helianthi transcriptome. The internet-based software SignalP v4.1, TargetP v1.01, Big-PI predictor, TMHMM v2.0 and ProtComp v9.0 were utilized to predict the signal peptides and the signal peptide-dependent secreted proteins among the 35,286 ORFs of the P. helianthi transcriptome. 908 ORFs (accounting for 2.6% of the total proteins) were identified as putative secretory proteins containing signal peptides. The length of the majority of proteins ranged from 51 to 300 amino acids (aa), while the signal peptides were from 18 to 20 aa long. Signal peptidase I (SpI) cleavage sites were found in 463 of these putative secretory signal peptides. 55 proteins contained the lipoprotein signal peptide recognition site of signal peptidase II (SpII). Out of 908 secretory proteins, 581 (63.8%) have functions related to signal recognition and transduction, metabolism, transport and catabolism. Additionally, 143 putative secretory proteins were categorized into 27 functional groups based on Gene Ontology terms, including 14 groups in biological process, seven in cellular component, and six in molecular function. Gene ontology analysis of the secretory proteins revealed an enrichment of hydrolase activity. Pathway associations were established for 82 (9.0%) secretory proteins. A number of cell wall degrading enzymes and three homologous proteins specific to Phytophthora sojae effectors were also identified, which may be involved in the pathogenicity of the sunflower rust pathogen. This investigation proposes a new approach for identifying elicitors and pathogenic factors. The eventual identification and characterization of 908 extracellularly secreted proteins will advance our understanding of the molecular mechanisms of interactions between sunflower and rust pathogen and will enhance our ability to intervene in disease states.


Background
Sunflower rust, caused by Puccinia helianthi Schw., is a widespread disease of sunflower (Helianthus annuus L.) throughout the world and may cause significant yield losses and loss of seed quality. P. helianthi is an obligate pathogen and completes its life cycle on sunflower. Although P. helianthi is a pathogen of great economic importance, little is known about the molecular mechanisms involved in its pathogenicity and host specificity.
Pathogen secretory proteins and host plant defense interactions involve complex signal exchanges at the plant surface and at the interface between the pathogen and the host [1,2]. Plant pathogens are endowed with a special ability to interfere with physiological, biochemical, and morphological processes of the host plants through a diverse array of extracellular effectors. These are present or active at the intercellular interface or delivered inside the host cell to reach their cellular target and facilitate infection or trigger defense responses [3][4][5].
Thus, genes encoding extracellular proteins have a higher probability of being involved in virulence.
Amino terminal signal peptides are responsible for transporting the virulent factors [20]. The N-terminal signal peptides can be classified into four types based on recognition sequences of signal peptidases. The first class is composed of "typical" signal peptides, which are cleaved by one of the various type I SPases of Bacillus subtilis [21][22][23] and most secretory proteins with this signal peptide are secreted into the extracellular environment. This group also includes signal peptides with a so-called twin-arginine motif (RR-motif ) that are transported via the twin-arginine translocation pathway (Tat pathway). In bacteria, the Tat translocase is found in the cytoplasmic membrane and exports proteins to the cell envelope or to the extracellular space [24]. The second class of signal peptides are lipoproteins cleaved by the lipoprotein-specific (type II) SPase of B. subtilis (Lsp) [25,26]. Secretory proteins with the aforementioned signal peptides are transported via the general secretion pathway (Sec-pathway) [27]. The third class constitutes prepilin-like proteins cleaved by the prepilin-specific SPase ComC and the fourth class of signal peptides consists of ribosomally synthesized bacteriocin and pheromone [28,29]. These signal peptides lack a hydrophobic H-domain and they can be removed from the mature protein by a subunit of the ABC transporter or by specific SPases.
An examination of the pathogenesis-related secretome of P. helianthi is important for understanding the molecular mechanism of pathogen-host interaction. Here, we generated a high-throughput transcriptome analysis of proteins containing a signal peptide. We analyzed a total of 35,286 ORFs of the P. helianthi transcriptome using SignalP v4.1, TMHMM v2.0, TargetP v1.1, TatP v1.0 and big-PI predictor bioinformatics tools to identify secretory proteins.

Isolates and culture conditions
Rust-infected sunflower leaves were collected in paper bags seperately, air dried at room temperature for 24 h and then spores from mature uredial pustules were brushed off the leaves and stored at 4-5°C. The collected inocula were inoculated on universal susceptible line 7350. After 10-15 days urediospores of a single pustule were used inoculating two weeks old susceptible plants to produce purified isolates. Subsequently, fresh urediniospores of each isolate were collected from rusted leaves by flicking leaves against parchment paper, and then fresh spores were dried for 3 days in a desiccator and stored individually in the refrigerator at 80°C below zero. In this experiment, the transcriptome data were obtained from P. helianthi isolate SY.

Puccinia helianthi transcriptomic data sets
We constructed a P. helianthi reference transcriptome for different growing stage urediniospores (0 h fresh urediniospores, 4, and 8 h germinated spores). The cDNA library was sequenced on the Illumina HiSeq™ 2500. For the assembly library, raw reads were filtered to remove those containing an adapter and reads with more than 5% unknown nucleotides. Low quality reads were also removed, in which the percentage of low Q-value (≤10) bases was more than 20%. Clean reads were de novo assembled by the Trinity Program yielding 59,409 transcripts with a mean size of 1394 bp. Sequence data has been uploaded to the Short Read Archive (https:// www.ncbi.nlm.nih.gov/sra) of the National Center for Biotechnology Information (NCBI); accession number SRP059519. The secretory proteins were predicted according to the N-terminal amino acid sequences of 35,286 ORFs (Additional file 1).

Prediction and validation of excretory/secretory (ES) proteins
ORFs fulfilling the following four criteria were defined as the computational secretome: (a) the ORF contains an N-terminal signal peptide; (b) the ORF has no transmembrane domains; (c) the ORF has no GPI-anchor site; and (d) the sequence does not contain the localization signal, which may target mitochondria or other intracellular organelles. Table 1 summarizes the bioinformatic tools used in this study. SignalP v4.1, TMHMM v2.0, TargetP v1.1, ProtComp v9.0 and big-PI predictor tools were employed to identify expected secretory proteins of P. helianthi. SignalP predicts classical secretory proteins in eukaryotes and a truncation protein sequence at 70 amino acids as filters. The standard was L = −918.235-123.455* (Mean S score) +1983.44* (HMM score) and L > 0 for predicting signal peptide proteins. TargetP allowed the prediction of mitochondrial proteins with a cut-off of 0.95 for mitochondrial proteins and 0.90 for proteins in other locations. Transmembrane proteins were predicted with TMHMM (version 2.0) with default options. The putative proteins generated from the transcriptome were initially analyzed by SignalP to predict classical secretory proteins on the basis of a D-score greater than 0.5. The proteins identified were then analyzed with TMHMM to screen for classical secretory proteins without transmembrane segments. Proteins that passed the first two steps were then evaluated by TargetP to identify mitochondrial proteins. Once mitochondrial proteins were identified, the remaining secretory proteins were examined and their sub-cellular localization was predicted with Protcomp. Those assigned to extracellular (secreted) categories were considered pathogenic secretory proteins.

Analysis of signal peptide sequences
In order to further examine the length of signal peptide sequences, the secretory proteins obtained from the previous step were analyzed using custom Perl script. Lipoprotein signal peptide prediction was done with LipoP v1.0, which was able to distinguish among lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins [41]. Signal peptides with an RR-motif were selected by TatP v1.0 and homology prediction of those signal peptide sequences was evaluated following alignment by Clustal Omega.

ES proteins annotation
Predicted ES proteins were annotated with InterProScan and gene ontology (GO) terms for protein domain and family classification [42]. GO term enrichment analysis was performed using the DAVID bioinformatics resource [43]. KAAS (KEGG Automatic Annotation Server) performed functional annotation by BLAST search against the manually curated KEGG database [44] and provided insight into BRITE functional hierarchies and KEGG pathway maps [45]. The ES proteins were independently assessed for homology matches against NCBI's nonredundant protein database and for orthologs against the Cluster of Orthologous Groups of proteins (COG) database using BLAST with permissive (E-value: 1e-10) search strategies. Finally, the ES proteins were predicted to have pathogenic function by BLAST analysis of the Pathogen Host Interaction (PHI) database (identity > 25, E-value: 1e-10).

ORF length of the secretory proteins from P. helianthi
To examine the ORF length of the predicted secretory proteins from P. helianthi, 35,286 P. helianthi ORFs were analyzed by bioinformatics tools and 908 (2.6%) ORFs were identified as secretory proteins. Among them, 728 proteins contained the complete ORF. The longest protein was 1001 amino acids (aa) and the shortest one was 34 aa. The length of most secretory proteins (79.8% of the total identified proteins with a complete ORF) was between 51 and 300 aa. Within this group, 41.0% of them were 101-200 aa long. Thus, we suggest most secretory proteins probably fall in the shorter length range (Fig. 1).

Characteristics of signal peptides of predicted secretory proteins in P. helianthi
The analysis of the signal peptides of 908 predicted secretory proteins reveals the length of the signal peptide ranges from 10 to 34 aa (mean = 21 aa) and most signal peptides (35.8%) ranged from 18 to 20 aa. Signal peptides with 19 aa length, however, were the most abundant, accounting for 13.7% (Fig. 2). The alignment of all 908 signal peptide sequences was done by Clustal Omega. The homology among the signal peptide sequences was low with the highest similarity (66.7%) observed between signal peptide sequence KU994941 and KU994981. No protein with an RR-motif signal peptide was found by TatP v1.0 while 463 proteins contained secretory pathway signal peptides cleavable by SpaseI, and 55 proteins harbored lipoprotein signal peptides cleavable by SpaseII. N-terminal transmembrane helices were found in 30 proteins and 360 of them could be localized to cytoplasmic organelles. Thus, most of the secretory proteins were determined to be secreted through the general secretion pathway (Sec-pathway).
Amino acid composition of signal peptides of predicted secretory proteins in P. helianthi The distribution of 20 amino acids in the signal peptide was statistically analyzed and the frequencies of amino acid residues in a descending order were: L -S -T -R - Hydrophobic amino acid leucine (L) showed an appearance rate of 16.1%, followed by serine (S) as 10.8% (Fig. 3). The occurrence of the negatively charged hydrophilic amino acid aspartate (D) is the lowest, accounting for 0.5%.
In general, the C-terminal region of signal peptides contains an enzyme recognition site. Based on this cleavage site, the amino acids of negative direction were named as −1, −2, and −3; those of positive direction were named as +1, +2, and +3. Between protein cutting locus positions −3 and +3, valine (V) is most likely to occupy the position −3 at a frequency of 26.7%. The frequency of serine (S) being at position −2 is 16.5%, alanine (A) has a 49.1% chance to be at position −1, while 12.9% of the time glutamine (Q) is found in position +1 (Table 2). Interestingly, it was found that most amino acids were widely used in the range of cleavage site −3 to +3 position in sunflower rust but no H, K, or Y was observed at position −1. This indicates amino acids near

Annotation of excretory/secretory (ES) of P. helianthi
All ES proteins identified were searched for sequence homology against our non-redundant dataset using BLAST. It was found that 581 (64.0%) computationally predicted ES proteins shared similarities with known proteins. A total of 143 ES proteins could be annotated in Gene Ontology (GO) and were classified into 27 functional groups, including 14 groups in biological process, seven in cellular component, and six in molecular function (Fig. 4). Within biological process, "metabolic process" (GO: 0008152) with 63 ES proteins and "cellular process" (GO: 0009987) with 26 ES proteins were predominant. In the category of cellular component, the three main groups were "extracellular region" (GO: 0005576, 19 ES proteins), "cell" (GO: 0005623, 18 ES proteins), and "cell part" (GO: 0044464, 18 ES proteins). The categories "catalytic activity" (GO: 0003824) and "binding" (GO: 0005488) were most common in molecular function, represented by 63 and 37 ES proteins, respectively.

Function prediction of predicted secretory proteins in P. helianthi
Out of 908 secretory proteins queried against our nonredundant dataset using BLAST, 581 had functional descriptions, of which 279 had clear functional descriptions    . 4 Gene ontology annotation of the secretory proteins of Puccinia helianthi. The best hits were aligned to the GO database, and 143 putative secretory proteins were assigned to at least one GO term. Most consensus sequences were grouped into three major functional categories and 27 sub-categories and 302 were predicted as hypothetical, conserved hypothetical, uncharacterized, or unnamed proteins. The querying of 908 secretory proteins against the COG database was performed for functional classification (Fig. 5). A total of 80 proteins could be assigned to the COG classification, of which 26 (32.5%) potentially participated in the transport and metabolism of carbohydrates (G; Fig. 5), followed by 23.8% involved in post-translational modifications, protein turnover, and molecular chaperones (O; Fig. 5). Proteins participating in inorganic ion transport and metabolism; replication, recombination and repair; transcription; amino acid transport and metabolism accounted for only 1.3%, respectively (P, L, K, E; Fig. 5). 188 out of the 908 proteins had annotations based on InterPro, of which 62 (33.0%) were hydrolases, including 19 peptidases, 15 glycoside hydrolases, seven esterases, five phosphatases, four each ribonuleases, and polysaccharide deacetylases, three each alpha/beta hydrolases, and glucanases (Table 5). Peptidase, glycoside hydrolase, pectinesterase, polysaccharide deacetylase, pectate lyase and glucanosyltransferase were found possibly to be related to cell wall degradation. Nine proteins contained an MD-2-related lipid-recognition (ML) domain, six contained a lipocalin/ cytosolic fatty-acid binding domain, and three contained a tyrosinase copper-binding domain. Six were annotated as lipocalin, four as the proteinase inhibitor I25 cystatin, four as apolipoprotein, three each as ribosomal protein, one as thaumatin, and two were annotated as the cysteine-rich allergen V5/Tpx-1-related secretory protein. The functions of most predicted secretory proteins are still unknown.

Discussion
Protein is the major functional component of living organisms. Many pathogenic microbes can secrete proteins into host cells to promote their infection process [46]. Therefore, analysis of secretory proteins in the pathogen genome or transcriptome will help reveal pathogenic mechanisms. According to the signal peptide hypothesis [47], secretory protein destination is determined by its signal peptide. The signal peptide will be cleaved off when the protein reaches its destination. A free online program, SignalP, has been developed that accurately identifies eukaryotic signal peptides [48,49]. An analysis of 47 known secretory protein and 47 other proteins of C. albicans by SignalP v2.0 showed that the putative results obtained were credible [30].
Signal peptides structures from various proteins commonly contain a positively charged N-region, a hydrophobic H-region and a neutral polar C-region. In the C-terminal region, helix breaking proline and glycine residues and small uncharged residues which are often found at the positions −3 and −1 determine the signal peptide cleavage site [50]. In P. helianthi, valine was observed more frequently (26.7%) at position −3, alanine was most likely to be at position −1 (49.1%), while histidine, lysine, tyrosine were not observed at this position. This indicates amino acids at −3 and −1 positions are relatively conserved, which might guarantee the recognition accuracy of signal peptidases.
Numerous algorithms are freely available for the prediction of protein structures, functions and interactions. Analyses of entire S. cerevisiae genome databases have included identification of GPI-anchored proteins [51], a prediction of protein sub-cellular localization [52] and a prediction of the "typical" secretory protein with Internet-based software SignalP v3.0, TargetP v1.01, Big-PI predictor and TMHMM v2.0 [33]. Bioinformatics approaches made the large scale prediction and analysis of ES proteins of Helminths possible, which included a comprehensive BLAST analysis to annotate the function of the ES proteins [53]. Thus, one approach to rapidly analyze the entire P. helianthi transcriptome and to predict its secretome is to utilize a wide range of appropriate and efficient bioinformatics tools. After screening 35,286 ORFs of transcriptome data, 908 (2.6%) were predicted as secretory proteins. These putative secretory proteins were small proteins. Up to 79.8% of these secretory proteins were between 51 and 300 aa with signal peptide length between 18 and 20 aa. The short length of amino acids in secretory proteins is likely due to the reference genome of P. helianthi is not available and the unavoidable limitations of de novo transcriptome reconstruction. In signal peptides, the frequency of leucine (L), a hydrophobic amino acid, reached 16.1%. Abundant hydrophobic amino acids may be relevant to the secretion of secretory proteins and their subsequent destination. Most of the amino acids in signal peptides were aliphatic, which are mostly neutral amino acids or hydroxyl or sulfur amino containing amino acids. These amino acids may be important for physiochemical properties of the secretory proteins, which can make the signal peptide cross the plasma membrane easier and enhance signal guidance function. Prediction result showed most of the signal peptides of 908 putative secretory proteins were cleaved by SpI. The majority of the secretory proteins in P. helianthi are likely transported via the general secretory pathway. Furthermore, no signal peptide contained the RR-motif, which may indicate the Tat pathway does not exist or has minor roles in P. helianthi.
Signal peptides can guide the secretory proteins to subcellular locations, and play a key role in the process of metabolism. Signal peptide sequence analysis of all 908 secretory proteins showed sequence similarity is  Alzheimer's disease low, which indicates higher sequence variability, consistent with previous reports [34]. The low conservation might contribute to accurate positioning and specific metabolic functions of individual secretory proteins. Among the 908 secretory proteins, most with functional descriptions are proteins responsible for transport and metabolism of carbohydrates, which is similar to previous research on Bradyrhizobium japonicum [54] and Rhizobium etli [55]. This implies a great deal of materials needed for rust pathogen development and infection may involve sugars, inorganic salt, and organic small molecules, which can be used as cofactors and to meet pathogen energy requirements. Our GO enrichment analysis indicated that hydrolase activity, carbohydrate metabolic process, peptidase activity were significantly enriched in the putative secretory proteins. It suggests rust pathogen P. helianthi can secrete various types of extracellular hydrolases which may include nucleases that can degrade the genetic material of the host plants and interfere with the host genetic metabolism. Additional hydrolase enzymes may be responsible for cell wall degradation; thereby making the host conducive to rust pathogen colonization by destroying the host cell structure and accelerating the process of infection. In addition, the secretory proteins also contain relatively unique serine proteases and similar proteins. In fungi, serine proteases are closely linked with pathogen infection and are often used to degrade the host plant proteins [56]. This suggests serine proteases may also be associated with the rust infection process. Cysteine peptidases (CPs) play important roles in facilitating the survival and growth of mammalian parasites [57]. CPs found in the sunflower rust pathogen, in turn, could also be associated with virulence to the host. In addition, two cysteine-rich secretory proteins identified as calcium chelating serine proteases [58] could be candidate effectors of this pathogen [59]. Three proteins similar to effectors of P. sojae were also found that might be similarly correlated with the pathogenicity of P. helianthi. These candidate proteins may provide more insight into common pathogenesis pathways utilized by both P. sojae and P. helianthi but more experimental evidence is necessary to confirm the biological roles of P. helianthi effectors.
Proteins containing the conserved ML domain are involved in lipid recognition or metabolism and are particularly important for the recognition of pathogenrelated processes such as lipopolysaccharide (LPS) binding and signaling [60]. LPS and glycoproteins have been detected in the neck region of haustoria [61]. Proteins containing the ML domain in P. helianthi may, therefore, play a role in the recognition of host lipid-related products.
The thaumatin protein is considered a model pathogen-response protein domain for pathogenesisrelated (PR) proteins involved in systematically acquired resistance and stress responses in plants, although their precise role is unknown [62]. Thaumatin-like secreted proteins of rust fungi may alter the plant-signalling pathway and have also been reported in the Melampsora secretome [63]. Future research into the role of thaumatin in sunflower rust infection will provide a better Among these 908 secretory proteins in P. helianthi, the majority of them were unclassified due to rust fungi are biotrophic species and require specific genes in their life. The similar results were reported in wheat rust fungus P. striiformis f. sp. tritici [64,65].

Conclusion
In this study, various open source bioinformatics tools were used to predict and analyze ES proteins from P. helianthi transcriptome. Out of 35,286 ORFs of transcriptome data, 908 (2.6%) were predicted as secretory proteins and most were short proteins. A BLAST analysis was used to annotate the function of the ES proteins and provided further evidence for some proteins as   candidates participating in the infection process of P. helianthi. Blasting PHI yielded a total of 43 secretory proteins that could be involved in pathogenicity and three secretory proteins were predicted to be similar to the effectors of P. sojae. Therefore, this investigation provides a novel approach for identifying elicitors and pathogenic factors. It also establishes a sound foundation for understanding the structures and functions of the pathogenic factors of P. helianthi. In conclusion, our data can be used as a candidate gene resource for further computational or wet lab research to unveil the molecular mechanisms underlying the interaction between sunflower and P. helianthi.