De novo transcriptome of Phakopsora pachyrhizi uncovers putative effector repertoire during infection

Phakopsora pachyrhizi, which causes Asian soybean rust (ASR), secretes effector proteins to manipulate host immunity and promote disease. To date, only a small number of effectors have been identified from transcriptome studies. To obtain a more comprehensive understanding of P. pachyrhizi candidate secreted effector proteins (CSEPs), we sequenced the transcriptome using two next-generation sequencing technologies. Short-read Illumina RNA-Seq data was used for reducing base-calling errors for long-read PacBio Iso-Seq. After initial de novo assemblies for RNA-seq and error correction of transcripts for Iso-Seq followed by filtering, we obtained 8,528, 27,647, 26,895, and 17,141 non-plant, non-soybean transcripts at 3, 7, 10, and 14 days after inoculation, respectively. We identified a repertoire of CSEPs of which a majority was expressed during the later stages of infection, and many that could be bioinformatically associated with haustoria. This approach for identifying CSEPs improves our current understanding of the P. pachyrhizi effectorome, and these CSEPs are expected to be a valuable resource for future studies of P. pachyrhizi-soybean interactions.


Introduction
Phakopsora pachyrhizi, the causal agent of Asian soybean rust (ASR), is a serious threat to soybean production worldwide and can cause yield losses greater than 80% if not controlled [1][2][3]. Despite its economic importance, there are currently no known commercial cultivars with durable resistance to ASR. P. pachyrhizi, an obligate biotrophic fungus, forms specialized structures called haustoria. During infection, haustoria absorb nutrients from the host and deliver secreted effector proteins into host cells [4,5]. P. pachyrhizi effectors play significant roles to manipulate host immunity and promote disease [6,7]. Therefore, identifying P. pachyrhizi effectors and understanding their functions is crucial to curb soybean losses to ASR. Genome and transcriptome analyses of several rust fungi have proven to be efficient approaches for the discovery of genes predicted to encode candidate secreted effector proteins (CSEPs) [8,9]. Fungal CSEPs are identified based on the presence of signal peptides for secretion, high percentage of cysteine residues, lack of transmembrane domains and small size [10][11][12][13]. The primary approaches for genome-based effector discovery have relied on transcriptomes to identify effectors during soybean-P. pachyrhizi interaction [14][15][16]. Analyses of P. pachyrhizi transcriptomes based on short-read sequencing have identified limited numbers of haustorial CSEPs from different P. pachyrhizi isolates [14,15]. For instance, 156 P. pachyrhizi effector candidates (PpECs) from the Thai-1 isolate and 35 P. pachyrhizi candidate effectors (Pp-CSEPs) from the GA-05 isolate were predicted from haustorial transcriptome and cDNA library sequencing, respectively [14,15]. The present study describes identification of several hundreds of novel CSEPs from long-read isoform and short-read RNA-Seq during P. pachyrhizi infection, and it more than doubles the list of previously published P. pachyrhizi haustorial effector candidates. Furthermore, these datasets derived from the LA04-1 isolate augment the P. pachyrhizi genome and transcriptome annotations, and provide a valuable resource for studying molecular mechanisms during P. pachyrhizi pathogenicity. (Everris NA Inc.) at 3 weeks following germination. When plants reached the fourth trifoliate stage they were transferred to the USDA BSL-3 plant pathogen containment facility at Fort Detrick [17]. P. pachyrhizi inoculum was prepared as previously described [18] using isolate LA04-1, which was collected in Louisiana, USA in 2004 and is fully pathogenic on Williams 82 [19]. Spore concentrations were measured with a hemocytometer and adjusted to approximately 1 × 10 5 spores ml −1 . Inoculum was applied with an atomizer until leaves were saturated. Plants were incubated in a 20°C dew chamber for 24 h, and then transferred to a 25°C greenhouse. Infected leaf tissue was collected at 3, 7, 10, and 14 days after inoculation (dai; Fig. 1), and immediately frozen in liquid nitrogen and stored at −80°C.

RNA extraction
Frozen leaf tissue was ground in liquid nitrogen and total RNA was extracted using the RNeasy plant mini kit (Qiagen, Valencia, CA, USA). Total RNA was treated with DNase I (TURBO DNA-free™ Kit, ThermoFisher Scientific, CA, USA) to remove genomic DNA contamination. Samples were quantified with a Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA) and quality was assessed with a NanoDrop 2000 (ThermoFisher Scientific, CA, USA). All RNA was high quality; A260/280 and A260/230 ratios were between 1.9 and 2.1. DNase-treated total RNA was used for making cDNA libraries for both short-read sequencing on Illumina and PacBio Isoform Sequencing (Iso-Seq; Fig. 2a).

cDNA library preparation and sequencing
The DNA-free total RNA was ribosomal depleted (Ribo-Zero rRNA Removal Kit -Plant Leaf, Illumina, San Diego, CA, USA) for short-read sequencing. Strand-specific cDNA libraries were prepared from

Illumina RNA-Seq and de novo assembly of the short-reads
The raw reads obtained from Illumina short-read sequencing were quality assessed using FastQC v.0.11.2 [20] to determine the necessity of trimming low-quality reads. Paired-end read trimming was conducted by Trimmomatic 0.36 [21] using sliding window 4:15 and excluding reads below a minimal length of 36. The trimmed paired-end reads were examined by FastQC again to confirm improvement of read quality. Trimmed paired-end RNA-Seq reads from 3, 7, 10, and 14 dai were mapped against the unmasked soybean genome v2.1 (https:// plants.ensembl.org/Glycine_max) [22] using STAR 2.5.3a aligner [23] to remove soybean reads. The non-soybean short reads were de novo assembled using Trinity v2.6.6 [24]. BLASTn (Basic Local Alignment Search Tool) search was performed on non-soybean transcripts using Blastplus v2.6.0 (NCBI: National Center for Biotechnology Information) [25,26] to remove any insect, bacterial, plant, and other non-fungal transcripts with a query coverage greater than 80% and identity greater than 95% (Fig. 2b). The final de novo transcriptome containing nonplant, non-soybean transcripts for each time point was used for prediction of CSEPs.

PacBio Iso-seq data processing and error correction
The subreads generated from six SMRT cells were pooled together for each time point and processed following the RS_IsoSeq protocol (https://www.pacb.com/support/software-downloads) available in SMRT analysis package v2.3.0 [27]. The analysis resulted in high-and low-quality isoform sequences that were combined into a final transcript sequence dataset. BLASTn (NCBI; non-redundant nucleotide database) was performed on all transcripts to remove plant sequences. Non-soybean Illumina short reads was used as a reference for error correcting non-plant Iso-Seq transcripts using three error correction algorithms (LoRDEC; HALC; Hercules) [28][29][30]. The combined errorcorrected, non-plant transcripts, and non-corrected non-plant transcripts were further filtered to remove residual soybean transcripts (Fig. 2b). The non-plant, non-soybean error-corrected or non-corrected transcripts were used for prediction of CSEPs. The PacBio Iso-Seq data processing, filtering, and selection of transcripts are described fully in supplementary method S1.

Assessment of assembly completeness
To provide a quantitative overview of the level of completeness achieved for the de novo assemblies and the error-corrected transcripts, the BUSCO (Benchmarking Universal Single-Copy Orthologs) v2 pipeline was used [31]. This was accomplished using the BUSCO compared to the predefined set of 1,335 Basidiomycota single-copy orthologs from the OrthoDB v9.1 database. We calculated the number of complete (length is within two standard deviations of the mean length of the given BUSCO), duplicated (complete BUSCOs represented by more than one transcript), fragmented (partially recovered BUSCOs) and missing (not recovered) in each of the four de novo assemblies and the two errorcorrected PacBio transcriptomes.

P. pachyrhizi effector prediction pipeline
To predict P. pachyrhizi CSEPs, open reading frames (ORFs) were identified within all non-plant, non-soybean transcripts by TransDecoder v3.0 (https://transdecoder.github.io/) [24], at a minimum protein length of 50 amino acids. All predicted ORFs were scanned for signal peptides using SignalP v 4.1 (Fig. 2c) [32] at a D-cutoff value of 0.45. To exclude membrane proteins, the resulting peptides were scanned for transmembrane domains using TMHMM v2.0 [33]. The standalone prediction programs EffectorP 1.0 [34] and EffectorP 2.0 [35] were used to predict P. pachyrhizi CSEPs (Fig. 2c). For maximum sensitivity, a union of P. pachyrhizi CSEPs predicted by either EffectorP 1.0 or Effector 2.0 was obtained. A BLASTp [25,26] search against the NCBI non-redundant protein database was performed on P. pachyrhizi CSEPs to identify protein sequence homology to fungal species. P. pachyrhizi CSEPs from each time point were clustered using CD-HIT v4.6.8 [36] with 100% amino acid identity to remove sequence redundancies, however, CSEPs with greater than one amino acid difference in sequence were retained and considered as a different CSEP. P. pachyrhizi CSEPs predicted from both the de novo assemblies and the error-corrected non-plant, non-soybean transcripts were combined for each time point (7 and 10 dai) and CD-HIT v4.6.8 with 100% amino acid identity was performed to remove redundancies but retained CSEPs with greater than one amino acid difference in sequence. To identify unique P. pachyrhizi CSEPs within a time point, CD-HIT-2D [36] with 100% amino acid identity was performed between time points. CSEPs with differences in more than one amino acid were retained and considered as a different CSEP. All transcript sequences with SNPs and allelic variants that produced different amino acids were placed into separate clusters by clustering programs such as CD-HIT and CD-HIT-2D when the threshold was set to 100% amino acid identity. However, protein sequences arising from transcripts having splice variants were placed into the same cluster if the protein sequence had 100% amino acid identity or if one protein sequence was a subset of the other. APOPLASTP (an effector protein localization predictor in the apoplast) [37] was performed on all P. pachyrhizi CSEPs to identify predicted apoplastic and nonapoplastic effectors (Fig. 2c). For subcellular localization prediction of P. pachyrhizi CSEPs, LOCALIZER (an effector protein localization predictor in the plant cells) [38] was used for each time point (Fig. 2c). The previously published P. pachyrhizi haustorial transcriptome [14] and P. pachyrhizi haustorial cDNA libraries [15] were mapped to P. pachyrhizi predicted CSEPs identified herein using GMAP aligner [39] with default settings to determine bioinformatically which of the predicted CSEPs were expected to be expressed in haustoria. Sequence homology identification when comparing sequences from two different isolates [14,15] was considered when there was any sequence that matched between the haustorial transcriptomes [14,15] and the P. pachyrhizi predicted CSEPs identified. P. pachyrhizi predicted CSEPs for each time point were also processed using in-house Python scripts (v3.5) to identify the number and percentage of cysteine residues within each CSEP.

De novo assembly of P. pachyrhizi transcriptome from short-reads
In this study, we present new, high-quality, protein-coding transcriptomes for four time points during P. pachyrhizi infection: 3, 7, 10, and 14 dai ( Fig. 1) representing early infection with established haustoria (3 dai), rapid colonization (7 and 10 dai), and sporulation stages (10 and 14 dai) [40]. RNA-seq libraries, generated from infected soybean leaves, yielded between 46 and 55 million paired-end reads per time point. Quality filtering removed less than approximately 2% of the raw reads. This resulted in high quality RNA-Seq datasets that contained between 45 and 54 million paired-end reads per time point. Prior to the de novo assemblies, all filtered raw reads were mapped against the soybean genome v2.1 to remove soybean reads from the dataset. The filtered non-soybean short reads were used to provide a comparative reference for the transcript isoform sequences obtained from PacBio Iso-Seq at 7 and 10 dai. The filtered non-soybean reads were also de novo assembled with a default k-mer of 25 using the Trinity assembler. The number of Trinity-assembled transcripts ranged between 35,144 and 62,719 transcripts with an average transcript length between 1,003 bp and 1,466 bp and a N50 length between 1,360 and 2,119 bases for the four time points (Supplementary Table S1). The assembled transcripts were decontaminated by discarding sequences (eg. insect, bacterial, plant, and other non-fungal) from the transcriptome using NCBI BLASTn. BLAST to NR database was executed and was configured to generate taxonomy_id, percentage identity and query coverage for each BLAST hit. All transcripts with hits to insects, bacteria, plant and other non-fungal sequences were removed if they had a query coverage over 80% and a sequence identity of more than 95%. A total of 72-75% transcripts were classified as probable contaminant and discarded. This resulted in a total of 8,528, 14,997, 14,716, and 17,141 non-plant, non-soybean transcripts with a N50 of 1,170, 1,903, 2,116, and 2,616 bases and a mean transcript size of 909, 1,294, 1,395, and 1,651 bases for 3, 7, 10, and 14 dai, respectively (Supplementary Table  S1). The non-plant, non-soybean transcripts were used to predict P. pachyrhizi CSEPs. Supplementary Table S1 summarizes the statistics of the sequence reads and the de novo transcriptome assemblies.

P. pachyrhizi transcriptome from PacBio Isoform sequencing
The transcriptome of P. pachyrhizi was generated using PacBio Iso-Seq on P. pachyrhizi-infected soybean leaf tissues collected at 7 and 10 dai. To increase the representation of different mRNA populations of P. pachyrhizi that encoded CSEPs, non-size selected cDNA libraries were generated and sequenced. A total of 386,655 and 243,529 circular consensus sequences (CCS) were generated from six SMRT cells with an average length of 656 and 915 bases for 7 and 10 dai, respectively (Supplementary Table S2). By applying the standard Iso-Seq classification and clustering protocol on the CCS, we produced 165,337 and 106,029 consensus transcripts including 21,019 and 14,703 polished high-quality (HQ) transcripts, and 130,695 and 87,154 polished lowquality (LQ) transcripts with a N50 of 587 and 946 bases for 7 and 10 dai, respectively (Supplementary Table S2). The 165,337 and 106,029 consensus transcripts were filtered for plant transcripts using NCBI BLASTn and sequence similarity searches were conducted against two databases (NCBI non-redundant nucleotide database and soybean genome v2.1). This generated 132,181 (7 dai), 80,725 (10 dai) plant transcripts and 33,156 (7 dai), 25,304 (10 dai) non-plant transcripts (Supplementary Table S2). The three error correction pipelines (LoRDEC, HALC, and Hercules) used non-soybean short-read RNA-Seq datasets as a reference for the 33,156 (7 dai) and 25,304 (10 dai) nonplant transcript error correction. The error-corrected non-plant transcripts were further filtered to remove residual soybean transcripts, which generated 20,506 (7 dai), 13,125 (10 dai) error-corrected soybean transcripts and 12,650 (7 dai), 12,179 (10 dai) error-corrected non-soybean transcripts (Supplementary Table S3). In general, the error correction led to improved CSEP prediction, more transcripts covered the full-length of known secreted proteins, longer ORFs and better completeness results in BUSCO assessments. The LoRDEC and HALC error correction outperformed Hercules with our transcript datasets. The best error-corrected transcripts were derived from all three error correction programs, which resulted in 72% and 79% of the total nonplant, non-soybean transcripts while that of the non-corrected transcripts was only 27% and 20% for 7 and 10 dai (Supplementary Table  S4). A total of 12,650 and 12,179 error-corrected non-plant, non-soybean transcripts were used to predict P. pachyrhizi CSEPs for both 7 and 10 dai. A detailed summary of the processed Iso-Seq, error correction and selection of non-plant, non-soybean transcripts is provided in Supplementary Table S2, Supplementary Table S3 and Supplementary  Table S4.

P. pachyrhizi transcriptome completeness analysis
To provide a quantitative overview of the level of completeness achieved for the P. pachyrhizi de novo transcriptomes and the PacBio transcripts, we assessed transcriptome completeness in terms of gene content. BUSCO [31] was used to assess the quality and completeness of P. pachyrhizi transcriptome assemblies during infection. This program used tBLASTn to identify conserved single-copy orthologs among groups of organisms. We used the BUSCO OrthoDB Basidiomycota database to search P. pachyrhizi transcriptome assemblies for the presence, absence, duplication, and/or fragmentation of 1,335 conserved singlecopy Basidiomycota orthologs. For the four de novo assemblies, we obtained 107 (8%), 598 ( Table S5). The PacBio transcripts showed a lower completeness than the de novo transcriptome datasets. The relatively high number of BUSCO hits over the time course, represented the increase in P. pachyrhizi fungal transcripts within the sample. The high number of duplicate hits as evident in both de novo assemblies and PacBio transcripts may, in theory, represent allelic variation (polymorphism among haplotype nuclei) in the sample used to construct the assembly, gene duplication and/or mechanisms such as alternative splicing.

Prediction of P. pachyrhizi candidate effectors
P. pachyrhizi effectors exhibit the essential ability to manipulate host cell processes and facilitate infection during soybean host interactions [6,7]. In the present study, a combination of programs was used to predict P. pachyrhizi CSEPs and their localization (Fig. 2c). We analyzed the predicted proteomes of P. pachyrhizi during infection based on unigenes using Transdecoder 3.0 and predicted 22,455, 63,347, 70,218, and 62,179 potential proteins at 3, 7, 10, and 14 dai, respectively (Supplementary Table S6). Fungal effectors are secreted, and therefore, the first step aimed at identifying P. pachyrhizi CSEPs was performed to identify secreted proteins. Typical characteristic features of many known fungal secreted effectors include the presence of signal peptides (SP), a lack of transmembrane domains and the absence of endoplasmic reticulum and mitochondria-targeting motifs [11]. Based on these features, a total of 673, 2,022, 2,810, and 1,922 P. pachyrhizi secreted proteins were predicted across time points with a signal peptide (SP) and lacked transmembrane domains. By using two fungal effector predictors (Effector P 1.0 and Effector P 2.0), we identified 133, 417, 597, and 281 predicted CSEPs during P. pachyrhizi infection ( Fig. 3a (Fig. 3a). Additionally, 18 predicted CSEPs were common across all time points (Supplementary Table S7). Fungal effectors are typically small and rich in cysteines [11]. We manually examined the P. pachyrhizi predicted CSEPs and found all predicted CSEPs with a length shorter than 395 amino acids, and 58 (43%), 155 (37%), 189 (31%), and 115 (40%) of the identified CSEPs were rich (≥4) in cysteine residues (Supplementary Table S7).
Several fungal pathogens are known to secrete effector proteins that function in the apoplast [37] or enter plant cells, co-localize in   Table S7). Other rust pathogens such as Puccinia graminis f. sp. tritici and P. triticina also showed high proportions of predicted nuclear-targeted effectors (~4.5%) and chloroplast-targeted effectors (3.35%) [38]. Haustoria of P. pachyrhizi are responsible for the production and secretion of numerous effector proteins, with several of these effectors targeted to the host cytoplasm where they may serve to promote infection [7,41]. To date, only 156 PpECs from a total of 4,492 P. pachyrhizi haustorial transcripts have been reported [14] while another 35 Pp-CSEPs were reported from a P. pachyrhizi haustorial cDNA library [15], of which 16 effector candidates overlap from the two datasets. We expected that the previous haustorial transcriptomes may have underpredicted CSEPs due to many sequences being less than full-length and thus lacking the predicted 5' N-terminal signal peptide. Therefore, we sought to retrieve additional P. pachyrhizi haustorial effector candidates by aligning the previously published 4,492 P. pachyrhizi haustorial transcripts [14] and 35 Pp-CSEPs from P. pachyrhizi haustoria [15] to our predicted CSEPs. This alignment identified a total of 7, 88, 143, and 83 P. pachyrhizi haustorial CSEPs across time points (Fig. 4). Of these, 4, 70, 105, and 67 transcripts at each time point were predicted to encode haustorially-expressed CSEPs that had not been previously identified. Furthermore, 1, 41, 83, and 45 were unique within a time point (Fig. 4). Conversely, this analysis identified 3, 18, 38, and 16 transcripts at each time point that were predicted to encode haustoria-expressed CSEPs that were previously reported as PpECs or Pp-CSEPs [14,15]. Of particular interest, PpEC23, a small secreted cysteine-rich effector previously identified and characterized to suppress plant immunity [6,14] was expressed at 10 dai. Based on this bioinformatic re-analysis of available haustorial transcriptome data, we were able to add 184 new predicted haustoria-expressed CSEPs to the previously reported 175 Pp effector candidates (Supplementary Table S7).
The annotation was an important part of our analysis, because it enabled us to assess and understand the functional descriptions of P. pachyrhizi CSEPs based on BLAST matches. We used BLASTp to search against the NCBI non-redundant protein database with a cut-off value set to less than 80% query coverage and less than 95% query identity. In total, 7, 25, 89, and 19 P. pachyrhizi CSEPs across time points were Pp-specific, as they showed significant homology to Pp candidate effectors reported previously (Supplementary Table S7) [14,15]. In addition, BLAST hits for several P. pachyrhizi CSEPs were annotated as hypothetical or secreted proteins specific to rust fungi, indicating that they are similar to other fungal proteins but that the functional knowledge of them is lacking (Supplementary Table S7).

Discussion
We report the sequencing, annotation, and de novo assembly of the P. pachyrhizi transcriptome during infection using two next-generation sequencing technologies. A total of 8,528, 27,647, 26,895, and 17,141 non-plant, non-soybean transcripts was generated at 3, 7, 10, and 14 days after inoculation, respectively. We probed our P. pachyrhizi de novo assemblies for transcriptome quality and quantifications of completeness. Transcriptome assembly completeness in terms of gene content might be assessed based on evolutionary expectations, so that recovery of conserved genes serves as an alternative measure for the overall completeness [31]. The higher BUSCO completeness alignment of de novo Illumina transcripts when compared to PacBio transcripts suggest that transcripts assembled from Illumina sequencing contained more of the expected core conserved Basidiomycota genes, and indirectly indicates that transcripts of more genes were captured. This result was also consistent with the BLAST search, in which more Basidiomycota proteins matched the de novo transcripts than that of the PacBio transcripts. Even though we found several BUSCO genes in our assemblies, lack of some BUSCO genes might result from the fact that originally BUSCO sets were developed to estimate completeness of genomic assemblies and did not require expression evidence [31]. The de novo transcripts contained higher N50 compared to the PacBio transcripts. This could be due to the high coverage of RNA-Seq sequencing and the use of multiple settings/assembler in our de novo assembly; while the sequencing depth in PacBio Iso-Seq was still relatively low. Additionally, lower N50 could be due to increase in the representation of smaller fragments as non-size selected cDNA libraries were generated and sequenced, and therefore smaller fragments were loaded preferentially on the PacBio sequencer [42].
The soybean rust fungus P. pachyrhizi is an obligate biotroph that derives its nutrients entirely from living host cells. Like other plantassociated fungal pathogens, P. pachyrhizi secretes an arsenal of effector proteins via specialized structures called haustoria during infection. The haustoria-host cell interface appears to mediate an active interaction involving extensive movement of nutrients as well as signaling/ defense molecules [43]. The development of a method to isolate haustoria from infected tissues was a remarkable finding in efforts to understand haustorial function and identify haustorial-specific genes [14,44]. More recently, another method used a direct proteomics approach to identify effector candidates from P. triticina Race 1 haustoria isolated with a specific monoclonal antibody [45]. Numerous effector proteins from rust fungi have been identified and characterized [46][47][48][49][50]. However, only a few haustorially expressed rust effector proteins have been identified and characterized to date that include RTP1 in the bean rust Uromyces fabae, AvrL567, AvrM, AvrP123, and AvrP4 in the flax rust Melampsora lini, PGTAUSPE-10-1 in the wheat stem rust fungus Puccinia graminis f. sp. tritici, and PpEC23 in the soybean rust P. pachyrhizi [5,6,14,[51][52][53].
The de novo transcriptome assemblies presented here expand the transcriptomic resources available for P. pachyrhizi. A total of only 175 candidate effector proteins were previously identified in the haustoria transcriptomes from two different isolates of P. pachyrhizi [14,15]. In comparison to the numbers of predicted effectors in other rust fungi, it is likely that haustoria-expressed effectors have been under-predicted for P. pachyrhizi, in part, because many transcripts were truncated due to the sequencing technology used. The previously published P. pachyrhizi haustorial transcriptome sequences were assembled from shotgun pyrosequencing [14]. This information gap prevents us from obtaining a comprehensive understanding of how P. pachyrhizi effectors manipulate host defense and metabolism. We reasoned that combined long-read and short-read RNA sequencing would provide more fulllength transcripts and thus improve P. pachyrhizi CSEP prediction.
To predict P. pachyrhizi CSEPs in this study, we used similar effector prediction approaches as previously reported [34,35]. Using this approach with a further search using BLASTp, we catalogued a total of 1,428 CSEPs across all time points during P. pachyrhizi infection. When the previous haustoria transcriptomes were aligned to the 1,428 predicted CSEPs, 321 of the CSEPs predicted to be expressed in the haustoria were obtained. Of these, 184 had not been previously reported as haustoria-expressed CSEPs, most likely, because the haustoria-derived sequences were not full-length and thus not annotated with a signal peptide in the original SignalP analyses. The percentage of sequence similarity from the two different P. pachyrhizi isolates ranged from 3.3% to 100% when previous haustorial transcriptomes [14,15] were mapped to P. pachyrhizi CSEPs identified in this study. All predicted CSEPs identified for both 7 and 10 dai using long-read sequencing were obtained from full-length transcripts as predicted by SMRT PacBio Iso-Seq analysis. At 7 dai, there were no partial transcripts identified and therefore did not affect CSEP prediction. At 10 dai, we identified six partial transcripts that were not identified as CSEPs because they lacked the 5′-N terminal signal peptide sequence. The SignalP program failed to predict the signal peptide cleavage sites in these amino acid sequences. However, these predicted CSEPs were identified from shortread sequencing having the 5′-N terminal signal peptide sequence. Thus, our sequencing strategy of using Illumina paired-end short-reads and PacBio SMRT long-reads coupled with re-analysis of the haustoria transcriptome sequences enabled us to approximately double the number of predicted haustoria-expressed CSEPs for P. pachyrhizi. These results highlight one of the shortcomings of fungal transcriptome assemblies derived either from Illumina's Solexa platform or 454 shortread sequencing.
The most established methods for identifying fungal effectors such as map-based cloning, analysis of fungal secretomes and screening of Expressed Sequenced Tag (EST) libraries [11] are less sensitive and labor intensive. Thus, bioinformatics-based prediction pipelines and tools have been developed recently to efficiently identify candidate effectors based on the typical features, such as small size, the presence of signal peptide and cysteine-rich proteins [11,54]. All of the predicted CSEPs identified in this study were small (shorter than 395 amino acids) and 31-43% were cysteine-rich (≥4 cysteine residues). Among the 1,428 CSEPs, 38% of them were annotated as hypothetical protein or as secreted protein, of which 31% gave hits to known fungal proteins using BLASTp. These predicted CSEPs may potentially provide a valuable resource for further understanding of P. pachyrhizi pathogenicity. Even though about 61% of the predicted CSEPs had no hits or had very low homology to other non-fungal species in a BLASTp search, most of them were expressed in planta, suggesting that they may be essential for the pathogenicity and virulence of P. pachyrhizi.

Conclusions
The de novo transcriptome assemblies of P. pachyrhizi during infection were performed to identify genes encoding putative effectors involved in compatible soybean-rust interactions. The effectorome of P. pachyrhizi provides novel insights into the molecular pathogenicity of this important biotrophic fungus. The P. pachyrhizi transcriptome analysis revealed a total of 1,428 new predicted CSEPs during infection including 184 new haustorial effectors, of which 112 haustorial effectors were predicted at later stages of P. pachyrhizi infection. Several of these new CSEPs may potentially function in altering of the host cell processes to suppress plant defenses, allowing the P. pachyrhizi rust pathogen to colonize susceptible soybean leaves. A high number of genes encoding CSEPs were annotated as hypothetical protein or as secreted protein with hits to other fungal species during disease development. This dataset will aid future studies focused on identifying the molecular mechanisms governing the interactions between P. pachyrhizi and its soybean host, including transcriptome changes in both organisms. Furthermore, the characterization of CSEPs involved in promoting disease will provide new insights for prevention and control of the economically important soybean rust disease caused by P.
pachyrhizi. As we were finalizing this manuscript, a consortium of public and private groups announced the genome sequences of three P. pachyrhizi isolates (K8108, MG2006, and PPUFV02) hosted by the Joint Genome Institute Mycocosm fungal genomics resource (https:// mycocosm.jgi.doe.gov/Phapa1/Phapa1.home.html). We expect that our data on full-length P. pachyrhizi transcripts, although it is from a different isolate, will be useful in helping the community to understand more about P. pachyrhizi gene structure, which will facilitate research on genome assembly and annotation, SNP identification, and effectordriven marker development for disease resistance breeding programs.

Data availability
All Illumina RNA-Seq and PacBio Iso-Seq raw data generated for this work were deposited at DDBJ/EMBL/GenBank under the BioProject PRJNA565800 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA565800. The TSA project for the Illumina RNA-Seq filtered de novo assemblies for each time point has been deposited at DDBJ/EMBL/GenBank under the accession GHWP00000000, GHWQ00000000, GHWR00000000 and GHWS00000000. The version described in this paper is the first version, GHWP01000000, GHWQ01000000, GHWR01000000 and GHWS01000000. The TSA project for the corrected PacBio Iso-Seq transcriptomes has been deposited at DDBJ/EMBL/GenBank under the accession GHWK00000000 and GHWL00000000. The version described in this paper is the first version, GHWK01000000 and GHWL01000000. The original non-filtered assemblies can be accessed in Figshare under the direct URL: https://doi.org/10.25380/iastate.9916217 (for Illumina RNA-Seq de novo transcriptomes); and under the direct URL: https://doi.org/ 10.25380/iastate.9916226 (for PacBio Iso-Seq transcriptome). All datasets generated for this study are included in the manuscript and/or the Supplementary files.

Funding
This project has been funded through the Iowa State University Plant Sciences Institute, USDA-NIFA Hatch project number 3808, and the United States Department of Agriculture Agricultural Research Service project number 8044-22000-046-00D. The funding bodies have no role to play in the design of this study, collection of materials, data analysis, interpretation of data, and in the writing of the manuscript.

Declaration of competing interest
The authors declare no conflict of interest.

Commercial endorsement disclaimer
Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.

Equal opportunity/non-discrimination statement
The U.S. Department of Agriculture (USDA) prohibits discrimination in all its programs and activities on the basis of race, color, national origin, age, disability, and where applicable, sex, marital status, familial status, parental status, religion, sexual orientation, genetic information, political beliefs, reprisal, or because all or a part of an individual's income is derived from any public assistance program. (Not all prohibited bases apply to all programs.) Persons with disabilities who require alternative means for communication of program information (Braille, large print, audiotape, etc.) should contact USDA's TARGET Center at (202) 720-2600 (voice and TDD). To file a complaint of discrimination write to USDA, Director, Office of Civil Rights, 1400 Independence Avenue, SW, Washington, DC 20250-9410 or call (800) 795-3272 (voice) or (202) 720-6382 (TDD). USDA is an equal opportunity provider and employer.