Comparative Genomics of the Sigatoka Disease Complex on Banana Suggests a Link between Parallel Evolutionary Changes in Pseudocercospora fijiensis and Pseudocercospora eumusae and Increased Virulence on the Banana Host

The Sigatoka disease complex, caused by the closely-related Dothideomycete fungi Pseudocercospora musae (yellow sigatoka), Pseudocercospora eumusae (eumusae leaf spot), and Pseudocercospora fijiensis (black sigatoka), is currently the most devastating disease on banana worldwide. The three species emerged on bananas from a recent common ancestor and show clear differences in virulence, with P. eumusae and P. fijiensis considered the most aggressive. In order to understand the genomic modifications associated with shifts in the species virulence spectra after speciation, and to identify their pathogenic core that can be exploited in disease management programs, we have sequenced and analyzed the genomes of P. eumusae and P. musae and compared them with the available genome sequence of P. fijiensis. Comparative analysis of genome architectures revealed significant differences in genome size, mainly due to different rates of LTR retrotransposon proliferation. Still, gene counts remained relatively equal and in the range of other Dothideomycetes. Phylogenetic reconstruction based on a set of 46 conserved single-copy genes strongly supported an earlier evolutionary radiation of P. fijiensis from P. musae and P. eumusae. However, pairwise analyses of gene content indicated that the more virulent P. eumusae and P. fijiensis share complementary patterns of expansions and contractions in core gene families related to metabolism and enzymatic degradation of plant cell walls, suggesting that the evolution of virulence in these two pathogens has, to some extent, been facilitated by convergent changes in metabolic pathways associated with nutrient acquisition and assimilation. In spite of their common ancestry and shared host-specificity, the three species retain fairly dissimilar repertoires of effector proteins, suggesting that they likely evolved different strategies for manipulating the host immune system. Finally, 234 gene families, including seven putative effectors, were exclusively present in the three Sigatoka species, and could thus be related to adaptation to the banana host.

Page | 3 (S2 Table). With respect to non-LTR retrotransposons, the long interspersed elements (LINEs) were the 1 dominant repeat class and were particularly enriched in P. musae (7.9 Mb, 27%) and P. eumusae (2.4 Mb, 2 19%). In contrast, LINE elements make up just 5.1% (2 Mb) of the repetitive fraction in P. fijiensis ( Fig. 2B; 3 S2 Table). Among Class II elements, a large expansion of the TIR/hAT transposase is apparent in the 4 genome of P. fijiensis (4.2 Mb, 11.1%) but this subfamily is only limitedly present in P. eumusae (0.02 Mb, 5 0.2%) and P. musae (0.1 Mb, 0.4%) (S2 Table). Finally, several cases of species-specific gains and losses 6 of particular types of Class I and Class II elements were observed. For example, the Class I DIRS Nagro 7 and Penelope-like (PLE) elements were absent from P. eumusae and P. musae, respectively, although they 8 were both present in P. fijiensis. Also, the Class II PiggyBac element was found present only in P. eumusae 9 and P. musae, while the Helitron element was only in P. fijiensis (S2 Table). Next to the well characterized 10 Class I and Class II TEs, a considerable amount of unclassified repeat elements that represent newly 11 evolved, species-specific repeats were also discovered in the genomes of the three species. Notably, 12 unclassified repeats occupied a higher fraction of the repetitive content in P. eumusae (3.7/12.6 Mb, 13 29.7%), as compared to P. musae (6.1/29.2 Mb, 21.1%) and P. fijiensis (7.4 Mb/37.3, 19.8%), consistent 14 with the later historical appearance of this species ( Fig. 2B; S2 Table).

15
The efficacy and specificity of RIP in transposable elements and beyond 16 differs among the three species. 17 In addition to impacting genome evolution, differences in the repertoire of TEs among the three species 18 also imply differences in TE activity and possibly genome defenses against mobile genetic elements. A 19 major defense mechanism against TE activity in fungi is mediated by repeat-induced point mutation (RIP) 20 [5,6], a homology-based process that causes C:G to T:A transitions to duplicated regions of DNA during 21 meiosis, thus rendering TEs inactive through mutation [7]. RIP can also affect other types of DNA elements,

22
including gene duplicates that are frequently inactivated by the RIP-mediated introduction of stop codons 23 [8].

24
Analysis by RIPCAL [9] indicated that the genomes of the three species are all subject to RIP (S3 Table).

25
Overall, P. musae has a higher number of RIP loci predicted in its genome (5070), followed by P. eumusae 26 (3820) and P. fijiensis (2591). However, given the differences in genome sizes, a larger fraction of the P. 3 fulva (97.2%), P. lingam (99.8%), Z. tritici (97.9%), and others [2][3][4]. Notably, they are also inconsistent with 4 the high density of TEs in the genomes of the three Sigatoka complex species, perhaps suggesting that 5 RIP cannot effectively defend them against TE activity. Furthermore, while the majority (2449/2591, 94.5%) 6 of the putative RIP loci in P. fijiensis are co-localized with the identified repeat elements, in contrast, 10% 7 (518/5070) of the RIP loci in P. musae and an impressive 33% (1257/3820) in P. eumusae are not 8 associated with repeat elements. This could suggest abundant spillage of RIP in P. eumusae outside the 9 duplicated target sequence into the adjoining non-duplicated sequences and/or that multicopy genes in 10 P. eumusae may be more sensitive to RIP than in P. musae, P. fijiensis, and other Dothideomycetes [2, 3, 11 10]. Furthermore, 2163, 3164, and 4539 of putative protein-coding genes that represent 19.4%, 29.8%, 12 and 34.6% of the total genes in P. eumusae, P. musae, and P. fijiensis, respectively, were found located 13 within a 2 kb region flanking the RIP loci in each species. These included 32 (28.6% of the total predicted 14 effectors), 35 (31.8% of the total predicted effectors), and 37 (35.2% of the total predicted effectors) of 15 putative effectors present in P. eumusae, P. musae, and P. fijiensis, respectively (S3 Table)  To understand similarities and differences in the functional properties of the three species, we annotated 21 each species proteome, by assignment into the functional categories of the eukaryotic orthologous groups

22
(KOG) database [12]. KOG has four major functional categories, i.e. cellular processing and signaling, 23 information storage and processing, metabolism, and poorly characterized proteins with unknown 24 functions. A total of 6292 (59.7%), 6783 (61.3%), and 7323 (55.9%) of the predicted proteins in P. musae, 25 P. eumusae, and P. fijiensis, respectively were assigned to KOGs. A KOG-based breakdown of the species' 26 proteomes indicated that the percentage of proteins allocated to each of the four higher functional 27 categories of KOG was comparable among the three species, although the absolute total number of Page | 5 proteins assigned to the same KOG category could be different among them (S6 Fig.). For example, a 1 total of 1873 proteins from P. musae, 2066 proteins from P. eumusae and 2164 from P. fijiensis, 2 corresponding to 17.8%, 18.7% and 16.5% of their proteomes, respectively, were assigned KOGs in the 3 cellular processes and signaling category. Similar results were also obtained when examining the 4 distribution of KOGs from each species within the functional categories of metabolism (Pm: 1946 proteins, 5 18.45%; Pe: 2026 proteins, 18.31%; Pf: 2235 proteins, 17.05%), information storage and processing (Pm: 6 1188 proteins, 11.26%; Pe: 1312 proteins, 11.85%; Pf: 1413 proteins, 10.78%), and the category of poorly 7 characterized ones (Pm: 1285 proteins, 12.18%; Pe: 1379 proteins, 12.46%; Pf: 1511 proteins, 11.53%) 8 (S6 Fig.). The pattern was also conserved when the proteomes were annotated based on the 25 9 subcategories of KOG, in which case P. fijiensis generally exhibited the highest number of proteins 10 annotated in all but three of the sub-categories (i.e. N: Cell wall/membrane envelope biogenesis, Y:

12
Proportionally to their proteome sizes, however, the three species do not exhibit any significantly large 13 differences in the percentage of proteins distributed across the 25 KOG subcategories (S6 Fig.), indicating 14 that, based on their KOG profiles, they execute a fairly similar spectrum of biological activities.

15
Further orthology-based comparative analysis of the species' gene and proteome complements indicated 16 that a total of 6307 protein-coding gene families containing at least one gene copy in each of the three 17 species were shared by all three species that represent their core proteome complement (Fig. 4A). KOG-18 based functional annotations revealed that nearly a third (2076) of the core gene families encode 19 hypothetical proteins that could not be assigned to any of the four higher-level categories of KOG. A total 20 of 4782 KOG terms could be assigned to the remaining 4231 families, with the ones involved in "cellular 21 processing and signaling" (1508), and "metabolism" (1311) being more abundant as compared to families 22 involved in "information storage and processing" (1008) or are "poorly characterized" families with 23 unknown function (955) (S7 Fig.). Note that some gene families receiving KOG annotations could be 24 assigned to more than one functional category of KOG. Thus, the distribution of KOG annotations for the 25 core proteome of the three species follows a pattern similar to that of their individual full proteomes.

Page | 6
indicating that the overall gene complement of this species is more divergent as compared to the gene 1 complements of the other two pathogens. Consequently, the number of species-specific protein-coding 2 genes retrieved from P. fijiensis (3442/13 107, 26.2%) was higher as compared to P. eumusae (1759/11 3 064, 15.9%) and P. musae (1867/10 548, 17.7%) (Fig. 4A), which is in line with the earlier branching of P. 4 fijiensis from the last common ancestor of all three species [13]. The KOG-based functional annotations 5 of the species-specific genes revealed that the overwhelming majority of these genes in P. musae (1652),   were predicted to encode for secreted proteins, including 6 putative effectors that could be required for 16 virulence specifically on the banana host (S8 Fig.). Similarly, of the species-specific genes, 2176, 1403, 17 and 1120 genes in P. fijiensis, P. musae and P. eumusae, respectively, are putative orphans, as no 18 homologs could be identified in none other species (Fig. 4B). The overwhelming majority of orphan genes 19 (~95%) in the three pathogens encode for hypothetical proteins without any protein domains or putative 20 functional roles assigned to them (S8 Fig.), suggesting that they may promote micro-evolutionary 21 divergence of the three species and likely virulence on the banana host as well.

22
Analysis of copy-number variations (CNV) reveals parallel patterns of gene 23 family expansions and contractions between P. fijiensis and P. eumusae. 24 Analysis of CNV among P. musae, P. eumusae, and P. fijiensis indicated that clustering of the species 25 based on the pattern of expansions and reductions in core gene families, and especially the ones related 26 to metabolism, is more respectful of the species virulence profiles rather than their evolutionary Page | 7 relationships, suggesting that P. fijiensis and P. eumusae exhibit somewhat concerted patterns of CNV 1 (Fig. 6).

2
To investigate whether the pattern of parallel changes in size in gene families shared between P. fijiensis 3 and P. eumusae extended beyond core gene families, the analysis was expanded to include gene families 4 that are shared by at least two of the species but not necessarily the third one (i.e. it could be absent in 5 the third species). The number of equally sized gene families between either two of the three species was 6 then enumerated and further classified into three major groups, based on the pair of species that were 7 sharing the equivalent family sizes. For example, in group Pe/Pf, P. eumusae and P. fijiensis share equal 8 family sizes but different than P. musae, in group Pe/Pm, P. eumusae and P. musae share equal family 9 sizes but different than P. fijiensis, and finally in group Pm/Pf, P. musae and P. fijiensis share equal family 10 sizes but different than P. eumusae. To avoid uncertain CNV status, species-specific genes were not 11 considered in this analysis. Once more, the pairwise comparisons showed that a significantly higher 12 number of the gene families had exactly the same copy number shared between P. eumusae and P. 13 fijiensis (Pe/Pf group: 1742 gene families), rather than between P. musae and P. fijiensis (Pm/Pf group: 14 1127 gene families) or between P. musae and P. eumusae (Pe/Pm group: 945 gene families) (S9 Fig.).

15
This result reinforces the confidence that P. eumusae and P. fijiensis share a more similar pattern of gene

23
Although clustering of P. fijiensis together with P. eumusae, when considering variations in gene families 24 related to metabolism, could be due to convergent expansions and contractions in these two species, an Page | 8 changes in gene copy numbers of their metabolic families. In specific, we first identified, based on KOG 1 annotations, gene families associated with metabolism in the nine Capnodiales species, and then using 2 the CNV for each family among the different species, we performed hierarchical clustering in order to 3 elucidate the pattern of copy number changes that has emerged during evolution. Hierarchical clustering 4 of the species based on copy number changes in the 1503 metabolic gene families that were identified in 5 at least one of the nine species, clustered P. eumusae together with P. fijiensis, supporting the occurrence 6 of parallel evolution (S10A Fig.). In addition, P. musae still clustered together with P. eumusae and P. 12 fijiensis together with P. eumusae. However, excluding these 130 gene families from the analysis still 13 clustered the two more virulent species together (S10B Fig.), indicating that their clustering is likely due to 14 parallel expansions and contractions in these two species rather than changes that took place solely in P.  (S11A Fig.). Hierarchical clustering of the species, based on the GO-distribution profiles (i.e. by 1 enumerating the number of genes assigned to each category of GO) of their entire proteomes, produced 2 the expected tree topology that was congruent with the species phylogenetic relationships (S11B Fig.). To 3 identify GO terms that define a tree topology that is respective of the species virulence profiles, we used a 4 random forest (RF) approach, a statistical method that can be used for an unbiased ranking and filtering 5 from large datasets of biomarkers (e.g. genes) that are associated with a given molecular signature or 6 pattern [15-17]. Using this approach, a total of 24 GO terms were identified that could possibly underlay 7 the clustering of P. eumusae together with P. fijiensis, when considering changes in the predicted 8 proteomes of the three species (S11C Fig.). Consequent mapping of these GO terms on a directed acyclic 9 graph (DAG) that illustrates the connections among the different terms in the form of parent-to-child 10 relationships, indicated that the majority of identified GO terms (16/24) are associated with metabolic 11 processes (GO: 0008152) (S12 Fig.). Included as child nodes are five terms that are directly related to

CAZy annotations and characterization of plant cell wall degrading enzymes 22
(PCWDEs) suggest small differences among the three species but also more 23 similar profiles for P. eumusae and P. fijiensis as compared to P. musae. 24 Page | 10 matter for their nutrition [18]. Traditionally, CAZymes have been organized into five major superfamilies, 1 comprised of glycoside hydrolases (GHs), glycosyltransferases (GTs), polysaccharide lyases (PLs), 2 carbohydrate esterases (CEs), and carbohydrate-binding modules (CBMs) [19]. Of these, CBMs do not   Table). More specifically, a total of 222, 236 and 244 putative GHs from 61, 60 and 59 different 17 GH families were recovered from the genomes of P. musae, P. eumusae, and P. fijiensis, respectively,  Page | 11 P = 0.217). However, at the individual family level many differences can be present between the two 1 groups (S4 Table; S14 Fig.). Family GH25, for example, which includes enzymes with lysozyme activities 2 that cleave the bacterial cell-wall polymer peptidoglycan and are produced by many organisms as a  Table). Such differences in the specific enzymatic repertoire of different fungi are not uncommon and

14
although within each individual order many deviations from the expected species phylogeny could be 15 observed (Fig. 7). Of marked importance, P. eumusae clustered together once more with P. fijiensis rather 16 than P. musae, as expected based on the phylogenetic placement of the three species, indicating that P.   Table). Moreover, within the group of PCWDEs, the ones directed towards  Table). The higher number of hemicellulases in these species is not unusual among plant pathogenic 6 fungi, which in general have a much higher and more diverse arsenal of lignocellulolytic (i.e. 7 decomposition of cellulose, hemicellulose, and lignin) enzymes as compared to pectinases [3,18]. It 8 should be noted that from the above calculations we excluded families such as GH1, GH3, GH5, and GH9, 9 whose members can act both on plant and fungal cell walls, and which have been placed in the generic 10 category of cell wall degraders.

11
Despite the fact that the three banana pathogens share similar numbers in PCWDEs, overall, they display 12 some differences at the individual family level. Such differences can be both in the diversity of families 13 present in each species as well as in the number of members in each family (S6 Table; Table;   thus revealing that they share more similar patterns of average family sizes for PCWDEs as compared to 24 P. musae (Fig. 7).

25
When compared to other Dothideomycetes, the median numbers of hemicellulases and other types of Page | 13 hemibiotrophic fungi from the Capnodiales clade (S6 Table) [2, 3]. Along the same lines, we could not 1 distinguish any clear differences in the overall arsenal of PCWDEs between monocot and dicot infecting 2 species. At the individual family level, however, the Mann-Whitney U test indicated that family GH74 3 involved in the degradation of cellulose is again underrepresented (or entirely missing) in the genomes of 4 P. musae, P. eumusae, and P. fijiensis as compared to the five hemibiotrophic species of Capnodiales. In 5 contrast, families CE1 and CE3 of esterases, whose members display hemicellulolytic activity, family 6 GH51, whose members are involved in the degradation of hemicellulose-pectin complexes, and families 7 GH78 and GH115, which include many enzymes with pectinolytic activity, are overrepresented in the 8 genomes of P. musae, P. eumusae, and P. fijiensis as compared to the five hemibiotrophic Capnodiales 9 (S6 Table). Such differences at the individual family level could be the result of adaptation of P. musae, P.  Table).

19
Annotation of the core enzymes involved in the biosynthesis of secondary 20

metabolites (SMs) reveals that the three Sigatoka species potentially produce 21 a diverse but only partially overlapping array of SMs. 22
Filamentous fungi secrete numerous secondary metabolites (SMs) to alter and adapt to their environment 23 [23]. These low-molecular-weight metabolic products display a wide range of chemical structures that   13 Ts1-to-Ts5, and 4 in Pf: TsI-to-TsIV) (S7 Table). No DMATs were detected in any of the three species. The

17
Unfortunately, a complete and reliable annotation of the full biosynthetic gene clusters in which the 18 identified core SM genes are embedded was not possible, mainly due to the highly fragmented genome 19 assemblies, especially for P. eumusae and P. musae.

20
Analysis by AntiSMASH and manual curations of the domain architectures of the core enzymes indicated 21 that nearly all of the PKSs present in the predicted proteome of the three species belong to the subcategory 22 of interative type I PKSs, with a higher number predicted in P. eumusae (n=10) than in P. musae (n=7) 23 and in P. fijiensis (n=6) (S7 Table). As expected, no type II PKSs were found in any of the three species, 24 while one type III was identified in P. eumusae, in agreement with the rare presence of these two PKS types   Table), alluding to the production of aliphatic compounds or reduced polyketide chains, respectively.

6
To better understand the type of SMs that might be produced by the three species, we performed a 7 phylogenetic analysis with other fungal core PKS enzymes that are involved in the biosynthesis of well 8 characterized SMs, such as aflatoxins, fumonisins, and others (S17 Fig.). A comprehensive list and activity in animals [28]. However, despite the fact that the P. eumusae Pks6 is annotated as a reducing 20 type, both P. eumusae and P. musae seem to lack orthologues of Pks4 from F. graminearum, and thus, 21 they are unlikely to produce zearalenone [29]. Moreover, the orthologous PksC, Pks3, and PksIII enzymes 22 from P. musae, P. eumusae, and P. fijiensis, respectively were clustered (bootstrap value of 99%) with core 23 enzymes that mediate the biosynthesis of the anthraquinone endocrocin in species of Aspergillus spp.

24
Anthraquinones are well-known for their array of industrial and medical uses but also as precursors to the 25 synthesis of aflatoxin intermediates [30]. Clustered with the orthologous PksB, Pks2, and PksII (bootsrap 26 of 100%) from P. musae, P. eumusae, and P. fijiensis, respectively, was EfPks1, which is involved in the

21
Taken together, the phylogenetic analysis indicates that although the three pathogens share some 22 orthologous core PKS enzymes, they still exhibit considerable variation in the arsenal of SMs that they 23 potentially produce, some of which may bare structurally similarity, at least in their backbone structure, to 24 already characterized phyto-and mycotoxins.

25
Similar overall results were extracted by an analysis of the NRPSs present in the three banana pathogens 26 (S18 Fig.). Like PKSs, NRPSs are megasynthases consisting of several enzymatic modules that elongate

12
Within the SID subgroup, the three A domains of the multimodular NpsA, Nps1, and NpsI enzymes from 13 P. musae, P. eumusae, and P. fijiensis, respectively, were seen clustered with high to median support (ML

19
To gain a deeper insight into the pathogenic potential of the three species that constitute the Sigatoka 20 disease complex, we characterized their secretomes, placing an emphasis on identifying and comparing 21 their arsenal of candidate effector repertoires. We broadly defined as effectors the subgroup of secreted 22 proteins that were shorter than 250 amino acids in length with a cysteine content that was at least two-fold  shared by the three species, while BlastP (e-value: 1e-5, alignment coverage > 50%) against the NCBI nr 1 database and the JGI fungal genome database was used to identify putative homologs in other fungal 2 species and beyond. As for the full proteome, we defined "core", as those effectors shared by the three 3 species and other fungal species as well, while we classified "lineage-specific" as the subset of core 4 effectors that are present only in the three pathogens that constitute the Sigatoka disease complex. We 5 also considered effectors that are found in only one of the three pathogens but not in the other two as 6 "species-specific", while we classified "orphans" as the subcategory of species-specific effectors that do 7 not have homologs in any other fungal species.

8
Secretome and effector identification was performed using the bioinformatics workflow presented in S20 predicted in the genomes P. musae, P. eumusae, and P. fijiensis, respectively, indicating that the three 12 species employ secretome arsenals of comparable size to the secretomes of most other hemi-biotrophic 13 fungi, but smaller as compared to necrotrophic pathogens (Mann-Whitney U test, P-value = 0.01) (S8 14   Table). Using the criteria listed above, a total of 110, 112, and 105 putative effector proteins could be 15 retrieved from the predicted secretomes of P. musae, P. eumusae, and P. fijiensis, respectively. Pfam and  Table). Notably, a second paralog of Avr4, which we termed Avr4-2, could be found  Table). Notably, RALFs are a family of  Table). Both these effectors are shown to be virulence factors in their respective species and thus 8 their homologs in P. eumusae might have a similar role in virulence as well. The third effector from P.   Table). This could potentially represent a 4 th paralog of the 14 Ecp2 effector family present in P. musae, although its sequence is highly diverse as compared to the other 15 three Ecp2 paralogs present in this species. Finally, search for Pfam domains in the species-specific 16 effectors of P. fijiensis identified one effector with a hit to the Ser-Thr-rich glycosyl-phosphatidyl-inositol-17 anchored membrane family (PF10342), two effectors with a hit to a cutinase (PF01083) and a peptidase 18 (PF13933) enzyme, respectively and two more effectors containing putative RALF domains (PF05498) (S9 19   Table). Neither of these two effectors, however, shared significant homology over the entre protein to the 20 RALF-like effector identified in P. musae.

21
The clustering analysis suggests that the three pathogens, despite their very close evolutionary 22 relationships, common infection biology and host range, exhibit a considerably diverse arsenal of effector 23 proteins that could have contributed to their differences in virulence. However, caution needs to be taken 24 regarding the numbers listed above, as when the effector repertoire of each species was used as query in 25 Blastp searches (e-value: 1e-5, alignment coverage > 50%) against the entire proteome of the other two 26 species, then additional putative homologs could be identified that were not annotated as effectors, either 27 because they were larger than 250 amino acids in length or because they were not predicted as secreted 28 proteins in the other species (S9 Table; S21B-D Fig.). For example, BlastP-based query of the P. eumusae Page | 21 effector repertoire against the predicted effectorome and proteome of P. musae, returned 39 and 75 protein 1 hits, respectively, indicating that the true number of effectors shared by the two species could be 2 considerably higher. From the additional 36 proteins that were retrieved as blast hits against the entire 3 proteome of P. musae, 7 were larger than 250 amino acids, while the remaining 29 were missing a signal 4 peptide. Manual curation of a randomly selected set of six of these non-secreted proteins indicated that in 5 two cases they represented misannotations and an alternative ORF could be found that corresponded to 6 a putatively homologous secreted protein. When considering all the blastp hits of a species effectorome 7 against the entire predicted proteome of the other two species, then the number of species-specific and 8 orphan effectors is reduced to 33 and 20 in P. musae, 27 and 17 for P. eumusae, and 43 and 36 in P. 9 fijiensis, respectively (S9 Table; S21B-D Fig.). This, however, still remains a relatively high number of 10 species-specific and orphan effectors in each species, while P. eumusae and P. musae continue to share 11 a larger number of putative homologous effectors as compared to putative effectors shared between P.