Genome-wide increased copy number is associated with emergence of super-fit clones of the Irish potato famine pathogen Phytophthora infestans

The plant pathogen that caused the Irish potato famine, Phytophthora infestans, continues to reemerge globally. These modern epidemics are caused by clonally reproducing lineages. In contrast, a sexual mode of reproduction is observed at its center of origin in Mexico. We conducted a comparative genomic analysis of 47 high coverage genomes to infer changes in genic copy number. We included samples from sexual populations at the center of origin as well as several dominant clonal lineages sampled worldwide. We conclude that sexual populations at the center of origin are diploid as was the lineage that caused the famine, while modern clonal lineages showed increased copy number (3x). Copy number variation (CNV) was found genome-wide and did not to adhere to the two-speed genome hypothesis. Although previously reported, tetraploidy was not found in any of the genomes evaluated. We propose a model of super-fit clone emergence supported by the epidemiological record (e.g., EU_13_A2, US-11, US-23) whereby higher copy number provides fitness leading to replacement of prior clonal lineages.

pathogen is thought to have originated in central Mexico (Goss et al. 2014; Grünwald and Flier 45 2005) where it is found existing alongside two closely related, endemic sister-taxa defining 46 Phytophthora clade 1c, namely P. mirabilis and P. ipomoeae (Galindo and Hohl 1985;Flier et al. 47 2002)( Figure 1A). Elsewhere in the world it emerges as clonal lineages (Fry et al. 1993; Hu et al. 48 2012; Fry et al. 2013;Cooke et al. 2012). These emergent clonal lineages are frequently 49 ephemeral, disappearing after a season or two ( Figure S1). However, high fitness (hereafter 50 referred to as 'super-fit') clones occasionally emerge and become dominant, replacing the 51 formerly dominant lineages. While this pathogen continues to reemerge globally, we know very 52 little about the mechanisms involved in pathogen emergence and genomic features that are 53 associated with these newly emerging, dominant and super-fit clones. A the lineage that caused the Great Famine. However, more recent work identified HERB-1 as the 74 famine causing lineage ), a lineage that differs from US-1, but might be 75 ancestral to US-1 (Martin et al. , 2016. During the mid-1990s, late blight re-emerged in 76 the US as novel clonal genotypes that were not previously observed Goodwin 1997a, 77 1997b having a sexually reproducing population at its center of origin as well as re-emerging clonal 84 epidemics in the US, and most of the rest of the world, consisting of distinct clonal lineages that 85 displace older clonal lineages. 86 87 The P. infestans genome has been characterized as having a two-speed genome. These two 88 speeds refer to two compartments: gene-dense regions, containing predominantly 89 housekeeping genes, and gene-sparse regions enriched for effectors (proteins that are secreted 90 from the pathogen and associated with infection) including RxLR genes (Haas et  older lineages. They also reported large changes in copy number variation (CNV), gene loss, 97 mutations, and gene expression patterns that distinguished 13_A2 from previous lineages. 98 These genomic changes are thought to underlie its emergence. 99 100 In addition to the two-speed genome model, several studies have documented variation in 101 ploidy. Phytophthora species are considered to be diploid (Brasier 1992). Extensive cytological 102 work documented that P. infestans was primarily diploid, yet indicated that some isolates might 103 be of higher ploidy (Sansome and Brasier 1973;Sansome 1977 We re-sequenced genomes of P. infestans to explore variation in gene copy number and in a 126 representative global sample that included a sexual population and select members of clonal 127 lineages. We combined our genome data with recently published whole genome data to obtain 128 a population of 47 high coverage samples ( Figure 2) that provide power for testing the 129 hypotheses of finding difference in ploidy, CNV and genic content in P. infestans. For this study, 130 we defined ploidy as a genome wide change in copy number (i.e., whole genome duplication), 131 whereas copy number refers to a change observed at the sub-chromosomal level. We tested 132 the hypotheses that sexual populations were diploid with little CNV while clonal populations 133 were predominantly triploid or tetraploid with high CNV. We also tested the hypothesis that 134 CNV and presence absence polymorphism are enriched in gene-sparse, effector rich portions of 135 the genome (as expected by the two-speed genome hypothesis). We also expected to find CNV 136 and presence/absence polymorphisms differed in clonal vs. sexual populations. Finally, we 137 tested the hypothesis that similar changes in CNV might be observed in other heterothallic 138 Phytophthora species for which genomic data for populations was available, such as P. To understand variation in CNV and gene content, we resequenced and used previously 148 published, populations of the potato late blight pathogen, P. infestans, from the center of origin 149 in Mexico (n = 16), and dominant clonal lineages in the US, Europe and South America ( Figure  150 2A). To allow for robust inference of gene copy number we used only genomes with a genic 151 average adjusted read depth (AARD) of 12x or greater (Figure 2A; Figures supplement 2A, B, C). 152 This resulted in a total of 47 high quality P. infestans genomes (Supplementary table 1; Figure  153 2B). 154 155 Genic copy number varies continuously in P. infestans 156 157 We observed genic CNV among populations ( Figure 3A) and a gradient of genic copy number 158 ranging from predominantly 2x to predominantly 3x ( Figure 3B). We did not observe classes of 159 individuals that would represent tetraploid individuals. Isolates from the United States 160 belonging to clonal lineages have a gradient of gene copy number ( Figure 3C). Diploid strains in 161 US lineages were mostly found in the well-represented lineage US-22 (n = 3) and US-23 (n = 1). 162 Similarly, in Europe isolates that were both predominantly diploid and triploid were observed. 163 The exception to this balance of ploidy appeared to be in South America where almost the 164 entire sample appeared to be predominantly triploid ( Figures  The degree of CNV was also explored for samples where tissue was extracted from historical 184 herbarium samples (HERB-1: M-0182896, Pi1889; US-1: Kew122, Kew126) ( Figure 3C). These 185 samples were not cultured on media and were not exposed to the modern fungicide metalaxyl. 186 These samples demonstrated variability in gene copy number as well ( Figure 3A, C) suggesting 187 that CNV may have been a natural condition in clonal lineages of P. infestans. Note that two of 188 the four samples that we determined to be of sufficient sequence depth to call copy number 189 were from the 20 th century (Kew122 and Kew126 both collected in 1955) and clustered with Genic copy number variation was not associated with specific classes of genes 231 232 If genic copy number was associated with the phenotype we would expect CNV to be 233 associated with a particular type of gene class (e.g., pathogenicity factors). We found that in the 234 sexually reproducing population from Mexico that was predominantly diploid all gene 235 categories had more 2X genes than 3X genes ( Figure 5; green). Similarly, for the populations 236 Gene copy number variation occurred in core orthologous genes 253 254 Core orthologous Phytophthora genes are genes that were reported to occur once and only 255 once in P. infestans, P. ramorum, and P. sojae (Haas et al. 2009) and are thought to be highly 256 conserved. Therefore, we would expect only low levels of CNV in these genes. To test this 257 hypothesis, we plotted all core orthologous genes present at 3x by their 5' and 3' intergenic 258 distances ( Figure 6). We observed substantial numbers of genes inferred to have three copies 259 (3x) among core orthologous genes in the gene dense portion of each genome ( Figure 6). This 260 indicates that this portion of the genome may be more dynamic than previously thought. The phenomenon of genic CNV is shared with other members of the Phytophthora genus 272 273 We explored if variation in ploidy apparent in P. infestans is observed in other heterothallic 274 Phytophthora taxa. We looked at species for which population level genome data was available 275 including P. andina (clade 1c), P. parasitica (clade 1), and P. capsici (clade 2)(clades as assigned 276 by Blair et al. 2008). The taxon P. andina appears to be diploid in our limited sample ( Figure 6). 277 However, we observed more heterozygous positions than in the other taxa ( Figure 6). This is 278 consistent with the interpretation that P. andina is a homoploid hybrid that arose from a cross 279 between P. infestans and another undescribed Phytophthora species (Goss et al. 2011). The 280 more distantly related P. parasitica appeared diploid as well. However, its relatively high 281 sequence depth allowed resolution of minor peaks indicating that a fraction of genes occur at 282 three copies (particularly in the sample P1569). The taxon most distantly related to P. infestans 283 included in our analysis was P. capsici. Three of the P. capsici samples appeared to be diploid 284 while one sample (Pc389) appeared to be triploid. These results suggest that our findings of 285 variation in ploidy and CNV within P. infestans are also shared among other species of 286 Phytophthora. To characterize the emergence of new clonal lineages of the Irish famine pathogen 303 Phytophthora infestans we resequenced whole genomes of select populations from several 304 dominant clonal lineages as well as the first sexual populations from the center of origin in 305 Mexico. Prior work focused primarily on select individuals rather than populations and did not 306 include sexual populations. The genomes were compared with previously sequenced, high 307 quality genomes to determine ploidy, CNV and gene content. Recent epidemiological records 308 indicated that new clonal lineages have emerged repeatedly in the US and Europe ( Figure  309 supplement 1). For example, the lineage US-1 was the first to establish in the US, but was 310 eventually displaced by US-8, US-11 and more recently by US-23 (Fry et al. 2015)( Figure  311 Supplement 1). Similarly, populations in the UK were displaced by 13_A2 in the last decade and 312 Clonal lineages show higher copy number than sexual populations at the center of origin 317 318 Populations studied show a gradient of CNV from 2x to 3x ( Figure 3B). Populations of P. 319 infestans that are sexually reproducing at the species' center of diversity in Mexico are diploid 320 ( Figure 3C). This contrasted with dominant clonal populations from the rest of the world that 321 are predominantly triploid. This suggests that there may be a connection between ploidy, 322 epidemic fitness, and mode of reproduction. 323 324 Isolates were predominantly diploid or triploid, but not tetraploid 325 326 We observed no tetraploid individuals as reported previously ). In fact, we 327 reanalyzed some of the same samples and data including the European lineage 13_A2 328 previously characterized as being tetraploid. In our analysis 13_A2 had mostly three gene 329 copies (Figure 3), which is in agreement with a more recent report (Li et al. 2017). Part of this 330 discrepancy is due to changes in technology. Plotting histograms of allele balance has typically 331 included all variants, including homozygous genotypes. Because homozygous sites are much 332 more abundant than heterozygous sites this tends to drive the scaling of the plot. To avoid this, 333 previous work limited plots to a frequency range of 0.2 to 0.8. We subset our data to only the 334 heterozygous genotypes, resulting in a plot from 0 to 1, and subset the data by omitting 335 variants with unusually high or low sequence depth. This is a significant improvement in 336 methodology for inferring ploidy or CNV based on allele balance (Knaus and Grünwald 2018). 337 338 Gene loss occurred within individuals in both sexual populations and clonal lineages 339 340 We tested the hypothesis that gene loss was shared by ancestry. This would provide the 341 expectation that members of a clonal lineage show fixed polymorphisms within members of the 342 clonal lineage. We used breadth of coverage to identify presence/absence of genes relative to 343 the reference genome. Instead we found that individuals within a clonal lineage (e.g. from 344 South America or US-1; Figure 4)  Our expectation following the proposed two-speed genome hypothesis was to find CNV 354 enriched in the gene-sparse, transposon and effector rich portion of the genome where CNV 355 could provide a means of creating novel paralogs. To our surprise CNV affects housekeeping 356 genes and effectors equally and is randomly dispersed throughout the whole genome ( Figure  357 5). In the diploid genomes from Mexico we found that core orthologous genes, pseudogenes, 358 and several pathogenicity factors were all predominantly 2x ( Figure 5). Genomes of clonally 359 reproducing strains from South America and the lineage US-1 were found to have core 360 orthologous genes, pseudogenes, and pathogenicity factors that were predominantly 3x. We 361 also expected CNV to be higher in pathogenicity factors than in core orthologs, yet levels of 362 CNV were not different regardless of gene class ( Figure 5). 363 364 Variation in ploidy can be found in other but not all Phytophthora species 365 366 We also evaluated if changes in ploidy could be observed in other heterothallic Phytophthora 367 species. We used genomes for moderate population sizes from P. andina, P. parasitica (= P. andina had one haplotype from each parental species as expected and were predominantly 2x 373 copy number. P. parasitica, a distant relative of P. infestans basal to clade 1, was diploid. 374 However, two strains (P10297 and P1569) had minor peaks at our expectation for three copies 375 indicating that fractions of these genomes may vary in copy number at 3x. Our ability to resolve 376 these peaks was likely due to the high sequence depth of these samples relative to the other 377 available taxa. In clade 2, the more distant P. capsici appeared predominantly diploid for 3 378 strains; however, one strain (Pc389) was triploid. These results suggest that variation in ploidy 379 and/or copy number may be a common feature throughout the Phytophthora genus. The late blight pathogen P. infestans continues to re-emerge causing financial loss for farmers 412 and threatening food security, particularly in developing countries (Fry et al. 2015 sequenced at a variable position. In a diploid heterozygote the expectation is that each allele 464 will be observed at an equal frequency, or a ratio of one half. A triploid heterozygote will be 465 expected to have alleles observed at a ratio of one third. A tetraploid heterozygote will be 466 expected to have alleles observed at a ratio of one quarter. Note that some combinations are 467 indistinguishable and therefore uninformative. For example, a tetraploid heterozygote with 468 only two alleles (e.g., A/A/C/C) will have each allele observed at a ratio of one half. This will be 469 indistinguishable from our expectation from a diploid heterozygote. 200,000 bp windows were made using the allele ratio data. This window size was chosen for P. 491 infestans because it was sufficiently large to include a population of heterozygous positions (we 492 observed a heterozygous position every 1-2 kbp) but small enough to obtain fine scale 493 resolution. The data were then assigned to bins ranging from 0 to 1 that are of width 0.02 and 494 the bin with the greatest density was used as a summary for the window. This is analogous to 495 the modal frequency. This summary was then categorized to a ploidy level by assigning it to the 496 closest expected ratio (i.e., 1/2, 2/3, 3/4, 4/5). Each genome was now summarized into 497 windows of ploidy. In order to assign copy number to genes the coordinates of each gene were 498 referenced in the windowed genome and the copy number of the window where the gene was 499 located was used to assign a copy number to the gene. This is critical because we do not expect 500 most genes to contain enough heterozygous positions to infer an accurate estimate of copy 501 number. Once a copy number was determined a confidence in this estimate was made by 502 subtracting the observed proportion from the determined proportion and dividing by the bin 503 width so that the value ranges from zero to one. Calculations were performed in R (R Core 504 Team 2018) and using vcfR Grünwald 2017, 2018). 505 506 Gene loss based on breadth of coverage 507 508 In order to determine gene loss we measured breadth of coverage (BOC) for each gene in each 509 genome. We used SAMtools mpileup ) to count per position sequence coverage 510 over all 18,179 genes in the P. infestans T30-4 genome (Haas et al. 2009). From this data, the 511 number of positions that were sequenced at least once and a median of coverage were 512 collected. Breadth of coverage was calculated by dividing the number of positions that were 513 sequenced at least once by the gene length (i.e., the proportion of positions sequenced in a 514 gene). We used a BOC of zero to indicate the loss of a gene. 515 516 Gene class and density 517 518 Published gene annotations (Haas et al. 2009) were used to assign genes to gene classes (core, 519 pseudogene, RxLR, etc.). The flanking intergenic region (FIR) lengths (i.e., intergenic distances) 520 were calculated using a previously available script 521 (https://figshare.com/articles/Calculate_FIR_length_perl_script/707328). This information was 522 used to create FIR plots for individuals and populations from Mexico, South America, and the 523 lineage US-1 using R (R Core Team 2018) and ggplot2 ). In order to explore 524 whether genes of a particular class from populations from Mexico, South America, and the 525 lineage US-1 were enriched for a particular copy number the genes were assigned a copy 526 number (based on allele balance) and plotted as box and whisker plots using ggplot2 (Wickham 527 2009). In order to visualize whether genes determined to have three copies were more 528 abundant in the gene dense or gene sparse portion of the genome FIR plots were created as 529 above except using core orthologous genes that were determined to have three copies. 530 531 Copy number variation in other species of Phytophthora 532 533 In order to address whether copy number variation occurred in other species of Phytophthora 534 we queried NCBI for samples that had Illumina sequence data as well as an assembled genome 535 reference for the species. These data were processed as the P. infestans data were processed. 536 In order to visualize these data in a phylogenetic context a tree from Martin et al.

Read depth
The inference of copy number, including gene loss, is dependent on sequence depth. For example, the loss of a gene may be determined by a lack of sequence coverage. Similarly, the presence of three copies of a gene may be inferred when the sequence depth for alleles at a locus are present in multiples of 1/3. Insufficient sequence depth could result in the incorrect inference that a gene has been lost when it may actually be below the level of detection, perhaps because it is in a region of the genome that is difficult to sequence. Similarly, if a gene exists as two copies but is sequenced at 3X coverage it would only be possible to incorrectly infer that it has three copies. In order to address the issue of sufficient sequence depth we began by filtering the available genomes to only analyze genomes of relatively high sequence depth. Sequenced read depth for genes was used to determine whether a sample had sufficient coverage for further analysis. Per position read depth was measured using SAMtools mpileup . A GC corrected estimate of per gene sequence depth was then calculated following Raffaele et al. (2010): Here ARD(g) is the median read depth for a gene. Raffaele et al. (2010) reported average read depth but we have used median read depth to provide a robust estimate, but have retained the nomenclature of Raffaele and colleagues. mARD is the median read depth over all genes and mARD GC is the median read depth for genes in the same GC percentile as the gene being evaluated. This results in a GC adjusted average read depth A ARD(g).
A threshold of 12X coverage was used to determine suitability for analysis. Because our estimate of copy number relied on calculating the frequency of allele depth we chose a number that was greater than 10. Also, because we expected some samples to have three copies we chose a threshold that would divide evenly by two and three. Lastly, the choice of threshold was a compromise between attempting to select the highest coverage samples available, but also retaining as many samples as possible. The samples KM177497, M.0182897, and M.0182904  were omitted because they did not appear to have enough data to process. Violin plots of were created of A ARD(g) for each available sample using ggplot2    (Martin et al., 2015), PIC97136, PIC97442 (this report) were considered but omitted based on the criterion that they had a genic AARD less than 12.

Allele balance: P. infestans
Data were processed as in the section 'Gene copy number inference based on allele balance' and in Knaus and Grünwald (2018  . The allele depth (AD) and genotypes were extracted from the vcfR object and the first and second most abundant alleles were extracted from the allele depth. The genotype information was used to subset the allele depth information to only heterozygous positions. The 15th and 85th percentiles were calculated for each sample and each alelle (the first and second most abundant alleles) and used as an inclusion threshold for depth filtering. A frequency for each allele was then calculated by dividing its allele depth by the sum of the first and second most abundant alleles. This information was used to plot each histogram. Samples are presented in the same order as in Figure                       . The allele depth (AD) and genotypes were extracted from the vcfR object and the first and second most abundant alleles were extracted from the allele depth. The genotype information was used to subset the allele depth information to only heterozygous positions. The 15th and 85th percentiles were calculated for each sample and each alelle (the first and second most abundant alleles) and used as an inclusion threshold for depth filtering. A frequency for each allele was then calculated by dividing its allele depth by the sum of the first and second most abundant alleles. This information was used to plot each histogram. These methods are also described in Knaus and Grünwald (2018).

Gene loss categories
Gene loss, relative to T30-4, was more frequent in RxLR genes and was more frequent in Mexican isolates (sexually reproducing) than US-1 isolates (clonally reproducing). Core orthologous genes had among the lowest count of lost genes for Mexican, South American, and US-1 isolates. RxLR effector genes had the highest number of gene losses with Mexican isolates having a similar number of losses as South American isolates and US-1 isolates had a non-significantly lower number of gene losses.