Genome-Wide Increased Copy Number is Associated with Emergence of Dominant Clones of the Irish Potato Famine Pathogen Phytophthora infestans

The plant pathogen implicated in the Irish potato famine, Phytophthora infestans, continues to reemerge globally. Understanding changes in the genome during emergence can provide insights useful for managing this pathogen. Previous work has relied on studying individuals from the United States, South America, Europe, and China reporting that these can occur as diploids, triploids, or tetraploids and are clonal. We studied variation in sexual populations at the pathogen’s center of origin, in Mexico, where it has been reported to reproduce sexually as well as within clonally reproducing, dominant clones from the United States and Europe. Our results newly show that sexual populations at the center of origin are diploid, whereas populations elsewhere are more variable and show genome-wide variation in gene copy number. We propose a model of evolution whereby new pathogen clones emerge predominantly by increasing the gene copy number genome-wide.

Phytophthora clade 1c, namely, P. mirabilis and P. ipomoeae (4,5,51). Elsewhere in the world, it emerges as clonal lineages (6)(7)(8)(9). These emergent clonal lineages are frequently ephemeral, disappearing after a season or two (8,10). However, novel clones occasionally emerge and become dominant, replacing the formerly dominant lineages. While this pathogen continues to reemerge globally, we know very little about the mechanisms involved in pathogen emergence and the genomic features that are associated with these newly emerging, dominant clones.
P. infestans exhibits two distinct lifestyles worldwide. In central Mexico, the pathogen exists as a sexual, randomly mating population (1)(2)(3)52). Throughout much of the remainder of the world, P. infestans is distributed as distinct clonal lineages that reproduce mitotically. Until the early 1990s, a single lineage, US-1, dominated the global populations (11). US-1 was thought to be the lineage that triggered the Great Famine. However, more recent work identified FAM-1 as the famine-causing lineage (12), a lineage that differs from but might be ancestral to US-1 (13,14). During the mid-1990s, late blight reemerged in the United States as novel clonal genotypes that had not been previously observed (15,16). The epidemiologically most notable genotypes included US-8 and US-11, which were characterized as having resistance to the fungicide metalaxyl. During the late 2000s, novel lineages emerged in the United States, including US-22, US-23, and US-24 (7,8). Similar observations were made in Europe, where the 13_A2 clonal lineage became dominant in the late 2000s and where it displaced 6_A1 in the United Kingdom and other previously existing clonal lineages (9). While populations in most of Europe are clonal, sexual populations have been described in northern Europe (9,(17)(18)(19)(20)(21)(22). The global population structure of P. infestans is therefore characterized as having a sexually reproducing population in Mexico as well as reemerging clonal epidemics in the United States and most of the rest of the world (except northern Europe), consisting of distinct clonal lineages that displace older clonal lineages.
The P. infestans genome has been characterized as being a two-speed genome. These two speeds refer to two compartments, gene-dense regions containing predominantly housekeeping genes, and gene-sparse regions enriched for effectors (proteins that are secreted from the pathogen and associated with infection), including RxLR genes (23,24), genes containing an arginine, any amino acid, leucine, and an arginine motif. It is thought that dramatic changes to the gene-sparse, transposon, and effectorrich portion of the genome are responsible for most of the adaptation in clonal lineages. For example, Cooke et al. (9) studied the recent emergence of the 13_A2 clonal lineage in the United Kingdom that largely displaced clonal lineages existing in the United Kingdom by about 2008. This study documented that this lineage was more aggressive, thus outcompeting and displacing older lineages. They also reported large changes in copy number variation (CNV), gene loss, mutations, and gene expression patterns that distinguished 13_A2 from previous lineages. These genomic changes are thought to underlie its emergence.
In addition to the two-speed genome model, several studies have documented variation in ploidy. Phytophthora species are considered to be diploid (25). Extensive cytological work documented that P. infestans was primarily diploid yet indicated that some isolates might be of higher ploidy (26,27). Several cytological studies indicated that individuals from sexual populations in Mexico were diploid, whereas individuals from clonal populations elsewhere frequently exhibited higher levels of ploidy (28,29). More recently, Yoshida et al. (12) analyzed whole-genome sequences to show that the allele balance (e.g., the frequency of each allele sequenced at heterozygous positions) for some individuals was triploid or tetraploid. This observation of higher ploidy was further supported by work combining microsatellite analyses, flow cytometry, and high-throughput sequencing of 18 genomes (predominantly from the Netherlands) (30). This body of prior cytological and genomic work provides support for a model that clonal populations are often triploid or tetraploid, while some populations/strains might be diploid. However, these observations are based on individual samples, not allowing broader inferences about populations at large, and have not included a representative sample from sexual populations.
We resequenced genomes of P. infestans to explore variation in gene copy number and in a representative global sample that included a sexual population and select members of clonal lineages. We combined our genome data with recently published whole-genome data to obtain a population of 47 high-coverage samples (see Text Files S1 and S2 in reference 53) that provide power for testing the hypotheses of finding differences in ploidy, CNV, and genic content in P. infestans. For this study, we defined ploidy as a genome-wide change in copy number (i.e., whole-genome duplication), whereas copy number refers to a change observed at the subchromosomal level. We tested the hypotheses that sexual populations were diploid with little CNV, while clonal populations were predominantly triploid or tetraploid with high CNV. We also tested the hypothesis that CNV and the presence/absence polymorphism are enriched in gene-sparse, effector-rich portions of the genome (as expected from the two-speed genome hypothesis). We also expected to find that CNV and presence/absence polymorphisms differed in clonal versus sexual populations. Finally, we tested the hypothesis that similar changes in CNV might be observed in other heterothallic Phytophthora species for which genomic data for populations was available, such as P. parasitica and P. capsici. Our findings provide a new perspective on how plasticity in ploidy, copy number, and presence/absence polymorphisms contribute to the emergence of the Irish potato famine pathogen and other Phytophthora pathogens.

RESULTS
Resequencing populations of P. infestans. To understand variation in CNV and gene content, we resequenced and used previously published populations of the potato late blight pathogen, P. infestans, from the center of origin in Mexico (n ϭ 16) and dominant clonal lineages in the United States, Europe, and South America (Fig. 1). To allow for robust inference of gene copy number, we used only genomes with a genic average adjusted read depth (AARD) of 12ϫ or greater (see Text S3 in reference 53). This resulted in a total of 47 high-quality P. infestans genomes (see Text S1 in reference 53).
Genic copy number varies continuously in P. infestans. We observed genic CNV among populations ( Fig. 2A) and a gradient of genic copy number ranging from predominantly 2ϫ to predominantly 3ϫ (Fig. 2B). We did not observe classes of  (2,52). In order to attain high-quality samples from the literature and our own resequencing for the inference of copy number variation, only samples with at least 12ϫ adjusted average read depth (AARD) were included (see Text S1 in reference 53).
Genome-Wide CNV in Phytophthora infestans ® individuals that would represent tetraploid individuals. Isolates from the United States belonging to clonal lineages have a gradient of gene copy number (Fig. 2C). Strains in U.S. lineages that were predominantly 2ϫ were mostly found in the well-represented lineage US-22 (n ϭ 3) and in US-18 (n ϭ 1). Similarly, in Europe, isolates that were both predominantly 2ϫ and 3ϫ were observed. The exception to this balance of copy number appeared to be in South America, where almost the entire sample was predominantly 3ϫ (Fig. 2C). Samples from Mexico had a low percentage of gene copy numbers assigned to 3ϫ (Ͻ20%; Fig. 2A and C), and the majority of genes occurred in two copies. While previous studies focused on variation in ploidy (12,(26)(27)(28)(29)(30), our work supports variation in genome size in P. infestans occurring largely at a subgenomic level; Mexican, FAM-1, and US-22 samples were predominantly 2ϫ with narrow variation that can be interpreted as diploidy, whereas samples from South America, US-1, other U.S. lineages, and those from Europe showed large variation ( Fig. 2A). The variation in CNV was also explored for samples where tissue was extracted from historical herbarium samples (FAM-1: M-0182896, Pi1889; US-1: Kew122, Kew126) (Fig. 2C). These samples were not cultured on medium and were not exposed to the modern fungicide metalaxyl and demonstrated variability in gene copy number as well ( Fig. 2A and C), suggesting that CNV may have been a natural condition in clonal lineages of P. infestans. Note that two of the four samples that we determined to be of sufficient sequence depth to call copy number were from the 20th century (Kew122 and Kew126, both collected in 1955; see Text S1 in reference 53) and clustered with US-1 (14), while the other two were from the 19th century and clustered with FAM-1 (M-0182896 collected in 1877 and Pi1889 collected in 1889). This indicates that CNV was observed throughout the time series of the data and was not restricted to modern samples that were cultured on medium.
Gene loss occurs in both clonal and sexual populations. We explored the hypothesis that gene loss (relative to the reference genome T30-4) had occurred collectively within a lineage or independently. The breadth of coverage (BOC) for a gene is the proportion of positions that were sequenced at least once in the reference C. genome (24). For example, a BOC of 0.75 would indicate that 75% of the positions in a gene were sequenced at least once. We used a BOC of 0 to define a gene loss event and presented samples for populations that included at least six individuals (groupings with more were randomly subset to a sample size of six) ( Fig. 3; see Text S5 in reference 53). Gene loss was most pronounced in RxLR and crinkler (CRN) effectors but was found in all gene classes (average range of 0 to 1 for core, CAZy, necrosis inducing-like protein [NPP1], secreted small cysteine-rich protein [SCR], and elicitin) (see Text S5 in reference 53). Gene loss among the isolates from Mexico ranged from 38 to 112 gene deletions. However, we found only one shared deletion among all samples within the clonal lineage (Fig. 3, bottom panel). Clonally reproducing isolates from South America demonstrated a loss of 39 to 63 genes, with only 9 gene losses shared in common among these isolates. Among 6 individuals belonging to lineage US-1, we observed a range of loss of 21 to 68 genes but only 5 gene losses common among all of the sampled lineages (Fig. 3). Gene loss is a dominant feature in the gene-sparse regions harboring Ͼ95% of genes subject to gene loss in all samples of the genome. However, the specific gene lost within any particular sample is unique and random and apparently affects clonal and sexual populations equally. Genic copy number variation was not associated with specific classes of genes. We found that in the sexually reproducing population from Mexico that was predom-

Mexico
South America  Genome-Wide CNV in Phytophthora infestans ® inantly diploid, all gene categories had more 2ϫ genes than 3ϫ genes (Fig. 4, green). In contrast, for the populations from South America (orange) and US-1 (red), which were clonally reproducing, we found that all gene classes had more 3ϫ genes than 2ϫ genes regardless of gene family. CNV occurs throughout gene space without a preference for functional annotation (Fig. 4). Gene copy number variation occurred in core orthologous genes. Core orthologous Phytophthora genes are reported to occur only once in P. infestans, P. ramorum, and P. sojae (23) and are thought to be highly conserved. Based on the two-speed genome hypothesis, one might expect higher copy number to preferentially occur in the gene-sparse region. We plotted all core orthologous genes present at 3ϫ by their 5= and 3= intergenic distances (Fig. 5). We observed substantial numbers of genes inferred to have three copies (3ϫ) among core orthologous genes in the gene-dense portion of each genome (Fig. 5). This indicates that this portion of the genome may be more dynamic than previously thought.
The phenomenon of genic CNV is shared with other members of the Phytophthora genus. We explored if the variation in ploidy apparent in P. infestans is observed in other heterothallic Phytophthora taxa. We looked at species for which populationlevel genome data were available, including P. andina (clade 1c), P. parasitica (clade 1), and P. capsici (clade 2) (clades as assigned by Blair et al. [31]). The taxon P. andina appears to be diploid in our limited sample (Fig. 6). However, we observed more heterozygous positions than in the other taxa (Fig. 6). This is consistent with the interpretation that P. andina is a homoploid hybrid that arose from a cross between P. infestans and another undescribed Phytophthora species (32). The more distantly related P. parasitica appeared diploid as well. However, its relatively high sequence depth allowed resolution of minor peaks, indicating that a fraction of genes occur at three copies (particularly in the sample P1569). The taxon most distantly related to P. infestans included in our analysis was P. capsici. Three of the P. capsici samples appeared to be diploid, while one sample (Pc389) appeared to be triploid. These results suggest that our findings of variation in ploidy and CNV within P. infestans are also shared among other species of Phytophthora.

DISCUSSION
To characterize the emergence of new clonal lineages of the Irish famine pathogen, Phytophthora infestans, we resequenced whole genomes of select populations. We Isolates from Mexico (green), where P. infestans is sexually reproducing, had a gene copy number predominantly of two for all classes of genes. Isolates from South America and US-1, both considered clonally reproducing, had a gene copy number predominantly of three for all gene classes. Gene copy number varies throughout gene space and is not associated with function. Box and whisker plots summarize points that represent samples (n ϭ 6) and the proportion of genes that were either 2ϫ or 3ϫ (based on the total number of 2ϫ and 3ϫ genes). A sample size of six was used (as in Fig. 3 focused on contrasting several dominant clonal lineages in the United States as well as sexual populations from the center of origin in Mexico for which we were able to obtain samples. Prior work (see below) focused primarily on individuals rather than populations and did not include sexual populations. The genomes were compared with previously sequenced, high-quality genomes to determine ploidy, CNV, and gene content. Recent epidemiological records indicated that new clonal lineages have emerged repeatedly in the United States and Europe (see Text S4 in reference 53). For example, the lineage US-1 was the first to establish itself in the U.S. but was eventually displaced by US-8, US-11, and more recently, by US-23 (1) (see Text S4 in reference 53). Similarly, populations in the United Kingdom were displaced by 13_A2 in the past decade and, more recently, by 6_A1 (9). While variation in ploidy has been described in individuals from clonal lineages of P. infestans, our work provides several new key insights based on population-level patterns, expanding on prior work focusing on single clonal strains.
Clonal lineages show higher copy numbers than sexual populations at the center of origin. The populations studied show a gradient of CNV from 2ϫ to 3ϫ (Fig. 2B). Populations of P. infestans that are sexually reproducing at the species' center of diversity in Mexico are predominantly diploid (Fig. 2C). This contrasts with dominant clonal populations from the rest of the world, which are predominantly triploid. This provides support for the hypothesis that there may be a connection between copy number, epidemic fitness, and mode of reproduction. Higher copy number might increase expression of advantageous genes. This hypothesis is, however, difficult or impossible to test experimentally and is not experimentally supported.
Isolates were predominantly diploid or triploid but not tetraploid. We observed only diploid and triploid strains but no tetraploid individuals as reported previously (12). We reanalyzed some of the same samples and data, including the European lineage 13_A2, previously characterized as being tetraploid. In our analysis, 13_A2 had mostly three gene copies and would thus be classified as being triploid (Fig. 2), which is in agreement with a more recent report (30). Part of this discrepancy is due to changes in technology. Plotting histograms of allele balance has typically included all variants, including homozygous genotypes. Because homozygous sites are much more FIG 5 Gene copy number does not follow the two-speed genome hypothesis. Core orthologous genes with a copy number of three are enriched in the gene-dense (rather than gene-sparse) regions of the genome. The background for each panel is a heatmap indicating gene abundance as in Fig. 3. Orange points (with transparency) are plotted over this background, where each point is a core orthologous gene that was determined to have three copies and was positioned based on their 3= (y axis) and 5= (x axis) intergenic distance as in Fig. 3. Background grid and axis ticks are identical to those in Fig. 3.
Genome-Wide CNV in Phytophthora infestans ® abundant than heterozygous sites, this tends to drive the scaling of the plot. To avoid this, previous work limited plots to a frequency range of 0.2 to 0.8. We subset our data to only the heterozygous genotypes, resulting in a plot of 0 to 1, and subset the data by omitting variants with unusually high or low sequence depth. This is a significant improvement in methodology for inferring ploidy or CNV based on allele balance (33).

Gene loss occurred within individuals in both sexual populations and clonal lineages.
We tested the hypothesis that gene loss was shared by ancestry. This would provide the expectation that members of a clonal lineage show fixed polymorphisms within that clonal lineage. We used breadth of coverage to identify the presence/ absence of genes relative to the reference genome. Instead, we found that individuals within a clonal lineage (e.g., from South America or US-1; Fig. 3) showed gene loss within individuals at a rate similar to that of the sexual population (Mexico; Fig. 3). Furthermore, gene loss affected many gene families, including effectors, and was located throughout the genome. This is consistent with the hypothesis that pathogenicity factors are thought to be enriched in the gene-sparse portion of the genome (23,34,35).
CNV is found throughout the genome and affects all gene families, including core genes and effectors, equally. Our expectation following the proposed twospeed genome hypothesis (23,24) was to find CNV enriched in the gene-sparse, transposon, and effector-rich portion of the genome, where CNV could provide a means of creating novel paralogs. To our surprise, CNV affects housekeeping genes and effectors equally (Fig. 4) and is randomly dispersed throughout the whole genome. In the diploid genomes from Mexico, we found that core orthologous genes, pseudo-

Clade 1
Other Clade 1c species  genes, and several pathogenicity factors were all predominantly 2ϫ. Genomes of clonally reproducing strains from South America and the lineage US-1 were found to have core orthologous genes, pseudogenes, and pathogenicity factors that were predominantly 3ϫ. We also expected CNV to be higher in pathogenicity factors than in core orthologs, yet levels of CNV were not different, regardless of gene class.
Variation in copy number can be found in other Phytophthora species. We also evaluated if changes in ploidy could be observed in other heterothallic Phytophthora species. We used genomes for moderate population sizes from P. andina, P. parasitica (ϭ P. nicotianae), and P. capsici available at the Sequence Read Archive to address this question (Fig. 6). Within Phytophthora clade 1c, P. andina appeared predominantly diploid. P. andina has been recognized as a hybrid with two parental species, one of which is P. infestans, while the other hybrid parent is unknown (32,36). The genomes of P. andina had one haplotype from each parental species as expected and were predominantly 2ϫ copy number. P. parasitica, a distant relative of P. infestans basal to clade 1, was diploid. However, two strains (P10297 and P1569) had minor peaks at our expectation for three copies, indicating that fractions of these genomes may vary in copy number at 3ϫ. Our ability to resolve these peaks was likely due to the high sequence depth of these samples relative to the other available taxa. In clade 2, the more distant P. capsici appeared predominantly diploid for 3 strains; however, one strain (Pc389) was triploid. These results suggest that variation in ploidy and/or copy number may be a common feature throughout the Phytophthora genus, consistent with other recent reports (37,38).
We propose a model of emergence where triploid clones emerge and eventually displace prior clonal lineages. Our work provides striking support for a model of predominantly diploid populations at the center of origin reinforced by sexuality and predominantly triploid clonal lineages elsewhere in the world (Fig. 7). In this model, novel clonal lineages emerging globally are predominantly triploid. These triploid Genome-Wide CNV in Phytophthora infestans ® lineages might be more fit and thus able to displace other extant lineages. A new lineage emerging from a sexual cross in Mexico is expected to be initially diploid and will gradually show an increase to three copies per gene. Older previously dominant lineages might thus be more triploid (e.g., US-1) than dominant younger lineages (e.g., US-23). Some lineages are ephemeral (e.g., US-18, US-22). The recently emerged diploid lineage US-22 was only observed between 2009 and 2011 and might be less fit (1,10) and, curiously, shows predominantly 2ϫ copies per gene. To the best of our knowledge, all lineages that became dominant in space and time are or were triploid, with the exception of FAM-1. It remains to be established if higher genic copy number confers higher epidemic fitness to a clonal lineage. Experimentally addressing this question might prove challenging given the fact that CNV is a whole-genome phenomenon. However, there are several studies supporting the idea that some clonal lineages (which are 3ϫ in our analysis) displacing older lineages were indeed fitter. Kato and colleagues showed that US-8 strains have larger lesions and sporulate more than US-1 strains (39). Similarly, Cooke and colleagues, using mark-recapture methods in the field, showed that the 13_A2 strains were among the most aggressive clones compared to the strains evaluated and outcompeted previously dominant clonal lineages (9).
Conclusions. The late blight pathogen P. infestans continues to reemerge, causing financial loss for farmers and threatening food security, particularly in developing countries (1). We report the observation that P. infestans isolates are diploid in central Mexico, where they reproduce sexually, and emerging dominant clonal lineages are predominantly triploid. These findings provide novel support for the hypotheses that a change in copy number might drive emergence of clonal lineages of the Irish famine pathogen.

MATERIALS AND METHODS
Sequence alignment and variant calling. The sample came from previously published sources (9, 12-14, 23, 24) as well as 11 new Phytophthora infestans genomes we sequenced (see Text S1 in reference 53). Isolates US040009, FP-GCC, US100006, FL2009P4, and ND822Pi were sequenced at the UC Davis Genome Center. Isolates PIC97136, PIC97146, PIC97335, PIC97442, PIC97750, and PIC97785 were sequenced at Oregon State University's Center for Genome Research and Biocomputing on an Illumina HiSeq 2000 platform. Additionally, five samples each of P. mirabilis and P. ipomoeae (see Text S2 in reference 53) were also sequenced at Oregon State University's Center for Genome Research and Biocomputing on an Illumina HiSeq 2000 platform. All other samples were obtained from publicly available repositories (see Text S1 and S2 in reference 53). Newly sequenced genomes are publicly available at the Sequence Read Archive (BioProject number PRJNA542680; Text S3 in reference 53).
The FASTQ format files were aligned to the P. infestans T30-4 reference (23) using the Burrows-Wheeler Aligner MEM algorithm (BWA-MEM) 0.7.10 (40,41). The resulting SAM format file was converted to BAM format, the mate information was fixed, and the MD and NM tags were added using SAMTools (41). PCR and optical duplicates were marked using Picard's MarkDuplicates (42). The per gene sequence depth and coverage over all T30-4 genes was calculated using SAMtools mpileup (41). From the mpileup data, the number of positions that were sequenced at least once and a median of coverage were calculated. In order to correct our measure of coverage for GC bias, we calculated an adjusted average read depth (AARD) (24). A median was chosen as a robust alternative to an average; however, we refer to our measure here as AARD to be consistent with the existing literature. The genes were sorted into bins based on percentiles of GC content. The adjusted median read depth was then taken by multiplying the median read depth for each gene by the ratio of the median read depth of all genes divided by the median average read depth for all genes in the GC bin of the gene. The AARD for each genome was summarized using violin plots (43), and a threshold of mean AARD of at least 12 was used as a threshold for inclusion of a genome for further analysis.
Variants were called from the BAM files for diploid genotypes to create genomic variant call format (gVCF) files using the Genome Analysis Toolkit (GATK) HaplotypeCaller (44,45). Diploid genotypes were called using the GATK's GenotypeGVCFs. The samples P10127, P10650, P11633, P12204, P1362, P6096, and P7722 were flagged by the GATK's HaplotyeCaller as having legacy quality encoding. These samples were run with the option fix_misencoded_quality_scores to accommodate this.
Gene copy number inference based on allele balance. The inference of gene copy number was made based on the ratio of alleles observed at heterozygous positions (12,30). The VCF specification (46) provides the option for variant callers to report the number of times each allele was sequenced at a variable position. In a diploid heterozygote, the expectation is that each allele will be observed at an equal frequency or a ratio of one half. A triploid heterozygote will be expected to have alleles observed at a ratio of one third. A tetraploid heterozygote will be expected to have alleles observed at a ratio of one quarter. Note that some combinations are indistinguishable and therefore uninformative. For example, a tetraploid heterozygote with only two alleles (e.g., A/A/C/C) will have each allele observed at a ratio of one half. This will be indistinguishable from our expectation from a diploid heterozygote. The ratio of alleles observed at each variable position has been used by other authors to make inferences about ploidy (12,30). Shortcomings of the present use of the ratio of alleles are that it has been presented graphically as a histogram and that the data appear "noisy" in that they do not form a strong consensus at an expected CNV value. A problem with the graphical representation of data arises when a large number of samples are to be explored or when the genome is subset into a large number of fractions, such as in windowing analyses. A numerical summary table provides the ratio of alleles observed in any genome or in any fraction of a genome. The problem of noisy data may in part be due to variants of low quality (i.e., technical error) or potential variation in ploidy throughout a genome or subgenomic region (i.e., biological variation).
The challenge of identifying high-quality variants and numerically summarizing them was addressed by our method of allele balance analysis (33). The data were quality filtered using the sequence depth of the most abundant allele for all variants in a genome. An 80% confidence interval was created to eliminate variants with the lowest 10% and highest 10% sequence coverage. This confidence interval was then applied to the second-most-abundant allele as well. The VCF file was further subset to only heterozygous positions. The allele balance ratio for each heterozygous variant was calculated by dividing the number of times the most abundant allele was sequenced by the number of times the most abundant allele and the second-most-abundant allele were sequenced, resulting in a proportion. Finally, 200,000-bp windows were made using the allele ratio data. This window size was chosen for P. infestans because it was sufficiently large to include a population of heterozygous positions (we observed a heterozygous position every 1 to 2 kbp) but small enough to obtain fine-scale resolution. The data were then assigned to bins ranging from 0 to 1 that are 0.02 frequencies wide, and the bin with the greatest density was used as a summary for the window. This is analogous to the modal frequency. This summary was then categorized to a ploidy level by assigning it to the closest expected ratio (i.e., 1/2, 2/3, 3/4, 4/5). Each genome was now summarized into windows of ploidy. In order to assign copy numbers to genes, the coordinates of each gene were referenced in the windowed genome, and the copy number of the window where the gene was located was used to assign a copy number to the gene. This is critical because we do not expect most genes to contain enough heterozygous positions to infer an accurate estimate of copy number. Once a copy number was determined, a confidence in this estimate was made by subtracting the observed proportion from the determined proportion and dividing by the bin width so that the value ranges from 0 to 1. Calculations were performed in R (47) and using vcfR (33,48).
Gene loss based on breadth of coverage. In order to determine gene loss, we measured breadth of coverage (BOC) for each gene in each genome. We used SAMtools mpileup (41) to count the per position sequence coverage over all 18,179 genes in the P. infestans T30-4 genome (23). From these data, the number of positions that were sequenced at least once and a median of coverage were collected. Breadth of coverage was calculated by dividing the number of positions that were sequenced at least once by the gene length (i.e., the proportion of positions sequenced in a gene). We used a BOC of 0 to indicate the loss of a gene.
Gene class and density. Published gene annotations (23) were used to assign genes to gene classes (core, pseudogene, RxLR, etc.). The flanking intergenic region (FIR) lengths (i.e., intergenic distances) were calculated using a previously available script (https://figshare.com/articles/Calculate_FIR_length_perl_ script/707328). This information was used to create FIR plots for individuals and populations from Mexico, South America, and the lineage US-1 using R (47) and ggplot2 (43). In order to explore whether genes of a particular class from populations from Mexico, South America, and the lineage US-1 were enriched for a particular copy number, the genes were assigned a copy number (based on allele balance) and plotted as box and whisker plots using ggplot2 (43). In order to visualize whether genes determined to have three copies were more abundant in the gene-dense or gene-sparse portion of the genome, FIR plots were created as described above but using core orthologous genes that were determined to have three copies.
Copy number variation in other species of Phytophthora. In order to address whether copy number variation occurred in other species of Phytophthora, we queried NCBI for samples that had Illumina sequence data as well as an assembled genome reference for the species. These data were processed as the P. infestans data were processed. In order to visualize these data in a phylogenetic context, a tree from Martin et al. (49) was obtained from TreeBase (50). The data were then plotted in R (47).
Data and code. All R code and data necessary to reproduce the figures are available on GitHub (https://github.com/grunwaldlab/P_infestans_CNV).