Genome-wide DNA Variants Identify Genetic Diversity and Population Structure of Leptosphaeria maculans in Western Canada

Background: Leptosphaeria maculans is a serious concern for canola production in Canada. For effective management, knowledge of the pathogen’s genetic variability and population structure is a prerequisite. Despite some information on race dynamics of the western Canadian L. maculans population in recent years, genetic diversity based on a large number of genome-wide DNA variants has not been investigated. Results: From 1,590 L. maculans isolates collected from 23 eld sites in three provinces: Manitoba, Saskatchewan and Alberta, Canada, in the years 2007-2008 and 2012-2014, 150 representative isolates were selected and whole-genome sequenced, and 31,870 polymorphic DNA variants (SNPs and InDels) were used to study L. maculans genetic diversity and population structure. Cluster analysis showed that the genetic diversity levels and isolate groupings varied with the number and genomic regions of the variants involved; isolates collected in 2012-2014 were more genetically diverse than those collected in 2007-2008 when genome-wide variants were considered. The genome wide association study (GWAS) detected variants in egn4_Lema_T86290 (AvrLm4-7), egn4_Lema_T86300 and egn4_Lema_T86310 associated with the year of collection, but no variants was found to be associated with the province or specic location from which the isolates were collected. Population structure analysis indicated the presence of three distinct sub-populations in western Canada. While isolates from Saskatchewan were mainly of one sub-population (sub-pop1), the Alberta isolates comprised two sub-populations (sub-pop1 and sub-pop2), and all the 3 subpopulations were found in Manitoba. Conclusion: The genetic diversity of the western Canadian L. maculans population varied among provinces. It was highly admixed in Manitoba, followed by that in Alberta. The Saskatchewan population had the lowest genetic diversity. Signicant genome variation between 2007-2008 and 2012-2014 occurred in the genes egn4_Lema_T86290 (AvrLm4-7), egn4_Lema_T86300 and egn4_Lema_T86310), with AvrLm4-7 becoming much wide DNA variants of the L. isolates investigated genetic diversity and population structure of using 31,870 DNA variants identied from 150 based on in in monitor changes in race structure of the blackleg pathogen in maculans from the raw sequence quality scores. Variant tables of isolates were exported, ltered (Q call>25, MAF ≥ 5, depth ≥ 10), and combined for further SNP data mining. The SNP distribution in specic genomic regions and variant composition was investigated. To explore variant density in different blocks, AT- and GC- block segments in supercontigs (SC) were rst determined by Excel VBA programming with sliding windows of 120 nucleotides. In this study, any continuous nucleotide stretch meeting the criteria of GC content <33.9% (15) and length ≥ 1000 base pairs was considered an AT-block. All “N” masked genome regions were excluded from the AT-block assessment. To predict genes located in AT blocks, three criteria were employed: (1) a gene is entirely located in an AT-block, if the full sequence of a gene is contained in AT block, (2) part of the gene sequence overlaps with an AT-block, and (3) a gene entirely resides in a GC-block with one end (5’ end or 3’ end) falls within 1000 bases to an AT-block.

found a high degree of gene exchange among eld L. maculans populations of in France [4,15] or a rapid countrywide diffusion of novel virulence alleles wherever novel resistance sources were used [15]. Fourteen minisatellite markers were used to characterize a worldwide collection of 465 isolates of L. maculans. Clustering analyses partitioned genotypes into distinct populations corresponding to major geographic regions, along with two differentiated populations in western Canada [16].
In recent years, DNA variants, including SNPs (Single Nucleotide Polymorphisms) and InDels (Insertions/Deletions), have become popular molecular markers for genetic diversity studies. SNPs can be categorized into transitions (C to T, G to A) and transversions (C to G, A to T, T to G or C to A), although transitions are more common than transversions. The Illumina GoldenGate genotyping assay can be used to analyse 384-3072 SNP loci across multiple individuals simultaneously [17]. A GoldenGate assay was used to assess the genetic diversity of 59 Australian L. maculans isolates collected from different years, regions and cultivars, and there was no correlation between SNP haplotypes and cultivars or the year of collection, indicating a high rate of sexual reproduction and evolutionary diversity in the pathogen [18].
Next generation sequencing (NGS) has ignited a revolution in the life sciences; it enables the inexpensive and rapid discovery of DNA variants in many organisms. Although NGS has found broad application in fundamental genome and genetic research, there are limited reports on the identi cation of genome-wide DNA variants in plant pathogens, including L. maculans. Zander et al. [19] reported 21,814 genome-wide SNPs in L. maculans based on a study of two Australian isolates. In a preliminary phylogenetic analysis, isolates from Western Australia clustered together and those collected from Brassica juncea stubble were identical. These SNPs provide a novel characterization of the genetic diversity for this pathogen population [19].
Genome sequencing revealed that the L. maculans genome has an isochore-like structure [20], which can be divided into AT and GC-rich blocks, probably caused by the ampli cation of transposable elements and repeat-induced point (RIP) mutations. The RIP mechanism causes nucleotide substitutions from C to T and G to A and is a premeiotic repeat-inactivation mechanism speci c to fungi that creates genetic diversity in the fungal genome [20].
Methods to detect genetic diversity and population structure fall into two categories: model-based parametric approaches and non-parametric approaches. Model-based methods, exempli ed by STRUCTURE [21] and snmf, a function of the R package LEA [22], assumes hypothetical K populations characterized by allele frequencies at a locus, probabilistically assigning individuals to each of these populations and allowing for admixture estimation of the genome originating from a given population. While other programs differ from STRUCTURE in using variation inference (fastStructure) [23] or a maximum likelihood framework (ADMIXTURE) [24], they proposed an explicit causal model assuming linkage equilibrium between loci and Hardy Weinberg equilibrium between alleles. Nonparametric methods such as PCA rst reduces the dimensionality of a variant frequency matrix and then identi es groups of individuals.
In this study, we applied NGS technology to sequence 162 of 1590 characterized isolates collected from different years and regions. Taking advantage of the abundance of genome-wide DNA variants, we investigated the genetic diversity and population structure of L. maculans isolates collected in western Canadian in the years of 2009~2009 and 2012~2014 using both parametric (snmf) and non-parametric methods including PCA and clustering. The results could provide valuable information and resources for blackleg management.

Results
Selection of isolates for re-sequencing Cluster analysis was conducted using phenotypic data from bioassays on a set of differential hosts (Table S1) inoculated individually with 1590 L. maculans isolates from western Canada. Six clusters were identi ed, containing 125, 325, 179, 143, 195 and 623 isolates, respectively (S Fig. 1). A total of 162 isolates (Table S2) The NGS short reads were assembled on the reference genome of L. maculans 'brassicae' isolate v23.1.3. Breadth and depth of coverage were examined to evaluate the quality of sequencing. Both coverages varied with isolates and supercontigs (SCs); four isolates with low depth and breadth of coverage were removed from the analysis. The remaining 158 isolates showed breadth of coverage of >80% (S Fig. 2 A), and ve SCs (SC 27,28,33,34,36) had lower breadth of coverage ranging from 49.6 to 73.8% (S Fig. 2 B). For depth of coverage, 88% of the isolates displayed values >20. The coverage for SCs ranged from 12 to 26 except for SC30, which was as high as 621. In summary, the average breadth of coverage was 91% and the average depth of coverage was 36.

Variant discovery
Compared with the reference genome of L. maculans, about 920,000 DNA variants (SNPs and InDels) were identi ed; among them were 104,744 polymorphic biallelic variants with MAF ≥1 and miss call rates≤50%. Considering cluster analysis and PCA were to be performed in this study, miss call rates >5% would be considered statistically consequential [25], therefore, additional eight isolates were removed. After ltering against the criteria of phred quality (Q call ≥25), minimum allele frequency (MAF ≥5), and miss call rate (≤5%), 31,870 polymorphic variants were selected for subsequent analysis. Variant numbers in each isolate ranged from 8,591 to 14,997 with an average of 9,575, as compared with the reference genome.
To understand DNA variation among the isolates, the sequences of 150 retained isolates were paired to each other to nd polymorphic variant numbers between isolates. The numbers varied dramatically with 11,175 isolate pairs in the range 508 to 17,708. About 4,000 isolate pairs showed 4,000~5,000 polymorphic variants, and 610 pairs displayed 15,000~18,000 polymorphic variants (S Fig. 3). This would suggest that approximately 56% of the 31,870 total polymorphic variants identi ed may be found in a single isolate pair.

Uneven variant distribution among genomic regions
As expected, variant numbers were highly correlated with SC size (R 2 =0.95, data not shown). There were 2,884 variants detected in the longest supercontig SC0, but only 14 variants in the shortest SC40 (Table 1). The L. maculans genome has an isochore-like structure, characterized by alternating AT-and GC-blocks [20]. Variant density in the two blocks was compared along with coding and non-coding regions. Variant density in the AT blocks and the non-coding regions was 0.88 and 0.78 variants/Kb, higher than the average for the whole genome (0.71 variants/Kb) (Table 1). In contrast, GC blocks and coding regions had lower variant density (0.62 and 0.61 variants/Kb). This suggested uneven variant distribution in the contrasting regions, which becamemore evident when individual supercontigs were examined.
For instance, almost twice as high as in coding regions (Table 1).
To understand if genes in AT-blocks and GC-blocks mutate differently, we calculated variant density in genes in the two bipartite regions. There were 12,469 genes annotated in the L. maculans genome. We sorted genes in AT-rich regions into three classes: class A genes resided entirely in an AT-rich block, class B genes were those in which only part of their sequences overlapped with an AT-block, and class C genes were located solely within an GC-equilibrated region with one of its ends falling within 2,000 bases from an AT-block (Table S3). In the end, 1,158 genes were assigned to AT-blocks, leaving 11,311 genes assigned to GC-blocks.
There were 10,042 variants detected in 3,971 genes in the GC-blocks, and 757 variants in 341 genes in the AT-blocks, with an average of 2.5 variants/gene in the GC-blocks and 2.2 variants/gene in the AT-blocks. Class A had 28 genes, with 25 located in SC30. Despite the higher variant density in AT blocks, none of the DNA variants was found in genes with full sequences residing in an AT-block. Class B was comprised of 248 genes; 80 of them had 274 variants distributed in 28 SCs. Class C were made up of 881 genes, including avirulence genes previously characterized, such as AvrLm4-7 [26], AvrLm3 [27], AvrLm2 [28] and AvrLm6 [29], suggesting class C might serve as a pool of effectors. Class C displayed 746 variants in 259 genes.
L. maculans genes could be classi ed into small, secreted protein-encoding genes (SSP-genes) and non-SSP genes. We inspected coding regions to determine if DNA mutation was biased towards SSP genes or non-SSP-genes. Rouxel et al. [20] identi ed 652 SSPs, from which we selected 587 annotated genes, carrying gene names beginning with egn4_Lema, for variant density comparison. In this study, 327 variants were found in 153 SSP genes, and 10,668 variants were detected in 4364 non-SSP genes.
On average, SSP genes had 2.1 variants/gene and non-SSP genes had 2.4 variants/gene. Variant numbers uctuated considerably among genes (Fig. 1). The top 2 SSP genes, egn4_Lema_T086290.1(AvrLm4-7) and egn4_Lema_T015100.1 had only 37 and 23 variants, respectively, while the top 2 non-SSP genes, egn4_Lema_T015070.1 and egn4_Lema_T015110.1, had 101 and 100 variants, respectively. Next to AvrLm4-7, AvrLm3 was detected with nine SNPs (six nonsynonymous SNPs and three synonymous), and the other avirulence genes carried three SNPs at most, indicating that some non-SSP genes might have mutated more frequently than SSP genes.

Dominance of transition variants
Variants are classi ed into transition, transversion and InDels (insertions and deletions). The variant composition of these four types was examined. Transition variants were dominant in the L. maculans genome, accounting for 68.0% of all variants, followed by transversion variants at 24.5%. Variant composition in different genomic regions was also investigated. When compared at the genome level, AT blocks contained more transition variants (75.0%) and fewer transversion (19.9%) and InDel variants (5.1%). In contrast, GC blocks had a reduced proportion of transition (63.2%) but higher percentage of transversion (28%) variants and Indels (8.8%) (S Fig. 4). In other words, when compared with GC blocks, AT blocks had 19% higher transition variants, but 29% fewer transversion variants and 42% fewer InDels.
Comparatively, the difference in variant composition between coding and non-coding regions, and SSP and non-SSP genes showed was less signi cant that between AT-and GC-blocks (S Fig. 4). Of course, the ratio varied considerably depending on the genes considered. For an example, gene egn4_Lema_T086290.1 (AvrLm4-7) had 36 transitions but only one transversion and no InDels; on the other hand, gene egn4_Lema_T007850.1 had ve transversions but only one transition.

Phylogeny of L. maculans isolates
This study involved 26 isolates from AB, 59 from SK and 65 from MB. Considering the abundance of variants identi ed, genetic diversity among these isolates was investigated rst.
The variants were discovered in different genomic regions or genes of interest, such as AT-blocks (13,851 variants), GC-blocks (18,019), coding regions (10,994), non-coding regions (20,876), SSP-genes (952), non-SSP genes (10,042), genes in AT-blocks (326), genes in GC blocks (10,668) and avirulence genes (48). Given the difference in genetic diversity as measured by variants in different regions, hierarchical cluster analysis was performed using the variants from different sources and their dendrograms were compared. Visual side-by-side comparison of overcrowded dendrograms was inconvenient due to the number of isolates involved, dendrograms were then compared to each other by their cophenetic matrices instead since a phylogenetic dendrogram is the graphical representation of a cophenetic matrix [30]; a coe cient value close to 1 would represent a signi cant similarity between the two dendrograms in the comparison (Fig. 2). The similarity of dendrograms between genome-wide variants and variants from different genomic regions could be classi ed into three tiers, the rst tier included AT-blocks, GC-blocks, codingregions, non-coding regions and non SSP genes, with correlation coe cients of >0.94. The second tier encompassed SSP genes, and genes in AT and GC blocks, from which the correlation coe cient ranged from 0.60 to 0.87. The third tier was Avr genes, with a correlation coe cient of only 0.14. It was found, however, that the similarity between any of these dendrograms with wholegenome dendrogram was only partly correlated with the number of variants involved. For instance, the genes in AT-blocks had almost 14 times as many variants as SSP genes, but its correlation coe cient (0.60) was less than the latter (0.73) when the two dendrograms were compared with that of genome-wide variants (Fig. 2).
We then chose to describe three dendrograms based on our interest in pathogen-host interaction. the rst was created by 48 SNPs in avirulence genes, the second constructed by the 952 variants in SSP genes, and the third was from genome-wide variants, to illustrate their difference in measuring genetic diversity. The three dendrograms are hereafter referred to as the Avr dendrogram, the SSP dendrogram and the whole-genome dendrogram.
The Avr dendrogram consisted of two main groups (I, II), each with two subgroups (Fig. 3 A). the Avr dendrogram showed avirulence gene diversity, it also revealed that some isolates shared an identical set of variants in multiple avirulence genes because the Euclidean distance among them was 0. For example, 33 isolates were clustered as leaf A in subgroup A (Fig. 3 A); they were literally the same in terms of avirulence gene sequences. So were the 25 isolates as leaf B in subgroup D. It was noticed, however, that isolates collected from the same location were observed in different clads, for instance, 19 isolates from Melfort, SK, were found in subgroup A, C and D, which denoted avirulence gene diversity for a speci c location.
The SSP dendrogram had two main groups (I, II) (Fig.3 B). Group I contained 127 isolates divided into two subgroups (A, B), and group II had only 23 isolates further divided into two subgroups (C, D). Group II consisted of isolates from MB and AB, but not SK. The Avr, SSP and whole-genome dendrograms resulted in different isolate groupings. This suggested that the level and complexity of genetic diversity varied with the set of variants used to examine these isolates. Regardless, the three dendrograms indicated high genetic diversity among these isolates. As shown by SSP and whole-genome dendrograms, the level of genetic diversity varied with the source of the isolates, with MB isolates being most diverse, followed by AB isolates. SK isolates appeared least diverse. Considering that use of variants in Avr or SSP genes alone would only re ect genomic diversity with respect to the L. maculans interaction with the differential hosts, and that genome-wide variants would be more informative and inclusive, we used genome-wide variants for further analysis.
PCA analysis of L. maculans isolates To further illustrate the relationships of the isolates collected at different locations and times, PCA was performed using genomewide variants. PC1, accounting for 17.8% of variation, and PC2 for 6.1%, were used to map the isolates (Fig. 4). Except for ve outliers (CR07-48, CR07-58, CR07-97, 12CC-481 and 12CC-419) from AB, all SK and AB isolates clustered together close to the origin on the negative side along with approximately half of the MB isolates. The other isolates from MB, however, segregated widely on the right side of the PCA map (Fig. 4). This indicated that isolates from SK and AB were of similar genetic composition, while MB isolates were highly diverse, which was consistent with the previously described cluster analysis.  (Fig. 4).
The difference in genetic diversity between years and locations may be re ected in variant changes. In order to nd DNA variants associated with years and regions, we carried out a GWAS study using TASSEL 5.2.31 [31] with the variants as genotype source and collection years, locations and provinces as traits. TASSEL offered two different models, the generalized linear model (GLM) and the multiple linear model (MLM) for marker-trait association. We tested the validity of GLM (PCA) and MLM (PCA+K) by evaluating their quantile-quantile plots (QQ-plots). A QQ-plot deviating from the null hypothesis across the entire standard distribution re ects undetected population strati cation or cryptic relatedness. Consequently, seemingly signi cant variants could be statistically unreliable, whereas truly signi cant variants would generate deviations only at the end of plot range [32]. While GLM (PCA) employs principal components only as covariates to control the population structure, MLM (PCA+K) uses both principle component and kinship to eliminate or reduce the confounding effect of a real association. The MLM (PCA+K) test proved to have higher correction e ciency because the base of its QQ-plot was consistent with the reference line (Fig. 5B), as compared with the GLM (PCA) approach (Fig. 5A). Therefore, Manhattan plots of MLM (PCA+K) tests were examined to nd variants associated with years, locations and provinces, and the variants were mapped to SCs with signi cant p values for potentially associated traits. There were the variants associated with the year (Fig. 5C), province (Fig. 5D) and location of the eld (Fig. 5E) of collections. To reduce false positives, the false discovery rate (FDR) approach [33] was adopted for selection of variants. As a result, none of the variants was associated with the province or eld site of collection, while 143 variants in SC12 were associated with the year of collection signi cantly. Of the 143 variants, 20 were located in egn4_Lema_T086290 (AvrLm4-7), 29 in egn4_Lema_T086300 and six in egn4_Lema_T086310. The rest of the variants were located in non-coding regions in proximity to these three genes. The other four signi cant variants were found in SC20, SC8, SC10 and SC11.
Population structure of L. maculans Initially, the popular Bayesian-based STRUCTURE software was tested for population structure analysis; it imposed tremendous computational burden due to the large number of variants involved, making implementation extremely ine cient. Landscape and ecological association (LEA), an R package [22], was used to perform STRUCTURE-like analysis and estimate admixture coe cients e ciently with improved algorithms [21,34]. LEA was run rst to establish a curve of cross-entropy versus number of ancestral populations, the optimal number K=3 indicated three ancestral populations in western Canada (Fig. 6A), which are hereafter referred to as sub-pop1, sub-pop2 and sub-pop3. The function snmf of LEA differentiated the 150 isolates by partitioning their genome into the three ancestral populations, with probability denoted by an admixture coe cient (Fig. 6B). Sub-pop1 appeared to be the dominant sub-population, consisting of 76% of the 150 isolates with a correlation coe cient of >0.8 .

Discussion
This study characterized genome wide DNA variants of the L. maculans isolates collected in western Canada and investigated the genetic diversity and population structure of L. maculans using 31,870 genome-wide DNA variants identi ed from 150 isolates based on temporal and spatial samplings. These isolates represented the L. maculans population in western Canada in 2007-2008 and 2012-2014. To the best of our knowledge, this was the rst time that the abundance of genome-wide DNA variants was applied to the study of genetic diversity in L. maculans.
Our study found that transition variants were the dominant type, especially in AT-blocks, which had 19% more transition variants than in GC-blocks. This con rmed the presence in L. maculans of RIP mutations, caused by a premeiotic repeat-inactivation mechanism, leading to C-to-T and G-to-A changes [35], which resulted in a bipartite structure, namely, alternating AT-and GCblocks [20]. The SNPs in AvrLm4-7 were reportedly affected by RIP mutations [5] because the AvrLm4-7 gene is located in an AT-block border. With this in mind, we examined the full sequences of all 28 genes contained in AT-blocks, but no variant was detected in these genes. Of the 28 genes, 25 were located in SC30. The BLAST annotation revealed that these genes were related to mitochondrion functions (data not shown), suggesting that SC30 is located in mitochondria. The DNA mutations in these mitochondrial genes might not be inherited due to their extremely detrimental effect on L. maculans fundamental biological function, or the RIP mechanism may not work on mitochondrial DNA.
Pathologists and breeders have long understood the importance of genetic variation in L. maculans with regard to the effectiveness and durability of disease resistance; the outcome of host-pathogen interaction often leads to avirulence gene mutation in the pathogen and consequently resistance breakdown in canola/rapeseed cultivars within short periods of time [9,36). Avirulence genes are members of SSP-encoding genes that play key roles in interactions with hosts [37]. Selection pressure tends to indirectly target the genes localized in rapidly-evolving transposable element-rich regions [38]. For this reason, SSP genes are supposed to mutate at a higher frequency than non-SSP genes. Counter-intuitively, some non-SSP genes accumulated more mutations than SSP genes showed by this study.
Genetic diversity of L. maculans has been studied with a number of molecular markers, including those use for mating types [4] and avirulence genes [18,19] studied the genetic diversity of the L. maculans population in Australia using 384 SNPs selected from the list of 21,814 SNPs described by Zander et al. [19] , and determined that the SNPs were located in 76 SCs. This raised the question of how variants should be chosen for a study. We carefully selected 31,870 variants (SNPs and InDels) by applying stringent quality parameters without the need for further validation because it is common practice to use NGS-derived variants on fundamental studies of humans [39], plants [40], and fungi [41]. In this study, we categorized variants by their genomic regions including AT-blocks, GC-blocks and SSP genes, and used these variants to perform cluster analyses. It was shown that dendrograms using variants from different regions differed; genetic diversity measured by variants selected from different genomic regions could only partially represent that based on genome-wide variants, with cophenetic correlation coe cients ranging from 0.14 to 0.96. We compared Avr and SSP dendrograms with genome-wide dendrograms, and reached three conclusions. (1) The Avr dendrograms exhibited the lowest level of genetic diversity, because it had only 53 leaves. This was due to the fact that some isolates shared the same leaf, such as those in leaf A or leaf B (Fig. 3 A), while both SSP and whole-genome (3) For isolates collected in the same years or from the same locations, isolates were highly dispersed across wholegenome dendrograms. Therefore, it would be bene cial to provide variant distribution information for a phylogenetic tree so that the genetic diversity could be better understood. For example, a dendrogram constructed from variants in SSP genes identi ed the similarity of isolates as related to their interaction with the host.
The L. maculans clonal population had been previously concluded according to mating type proportion and identical genotypes [16,42]. We also observed that isolates from a location clustered together in the Avr dendrogram based on 48 variants. For example, there were 19 isolates collected from Melfort, SK, in 2007-2008 used in this study, and 11 of them shared the same genotype and cladded together in leaf B (Fig. 3A). When 31,870 variants were considered in the analysis, the whole genome dendrogram depicted a different picture; despite the 11 isolates that were clustered in the same subgroup (Fig. 3C), they were distributed into different clads. From the perspective of this study, clonal multiplication might be true, but to a small geographical area; it still remains unclear how the area should be de ned reasonably.
Our GWAS result showed that there were no variants associated with province or speci c location, which meant that the variability identi ed were not geographically speci c. Interestingly, 143 variants were found to be associated with the year of collection, with 20 of them located in AvrLm4-7. It was also found that AvrLm4-7 was among the most mutated of the SSP genes, indicating AvrLm4-7 mutated signi cantly between 2007-2008 and 2012-2014. Theoretically, an Avr gene mutates in response to selection pressure imposed by its cognate R genes [43]. In western Canada, however, the major gene integrated into blackleg resistant cultivars was Rlm3 [44]; therefore, it is understandable that AvrLm3 exhibited nine SNPs. It was bit surprising that AvrLm4-7 had ve times as many SNPs as in AvrLm3, despite the reality that the corresponding R genes had not be used commercially in the region. All isolates selected from 2012-2014 carried Avr4-7, compared to only 50% of the isolates from 2008-2009 carried Avr4-7. This would show a substantial increase of AvrLm4-7 in the L. maculans population between 2007-2008 and 2012-2014, and this is consistent with previous race-pro ling results that indicated that the frequency of AvrLm3 decreased from 2008 to 2012 in western Canada, while AvrLm4-7 increased based on phenotypical testing on host differentials [44]. AvrLm7 was proven to have a masking effect on AvrLm3 [45]; thus, these mutations in AvrLm4-7 might be an effort for L. maculans to disrupt AvrLm3 function. Therefore, the loss of function of AvrLm3 or the overcoming of Rlm3 by L. maculans [46], was realized likely through the activation of AvrLm4-7 because the six non-synonymous SNPs in AvrLm3 do not compromise AvrLm3 function in the absence of AvrLm7 [27]. In addition, AvrLm4-7 was reported to contribute to pathogen tness [47]; any environmental stress which threatens pathogen tness might trigger DNA mutations in the gene.
From the population structure study, at least three sub-populations were identi ed in western Canada based on the isolates collected in both 2007-2008 and 2012-2014. Of these sub-populations, sub-pop1 was apparently the predominant group, but genomic integration with subpop2 and subpop3 was detected in isolates from MB and AB. This indicated that the L. maculans population in western Canada was also panmictic, as those previously reported in Australia [18] and France [4,15]. A panmictic population allows for high gene ow. Consistent with previous ndings using minisatellite markers [16,48], isolates from MB were the most admixed and the admixture proportion also changed with time. There were three ancestral populations present in  [16] suggested that the L. maculans population in western Canada appeared to be derived from a source in Ontario. This may explain why MB isolates had the highest level of genetic diversity, with a higher proportion of subpop2 and subpop3 than SK isolates. However, it raises the question of why subpop2 was considerably higher at Camrose and Olds in AB than locations in SK. It is possible that subpop2 in AB might have been transported from MB, or there might be another pathway for the subpopulation to spread from MB to AB, while avoiding SK. The data from the current study can not seem to clarify that.

Conclusion
In this study, 31,870 polymorphic variants were identi ed from 150 L. maculans isolates collected from western Canada in 2007-2008 and in 2012-1014. For the both collection times, three sub-populations (sub-pop1, sub-pop2, and sub-pop3) were differentiated with uneven spatial distribution across the region. Sub-pop1 was the major sub-population of the L. maculans population in SK. The AB L. maculans population was composed of sub-pop1 and sub-pop2, while all three sub-populations were prevalent in MB elds.  (Table S2) were used as differential hosts to determine the presence/absence of Avr genes following established methodology [49]. Hierarchical clustering the 1590 isolates was implemented using the squared Euclidean distance and centroid linkage implemented by the function hclust in the R statistical package v3.6.2 (https://www.rdocumentation.org/packages/stats/versions/3.6.2), and isolates were rst randomly selected from major branches of the resulting dendrogram to represent different races, then collection locations were considered to include canola growth elds as many as possible. The number per location varied due to isolate availability. Finally 162 isolates were chosen for sequencing.

Extraction of fungal DNA
The isolates selected were cultured in 50 ml sterile centrifuge tubes containing 10 ml liquid V8 medium (800 ml of distilled water, 200 ml of V8 juice, 0.7 g of calcium carbonate, and 100 mg of streptomycin sulfate) in an incubator (Lab Line Orbit Environment shaker) at 20 o C and 300 rpm for 7 days after which a mycelial ball of L. maculans had formed. The mycelial balls were collected and dried in a Labconco Benchtop Freeze Dryer (Labconco, Kansas City, MO), and then employed for DNA extraction using the EZNA Fungal DNA mini kit (Omega Bio-Tech, Norcross, GA) following the manufacturer's instructions. Sequence alignment and SNP discovery The NGS data was analyzed using the software package Lasergene Genomics Suite 13 including SeqMan NGen, SeqMan Pro and QSeq (DNASTAR, Madison, WI) [50]. Paired-end short reads of 100 bp length were rst assembled to the reference genome sequence of L. maculans 'brassicae' isolate v23.1.3 using SeqMan NGen following whole genome reference-guided work ow with mer size set at 100 nucleotides and minimum match at 80%. The high minimum match threshold was intended to deal with repetitive sequences in L. maculans. Because SeqMan NGen constantly computes match percentage for any alignment of 50 bases to compare with the minimum match threshold in the course of assembly, all alignment with a minimum match greater than the threshold would be dropped. The sequence of wa74_scaffold00338 (GenBank: FO906748.1) containing the gene AvrLm3, which was previously reported to be absent from the aforementioned reference genome [20], was included as the reference template to detect variants in AvrLm3 and its anking regions. Variants (SNPs and InDels) relative to the L. maculans reference isolate were determined using QSeq based on the following ltering criteria: P not ref ≥90%, Q call≥25, minimum SNP percentage≥90%, depth≥10 (https://www.dnastar.com/manuals/ seqman-ngen/17.1/en/topic/part-a-setting-up-the-assembly-inseqman-ngen). Short InDels were detected in comparison with the reference genome by MAQ algorithm [51]. The algorithm uses mate-pair information in original sequence les and estimates the error probability of each read alignment. The error probabilities are also summarized for the nal genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores. Variant tables of isolates were exported, ltered (Q call>25, MAF ≥ 5, depth ≥ 10), and combined for further SNP data mining. The SNP distribution in speci c genomic regions and variant composition was investigated. To explore variant density in different blocks, AT-and GC-block segments in supercontigs (SC) were rst determined by Excel VBA programming with sliding windows of 120 nucleotides. In this study, any continuous nucleotide stretch meeting the criteria of GC content <33.9% (15) and length ≥1000 base pairs was considered an AT-block. All "N" masked genome regions were excluded from the AT-block assessment. To predict genes located in AT blocks, three criteria were employed: (1) a gene is entirely located in an AT-block, if the full sequence of a gene is contained in AT block, (2) part of the gene sequence overlaps with an ATblock, and (3) a gene entirely resides in a GC-block with one end (5' end or 3' end) falls within 1000 bases to an AT-block.
Variant imputation and data analysis There were 40,000 missing calls in the variant table to be used for subsequent analysis. PCA and cluster analysis does not support miss data, so miss calls in our SNP panel were imputed prior to the analysis. Though there were various widely-used imputation algorithms such as BEAGLE, IMPUTE and MACH [52], they were found di cult to be applied onto haploid L. maculans, for example, a pedigree le was unavailable for our isolates randomly collected from elds. To overcome this, an R script was developed to impute the miss calls based on a reference haplotype. The script was tested on multiple simulation SNP panels with 40,000 miss calls and imputation accuracy reached 90~95%. The imputed variant panel was then converted into a table of 0 and 1 for PCA and cluster analysis [53].
PCA analysis was performed using the R package FactoMineR [57] and Factoextra [58]. As a popular tool for multivariate data analysis, FactoMineR offered methods for PCA and CA (correspondence analysis); the latter was specially designed for processing categorical data sets. Considering that the L. maculans DNA variant panel was binary consisting of 0 and 1, both PCA and CA methods were compared only to nd slight differences in terms of eigenvalues and isolate positions on their maps of the top two principal components. For ease of plotting, PCA was selected to differentiate and group isolates.
GWAS was performed using all variants in the software TASSEL v5.2.31 [31]. The insertion and deletion was converted into "+" and "-" prior to work ow implementation. General linear model (GLM) and mixed linear model (MLM) were primarily tested to nd an option t for our study. Correlation method was selected for the built-in PCA algorithm and normalized_IBS for kinship.

Competing interests
The authors declare that they have no competing interest.
Ethics approval and consent to participate Not applicable Consent to publish Not applicable

Availability of data and materials
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

Funding
This work was funded by a competitive grant from Agriculture and Agri-Food Canada. The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author contributions FY conceived of and designed the study; QC conducted the experiments, analyzed data and drafted the manuscript; GP and RK provided important information and materials. All authors reviewed the manuscript and approved the nal draft.