Identification of Genome-Wide Variants and Discovery of Variants Associated with Brassica rapa Clubroot Resistance Gene Rcr1 through Bulked Segregant RNA Sequencing

Clubroot, caused by Plasmodiophora brassicae, is an important disease on Brassica species worldwide. A clubroot resistance gene, Rcr1, with efficacy against pathotype 3 of P. brassicae, was previously mapped to chromosome A03 of B. rapa in pak choy cultivar “Flower Nabana”. In the current study, resistance to pathotypes 2, 5 and 6 was shown to be associated with Rcr1 region on chromosome A03. Bulked segregant RNA sequencing was performed and short read sequences were assembled into 10 chromosomes of the B. rapa reference genome v1.5. For the resistant (R) bulks, a total of 351.8 million (M) sequences, 30,836.5 million bases (Mb) in length, produced 120-fold coverage of the reference genome. For the susceptible (S) bulks, 322.9 M sequences, 28,216.6 Mb in length, produced 109-fold coverage. In total, 776.2 K single nucleotide polymorphisms (SNPs) and 122.2 K insertion / deletion (InDels) in R bulks and 762.8 K SNPs and 118.7 K InDels in S bulks were identified; each chromosome had about 87% SNPs and 13% InDels, with 78% monomorphic and 22% polymorphic variants between the R and S bulks. Polymorphic variants on each chromosome were usually below 23%, but made up 34% of the variants on chromosome A03. There were 35 genes annotated in the Rcr1 target region and variants were identified in 21 genes. The numbers of poly variants differed significantly among the genes. Four out of them encode Toll-Interleukin-1 receptor / nucleotide-binding site / leucine-rich-repeat proteins; Bra019409 and Bra019410 harbored the higher numbers of polymorphic variants, which indicates that they are more likely candidates of Rcr1. Fourteen SNP markers in the target region were genotyped using the Kompetitive Allele Specific PCR method and were confirmed to associate with Rcr1. Selected SNP markers were analyzed with 26 recombinants obtained from a segregating population consisting of 1587 plants, indicating that they were completely linked to Rcr1. Nine SNP markers were used for marker-assisted introgression of Rcr1 into B. napus canola from B. rapa, with 100% accuracy in this study.

associated with Rcr1 region on chromosome A03. Bulked segregant RNA sequencing was performed and short read sequences were assembled into 10 chromosomes of the B. rapa reference genome v1.5. For the resistant (R) bulks, a total of 351.8 million (M) sequences, 30,836.5 million bases (Mb) in length, produced 120-fold coverage of the reference genome. For the susceptible (S) bulks, 322.9 M sequences, 28,216.6 Mb in length, produced 109-fold coverage. In total, 776.2 K single nucleotide polymorphisms (SNPs) and 122.2 K insertion / deletion (InDels) in R bulks and 762.8 K SNPs and 118.7 K InDels in S bulks were identified; each chromosome had about 87% SNPs and 13% InDels, with 78% monomorphic and 22% polymorphic variants between the R and S bulks. Polymorphic variants on each chromosome were usually below 23%, but made up 34% of the variants on chromosome A03. There were 35 genes annotated in the Rcr1 target region and variants were identified in 21 genes. The numbers of poly variants differed significantly among the genes. Four out of them encode Toll-Interleukin-1 receptor / nucleotide-binding site / leucine-rich-repeat proteins; Bra019409 and Bra019410 harbored the higher numbers of polymorphic variants, which indicates that they are more likely candidates of Rcr1. Fourteen SNP markers in the target region were genotyped using the Kompetitive Allele Specific PCR method and were confirmed to associate with Rcr1. Selected SNP markers were analyzed with 26 recombinants obtained from a segregating population consisting of 1587 plants, indicating that they were completely linked to Rcr1. Nine SNP markers were used for Introduction Materials and Methods Development of F 2 lines from a F 1 segregating population and evaluation of the F 2 lines for resistance to clubroot Pak choy (B. rapa subsp. chinensis) cv. "Flower Nabana" (FN), a hybrid developed by Evergreen Y.H. Enterprises, Anaheim, CA, USA, is highly resistant to pathotype 3 of P. brassicae. Rcr1 was identified and genetically mapped in 1587 plants from a F 1 population that segregated for resistance (R) and susceptibility (S) [11]. This population was derived from a cross between FN, which is heterozygous at Rcr1 locus, and canola line ACDC, which is a clubroot-susceptible B. rapa doubled haploid developed at the Saskatoon Research Centre of Agriculture and Agri-Food Canada. An additional 200 plants in the F 1 population were self-pollinated, but most of the plants did not produce seed due to self-incompatibility in B. rapa. Seed was obtained from 38 plants, resulting in 38 F 2 lines that could be evaluated for resistance to multiple pathotypes of P. brassicae. These lines were tested for resistance to pathotypes 2, 3, 5 and 6 under controlled conditions at the University of Guelph, ON, Canada. Plant seedlings were grown in soil-less mix in tall narrow plastic pots (164 mL conetainers, Stuewe & Sons Inc., Corvallis, OR, USA) and inoculated with 5 ml of resting spore suspension (1×10 6 spores ml −1 ) at 10 days after seeding. Roots of 7-14 plants per line were assessed for clubroot severity at 6 weeks after inoculation using a standard 0 to 3 scale where: 0 = no clubbing; 1 = small clubs only; 2 = moderate clubs; and 3 = severe clubbing. A disease severity index (DSI) [63,64] was calculated using the following formula: The highly susceptible Shanghai pak choy cv. "Mei Qing Choi" (Stokes Seeds, ON, Canada) was included as a susceptible control for each group to ensure that the inoculation was highly effective. Evaluation of the F 2 lines and controls for resistance to the pathotypes of P. brassicae was replicated two times.

RNA-seq and sequence alignment
At 15-days post-inoculation, root tissue from 9 R plants was combined to form the R bulk and from 9 S plants to form the S bulk; together, the two bulks comprised one biological replicate as described by Chu et al. 2014 [11]. In total, three replicates were assessed. The total RNA from each replicate was isolated using an RNeasy Plant Mini Kit (Qiagen; Toronto, ON, Canada) with oncolumn deoxyribonuclease (DNase) digestion using a Qiagen RNase-Free DNase set following the manufacturer's instruction. The cDNA library was prepared using TruSeq RNA Sample Preparation Kits v2 (Illumina; San Diego, CA, USA). RNA-seq was carried out using a Illumina Hiseq 2500 platform (Plant Biotechnology Institute, National Research Council, Saskatoon, SK, Canada) and global gene expression by the RNA-seq was conducted using the B. rapa reference genome v1.2 (B. rapa subsp. pekinensis cv. "Chiifu") [11]. In the current study, the short reads from the RNA-seq were aligned to the B. rapa reference genome v1.5, which was downloaded at http://brassicadb.org/brad/downloadOverview.php. The reference genome consists of 10 chromosomes and 40,357 scaffolds. The total lengths of chromosomes and scaffolds are about 258 million bases (Mb) and 27 Mb, equivalent to about 90% and 10% of the reference genome, respectively. To simplify downstream data analysis, only the 258 Mb chromosome sequences were used in the current study, so the short reads from the RNA-seq project were assembled into 10 chromosomes A01 to A10 of B. rapa. The program SeqMan NGen 11 (DNASTAR, Madison, WI, USA) was used for short read assembly. Two methods were used: 1) short reads from three biological replicates each of the R and S bulks were assembled into the reference genome, resulting in six assembly files-hereafter, this method is called the single sample assembly (SSA); 2) short reads from a pool of the three R bulks and a pool of the three S bulks were assembled into the reference genome, resulting in two assembly fileshereafter called the pooled sample assembly (PSA). Standard assembling and filtering parameters were used.

Identification of variants
Discovery of variants (SNPs and InDels) in comparison with the DNA sequences in the B. rapa "Chiifu" [65] were performed using SeqMan Pro 11 (DNASTAR, Madison, WI, USA) with Q call ! 15 and depth ! 5. Analysis of variance and t-tests (LSD) were used to assess the variation in SNPs and InDels [66].

Genotyping SNP markers
Selected SNPs identified in the target region were confirmed using the Kompetitive Allele Specific PCR (KASP) method (http://www.lgcgroup.com/) following the manufacture's instruction. PCR reactions were performed using a StepOne Plus Real Time PCR System (Applied Biosystem, Mississauga, ON, Canada). FN8, a line derived from the B. rapa cv. FN and homozygous at the Rcr1 locus, was included for this study.
To validate the SNP markers for MAS selection with Rcr1, DNA samples from 38 F 1 plants from the cross between FN and the susceptible B. rapa line ACDC and 26 recombinants from the B. rapa population that was originally used to identify Rcr1 [11] were assessed. The accuracy of the SNP markers for use in MAS was determined by testing the recombinants from a BC 1 population derived from B. napus introgressed with Rcr1 described by Chu et al, 2014 [11].

Resistance to selected pathotypes
Rcr1 was initially identified based on resistance to pathotype 3 (Williams' differentials) [41] of P. brassica, the most prevalent pathotype identified in western Canada [11]. The resistant donor FN was found to be resistant to multiple pathotyes of P. brassica [43]. To determine if resistance to other pathotypes in the R donor cv. FN was also associated with Rcr1, we tested 38 F 2 lines derived from the F 1 segregating population for resistance to pathotypes 2, 3, 5 and 6. As expected, FN conferred resistance to pathotype 3 (0 DSI), but ACDC (95 DSI) and "Mei Qing Choi" (100 DSI) were highly susceptible (Fig 1). Eight of the 38 lines (F 2 -6, 18, 19, 24, 29, 30, 34 and 41) were highly susceptible to pathotype 3 (severity > 80 DSI) so their parental F 1 plants likely did not carry Rcr1. Severity was low to moderate in the other 30 lines (0-52 DSI), which indicates that the parental F 1 plants for these lines likely carried the Rcr1 gene. To further confirm the F 1 phenotypes determined by F 2 lines, DNA samples from the original 38 F 1 plants were assessed using two microsatellite markers, MS7-9 and sN8591, that flanked Rcr1 [11]. The 30 F 1 parental plants of the F 2 lines with low to moderate DSI carried the allele associated with resistance; the 8 F 1 plants with high DSIs in their F 2 lines carried the allele associated with susceptibility (Fig 1), confirming the phenotypes in the 38 F 1 plants determined by DSIs in the F 2 lines.
The clubroot reaction of the parental lines and the F 2 lines to pathotypes 2, 5 and 6 was also assessed. The parental cv. FN was highly resistance to pathotypes 2, 5 and 6, but ACDC and "Mei Qing Choi" were highly susceptible (Fig 1). The eight F 2 lines that were susceptible to pathotype 3 were also highly susceptible to pathotypes 2, 5 and 6 (> 80 DSI). In contrast, clubroot severity was low to moderate (0-55 DSI) in the other 30 lines (Fig 1). Therefore, we conclude that resistance to pathotypes 2, 5 and 6 of P. brassicae was associated with resistance to pathotype 3 that was used for identification of Rcr1 in the 38 lines.

RNA-seq and sequence alignment
The average sequence counts and average accumulated sequence length were 125.9 million (M) and 11,093.8 Mb per R bulk, and 114.0 M and 10,005.8 Mb per S bulk using the SSA method (Table 1). This provided an average depth of coverage of the reference genome of 43 fold in R bulks and 39 fold in S bulks. The average sequence length assembled into the reference genome was 88 bases. More sequences were assembled into the longer chromosomes A03 and A09, while fewer sequences were assembled into the shorter chromosomes A04 and A10 ( Table 1). The sequence counts assembled into the genome for each chromosome were highly correlated to chromosome length (r = 0.90) for both R and S bulks.
Short reads from three R bulks and three S bulks were further assembled using the PSA method. As observed previously with the SSA method, more sequences were aligned into the longer chromosomes A03 and A09 and fewer sequences were aligned into the short chromosome A04 and A10 (Table 1). A total of 351.8 M sequences, 30,836.5 Mb in length, with coverage of 120 fold of the reference genome were assembled into B. rapa chromosomes from the pool of three R bulks, and 322.9 M sequences, 28,216.9 Mb in length, with 109 fold coverage were assembled from the pool of three S bulks (Table 1). About 50% of the aligned sequences Evaluation of plants from 38 F 2 lines for resistance to clubroot. A, Plants were inoculated with pathotype 3 and confirmed with molecular markers MS7-9 and sN8591 flanking Rcr1 described by Chu et al [11]. "+":heterozygous alleles, one allele from R parent and the other one from S parent, "-":homozygous alleles from S parent; B, Inoculated with pathotype 2; C, Inoculated with pathotype 5; D, Inoculated with pathotype 6. Names of the lines indicate below X axis. Letters F, A, and M represent B. rapa cv. or breeding line FN, ACDC and "Mei Qing Choi", respectively. Disease severity indexes (DSI) were calculated from clubroot severity ratings. R and S are the phenotypes assigned for each line based on DSIs. The experiments were replicated two times with similar results.

Chromosome
Number of sequences (x 10 6 ) Accumulated length of sequences (bases x 10 6 ) were aligned to the top strand of the reference DNA and the other 50% to the bottom strand using the either the SSA or PSA methods (S1 Table).

Identification of variants
Variants in both R and S samples were very frequent, with means of about 605.9 K SNPs and 82.2 K InDels per R bulk and 579.4 K SNPs and 76.3 K InDels per S bulk by using the SSA method. There was a strong positive correlation between the numbers of SNPs compared to InDels (r = 0.99 in both R and S bulks). The numbers of SNPs and InDels varied among the chromosomes (Fig 2), indicating that the longer chromosomes (A03, A09) carried more variants than the shorter chromosomes (A04, A10). However, the proportion of SNPs and InDels was similar among chromosomes, with about 88% and 12%, respectively, in both R and S bulks (S1 Fig).
The numbers of SNPs and InDels identified using the PSA method were higher than with the SSA method, with 776.2 K SNPs and 122.2 K InDels in the R bulk, and 762.8 K SNPs and 118.7 K InDels in the S bulk. As with the SSA method, there was a strong positive correlation between the numbers of SNPs and InDels with the correlation coefficient (r = 0.99) in either R or S bulks. The proportion of SNPs and InDels on each chromosome was 87% and 13% in both the R and S bulks (S1 Fig), which is a slightly lower proportion of SNP and higher proportion of InDel identified using the PSA method than the SSA method. As observed with the SSA method, the numbers of SNPs and InDels identified using PSA method (Fig 2) were higher on the longer chromosomes.

Comparison of variants in R and S bulks
Variants identified in the R and S samples could be the same (monomorphic, mono) or different (polymorphic, poly) (Fig 3). Mono variants comprised 73% (range 63.7-75.6%) of the variants identified across the B. rapa genome using the SSA method, and poly variants 27% (range 24.4%-36.3%). There were no differences among chromosomes in the proportion of mono and poly variants except for chromosome A03, where the clubroot resistance gene Rcr1 is located, which carried more ploy variants and fewer mono variants (64%, P 0.05) than the other chromosomes (Fig 4).
With the PSA method, the proportion of mono variants was higher than for the SSA method for each chromosome (Fig 4). Overall, approximately 5% of the poly variants identified by SSA method were identified as mono using the PSA method. On chromosome A03, there were 66% mono variants and 34% poly variants, very similar to the results from SSA (Fig 4).

Analysis of variants in the target region
A previous study reported that Rcr1 was mapped in an interval of 0.28 centi-Morgan on B. rapa chromosome A03 and located in the 24.26-24.54 Mb of A03 [11]. In the current study, the target region was first analyzed using the SSA method. There were 35 genes annotated in the region, based on the reference genome v1.5 (Table 2). Three genes (Bra038750, Bra019396 and Bra019394) did not show any expression and no short reads were assembled into the reference genome, so no variants could be identified. Short reads with sequence counts < 20 and number of variants < 1 was associated with 11 genes (Bra038751, Bra038747, Bra019411, Bra019408, Bra019404, Bra019402, Bra019401, Bra019399, Bra019397, Bra019393 and Bra019392). More than 20 sequence counts with > 1 variants were associated with 21 genes in both R and S bulks ( Table 2). Information on BLASTX (best hit) to A. thaliana and gene ontology annotation for the genes was obtained from http://brassicadb.org/brad/index.php (S2 Table). Four genes (Bra019409, Bra019410, Bra019412 and Bra019413) encode TNL-class disease resistance proteins. The numbers of poly variants between the R and S bulks among the genes in the coding regions was assessed because the poly variants represent differences in the DNA sequences between the R and S bulks. The numbers of poly variants differed significantly among the 21 genes (Table 3).Three TNL genes (Bra019409, Bra019410, and Bra019413) had the most poly variants, with means of 81.7, 75.0 and 34.7 poly variants per gene, respectively, followed by four genes (Bra019406, Bra038754, Bra019390 and Bra038757) with means of 33.3, 31.3, 29.0 and 24.0. These seven genes exhibited more poly variants than the other 14 genes (P 0.05). The fourth TNL-class gene (Bra019412), with mean of 4.0 poly variants, was one of the 14 genes that showed few polymorphic variants ( Table 3).
The PSA method identified more poly variants among the 21 genes in the target region than the SSA method (Table 3). For both methods, three TNL genes (Bra019409, Bra019410, and Bra019413) consistently carried the highest number of poly variants among the 21 genes.

Analysis of variant types in the TNL genes
The poly variants that uniquely occurred in the R bulks but not in the S bulks (Fig 3) with depth > 5 in both samples were further assessed using the PSA method. A total of 108 poly variants were identified from the coding sequences in the four TNL genes (S3 Table). SeqMan Pro software was used to sort the 108 variants that affect amino acid sequences into four groups: non-synonymous, nonsense, frameshift and synonymous variants. Non-synonymous variants occurred in each of the TNL genes, but Bra019409 and Bra019410 carried many more non-synonymous variants than the other two TNL genes (Fig 5). Three nonsense variants were identified in Bra019409 and one in Bra019410, but none were identified in the other two genes. Frameshift variants caused by InDels were found in Bra019409 and Bra019413, and synonymous variants occurred in each of the genes except Bra019412 (Fig 5). Overall, Bra019409 and Bra019410 carried higher number of poly variants that uniquely occurred in the R samples and could affect the amino acid sequence in their proteins.

Genotyping SNP markers identified in the Rcr1 region
The DNA sequences flanking 11 SNPs in the four TNL genes (S3 Table) and three SNPs (SNP_A03_18, 19, and 32) in the gene Bra019406 that was one of genes harboring more poly variants (Table 3) were obtained from the B. rapa reference genome (S4 Table). These SNP markers, which spanned about 42 Kb in the target region (Table 4), were genotyped using the KASP method. As expected, each SNP marker was polymorphic between the R parental line FN8 and the S parental line ACDC (Fig 6), which confirmed that a SNP was present at each locus. In an allelic discrimination plot, PCR products from the FN8 (R) were usually close to the right bottom and ACDC (S) to the left top (Fig 6). This indicates that they carried homozygous alleles that might be associated with R and S, respectively. In contrast, when DNA samples from the 38 F 1 plants (Fig 1) were analyzed using the 14 SNP markers, the PCR products from 30 R plants were in the middle (Fig 6). This indicates that these plants were heterozygous at  this locus (Table 4), with one allele from the R parent and the other one from the S parent. PCR products from the eight S plants were close to the right top (Fig 6), where the PCR product from the susceptible line ACDC was located, indicating that they carried homozygous alleles from the S parent (Table 4). These results confirm that the 14 SNP markers identified by RNA-seq were completely associated with Rcr1 on chromosome A03 in this population.
To determine if the SNP markers were located in the target region, we analyzed 26 recombinants, including LP9-1, LP5-12, LP5-71 and LP5-183 that were identified by the closest flanking markers, MS7-9 and sN8591 from 1587 F 1 plants with five SNP markers (SNP_A03_08, 09, 12, 16, and 32) that reside in the genes Bra019413, Bra019412, Bra019410, Bra019409 and Bra019406 respectively (Fig 6). Each SNP marker co-segregated with phenotypes and no recombination event was found in these plants (Fig 7). This indicates that the SNP markers were completely associated with the clubroot resistance gene Rcr1.

Validation of SNP markers for MAS of Rcr1 in canola breeding
To assess the potential use of the SNP markers in MAS for introgression of Rcr1 into canola, DNA samples were assessed from seven recombinants in the BC 1 populations consisting of 176 plants, which were previously used for validating Rcr1 linked to microsatellite markers for MAS [11], with the 14 SNP markers (Table 4). Five SNP markers (SNP_A03_16, 18,19,32,103) were polymorphic in the B. rapa population; however, they were monomorphic in the B. napus population. Polymorphism was identified in the rest of 9 SNP markers ( Table 5). The SNP alleles from the R parent occurred in 3 R plants (BN-4, 27, and 58) and the alleles from  Table 5). The accuracy for MAS in the BC 1 population using the 9 KASP SNP markers was 100%, and the SNP markers completely co-segregated with the phenotypes.

Discussion
More than 10 clubroot resistance genes including CRa, CRb, CRc, CRk, Crr1, Crr2, Crr3 and Crr4 were identified previously in vegetable cultivars of B. rapa [9,10,[12][13][14][15][16][17][18][19][20][21][22][23]. These genes were identified based on the reaction to the P. brassicae pathotypes prevalent in Japan, Korea and China, which would likely be classed into pathotypes 2 or 4 based on Williams' differential hosts [41]. At least five pathotypes of P. brassicae (2, 3, 5, 6 and 8, again based on Williams' differentials) occur in Canada. Pathotype 3 is the most prevalent on canola [67,68]. Rcr1 was previously mapped on chromosome A03 based on resistance to pathotype 3 [11], but the donor cv. FN was resistant to each of the pathotypes previously identified in Canada [43]. To determine if Rcr1 confers resistance against the range of Canadian pathotypes, a F 2 population consisting of 38 lines was developed. The F 2 lines were equivalent to self-pollinated BC 1 populations (BC 1 S 1 ). There were two genotypes "Rr", heterozygous at Rcr1 locus, which should be resistant and "rr", homozygous susceptible alleles, which should be susceptible phenotype in F 1 . Theoretically, segregation for resistance and susceptibility from "Rr" plants in the F 2 lines is 3:1 and that from "rr" should be 0:1. Ideally, X 2 analysis would have been performed. Unfortunately, the small number of plants available for evaluation of plant resistance in this study made X 2 analysis impractical. We speculated that the eight F 2 lines with high severity were from "rr" F 1 plants and the 30 F 2 lines with low to moderate severity were from "Rr" F 1 plants. Similar results were obtained by repeating the experiments. In addition, the results were confirmed by analysis of the molecular markers (Fig 1, Table 4). Thus, the genotypes of the F 1 plants determined by the phenotypes in F 2 lines in this study were reliable, even though 7 to 14 plants each line for each pathotype were tested per replicate. In some of the F 2 lines that were derived from "Rr" F 1 plants, their DSIs were as low as zero. This could be caused by segregating distortion so no "rr" plants appeared in the small populations. In addition, disease escapes from "rr" plants could contribute to the zero DSIs. Resistance to pathotypes 2, 3, 5 and 6 was completely linked in each of the 38 lines. However, variation was usually found among DSIs of the same F 2 lines responding to different pathotypes. The reason for the variation could be that the proportions of the genotypes (RR, Rr and rr) for those "Rr" derived F 2 lines against each   pathotype were not exactly same. It also could be caused by some escapes from "rr" plants especially for those "rr" derived F 2 lines. The original intent was to assess 200 F 1 plants in this study, but only these 38 plants produced sufficient seed in the F 2 generation for assessment. Resistance to pathotype 8 was also associated with resistance to pathotype 3 in one iteration of the assessment (data not show), but there was not sufficient seed to repeat that portion of the study. Resistance to this range of pathotypes indicates that Rcr1 could confer broad spectrum resistance to many of the pathotypes of P. brassicae that are present in Canada. Another possibility is that the resistance to this range of pathotypes could be controlled by linked genes. Clearly, this still needs to be determined. Analysis of the recombinants selected from the F 1 population derived from the cross ACDC x FN. Phenotypes (R for resistant, S for susceptible) on the right, molecular markers (sS2093, sR12302I, sN8591, MS7-9, A3-020, sR6340I and sJ5076F) on the top were described previously [11]. The recombinants were analyzed with five SNP markers (SNP_A03_08, 09, 12, 16 and 32) residing in the genes Bra019413, Bra019412, Bra019410, Bra019409 and Bra019406 respectively using the KASP method. They are highlighted in the red rectangle. PCR products amplified from R alleles are denoted in light grey and those from S alleles in black. The four recombinants in blue rectangles were identified with the closest flanking markers sN8591 and MS7-9 [11]. RNA-seq generates a wealth of short DNA sequence reads from random places in the transcriptome. Transcriptome analysis and identification of SNPs based on RNA-seq have been carried out recent years in many organisms, including Brassicas such as B. napus [53,69,70] and B. rapa [11,54]. The current study focused on characterization of variants in the B. rapa population that carried Rcr1, and identified the most probable candidates for Rcr1 based on DNA variants and development of SNP markers tightly linked to the gene for MAS. Also, SNP variants were found to be the most common DNA sequence variation, accounting for over 80% of the B. rapa transcriptome in the current study. Similar results were reported in rice, with an average of 99,955 putative SNPs and 14,617 putative InDels [71] and the human genome, where SNPs made up about 3 million of the 4 million DNA sequence [72]. Recently, RNA-seq of human airway epithelial cells revealed that over 90% of variants were SNPs [73].
A wealth of SNPs and InDels were identified in the B. rapa genome using both the SSA and PSA approaches, which generally produced similar results. The SSA method, in which three biological replicates of R and S bulks were assessed, provided estimates of variance that could be used to assess the statistical significance of variation of SNPs and InDels. This approach is very useful for identifying differentially expressed genes through RNA-seq. In contrast, the PSA approach, which merged the data across biological replicates, is much less amenable to statistical analysis. However, it produced a greater depth of sequencing data, providing more reliable results in discovery of variants. Therefore, this approach would be preferred in future SNP discovery through bulked segregant RNA sequencing.
In both R and S samples, variants from the reference genome DNA sequences in B. rapa Chinese cabbage "Chiifu" v1.5 [65] were very frequent. One factor that likely contributed to this high frequency of variants was that the parental lines were pak choy and canola, rather than Chinese cabbage. Deep sequencing, with coverage of 120 fold of the reference genome in R and 109 fold in S samples, could also have contributed to the wealth of variants in the current study. More than 70% of the variants between R and S bulks were monomorphic in all chromosomes except A03, where a significantly higher percentage of polymorphic variants were present. The high frequency of poly variants could indicate that there was a higher level of difference between the R and S bulks in Rcr1 and its surrounding region than in other parts of genome.
Rcr1 was fine mapped in an interval of 0.28 centi-Morgan, flanking by microsatellite markers MS7-9 and sN8591 on B. rapa chromosome A03 [11]. In this study, we identified more than 400 variants (Table 3) in the target region. Fourteen SNP markers that reside in five genes  (Bra019406, Bra019409, Bra019410, Bra019412 and Bra019413) in the target region were genotyped using KASP method. Each of the markers co-segregated with Rcr1, indicating that they were completely linked to Rcr1 and closer to Rcr1 than the markers previously described by Chu et al 2014 [11]. However, it was not possible to define the Rcr1 location further with the recombinants in the mapping population, because no recombination events could be identified in the set of recombinants. Therefore, it was essential to obtain accurate SNP and Indel profile on the genes of interest through analysis of RNA-seq data. Four of the five genes identified encode proteins related to disease resistance that belong to TNL families. Non-synonymous, nonsense, frameshift and synonymous variants were identified in the TNL genes. It is unlikely that the genes Bra019412 and Bra019413 are the candidates for Rcr1; although they showed differential expression between R and S bulks [11], few poly variants, especially variants that cause changes in amino acid sequence, were identified in these genes. Bra019409 and Bra019410 carried higher number of variants that could cause changes in amino acid sequence and higher number of synonymous variants, previously considered as 'silent', but which can cause changes in protein expression, conformation and function [74]. Therefore, it is more likely that either of these genes is a candidate for Rcr1. It is also possible that a combination of the TNL genes is required for the expression of Rcr1 resistance. Rcr1 was mapped in the same interval as one of the cloned clubroot resistance gene CRa [11]. However, it was not illustrated how CRa was related to the TNL genes in the target region [38]. The complete CRa coding sequence is 4223 bp (GenBank: AB751517.1). We searched the B. rapa genome using the 4223 bp sequence at http://brassicadb.org/brad/blastPage.php and found that CRa was homologous to the four TNL genes Bra019410, Bra019412, Bra019409 and Bra019413 with score and E value at 1905, 0.0; 823, 0.0; 739, 0.0 and 504, e-141 respectively. This indicates that CRa is a gene homologous to the TNL genes especially Bra019410, but obviously different from any of the TNLs. Identifying the TNL gene that corresponds with Rcr1 and relationship of Rcr1 with CRa will be addressed after Rcr1 has been cloned. Breeding for clubroot resistance in canola at many locations in Canada is severely constrained because the pathogen, or the pathotype of interest, does not occur at or near the site of established breeding institutions. Use of highly specific markers in MAS could be to accurately assess the reaction of lines under controlled (i.e. indoor) conditions. A number of techniques are available for genotyping SNPs. NGS is an emerging method of SNP genotyping that is being increasingly adopted for discovery applications. However, this method can be expensive and time-consuming in terms of informatics needs, and currently generates datasets with a large proportion of missing data [75]. KASP offers cost-effective and scalable flexibility in applications [76]. In total, 14 robust SNP markers were confirmed that were completely associated with Rcr1, based on testing the recombinants identified from 1587 plants. Most of the SNPs proved to be very reliable markers in use for introgression of Rcr1 into elite canola breeding lines. Furthermore, MAS with the polymorphic SNP markers was confirmed to have detected Rcr1 in the BC 1 population described by Chu et al [11] with 100% accuracy in this study. Therefore, we conclude that these SNP markers could provide an effective and robust basis for introgression of Rcr1 into canola using MAS.