Sequencing 4.3 million mutations in wheat promoters to understand and modify gene expression

Significance Induced mutations in regulatory regions can be used to modulate the spatial or temporal expression of genes, whereas mutations in the coding regions can be used to generate knockouts or allelic variants to study gene function. Given the absence of regulatory limitations, these mutations can be also used in commercial applications. In this study, we sequenced 4.3 million induced mutations in the promoters and 4.7 million in the coding regions of most genes from the tetraploid wheat variety Kronos and deposited them in public databases. We provide examples of how this resource can be used to understand gene function, modulate gene expression, and generate changes in valuable wheat agronomic traits.


Supporting Information Text
Method S1.Library preparation and capture.
For the promoter capture, we re-used DNA sequencing libraries generated in our previous exome capture study (1), which included 1,513 M2 plants from the tetraploid wheat cultivar 'Kronos' mutagenized with Ethyl Methane Sulfonate (EMS).Briefly, genomic DNA libraries were constructed using the High-Throughput Library Preparation kits from KAPA Biosystems, Inc. (Wilmington, MA, USA, catalog number KK8234) with dual-SPRI bead size selection.The libraries were barcoded using NEXTflex-96 TM oligos 1-48 (Bioo Scientific, Austin, TX, USA, catalogue number 514105) and amplified by PCR (five cycles).The products were purified with Agencourt AMPure beads (Beckman Coulter, A63881) on the Sciclone G3 platform and were eluted in ultrapure DNase/RNase free distilled water and stored at -80 °C.
To capture the gene promoter regions, we used a previously published assay (2) based on 2-kb regions upstream of 110,790 high-confidence wheat genes annotated in Chinese Spring (CS) RefSeq v1.0.Probes for this assay were ordered from Roche (SeqCap EZ Prime Developer Probes 96 Reaction, catalog number 8247633001).For the captures, we pooled 24 DNA libraries together (125 ng/sample, 3 µg in total) and mixed the DNA with blocking oligos for Illumina adapters and repetitive DNA sequences (developer reagent).The DNA mixture was dried using a speed vacuum centrifuge.Capture hybridization and washes were done according to recommended protocols from Roche SeqCapEZ User Guide 4.2 v7.Briefly, the DNA pellet was dissolved in the buffers provided in the SeqCap EZ Hybridization and Wash Kit (Roche, catalog number 5634253001), denatured at 95 °C for 10 minutes, and hybridized to the SeqCap EZ Prime Developer probes for the wheat promoter regions (2) (Roche, catalog number 8247633001) for 70 h at 48 °C in a thermocycler (lid temperature set to 58 °C).The hybridization reaction was recovered using Dynabeads M-270 streptavidin beads (Invitrogen, 653-06) and washed according to the manufacturer's protocol (Roche, catalog number 5634253001).We pooled captures from 48 sequencing libraries and sequenced them in one lane of Illumina NovaSeq S4 (PE150) at the UC Davis Genome Center.

Method S2. Identification of SNPs using MAPS.
As a genome reference, we used the CS RefSeq v1.0 genome (3) with the improved CS RefSeq v1.1 annotation.For simplicity, and since the genome coordinates from v1.0 and v1.1 are the same, we will refer to the reference as CS RefSeq v1.1.We aligned the reads from the mutant plants to CS RefSeq v1.1 and called SNPs using default mpileup parameters and mapping quality higher than 20.We then used the MAPS pipeline to select bases in the reference covered by at least one read at quality higher than 20 in a minimum number of samples (MinLib parameter = 38).The minimum number of reads carrying the mutation (minimum coverage, henceforth, MC) required to call that mutation was established independently for homozygous (HomMC) and heterozygous (HetMC).We used four levels of stringency, designated from the lowest to the highest as HetMC3/HomMC2, HetMC4/HomMC3, HetMC5/HomMC3, HetMC6/HomMC4.For each library, we used the lowest stringency level that resulted in a %EMS > 98%, and we refer to this method as EMS98%.This threshold was sufficient to reduce the percent of non-EMS transitions (A>G/T>C) below 0.7 % in the worst lines.The adjustment of the stringency level by library resulted in the elimination of more SNPs from the low-quality libraries and the inclusion of more mutations from the good quality libraries relative to the fixed HetMC5/HomMC3 stringency level used in the previous exome capture.The new strategy significantly increased the number of called mutations while maintaining a low estimated error.

Method S3. Error rate determination
We used four different levels of stringency to call mutations based on the minimum number of reads showing the mutation in heterozygous and homozygous states.The most stringent combination required a heterozygous mutation to be observed in at least six reads and a homozygous mutation to be observed in at least four reads (henceforth, HetMC6/HomMC4).The most liberal criteria required the heterozygous mutation to be observed in at least three reads and a homozygous mutation to be observed in at least two (henceforth, HetMC3/HomMC2).We also used HetMC4/HomMC3 and HetMC5/HomMC3 as intermediate criteria.
The EMS treatment is expected to produce G to A (G>A) or C to T (C>T) mutations (henceforth EMS mutations).At each stringency level, we determined the percent of C to T and G to A mutations, which is designated hereafter as %-EMS.Although we eliminate all the non-EMS mutations from the final database, we cannot eliminate the G>A or C>T errors included among the real mutations induced by the EMS treatment.To estimate the number of these residual errors we used the frequency of reciprocal non-EMS transitions (A>G or T>C SNPs, henceforth, A>G/T>C), which have a similar frequency to the G>A/C>T changes among random mutations (5).
To select a %-EMS threshold, we tested different %-EMS stringency levels and found that ≥ 98% EMS was sufficient to reduce the percent of non-EMS transitions below 0.7 % in the worst lines.We then selected mutations in each line at the first HetMC/HomMC level that had ≥ 98 % EMS mutations.Most of the libraries (940) exceeded the 98%-EMS threshold at the lowest stringency level (HetMC3/HomMC2).Among the other libraries, 389 exceeded the threshold at HetMC4/HomMC3, 120 at HetMC5/HomMC3 and 64 at the highest stringency level (HetMC6/HomMC4).Twenty-two libraries did not reach the 98% EMS threshold even at the highest stringency level (HetMC6/HomMC4) and were eliminated.

Method S4. Variant Effect Prediction.
After creating a VCF file with all the EMS mutations, we predicted the effects and SIFT scores (4) of all mutations using the Variant Effect Predictor (VEP) program (5) from Ensembl tools release 103 in offline mode.Release 50 of the T. aestivum VEP cache file was downloaded from Ensembl (ftp://ftp.ensemblgenomes.org/pub/plants/release-50/variation/vep/triticum_aestivum_vep_50_IWGSC.tar.gz).Commands used to install VEP, run the prediction and count the number of different types of mutations were included in the script file 'variant_effects_prediction_for_exome_capture_SNPs.sh'deposited in Zenodo together with the VCF files (doi.org/10.5281/zenodo.7754136).

Method S5. Estimation of the proportion of mutated "EMS-accessible" G residues.
The large number of mutations detected in promoter and exome captures provided a unique opportunity to estimate the saturation of our mutant libraries, which equals the proportion of "accessible" GC nucleotides carrying at least one mutation in the population (henceforth, mutated sites).The number of mutations per site follows an approximate Poisson distribution, with most of the sites lacking mutations (n0), a large number carrying one mutation (n1) and rapidly decreasing numbers of sites carrying more than one mutation (n2 to n>7).From our data we can count the number of sites that have one or more mutations, but we need to infer n0 from the Poisson distribution.The main parameter of Poisson distribution is the mean (λ = average number of mutations per site).We estimated the value of λ that maximized the adjustment of the predicted number of sites with mutations in one, two or three lines to the observed values.To do that we selected the λ that minimized the normalized differences between the observed and expected data or (O-E) 2 /E.In this optimization we excluded the sites with mutation in 4 or more lines to minimize the risk of including residual heterogeneity sites since we observed excess mutation relative to the expected from a Poisson distribution.
Once we optimized λ, we used it to estimate n0 and the accessible GC sites as follows: λ = total EMS mutations / accessible sites λ = total EMS mutations / (n0 + mutated sites) and EMS-accessible GC sites = n0 + (mutated sites) and % saturation of the EMS-accessible space = mutated sites / EMS-accessible GC sites.Table S1  Table S7.Effect of a mutation in a conserved SPL binding site on VRN-A1 expression during spike development.The mutation in the promoter of the VRN-A1 gene was identified in mutant line K4679.Expression was evaluated in young spikes (5 mm long) in sister lines with and without the mutation.Transcript levels were estimated by qRT-PCR relative to ACTIN, which was used as internal control.

Supporting Tables
VRN-A1 specific qRT-PCR primers were from a previously published study (6).Transcript levels are expressed as fold-ACTIN using the Ct method.
ANOVA combining the two experiments and using experiment as block  S9.Effect of a mutation in a conserved LFY-binding site on VRN-A1 expression during spike development (Waddington 3.25).The induced EMS mutation was identified in line K2944 in the VRN-A1 promoter.Transcript levels were estimated by qRT-PCR relative to ACTIN, which was used as internal control.VRN-A1 specific qRT-PCR primers were from a previously published study (6).Transcript levels are expressed as fold-ACTIN using the Ct method.

.
Summary of mutations and uniquely mapped EMS mutations at different stringency levels.
* EMS-type mutations = G>A and C>T, excluding residual heterogeneity (RH).† Reciprocal transitions A>G and T>C were used to estimate the error rate in EMS mutations.

Table S2 .
Correlations between promoter and exome capture for number of mutations and percent EMS and reciprocal non-EMS transitions across 1,492 common mutant lines at HetMC5/HomMC3 and 1,465 common mutant lines at EMS98%.Correlations were calculated using values obtained with both the previous global threshold HetMC5/HomMC3 for the exome capture remapped to CS RefSeq v1.1 and the new threshold method adjusted by library (EMS98%).

Table S3 .
Comparison of SNP effects between previous (1) and new exome capture results.The new results are based on mapping the reads to the CS RefSeq v1.1 reference and using the EMS98% stringency method to call mutations.Annotation of SNP effects was performed using VEP from ENSEMBL (Method S3).
* EMS-type mutations = G>A and C>T SNPs, excluding residual heterogeneity SNPs † A>G and T>C SNPs ‡ A SIFT score < 0.05 was used to classify predicted deleterious missense mutations.

Table S4 .
Shared SNPs between promoter capture and exome capture.

Table S5 .
Number of genes and EMS-type SNPs in the promoter and re-mapped exome capture per chromosome.
* The Kronos contigs were used only in the Promoter capture to identify SNPs and were also considered in the sequencing space.

Table S6 .
Estimated average percent of GC sites accessible to EMS mutagenesis and proportion of these sites with at least one EMS-type mutation (saturation), using a Poisson approximation.
* The reduced number of observed mutations relative to the expected from Poisson distribution is likely affected by the MAPS pipeline, which eliminates mutations present in >1 line within the 48 lines included in one sequencing batch to eliminate potential homoeologous SNPs.† Mutated sites / Estimated EMS-accessible GC sites.‡ The % saturation estimated from the Poisson adjustment was validated by dividing the number of duplicated mutations present in >1 line by the total number of EMS mutations, which provides a similar probability.

Table S8 .
Phenotypic effect of a mutation within a conserved SPL-binding site within the VRN-A1 promoter.The mutation was identified in line K4679, and its effect on days to heading (DTH) and spikelet number per spike (SNS, avg.two taller tillers) was evaluated in two experiments.
VRN1 transcript levels relative to ACTIN (Ct method)

Table S10 .
Phenotypic effects of the K2944 mutation in a conserved LFY-binding site within the VRN-A1 promoter.The effect on spikelet number per spike (SNS) in the main shoot was evaluated in two independent experiments using BC1F2 and BC1F3 plants (derived from a single heterozygous BC1F2 plant).The effect on days to heading (DTH) and leaf number (LN) was evaluated in one experiment using BC1F3 plants.Data transformed using a power transformation (**-2) to improve normality of residuals.For the untransformed data the P values for Genotype was 0.0002.
1In Exp 1 no transformation restored normality of residuals so we used a non-parametric test Average spikelet number per spike (SNS)

Table S11 .
Relative abundance and % of heterozygous mutations among non-EMS errors.