Characterization of the genomic sequence data around common cutworm resistance genes in soybean (Glycine max) using short- and long-read sequencing methods

The common cutworm (CCW, Spodopteraab litura Fabricius) is one of the pests that most severely infect soybean (Glycine max L. Merr.). In a previous report, quantitative trait loci (QTL) analysis of CCW resistance using a recombinant inbred line derived from a cross between a susceptible cultivar ‘Fukuyutaka’ and a resistant cultivar ‘Himeshirazu’, identified two antixenosis resistance QTLs, CCW-1 and CCW-2. To reveal sequence variation between the aforementioned two cultivars, whole genome resequencing was performed using Illumina HiSeq2000 (75,632,747 and 91,540,849 reads). The generated datasets can be used for fine mapping and gene isolation of CCW-1 and CCW-2 as well as for revealing more detailed genetic differences between ‘Fukuyutaka’ and ’Himeshirazu’ .


Specification
Plant science Specific subject area Agricultural and Biological Sciences, Genomics of soybean ( Glycine max ) Type of data Figure and fastq/fasta files How data were acquired Whole genomes of soybean cultivars 'Fukuyutaka' and 'Himeshirazu' were sequenced using the ILLUMINA HiSeq20 0 0 short-read sequencer. The sequence of the unique genomic region in CCW2 was amplified by genomic polymerase chain reaction (PCR) and sequenced using MinION nanopore long-read sequencer (type R9.4, Oxford Nanopore Technologies Ltd., UK [ONT]). Data format Raw sequencing reads (fastq), Binary Alignment Map (BAM) and analyzed files (fasta) Parameters for data collection The common cutworm susceptible soybean cultivar 'Fukuyutaka' and resistant cultivar 'Himeshirazu' were used in this work. Their seeds are available from Genebank in NARO ( https://www.gene.affrc.go.jp/databases _ en.php ). Genomic DNA for the sequencing was prepared from new leaves of one individual. Description of data HiSeq: Sequencing libraries were prepared with 1 μg DNA input, using the TruSeq DNA PCR-Free Library Preparation Kit (Illumina). Library pools were quantified by qPCR, loaded on the HiSeq20 0 0 patterned flow cells and clustered on an Illumina cBot in accordance with the manufacturer's protocol. Flow cells were sequenced on the Illumina HiSeq20 0 0 with 2 × 100 bp reads. Demultiplexing of sequencing data was performed with bcl2fastq2. MinION: Amplicons were obtained by amplification from the genomic DNA of 'Himeshirazu'. A total of 1 μg amplicon was end-repaired and used for library construction. The MinION sequencing was run using MinKNOW (version 1.7.3). The resulting FAST5 files were converted to FASTQ files using the Albacore basecaller (version 1.1.0, ONT). The raw reads were assembled using Canu (version 1.

Value of the Data
• The genomic data of the susceptible and resistant soybean cultivars of common cutworm can be used for the development of a molecular marker for detecting quantitative trait loci and isolating genes. • The sequence data for insert genomic region of 'Himeshirazu' in the CCW2 region can be used for fine-mapping of a candidate gene. • These data can be used for development of DNA markers and can contribute to markerassisted selection in soybean breeding.
MinION: We determined the inserted sequences in the CCW2 region observed in 'Himeshirazu'. The amplified fragment, whose length was about 18 kbp estimated from PCR analysis, was sequenced using the Oxford Nanopore MinION platform (Oxford Nanopore Technologies Ltd., Oxford, UK). We obtained 28,725 raw reads. Only 18 reads were remained after the trimming and quality controls by Canu. The length distribution of 18 reads was bipolarized between 18,023 bp to 41,188 bp ( Table 3 ). From the estimated size of the regions, we considered the longer reads would be artifacts. To confirm the possibility, we conducted homology search among 18 reads by BLASTN. While 14 shorter reads had one homologous region with each other, four longer reads (No. 15-18) whose lengths were 34,355 bp, 33,401 bp, 36,324 bp and 41,188 bp, respectively, had two homologous regions to short reads. We confirmed tandem duplication of a shorter read on a long read by mummer-4.0.0beta2 [8] . We also conducted a homology search of 18 reads against Gmax275 genome sequences and found the homology on Chr07 with gaps (7.2-7.5 Kbp) ( Table 3 ). Therefore, we concluded that the longer reads were chimeric reads and excluded from the assembly. Finally, we constructed a consensus sequence from 14 reads. We also confirmed that the consensus sequence contained a target insertion observed in 'Himeshirazu' compared with the regions on Chr07 of the Gmax275 reference genome sequence with a long gap ( Fig. 3 ). These data will be useful to perform fine mapping of CCW-2 and identify the responsible gene.

Sample collection and DNA extraction
Samples for HiSeq: Soybean cultivars 'Fukuyutaka' and 'Himeshirazu' were cultivated in a greenhouse at the National Agriculture and Food Research Organization (NARO) in Tsukuba, Ibaraki, Japan, and treated in dark condition for one-week to reduce organelle before DNA extraction. Leaves were collected from about five seedlings of 'Fukuyutaka' and 'Himeshirazu' (seeds from a single individual), and DNA was extracted from bulked leaves using a protocol from Peterson et al. [9] with some modification.
Samples for MinION: 'Himeshirazu' was cultivated in an artificial climate chamber at NARO. Genomic DNA was extracted from the newest fresh leaves of 'Himeshirazu' using the CTAB method with the following modifications: Leaves were homogenized in liquid nitrogen and the tissues were transferred to preheated 2 x CTAB DNA extraction buffer (2% CTAB, 0.1 M Tris-HCl pH 8.0, 1.4 M NaCl, 1% PVP, 20 mM EDTA) and 80 μg/ml proteinase K. Then, they were incubated in a water bath at 55 °C for 15 min, and mixed occasionally by gentle inversion of the tubes. After they were removed from the water bath and the same volume of chloroform-isoamylalcohol (24:1) was added, they were mixed by inversion. They were spun down at 30 0 0 rpm and the supernatant was transferred to the new tube. Equal volume of supernatant was added to isopropanol. They were mixed by inversion and centrifuged at 140 0 0 rpm for 5 min (MX-201, TOMY Seiko Co., Ltd, Tokyo, Japan). The pellets were washed with 70% ethanol twice and dried at room temperature. The DNA pellet was air-dried and dissolved in 50 μl of low TE buffer (10 mM Tris-HCl, 0.1 mM EDTA pH 8.0). The DNA concentration was measured by nanodrop (Thermo Fisher Scientific Inc., USA) and Qubit (Thermo Fisher Scientific Inc.).
MinION: 10 ng DNA from 'Himeshirazu' were used in the PCR reaction with primers CCW2-2_F (5'-TGACTGATCCTGCTGTGAGAATGTT-3') [Chr07:4559602-4559619] and CCW2-8_R (5'-TGTAACGTAGGAAAATGACAACACTACATC-3') [Chr07:4602994-4602971] for the amplification of approximately an 11-kb region in the reference Gmax275 genome. PCR was performed using the GeneAmp PCR PCR System 9700 (Thermo Fisher Scientific Inc.) using PrimeSTAR GXL DNA Polymerase (Takara Bio Inc., Shiga, Japan). The PCR conditions were as follows: initial denaturation at 94 °C for 1 min, 30 cycles of denaturation at 98 °C for 10 s, and annealing and extension at 68 °C for 10 min. The PCR products were electrophoresed on 0.8% agarose gel using the HindIII DNA ladder (Takara Bio Inc., Shiga, Japan) and stained with ethidium bromide. The amplicon size from 'Himeshirazu' was approximately 18 kb (between 9416 bp and 23130 bp fragment of Table 3 Summary of blastn results. The 18 "pass" reads aligned to target the sequence of the reference genome (Gmax275

Identification of the unique genomic sequence in the CCW-2 region of 'Himeshirazu' using MinION long-read sequence data
The 28,725 reads derived from the MinION sequencing platform were input to canu-1.6 with the options (-p asm -d gmax_amplicon genomeSize = 150 0 0 correctedErrorRate = 0.5 -nanoporeraw all.fastq gnuplotTested = true useGrid = false). After quality control and trimming, only 18 long reads were remained. The homologies of the 18 reads to CCW-2 regions were analyzed by blastn in the BLAST + [16] and detected an insertion region of 7.2-7.5kb that did not hit the reference sequence ( Table 3 ). Four of 18 reads showed tandem repeat sequence, and the length of the read was about twice the size of the PCR product, suggesting that the four reads are a chimera. Then, by using 14 MinION reads, a consensus sequence was generated. From the consensus sequence, 7.7kb insertion (breakpoint junction on Chr07:4588576-4588579 [TGGA]) was detected by comparing with Gmax275 reference genome ( Fig. 3 ).

Declaration of Competing Interest
The authors declare that they have no competing financial interests or personal relationships that can influence the work reported in this paper.