Transcriptomic dataset of cultivated (Sesamum indicum), wild (S. mulayanum), and interspecific hybrid sesame in response to induced Macrophomina phaseolina infection

We report here the data of transcriptome sequencing of control and infected sesame genotypes. Sesame is an emerging oilseed crop [1]. The destructive soil-borne fungi Macrophomina phaseolina Tassi (Goid) causes charcoal rot of sesame, leading to high (>50%) yield loss. Most of the high-yielding sesame cultivars (Sesamum indicum) of India are susceptible to charcoal rot. Wild sesame, Sesamum mulayanum shows a high degree of tolerance against many pathogens [2]. We have earlier developed an interspecific hybrid between Indian cultivated sesame and S. mulayanum. The parents and the F6 recombinant constitute the three experimental genotypes in the present report. The seedlings were infected with M. phaseolina. The data of the infected and control (mock-inoculated) transcriptome is presented. The RNA-seq by Illumina NovaSeq 6000 technology generated 2.9 × 108 paired-end reads. We deposited the data in NCBI sequence read archive (SRA) with accession number PRJNA642699. The de novo assembly of clean reads generated 106,295 unigenes with an average length of 1,342 bp covering 1.42 × 108 nucleotides. The screening of 106,295 unigenes with MISA and SAMtools software resulted in the identification of 26,880 simple sequence repeats (SSRs), 90,181 single nucleotide polymorphisms (SNPs), and 25,063 insertion deletions (InDels). Apart from mono-base repeats, di-nucleotides repeats (42.51%) were found to be the most abundant, followed by tri-nucleotides (14.28%) among the SSRs. Subsequently, we have designed 22,494 pairs of primers based on perfect di and tri-nucleotide SSRs. Transitions (Ts, 60%) were the most abundant substitution type among the SNPs followed by transversions type (Tv, 40%), with a Ts/Tv ratio of 1.48. The development of genic-SSR markers and SNP information will pave the way for molecular marker-assisted breeding of sesame for tolerance against charcoal rot.


a b s t r a c t
We report here the data of transcriptome sequencing of control and infected sesame genotypes. Sesame is an emerging oilseed crop [1] . The destructive soil-borne fungi Macrophomina phaseolina Tassi (Goid) causes charcoal rot of sesame, leading to high ( > 50%) yield loss. Most of the high-yielding sesame cultivars ( Sesamum indicum ) of India are susceptible to charcoal rot. Wild sesame, Sesamum mulayanum shows a high degree of tolerance against many pathogens [2] . We have earlier developed an interspecific hybrid between Indian cultivated sesame and S. mulayanum . The parents and the F 6 recombinant constitute the three experimental genotypes in the present report. The seedlings were infected with M. phaseolina . The data of the infected and control (mockinoculated) transcriptome is presented. The RNA-seq by Illumina NovaSeq 60 0 0 technology generated 2.9 × 10 8 pairedend reads. We deposited the data in NCBI sequence read archive (SRA) with accession number PRJNA642699. The de novo assembly of clean reads generated 106,295 unigenes with an average length of 1,342 bp covering 1.42 × 10 8 nucleotides. The screening of 106,295 unigenes with MISA and SAMtools software resulted in the identification of 26,880 simple sequence repeats (SSRs), 90,181 single nucleotide polymorphisms (SNPs), and 25,063 insertion deletions (In-Dels). Apart from mono-base repeats, di-nucleotides repeats (42.51%) were found to be the most abundant, followed by tri-nucleotides (14.28%) among the SSRs. Subsequently, we have designed 22,494 pairs of primers based on perfect di and tri-nucleotide SSRs. Transitions (Ts, 60%) were the most abundant substitution type among the SNPs followed by transversions type (Tv, 40%), with a Ts/Tv ratio of 1.48. The development of genic-SSR markers and SNP information will pave the way for molecular marker-assisted breeding of sesame for tolerance against charcoal rot.
© 2020 The Author(s

Value of the Data
• It is the first report of de novo transcriptome dataset of a commonly cultivated Indian sesame ( Sesamum indicum ), wild ( S. mulayanum ) and the inter-specific hybrid in the response of Mp infection-causing charcoal rot. • This transcriptome dataset will unravel the resistance mechanism to Mp by identifying defence-related genes and pathways involved during plant-pathogen interaction in sesame. • Processed SSR/SNP data can be used to develop molecular markers for charcoal rot tolerance.
• The dataset will foster future molecular marker-assisted breeding of sesame.

Data Description
In the present report, we have performed transcriptome analyses using Illumina technology from leaf RNA samples of three sesame genotypes ( S. indicum, S. mulayanum , and an interspecific hybrid, designated as recombinant throughout the manuscript) in control and Mp-infected state. This analysis generated a total of 290,670,920 raw sequencing reads from a 200 bp insert library ( Table 1 , Supplementary Fig. 1 ). Table 2 depicts the details about the quality of RNA used for library preparation. After screening the quality of data (Base quality and Phred score), the raw reads in FASTQ format were submitted to the NCBI sequence read archive (SRA) in the BioProject PRJNA642699 with the accession ID. SRX86 48465, SRX86 48466, SRX86 48467, SRX86 48469, SRX86 48470, and SRX86 48471 ( Supplementary Table 1 ). A total number of 286,237,946 clean reads were generated after trimming the adaptors and removing low-quality bases. Of these reads, the de novo assembly by Trinity program resulted in 106,295 unigenes; with an average of 94.64% Q > 30 and 46.76% GC ( Table 3 ). The length of the unigenes ranged from 201 to 14,433 bp with an average length of 1,342 bp. There were 27,070 (25.46 %) unigenes having a length between 200 to 499 bp, and 31,059 (29.21%) unigenes with a length between 500-999 bp. Unigenes with length more than 10 0 0 bp and 20 0 0 bp accounted for 25,049 (23.56%) and 23,117 (21.74%) respectively ( Fig. 1 , Table 3 ).
We identified 26,880 SSRs from 20,842 (19.6%) unigenes; with an average frequency of one SSR per 5.3kb. More than one SSR was present in 4,596 unigenes, and the number of SSRs in compound formation was 1,811 ( Table 4 ). Apart from the mono-nucleotide repeats (11,352, 42.23%), di-nucleotide (11,429, 42.51%) and tri-nucleotide repeats (3,840, 14.28%) together constituted 56.79% of the identified SSRs ( Fig. 2 , Table 5 ). The microsatellite frequency decreased with the increase of repeat units for all the SSR types ( Fig. 3 ). The repeat numbers ranged from The number of reads after filtering Clean Bases: Clean read numbers multiply read length, saved in G unit Error Rate: Average sequencing error rate, which is calculated by Q phred = -10log 10 (e) Q20: Percentage of bases whose correct base recognition rates are greater than 99% in total bases Q30: Percentages of bases whose correct base recognition rates are greater than 99.9% in total bases GC content: Percentages of G and C in total bases       10-24 for mono-nucleotides, 6-14 for di-nucleotides, 5-13 and 38 for tri-nucleotides, 5-8 for tetra-nucleotides, 5-7 for penta-nucleotides, 5-7 and 12 for hexa-nucleotides ( Supplementary Table 2 ). In the di-nucleotide SSR class, AG/CT nucleotides were the largest SSR motif (5,808, 21.60%), followed by AT/AT nucleotides (3,244, 12.06%) and AC/GT nucleotides  Table 4 ). The di-nucleotide (3,242, 55.17%) repeats were most abundant in 5 UTR, whereas tri-nucleotide (815, 82.74%) and mono-nucleotide (2833, 61.70%) repeats were preferentially present in the CDS and 3 UTR region ( Fig. 5 , Supplementary Table 4 ). We were unable to classify 13,617 (54.32%)    ( Fig. 6 , Table 6 ). The details of the primers sequence, expected product size and T m for 7,638 genic-SSR primer pairs are provided in Supplementary Table 5 . We identified 90,181 SNP loci, and the average SNP density in the whole transcriptome was 0.63/Kb. The details of SNPs, SNP-containing unigenes, position and distribution, are presented in Supplementary Table 6 . The number of SNPs in each sesame genotype and infection stages varied from 40,287 to 49,929 ( Table 7 ). The SNPs were further classified into non-coding and   coding types. The minimum non-coding SNPs were detected in S. indicum infected sample (60.08%), while it was maximum in the control of S. mulayanum (62.25%). The occurrence of synonymous SNPs was higher (22.21 -24.12%) than the non-synonymous SNPs (15.20 -15.80%) ( Table 7 ). Details of the non-synonymous SNPs, including the nucleotide and predicted amino acid substitutions are given in Supplementary Table 7 . The total number of transition (Ts) and transversion (Tv) mutations were 53,974 (60%) and 36,270 (40%) respectively, with a Ts/Tv ratio of 1.48 ( Fig. 7 , Table 8 ). Among the transition mutations, G/A and C/T showed high occurrences of 15.50 and 15.37% respectively. The C/G (5.31%) and G/C (5.20%) mutations showed higher occurrences in comparison to the other transversion  ( Table 8 ). Finally, 25,063 insertions and deletions (InDels) were recorded with an average of one InDel per 5.69 kb transcriptome sequence. The details of the InDels are presented in Supplementary Table 8 .

Experimental Design, Materials and Methods
The overall experimental design is depicted in Fig. 8 .

Plant material
In this study, the parental sesame genotypes were as follows: A high-yielding cultivar of Indian sesame ( Sesamum indicum L. -IC 131989, NBPGR germplasm collection, India), and wild sesame ( S. mulayanum Nair). The third genotype was a recombinant (RIL) line, which we developed through interspecific hybridization. We maintain these genotypes in the experimental plots of Madhyamgram Experimental Farm (MEF), Bose Institute, Kolkata, India.

In vitro infection
A pure culture of Macrophomina phaseolina (Mp) was maintained on potato dextrose agar (PDA) plates at 30 °±1 °C for active growth. The inoculum was prepared by multiplication in PD broth under agitation until the development of micro-sclerotia (48 h). The micro-sclerotia were collected by filtration, rinsed and diluted with sterile distilled water. Parallel to it, the healthy seeds of three sesame genotypes were surface disinfected with 0.5% mercuric chloride for 10 min and rinsed in autoclaved double distilled water thrice. The seeds were transferred aseptically in plastic pots filled with pre-autoclaved soil-rite for germination. The germinated seedlings (21 days old) were transplanted to identical pots containing 350 micro-sclerotia g-1 soil-rite. We maintained the pots in a growth chamber for 72 h with 14 h light/10h dark cycle . After mockinoculation with sterile water, the control sets were kept in the same condition [3] . The experiment was laid out as a complete randomized design (CRD). There were three replicas for both controls, and Mp inoculated genotypes, each having four plants: Three sesame genotypes × four biological replicates × two treatments (control, infected).

RNA isolation and library construction for sequencing
The total RNA was isolated from leaves of the control and infected plants of three genotypes using Spectrum TM Plant Total RNA kit (SIGMA). The Agilent Bioanalyzer 2100 system (Agilent Technology, USA) was used to check the RNA integrity (RIN) and quantitation using RNA Nano 60 0 0 Assay Kit. Total RNA (1 μg) was processed using NEBNext® Poly(A) mRNA Magnetic Isolation Module (NEB E7490), and six libraries were prepared with the NEBNext® Ultra TM RNA Library Prep Kit for Illumina® (NEB, USA) following the manufacturer's instruction (E7530). Index codes were added to attribute sequences to each sample. After the quality check procedures, mRNA from total RNA was enriched using oligo(dT) beads. The mRNA was then fragmented randomly in NEBNext First Strand Synthesis Reaction Buffer (5X) using divalent cations under elevated temperature. For the synthesis of the first strand of cDNA, we used M-MuLV Reverse Transcriptase (RNase-H) and random hexamer primer. It followed the generation of the second strand by nick-translation. For it we used a custom second-strand synthesis buffer (Illumina), which contained dNTPs, RNase H and DNA polymerase I. Using the exonuclease/polymerase activities, the remaining overhangs were converted into the blunt ends. For hybridization, NEBNext adaptor was ligated with the hairpin loop structure after adenylation of DNA fragments (3 ends). The AMPure XP system (Beckman Coulter, Beverly, USA) was used for purification of the library fragments, and to select the cDNA fragments in the range of 250 ∼300 bp. Subsequently, 3 μl of USER Enzyme (NEB, USA) was added with the adaptor-ligated cDNA. The program of the reaction mixture was 37 °C for 15 min, followed by 5 min at 95 °C. The PCR reaction was conducted with the High-Fidelity Phusion (DNA polymerase), universal primers and Index (X) Primers for amplification of the size-selected cDNA. After purification, the PCR products were used to build the library. The quality of the library was evaluated with the Agilent Bioanalyzer 2100 system. The cBot Cluster Generation System was used to cluster the index-coded samples using PE Cluster Kit cBot-HS (Illumina). After cluster generation, six libraries were sequenced using Illumina NovaSeq 60 0 0 to generate paired-end reads.

De novo transcriptome assembly
The raw reads that qualified Illumina's quality control were passed through in-house Perl scripts in FASTQ format. Low-quality reads containing ploy-N and adapters were removed to obtain the clean reads [4] . It followed the calculation of GC-content, sequence duplication level as well as Q20, Q30 of clean data. We deposited the raw reads in FASTQ format in the NCBI SRA database. All the downstream analyses were based on clean data with high quality. Transcriptome assembly was accomplished with clean reads using Trinity with 'min-kmer-cov' set to two by default and all other parameters set to default [5] . De novo transcriptome filtered by Corset (V 1.05) was used as a reference.

Ethics Statement
This work did not involve any human or animal subject.

Supplementary Materials
Supplementary material associated with this article can be found in the Mendely Data repository at http://dx.doi.org/10.17632/nk27dkn5d7.1

Declaration of Competing Interest
GG is funded by an intra-mural grant of Bose Institute, Department of Science and Technology, Government of India. DD and VA are funded by the University Grants Commission of India for research fellowship. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.