Whole Genome Sequencing and Analysis of Godawee, a Salt Tolerant Indica Rice Variety

Godawee is a cultivated salt tolerant Oryza sativa rice variety indigenous to Sri Lanka, but its genetic basis is unknown. The whole genome of Godawee was sequenced using Illumina paired-end technology. The reads were mapped to Oryza sativa Nipponbare reference genome. Investigation of genome wide variation patterns resulted in the identification of 2,231,717 SNPs and 480,460 InDels. In silico analysis identified 192,249 non-synonymous SNPs in 31,287 genes. Variants in 28 Salt Tolerance Related Genes (STRGs) were examined. The OsHKT 2;1 gene had the largest number of SNPs in comparison to the reference genome. 16 non-synonymous SNPs that were predicted to have functional effects either by SIFT, Provean or SNAP algorithms were detected in the STRGs. The upstream regions of the STRGs were examined for cis-regulatory elements found in STIFDB, Plant CARE and PLACE databases. The most striking of this was the WRKY cis-acting family of elements which were found in abundance in the upstream regions of OsAPx8, OsMSR2, OsTIR1, OsHKT2;3, OsHKT14 and OsSOS1 genes. Sequencing and whole genome analysis of Godawee helped to understand the genetic basis of its salinity tolerance which is a complex trait involving multiple factors.


Introduction
Rice has been consumed in tropical and subtropical Asia from as far back as 8000 BC [1].Over the years the process of farming and selection for specific features has changed the wild rice, Oryza rufipogon, consumed by ancient people into the improved species, Oryza sativa, which is the major staple food of a third of the world population today [2].
The United Nations Population Fund (UNFPA) estimates that the world population would reach 9.6 billion by the year 2050 [3].To ensure food security for this increasing population food production has to increase by approximately 44 million metric tons annually from now on.Unpredictable climate change and abiotic stresses such as soil water scarcity (drought) and salinity hamper achieving this goal [4][5][6][7][8].
Accumulation of water-soluble salts in the soil is high in arid and semi-arid regions.No part of the world however is safe from salinization [9].Soil that has an electrical conductivity of >4 dS m-1 is considered to be saline soil [10].The osmotic effect of saline water reduces the uptake of water by plants.The excess ions adversely affect plant cells by efflux of intracellular water.Salinity is also known to increase the nutritional imbalance in plants [11].These mechanisms play a major role in inhibiting plant growth [12].
Rice is the most sensitive cereal crop for salinity [13].The loss of rice yield in saline lands is estimated to be 30-50% per annum [14].During evolution plants have developed mechanisms to sense changes in the environment and adapt to them [15].Rice has also developed mechanisms to withstand salinity [16].People have selected and cultivated such salt tolerant rice varieties from time immemorial.
There are three common techniques that are used to produce rice plants that can tolerate salinity.They are conventional breeding; identification of quantitative trait loci (QTLs) and marker assisted selection or direct screening of rice plants in a saline environment [17,18]; and genetic modification by introducing Salt Tolerance Related Genes (STRGs).The recently introduced technique of targeted gene editing promises great potential for the alteration of genes without affecting characteristics of elite genotypes [19][20][21][22].Genetic modification requires the identification of the relevant genes in stress tolerant pathways and genetic variants in those genes that confer functional advantage.Whole genome sequencing of plants on next generation genome sequencing platforms makes it possible to do this faster and more efficiently than before [23].
Godawee, a Sri Lankan traditional rice variety known for its salinity tolerance [24,25], was sequenced as the initial step of a whole genomebased polymorphism study to identify genes involved in salinity tolerance and genetic variants in those genes that confer functional advantage.In this paper we report the initial results of this work.leaves using a DNeasy Plant Mini Kit (Qiagen, Germany) as mentioned in the manufacturer's protocol.

Sequence and assembly
Library construction and sequencing: A paired-end sequencing library with inserts of approximately 600 bp in size was constructed using 50ng of genomic DNA according to the manufacturer's protocol using a Nextera sample preparation kit (Illumina Inc., USA).Agarose gel electrophoresis was performed to confirm the fragment sizes of the library.Approximately 15 pg of the DNA library was denatured using freshly prepared 0.2N NaOH and sequencing was performed using the TruSeq v3 kit (Illumina Inc., USA) on a Miseq platform.The initial cluster density was 1548 K/mm2.At the end of a 55-hour long sequencing run 18773.7 MB of sequencing reads were produced.All reads passed the Illumina quality filter.Of these reads 11500 MB (54 million sequencing reads) were of high quality (above Q30).The reads were obtained in FASTQ file format for further analysis.
Reads mapping: Nextera DNA adapter trimmed reads generated on the Illumina Miseq platform were trimmed to remove low quality bases and filtered using 'Trimmomatic' (http://www.usadellab.org/cms/?page=trimmomatic/) by retaining the bases with a minimum Phred quality score of 30.Burrows Wheeler alignment (BWA) tool [26] with default parameters was used for alignment of all the quality filtered paired end reads to Oryza sativa L. cv.Nipponbare reference genome -the unified build release Os-Nipponbare-Reference-IRGSP-5.0 (International Rice Genome Sequencing Project).BWA aligner tool [26] with default parameters was also used to map reads to chloroplast (DDBJ Acc.No: X15901) and mitochondrial (DDBJ Acc.No: DQ167400) genome references.
The sequences that did not map to the Nipponbare reference genome were assembled into contigs using the ABySS tool with default parameters [27,28].Contigs larger than 1000 bp were filtered and Blasted against the Oryza sativa japonica reference genome to identify the divergent homologs in the Godawee genome followed by annotation using Blast2GO software through Blastn analysis against the NCBI nucleotide database [29,30].

Variants identification and annotation
The aligned reads in the SAM file was converted to BAM and sorted using Sort of SAMtools V0.1.19[31].PCR duplicates were removed using the Picard tools V0.1.19(https://broadinstitute.github.io/picard/).The GATK tool [32] was used with default parameters for variant calling followed by identification of SNPs and InDels.
The variant calling file (VCF) was filtered using parameters such as mapping quality ≥ 55, base quality ≥ 30, variant quality ≥ 90, number of reads per base between 5 and 75 and the distance of adjacent variants ≥ 5.The rice7 gene model database for Oryza sativa (http:// sourceforge.net/projects/snpeff/files/databases/v3_6/snpEff_v3_6_rice7.zip) and SnpEff V3.6 tools were used to annotate the filtered variants [33].
SNPs and InDels were annotated as genic and intergenic.The SNPs were separated on the basis of transition (C/T and G/A) and transversion (C/G, T/A, A/C and G/T).Further classification of SNPs and InDels found in the genic region was carried out on the basis of their location (i.e.exons, introns, 5'UTRs, 3'UTRs, and splice-site regions).SNPs in coding region were also classified into synonymous (cause no change in amino acid), non-synonymous (cause change in amino acid), stop gain (introduce a stop codon), stop loss (remove existing stop codon), start gain (introduce a start codon), and start loss (remove existing start codon).
Densities of SNPs and InDels in every 100 kb region were calculated to locate hyper-and hypo-variable regions of the genome.Chromosome regions containing SNPs/kb>5 or InDels/kb>1 were defined as hyper-variable regions whereas those containing SNPs/ kb<0.1 or InDels/kb<0.01were defined as hypo-variable regions.

Next generation sequencing (NGS) and mapping
Whole genome sequencing of Godawee generated 11.5 GB of extremely high quality sequence data (54 million sequencing reads).Phred quality score of more than 90% of the reads was >30.These reads were first mapped to Oryza sativa L. cv.Nipponbare reference genome using the Burrows Wheeler Alignment (BWA) Tool [26].The overall mapping rate was 0.97 (52.7 × 10 6 reads).The depth of coverage was 22x.The reads covered 91.65% of the reference genome after duplicate masking (Figure 1).Majority of the mapped reads (78.7%=41.5 × 10 6 reads) were uniquely mapped to chromosomes.The remainder mapped to multiple locations on the reference genome (Figure 1).Unmapped reads (1.3 × 10 6 ) were de novo assembled (described later).

Variant identification
The SNPs and InDels in O. sativa L. cv.Godawee were determined with reference to the Nipponbare reference genome.A total of 4,334,404 variants (3,747,311 SNPs and 587,093 InDels) were identified prior to quality filtering.Quality filtering yielded 2,231,717 SNPs and 480,460 InDels (235,905 insertions and 244,555 deletions).Homozygous to heterozygous variant ratios were calculated to be 5.27 for SNPs and 4.65 for InDels.The highest total number of variants was on chromosome 1 (316,158) and the lowest number was on chromosome 9 (171,807).Average number of variants per chromosome was 226,015 at the rate of one variant for every 138 bases (Table 1).The density of occurrence of variants was calculated to determine the genomic distribution of SNPs and InDels.The highest SNP density was on chromosome 10 (692.1 per 100 kb) and the lowest density was on chromosome 5 (509.4 per 100 kb).The highest InDel density was on chromosome 11 (143.2 per 100 kb) and the lowest was on chromosome 5 (108.6 per 100 kb) (Table 1).The average density of variants across the genome was 600.3 SNPs per 100 kb and 128.7 InDels per 100 kb.The genome comprised of 2485 high SNP density regions (SNP/kb>5) and 22 low SNP density regions (SNP/kb<0.1).Chromosome 1 harboured the highest number of high SNP density regions (308) and chromosome 9 harboured the lowest number of high SNP density regions (160) (Table 2).In comparison, the genome only contained 3 high InDel density regions (InDel/kb>5) and 90 low InDel density regions (InDel/kb<0.1).The highest numbers of low InDel density regions were detected in chromosome 5.

Annotation of variants
Analysis of SNPs showed that 757,841 were genic and 1,473,876 were intergenic (Figure 2a).Among the SNPs found in genic regions 129,515 (40.25%) were synonymous variants (silent variants) and 192,249 (59.75%) were non-synonymous variants.Among nonsynonymous variants 186,054 (57.82%) were missence variants and 6,195 (1.93%) were nonsense variants.Non-synonymous variants were found to be distributed over 31,287 genes and the SNP number varied between 1 to 241 SNPs per gene.In the genic regions the number of SNPs/kb ranged between 0.03 and 87.61.The transition to transversion ratio of SNPs ranged between 2.26 in chromosome 4 and 2.54 in chromosome 3 with an average of 2.41.Outliers of non-synonymous SNPs/kb in genic regions was calculated using Z-score and 13,016 genes were classified as outliers due to the presence of more than 2 non-synonymous SNPs/kb (Figure 3).Analysis of InDels showed that 142,938 were genic and 337,522 were intergenic.Further analysis showed that 28,739 InDels were in coding regions (Figure 2b).The majority of InDels were either mono-nucleotide (43.18%) or dinucleotide (17.07%).InDels with lengths ranging from -203 to +375 bp were detected.

Analysis of the variants in the salt tolerance related genes (STRGs) and functional effects of non-synonymous SNPs
Salt tolerant genes were identified by searching scientific literature and by examining gene expression data of the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/).Salt tolerant pathways are known to consist of approximately 28 Salt Tolerance Related Genes (STRGs) (Table 3).Analysis of these genes showed the presence of 2249 SNPs and 632 InDels.The majority of SNPs and InDels were in introns and other non-coding regions.The coding regions contained 78 non-synonymous SNPs and 76 synonymous SNPs (Table 3).Nonsynonymous SNPs ranged from 1 (OsNAC6, OsMAPK4, OsAPx8, OsMAPK5, OsAFB2, OsSIK1, OsHKT1;4 and OsNAC) to a maximum of 32 (OsHKT2;1) per gene.Non-synonymous (Ka) to synonymous SNP (Ks) ratio can be calculated to investigate positive (Ka/Ks>1), negative (Ka/Ks<1) or neutral (Ka/Ks=1) evolutionary selection of the genes.The Ka/Ks ratio showed 6 STRGs under positive selection, 7 STRGs under negative selection and 4 STRGs under neutral selection (Table 3).Table 3: Details of the SNPs and InDels identified in the salt tolerance related genes (STRG).
The summary of prediction of the effect of SNPs on protein function in STRGs is in Table 4.Only two SNPs were found to be deleterious by more than one algorithm.They were OsHKT 2;1 (1312T>C / Phe438Leu) and OsHKT 2;3 (470T>C / Ile157Thr).

Annotation of unmapped reads
De novo assembly of unmapped reads resulted in 283,020 contigs.
Since the average rice gene size is 3223 bp (Rice genome annotation project), the resultant contigs larger than 1000 bp (500 contigs) were selected for further analysis and blasted against the Oryza sativa japonica reference genome to identify the divergent homologs in the Godawee genome.The contigs did not show significant similarity with any sequence in the reference.These contigs were annotated with Blast2GO software using the Blastn algorithm in order to search for matches in the NCBI RefSeq database and annotated with Gene Ontology terms [29,30].Blast2GO analysis using the Blastn algorithm is summarised in Figure 4.This showed that the majority of the contigs matched Oryza sativa indica sequences (49.8%) or Oryza sativa japonica sequences (16.4%).
A total of 10 known CAREs associated with stress responsive pathways were analysed against upstream regions of the 28 STRGs and compared with the Oryza sativa reference genome in order to understand gene regulation.
CAREs involved in WRKY encoded transcriptional repression of gibberellin signalling were observed in high frequencies in the upstream regions of all 28 genes [40].The number of WRKY CAREs in the upstream region of the OsAP × 8 (Ascorbate peroxidase 8) gene in the Godawee genome, as compared to the reference (Nipponbare), was significantly higher.Furthermore, these were found in higher numbers in OsMSR2, OsTIR1, OsHKT2;3, OsHKT1;4 and OsSOS1 genes in Godawee compared to their counterparts in the reference.

Discussion
In this paper we report the sequencing and analysis of the Godawee genome with special reference to the genes related to salinity tolerance.We generated high quality sequence reads of the Godawee genome and mapped it to the Oryza sativa L. cv.Nipponbare reference genome, and performed a bioinformatic analysis.
The majority of the reads were uniquely mapped with others mapping to multiple locations.The mapping rate obtained was 0.97 with a depth of 22x.The reads covered 91.65% of the reference genome.In agreement with previous reports of similar genome wide studies of other rice varieties 8.35% of the reads did not map to the reference genome [41,42].Further analysis of the unmapped reads showed that the reads consisted of sequences which mapped to other Oryza sativa subspecies as has been reported previously [43].
Analysis of mapped reads showed that the distribution of SNPs in the Godawee genome was diverse.The number of SNPs in many regions was low indicating that there were SNP deserts as has been reported previously [41,44,45].A large SNP desert was detected in chromosome 5 between 9. The ratio of transition to transversion of SNPs that was found to be 2.427 with a higher number of transition SNPs than transversion SNPs is comparable with the ratios reported previously [44,46,47].It is hypothesised that the higher frequency of transition SNPs than transversion SNPs is likely to contribute to conservation of protein structure and RNA stability during evolution [48].Among transitions, C/T SNPs appeared to be more abundant than A/G SNPs.This is comparable with the previous findings in monocots [41,46,49].On the other hand, in agreement with previous reports on the occurrence of transversions, the frequency of T/A transversion SNPs in Godawee was higher than that of A/C, G/T and C/G transversion SNPs [42,46].Plant salinity tolerance occurs in two phases known as osmotic and ionic.The osmotic phase starts with an increase in salt concentration around the roots and is characterised by reduced rate of shoot growth and reduction in number of tillers.Ionic phase starts with the accumulation of salts above the toxic levels in old leaves and is characterised by the death of leaves [50].Mechanism of salinity tolerance in plants can be categorised as (1) ion homeostasis and compartmentalization, (2) ion transport and uptake, (3) biosynthesis of osmoprotectants and compatible solutes, (4) activation of antioxidant enzyme and synthesis of antioxidant compounds, (5) synthesis of polyamines, (6) generation of nitric oxide (NO), and (7) hormone modulation and involved several pathways [51].The STRGs that were identified for this investigation are involved in these mechanisms.
Analysis of the variants in the 28 STRGs in Godawee showed that all were polymorphic.Among the 2881 variants identified in STRGs, 2723 variants were found to be in non-coding regions, whereas 158 were in the coding regions.In order to predict the nature of selection on individual genes, the ratio of non-synonymous to synonymous SNPs (Ka/Ks ratio) was calculated for each gene [52].The results showed that OsAPx8, OsHKT1;1, OsNAC and OsPSY1 genes were under neutral conditions of evolution in Godawee.Several other STRGs were found to be still evolving under positive or negative selection.The most significant of all the STRGs was the highly polymorphic OsHKT2;1 gene (32, non-synonymous and 36 synonymous SNPs).OsHKT2;1 gene encodes a Na+/K+ transporter, a key component in Na+ efflux from roots and a major contributor to salinity tolerance in plants [53].The highly polymorphic nature of this gene suggest that it could be playing a major role in salinity tolerance of Godawee.Two SNPs, the 1312T>C SNP of the HKT2;1 gene that results in the substitution of the amino acid Phenylalanine by Leucine at position 438 in the HKT2;1 protein and the 470T>C SNP of the HKT2;3 gene that results in the substitution of the amino acid Isoleucine by Threonine at position 157 in HKT2;3 protein were predicted to be affecting their function by at least two function prediction algorithms.Amino acid motifs in conserved regions are known to be vital for protein function and stability [34,54].In addition to these two SNPs detailed analysis of all 28 STRGs to understand the genetic basis of salinity tolerance of Godawee resulted in the identification of 14

Figure 1 :
Figure 1: Classification of the Godawee reads mapped to Nipponbare reference genome.

Figure 2 :
Figure 2: Annotation of SNPs and InDels dentified in Godawee genome.(a) SNP annotation (b) InDel annotation.Annotated SNPs and InDels were classified based on their locations.

Figure 3 :
Figure 3: Distribution and skewness of the number of non-synonymous SNPs per kb in 31,287 genes of Godawee.The outlier value calculation indicated that 13,016 genes contained >2 non-synonymous SNPs per kb (blue bars).

Figure 4 :
Figure 4: Results of Gene Ontology (GO) analysis of unmapped reads using Blastn algorithm (Blast against NCBI nucleotide database).
2 and 13.4 Mb with 1.21 SNP/kb.Two more SNP deserts were identified in chromosome 4 (25.3 to 27.3 Mb, 0.91 SNP/kb) and 8 (23.8 to 24.4 Mb, 0.51 SNP/ kb).SNP deserts of 2.1 Mb and 0.7 Mb in size have been reported in monocots by Rathinasabapathi et al., (2015) [41].The presence of similar SNP deserts in Godawee rice may indicate the highly conserved function of these regions.

Table 1 :
Number and density of SNPs and InDels in the Godawee genome.

Table 2 :
Number of high and low SNP and InDel density regions in chromosomes.