Comparative analysis of the transcriptomes of two rice subspecies reveals differentially expressed genes associated with phenotypic differences and reproductive isolation

Background: Rice has been used as a model plant to study adaptation, genome evolution and reproductive isolation among species and the genetics and evolution of complex traits. Two subspecies of cultivated rice, Oryza sativa ssp. indica and O. sativa ssp. japonica , with reproductive isolation and differences in morphology and phenotypic differences, were established during the process of rice domestication. Results: To understand how domestication has changed the transcriptomes of the two rice subspecies and given rise to the phenotypic differences, we obtained approximately 700 Gb RNA-Seq data from 26 indica and 25 japonica plants, and identified 97,005 transcribed fragments and 7702 novel transcriptionally active regions. We also identified 1857 (4.58% in all genes) differentially expressed genes (DEGs) between indica and japonica rice. According to previous population genetic analyses, these DEGs may associate with the phenotypic differences between the two subspecies. Functional annotation of these DEGs indicates that they are involved in cell wall biosynthesis and reproductive processes. Furthermore, compared with the non-DEGs, the DEGs from both subspecies had more 5′ flanking regions with low polymorphism to divergence ratios, indicating a stronger positive selection pressure on the regulation of the DEGs. Conclusion: This study improves our understanding of the rice genome by comparatively analyzing the transcriptomes of indica and japonica rice and identifies DEGs that may be responsible for the reproductive isolation and phenotypic differences between the two rice subspecies.


4
In addition to being a staple food that feeds over 50% of the world's population, rice has also been used as a model plant for molecular, genetics and comparative genomic studies (Khush 1997 Indica rice is mainly cultivated in tropical and subtropical regions, whereas japonica rice is planted mainly in temperate regions or at higher altitudes in tropical and subtropical regions. During the process of domestication and breeding, these two subspecies have evolved characteristic morphological and agronomic traits that may contribute to intraspecific phenotypic adaptations (Vaughan et al. 2003;Wu et al. 2003; Zheng et al. 2016). The morphological features, including leaf color, seed size and apiculus hair length, cannot be used alone to definitively distinguish between the two subspecies due to the presence of morphological variants (Kato 1928;Matsuo 1997;Oka 1988). According to a previous report, the hybrid progenies of these two subspecies are sterile, and it is difficult to utilize the heterosis of their hybrids (Yang et al. 2012). However, the molecular mechanisms underlying the reproductive isolation and phenotypic differences remain largely unknown.
Thus far, there have been many studies on the molecular basis of the phenotypic differences between the two subspecies using different methods (Beukert et al. 2017; Yuan et al. 2017). Among the methods used to search for candidate genes from hybrids of indica and japonica rice, the most important one is the quantitative trait locus (QTL) (Huang et al. 2010; Zheng et al. 2017). Using this method, several studies have managed to isolate and characterize candidate genes that are 5 expressed in key tissues or at certain developmental stages from the two subspecies, indicating that gene expression variations contribute greatly to the phenotypic differences between the two subspecies (Lu et  between indica and japonica rice. Some of the identified DEGs were then confirmed by reverse transcription polymerase chain reactions (RT-PCR). This study has the following conclusions: (1) we report that the two rice subspecies have different gene expression profiles; (2) we propose the relationship between the DEGs and the phenotypic differences; and (3) we speculate that nucleotide diversity in gene regulatory regions may associate with the phenotypic differences and play an important role in crop domestication. Collectively, the results of this study improve our understanding of the rice genome.

Sample Collection and Total RNAIsolation
In a previous study, we collected seeds from 825 Oryza sativa ssp.  (Table S1).
For total RNA isolation, three 40-mm-long young panicles were collected from each rice plant. These panicles were subjected to RNA isolation using TRIzol reagent (Invitrogen, USA), according to the manufacturer's instructions. The purity and integrity of the RNA were determined by the NanoPhotometer ® spectrophotometer (IMPLEN, USA) and the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, USA), respectively. Thereafter, rRNA was removed from the total RNA samples and the rRNA-depleted total RNA was subjected to library 7 construction using the NEBNext ® Ultra™ Directional RNA Library Prep Kit for Illumina ® (NEB, USA), according to the manufacturer's instructions.

RNA-Seq Data Filtering and Assembly
Quality control of the raw data was performed with an in-house-developed PERL script. Low quality reads, reads with adaptor sequences and reads containing poly-N stretches were removed. The number of remaining clean reads for each sample is shown in Table S2. At the same time, Q20, Q30 and GC content of the clean reads were calculated (Table S2). To further evaluate the quality of the RNA-Seq data, we
We performed principal variance component analysis to evaluate the differences in gene expression levels between the two subspecies, and the Pearson correlation coefficients between RNA-Seq data and RT-PCR results were also calculated. To identify a gene as differentially expressed, we used the Cuffdiff program within the 8 Cufflinks. With the application of a model based on the negative binomial distribution (Trapnell et al. 2010), Cuffdiff presents statistical routines for the determining of differential expression in the transcripts of gene expression data. To control the false discovery rate, we filtered out genes that were expressed at a level of less than 20 mapped reads in any of the 51 plants. We also calculated the Bayesian posterior P-value of each gene using a previously described method (Storey 2004).

Data
To identify polymorphisms in all the DEGs and non-DEGs, we explored the were confirmed by a resampling test, which was repeated 1000 times.

GO Analysis
To annotate the functions of the DEGs, we performed Gene Ontology (GO) analysis using a GO Seq R package, in which the gene length bias was corrected. We also calculated the FDR-corrected P-values for the GO terms using hypergeometrical distribution. GO terms with corrected P-values less than 0.05 were considered to be 9 significantly enriched.

Quantitative RT-PCR Analysis
Quantitative RT-PCR was performed using the SYBR ® Premix Ex Taq TM Kit (TaKaRa, Japan) on an ABI PRISM 7900HT platform (Applied Biosystems, USA), according to the manufacturer's instructions. For each sample, two biological replicates were analyzed by three technical repeats. The rice ubiquitin-1 gene (LOC_Os03g13170) was used as the internal reference gene. The primers for qRT-PCR analysis are listed in Table S5.

Subspecies
To characterize the gene expression profiles at the reproductive stage of the two rice subspecies, we extracted total RNA from 40-mm-long young panicles of 26 indica and 25 japonica plants (Table S1). For each plant, equal amounts of total RNA isolated from three panicles was mixed to establish an RNA pool, which was then further processed using a previously described method ( (Table S2).
After filtering out low-quality reads and reads containing adaptor sequences (7% of all reads), we obtained approximately 4.52 billion clean reads (345. 48 and 331.99 Gb for indica and japonica plants, respectively) (Table S2). These clean reads were then mapped to the reference japonica rice genome (Nipponbare; IRGSP v7.0), with at most two mismatches allowed for each read, while ignoring the reads that were mapped to more than two locations in the reference genome (multi-mapped reads).
According to these criteria, we filtered out (~15%) of the clean reads and mapped 70.36-89.88% of the clean reads to the reference genome ( Fig. 1a; Table S3).
We then calculated the mapping rates for each sample and the average mapping rates for the two subspecies. The results showed a significant difference in the average mapping rate between the two subspecies. The average mapping rate for indica plants was 77.66%, whereas that for japonica was 83.88% (t=19.275 and p<0.001). Due to the fact that we used the japonica genome as the reference, we suspect that this difference may have been caused by mapping biases. Therefore, we decided to employ a previously constructed pseudo-transcriptome (Koenig et al 2013) of indica rice as the reference genome for the reads obtained from the indica rice samples. As a result, there was no significant difference in the average mapping rate between the two subspecies. The newly calculated average mapping rate for indica plants was 84.01%, whereas that for japonica remained 83.88% (t=0.12 and p=0.91).

Regions
We subsequently assembled the RNA-Seq clean reads into 97,005 transcribed fragments with a mean length of 884 bp (ranging from 50 to 20674 bp) (Fig 2a).
These fragments were aligned to the sequences of known rice genes (MSU 7.0) The sequence alignment results are shown in Fig 2b. Among the assembled fragments, 26.31% overlapped with exons, 36.31% fell into annotated exons, 16.57% fell into introns, and the remaining 20.77% were non-gene sequences (Fig 2b). We also found that 42% of rice genes had alternative transcripts. This percentage was much lower than that previously estimated based on microarray results (Jung et al. 2013).
In total, we identified 4579 novel transcriptionally active regions based on the sequences of rice genes and non-coding RNAs (ncRNAs) deposited in the Ensembl ncRNA and National Center for Biotechnology Information (NCBI) EST "others" databases (E-value<1e-6, Table S4). Among these regions, 61.7% were unknown protein-coding regions, 67.2% could be transcribed into single-exon transcripts, 42.71% shared more than 90% sequence identity with at least one known EST, and 12.14% were likely to encode non-coding RNAs. In addition, our RNA-Seq data expanded the untranslated regions (UTR) of 11.23% of rice genes.

Identification of DEGs between the Two Rice Subspecies
According to the MSU rice genome annotation database, 74.71-88.49% of the mapped reads overlapped with annotated genes for each sample. The number of reads mapped to a certain annotated gene ranged from 2 to more than 300,000, with a median number of 23 and 32 for indica and japonica plants, respectively. There were 40,580 genes mapped by two reads in at least two plants. In addition, we were able to establish linear relationships between the number of reads mapped to a certain gene and the gene expression level (FPKM) in all the samples (0.77 <R 2 <0.94; Fig. 3a). Given that we found significant differences in the number of reads mapped to a certain gene between the indica and japonica plants (t-test; P=0.012), we concluded that the two rice subspecies had significantly different gene expression profiles. Furthermore, principal variance component analysis also revealed significant differences in gene expression levels between japonica and indica individuals (Fig 3b).
We then employed fragments per kilobase of transcript per million mapped reads (FPKM) values to estimate the expression level of various genes. To validate the estimation results, we examined the relative expression levels of three genes with clear functional annotations (Table S5)  Of these DEGs, 576 and 1281 were expressed at lower and higher levels, respectively, in indica rice than in japonica rice (Fig. 5a). We then mapped these DEGs to the rice genome and found they are distributed throughout the 12 chromosomes (P>0.05; Fig. 5b). Furthermore, using REEF software (sliding window size was set to 500 kb to 1 Mb and the step size was set to 10 kb), we found that the DEGs tended to group in close proximity along the chromosomes (FDR <0.05) (Copper et al. 2006). However, we could not conclude that the DEGs were clustered.

Functional Annotation of the DEGs
We annotated the DEGs using enrichment analysis tool. According to the annotations, 21 Gene Ontology (GO) terms were significantly enriched (P<0.05; Fig.   6). These terms could be classified into two groups. The first group consisted of reproduction-related terms, including "reproduction", "recognition of pollen", 13 "multi-organism reproductive process" and "cell recognition". The second group consisted of cell wall biosynthesis-related terms, including "cell wall organization", "cellulose biosynthetic process", "cellulose metabolic process", "cellulose synthase activity" and "UDP-glucosyl transferase activity". The GO analysis showed that the genes encoding the binding proteins were enriched in these co-expressed genes (Fig. 6). Interestingly, genes expressed at lower levels in indica rice than in japonica rice were annotated with reproduction-related terms, whereas those expressed at higher levels in indica rice than in japonica rice were annotated with cell wall biosynthesis-related terms. Notably, the DEGs annotated with reproduction-related terms included the MADS-box genes, which comprise a gene family associated with flower development and reproduction (Becker and Theissen 2003). We identified six MADS-box genes, namely OsMADS26, OsMADS15 , OsMADS37 , OsMADS63 , OsMADS74, and OsMADS8 that were expressed at significantly different levels between the two subspecies (Table S1). Consistent with these findings, among the ten most significant morphological differences between the two subspecies, six were reproduction-related traits.

Effect of Selection Pressure on the Regulation of the DEGs
To explore whether artificial selection may have affected the expression of the DEGs, we investigated the gene regulatory regions (i.e., 5' and 3' flanking regions) and gene-coding regions of the DEGs (Fig 7). Specifically, we searched for genomic regions with polymorphism to divergence (π/Dxy) ratios ranked in the lowest 5% Nonetheless, these studies have demonstrated that the percentages of DEGs between two species are generally higher than those between two subspecies. For example, a previous transcriptomic study found that 3.3% of the 18,242 genes detected were expressed at significantly different levels between maize and its progenitor teosinte (Swanson-Wagner et al. 2010), whereas in another study, the percentage of DEGs between wild and weedy sunflowers (Helianthus annuus) was 5% (Lai et al. 2008). In this study, we found that 4.5% of all the genes detected were expressed at significantly different levels between indica and japonica rice, suggesting that the strong artificial selection altered the expression of a small proportion of rice genes. According to our GO analysis, genes that were expressed at lower levels in indica rice than in japonica rice were annotated with reproduction-related terms, whereas those that were expressed at higher levels in indica rice than in japonica rice were annotated with cell wall biosynthesis-related terms.
Therefore, it is reasonable to speculate that differences in gene expression patterns are one of the reasons for the reproductive isolation between the two rice subspecies.
The contribution of genetic drift and selection pressures to the genetic and phenotypic differences between species has become a research hotspot in DEGs is a strong indication for the involvement of positive selection during speciation. In this study we looked at the role of RNAs transcript in shaping the changes in cell wall biosynthesis and reproductive processes that occurred during rice domestication. We've found that they play an important role in this process.
Despite almost 20 years of genomics and genome-enabled studies of crop domestication, we still know remarkably little about the genetic basis of most domestication traits in most crop species. Early studied tended to go for "lowhanging fruit" -simple traits that were controlled by just one or two genes with easily identifiable mutations. Far more difficult is figuring out the more subtle developmental changes that were critical for a lot of the changes during crop domestication. This study offers a step in that direction, by examining one regulatory mechanism that has been critical for modulating domesticationassociated changes in rice cell wall biosynthesis and reproductive processes.

Conclusions
In summary, we identified 97,005 transcribed fragments, 7702 novel transcriptionally active regions and 1857 differentially expressed genes (DEGs) between indica and japonica rice. These results enhance our understanding of the rice genome by identifying DEGs that may be responsible for the reproductive isolation and phenotypic differences between the two rice subspecies. The evolution of Asian cultivated rice, a major crop for more than half of the world's population, has been a key topic of study. Our expression variations and sequence diversity confirm of gene expression and regulation regions allowed us to deduce the relationships between genome evolution and phenotype diversity, providing much new insight into the origin and domestication process of rice.