Characterization of the transcriptional divergence between the subspecies of cultivated rice (Oryza sativa)

Cultivated rice consists of two subspecies, Indica and Japonica, that exhibit well-characterized differences at the morphological and genetic levels. However, the differences between these subspecies at the transcriptome level remains largely unexamined. Here, we provide a comprehensive characterization of transcriptome divergence and cis-regulatory variation within rice using transcriptome data from 91 accessions from a rice diversity panel (RDP1). The transcriptomes of the two subspecies of rice are highly divergent. Japonica have significantly lower expression and genetic diversity relative to Indica, which is likely a consequence of a population bottleneck during Japonica domestication. We leveraged high-density genotypic data and transcript levels to identify cis-regulatory variants that may explain the genetic divergence between the subspecies. We identified significantly more eQTL that were specific to the Indica subspecies compared to Japonica, suggesting that the observed differences in expression and genetic variability also extends to cis-regulatory variation. Using RNA sequencing data for 91diverse rice accessions and high-density genotypic data, we show that the two species are highly divergent with respect to gene expression levels, as well as the genetic regulation of expression. The data generated by this study provide, to date, the largest collection of genome-wide transcriptional levels for rice, and provides a community resource to accelerate functional genomic studies in rice.

The unique natural and agronomic selection pressures placed on the wild progenitors and early protodomesticates resulted in drastic changes at the genetic level. Work by Huang et al. [2] showed considerable reduction in genetic diversity in Indica and Japonica compared with O. rufipogon. Such drastic reductions in genetic diversity are common following domestication. Moreover, the transition from an out-crossing/heterogamous nature of O. rufipogon to the autogamous breeding system of cultivated rice likely led to greater partitioning of genetic diversity among the two subspecies, and further differentiation of the two groups. These large genetic differences have been recognized for nearly a century as hybrids between Indica and Japonica exhibit low fertility [3]. More recently, these genetic differences have been realized with the availability of high density molecular markers and full genome sequences for both Indica and Japonica [2,[4][5][6][7][8][9][10][11][12]. For instance, Ding et al. [4] showed that approximately 10% of the genes in the Indica and Japonica genomes showed evidence of presence-absence variation or asymmetrical genomic locations. Several other studies have highlighted genetic differences between the subspecies as structural variants differences, gene acquisition and loss, transposable element insertion and single nucleotide polymorphisms [2,[5][6][7][8][9][10][11][12].
While the morphological and genetic differences of Indica and Japonica have received considerable attention, few studies have investigated the divergence between the two subspecies at transcriptome level [13][14][15]. Walia et al. [13] utilized genome-wide expression profiling to characterize the transcriptional responses for two Indica and Japonica cultivars to salinity. This study was performed to elucidate the mechanisms underlying the contrasting responses to stress exhibited by the cultivars, rather than examine the transcriptional difference between the subspecies. Moreover, separating genotypic differences from subspecies differences is not feasible with the low number of cultivars used in these studies. Lu et al. [14] compared transcriptional profiles of two Indica accessions and a single Japonica accessions and identified many novel transcribed regions, highlighted alternative splicing differences, and differentially expressed genes between accessions. Although these studies provided insights into the transcriptional differences between Indica and Japonica, given the small sample size, the scope for extending conclusions to a population level is limited. Jung et al. [15] leveraged the large number of public microarray databases to compare transcriptional diversity between the two subspecies. The 983 publicly available Affymetrix microarrays were classified into Indica and Japonica subspecies based on the cultivar name. This study showed that considerable differences in expression levels were evident between the two subspecies. However, large proportion of information is likely lost due to the heterogeneity in sample types (e.g. tissue, developmental stage) and varying growth conditions. Thus, a more highly controlled study that utilized a larger panel with genotypic information would provide greater insight into the differences in expression levels, as well as provide a mechanism for connecting transcriptional differences between the two subspecies with genetic variation.
The objective of this study is to examine the genetic basis of the transcriptional variation at a population level within the O. sativa species. By combining population and quantitative genetics approaches, we aim to elucidate the genetic basis of transcriptional divergence between the two subspecies. To this end, we generated transcriptome data using RNA sequencing on shoot tissue for a panel of 91 diverse rice accession selected from the Rice Diversity Panel1 (RDP1) [16][17][18]. Here, we show that transcriptional diversity between Indica and Japonica subspecies is consistent with diversity at the genetic level. Moreover, we connect transcriptional differences between the two subspecies with divergent patterns of cis-regulatory variation. This study is the first to document the transcriptional divergence between the major subspecies of cultivated rice at a population level, and provides insight into the genetic mechanisms that have shaped this transcriptional divergence.

Results
We selected 91 accessions to represent the genetic diversity within Rice Diversity Panel 1 (RPD1). Using the subpopulation assignment described by Zhao et al. [16] and Famoso et al. [17], shoot transcriptome data was generated for 23 tropical japonica, 23 indica, 21 temperate japonica, 13 admixed, 9 aus, and 2 aromatic accessions. Genes with low variance or expression within the expression set were filtered out, as these genes are uninformative for downstream analyses focused on natural variation in gene expression. A total of 25,732 genes were found to be expressed (>10 read counts) in at least one or more of the 91 accessions. This equates to about 46% of the genes present in the rice genome (total of 55,986 genes in MSUv7 build).

Divergence between the Indica and Japonica subspecies are evident at the genetic and transcriptional levels
To examine patterns of variation within the transcriptomics data, we performed principle component analysis (PCA) of transcript levels for the 91 accessions. Prior to PCA, lowly expressed genes were removed if they were not expressed (<10 reads) in at least 20% of the samples. This filtering removed approximately 33,311 genes, resulting in a total of 22,675 genes that were used for the principal component analysis based on the normalized read counts. For the genetic analysis, we used 32,849 SNPs. PCA analysis of the expression matrix resulted in a clear separation between the two subspecies along PC1, suggesting a significant transcriptional divergence between Indica and Japonica (Fig. 1). The first PC accounted for approximately 26.8% of the variation in gene expression. While PC1 was able to differentiate between the two subspecies at the transcriptional level, no clear clustering of accessions was observed along other PCs (Fig. 1). These results suggest that the two subspecies of cultivated rice have divergent transcriptomes, but the transcriptomes of the subpopulations are more similar. Consistent with these results, differentiation between the subspecies was clearly evident along PC1 using the genetic (SNP) data alone (Fig. 1a,b). The clustering of accessions along PCs 2-4 for the SNP data was consistent with those described by Zhao et al. [16] (Fig. 1), and were effective in discerning the two subpopulations in rice. These results collectively suggest that the two subspecies are highly divergent at the genetic and transcriptional levels.

Differential expression analysis reveals contrasting expression between subspecies
To further explore the differences and identify genes that display divergent expression between the two subspecies, the 91 accessions were first classified into Indica and Japonica-like groups, using the program STRUCTURE with the assumption of two groups and no admixture [19]. A total of 35 accessions were assigned to the Indica subspecies, while 56 were assigned to the Japonica sub- Fig. 1 Principle component analysis of markers and gene expression matrices. The top four principle components from PCA analysis of the genotypic data are pictured in A and B to illustrate the divergence of the major subpopulations in rice. The panels in C and D summarize PCA of expression data. PVE: percent variation explained by each component species. Next, a linear mixed model was fit for each of the 26,675 genes, where subspecies was considered a fixed effect and accession as a random effect. A total of 7,417 genes were found to exhibit contrasting expression between the two subspecies (FDR ≤0.001, Additional file 3). Of these genes, 4,210 (57%) showed significantly higher expression in Japonica relative to Indica, while 3,207 (43%) showed higher expression in Indica relative to Japonica.
This divergent expression levels observed between the two subspecies could be the result of the presence or absence of genes within the subspecies. To this end, we sought to identify genes showing a presence-absence expression variation (PAV). Genes with a read count greater than 10 were considered as expressed and coded as 1 while those with read counts less than 10 were coded as 0. These genes were further filtered, so that genes that were expressed in at least 20%, but no more than 80% of the samples were retained for downstream analyses. A logistic mixed effects model was fit for the 4,163 genes meeting this criteria. In total, 1,980 genes showed evidence of PAV between the two subspecies (FDR < 0.001; Additional file 3). This analysis, enriched for genes that were expressed at higher frequency in Japonica rice compared to Indica. For instance, 1,435 genes were found to be expressed at a significantly greater frequency in Japonica relative to Indica, while only 545 were found to be expressed predominately in Indica. Moreover, we detected significant enrichment for GO terms associated stress response (GO:00006950) and response to biotic stress (GO:0009607), as well genes with kinase activity (GO:0016301). Within Indica-specific genes, only a single GO category was enriched for oxygen binding activity (GO:0019825; Table 1). Moreover, 173 were identified with no evidence of expression in Indica while only 18 were identified in Japonica. Collectively, these results suggest that the divergence between Indica and Japonica subspecies may be due, in part, to differences in mean expression levels as well as presence-absence expression variation.

Japonica subspecies exhibits reduced genetic and transcriptional diversity
Several studies have shown that the unique domestication history of the two subspecies has resulted in large differences in the overall genetic diversity between the two subspecies, with Indica being more genetically diverse than Japonica [2,[20][21][22]. We next explored the variation in gene expression within each subspecies. Two metrics were used to examine the differences in diversity at both the genetic and transcriptional levels within each subspecies: nucleotide diversity (π) and the coefficient of variation (CV). Diversity analyses within each subspecies may be influenced by differences in sample size. Since the number of Japonica accessions were greater than Indica, a subset of 35 Japonica accessions were randomly selected for diversity analyses. The results for the full set of 56 Japonica accessions are provided as Fig. S1.
Expression diversity was estimated using the coefficient of variation (CV) for 22,675 genes. CV was significantly different between the two subspecies (Wilcoxon rank sum test, p < 0.0001; Fig. 2). The Indica subspecies exhibited approximately 12.6% higher expression diversity compared to Japonica. On average, CV in the Indica subspecies was 3.46, while in the Japonica subspecies the mean CV was 3.07. These results suggest that the transcriptional diversity is lower in the Japonica subspecies compared to Indica. CV estimates using the complete set of Japonica accession were similar (CV: 3.46 and 3.10 for Indica and Japonica, respectively; Fig. S1).
Genetic diversity within each subspecies was estimated using π for 33,543 SNPs in randomly selected 35 Indica and 35 Japonica accessions. Similar differences were observed for π as CV, however the differences between Table 1 Gene onotology (GO) enrichment analysis for genes exhibiting significant presence-absence expression variation (PAV) (FDR < 0.001). GO enrichment was conducted using AgriGO using the MSU V7 genome build without transposable elements as a background. GO enrichment was conducted separately for genes expressed predominately in each subspecies

Fig. 2
Genetic and expression diversity within Indica and Japonica accessions. a The coefficient of variation was used as an estimate of the diversity in gene expression within each subspecies. A subset of 35 Japonica accessions were randomly selected for diversity analyses to ensure that sample sizes were equal between the two subspecies. The vertical dashed lines represent the mean CV within each subspecies. b Site-wise nucleotide diversity (π) was used as an estimate of the genetic diversity within each of the subspecies using 36,901 SNPs described by [16] subspecies was much greater (Wilcoxon rank sum test, p < 0.0001; Fig. 2). The Indica subspecies showed a 64.7% higher nucleotide diversity (π) compared to Japonica. On average, π estimates were 0.26 for Indica and 0.17 for Japonica. These results are consistent with reports by Huang et al. [2] and Garris et al. [23], and are in agreement with the expression diversity reported above. Together these data suggest that the Japonica subspecies exhibits less genetic and transcriptional diversity compared to Indica.

Gene expression is heritable in cultivated rice
The above analyses shows a strong differentiation between the subspecies at transcriptional and genetic levels, and presents a possible linkage between expression and genetic diversity. However, the extent of variation in gene expression that can be accounted by genetic variation is not yet determined. To estimate the extent to which variation in gene expression is under genetic control, a mixed model was fit to the expression of each of the 22,675 genes and the variance between accessions was estimated. The significance of the random between − accession term was determined using a likelihood-ratio test. The broad-sense heritability (H 2 ) was estimated as the proportion of the total variance explained by between-accession variance to total variance. A total of 11,895 genes showed a significant between − accession variance (FDR < 0.001; H 2 ≥ 0.47), which accounts for approximately 53% of the genes expressed in at least 20% of the samples ( Fig. 3a To determine the extent to which additive genetic effects could explain variance in gene expression, a genomic relationship matrix was constructed using 32,849 SNPs following VanRaden [24] and variance components were estimated using a mixed linear model for each gene. A total of 10,125 genes were identified with significant h 2 (Additional file 4). Of these, 234 genes had highly heritable expression (h 2 ≥0.75), while 2,750 genes showed moderate heritability (0.5 ≤ h 2 <0.75) (Fig. 3b). An additional 7,141 genes showed low narrow sense heritability (h 2 <0.5). Collectively, these results indicate that many genes in the rice transcriptome are under genetic control.

Genetic variability of gene expression is considerably different between subspecies
The analyses above indicate that the two subpopulations differ at the transcriptional and genetic levels, and that for many genes, variation in expression can be explained by genetic effects. We next asked whether the heritability of gene expression is different between the two subspecies. To this end, the expression dataset was partitioned into Indica and Japonica subsets and genes with low expression in each subspecies were removed (expressed in less than 20% of the samples). Since the number of accessions for the two subspecies are unequal, 35 Japonica accessions were randomly sampled to ensure equal sample size, and the number of genes that were expressed in each subspecies were counted. Here, a gene was considered as expressed if 10 or more reads mapped to the gene in 20% or more of the samples. A total of 22,444 genes were found to be expressed in at least 20% of the samples for the Japonica subspecies, while 22,068 were found to be expressed in the Indica subspecies. A large number of genes were common to both subspecies (21,166 genes). A total of 1,278 genes were found to be uniquely expressed in Japonica, and 902 were found to be uniquely expressed in Indica.
A total of 5,005 genes exhibited significant H 2 in Indica and 3,338 genes in Japonica (FDR < 0.001; Additional file 5). For these genes, H 2 ranged from 0.67 to 0.98 in Indica and 0.67 to 0.97 in Japonica. A larger number of genes were identified with significant additive genetic variance, with 6,804 identified in Indica and 5,103 found in Japonica. For these genes, narrow-sense heritability ranged from 0.201 to 0.953 in Indica and 0.220 to 0.948 in Japonica. Interestingly, few genes showed significant heritable expression in both subspecies. For instance, only 1,681 and 2,644 genes were found to have significant H 2 and h 2 , respectively, in both Indica and Japonica. Moreover, a comparison of H 2 and h 2 between subspecies showed that for many genes, heritability estimates were considerably different between Indica and Japonica (Fig. 4).
To systematically identify genes showing significant differences in H 2 or h 2 ( H 2 and h 2 , respectively) between subspecies, accessions were randomly partitioned into two groups of equal size and the difference in heritability was estimated between groups. The resampling approach was repeated 100 times. A total of 1,860 genes showed significant differences in H 2 (p < 0.01) between the two subspecies, with a minimum absolute difference in H 2 of 0.40. Fewer genes were identified with a significant difference in h 2 between Japonica and Indica (Additional file 6). Only 1,325 genes were found with significant differences in h 2 between Indica and Japonica, and the absolute difference in h 2 ranged from 0.54 to 0.95 (Fig. 4).
These differences in heritability may be due to insufficient phenotypic variation (e.g. lack of expression diversity), or changes in the genetic or environmental factors that contribute to phenotypic variation. Thus, to further examine the potential causes of the observed differences in heritability, we quantified the expression diversity (CV), genetic variation and environmental variation within each subspecies for genes exhibiting H 2 and h 2 , as well as those with shared heritable variation. For genes exhibiting subspecies-specific genetic variability, the loss of heritability was largely due to an increase in environmental effects on phenotypic variation in the subspecies lacking heritability rather than loss of phenotypic variation. This is clearly evident in Additional file 2. The mean CV for H 2 genes decreased slightly in subspecies lacking genetic variability. However, for these same genes the proportion of phenotypic variation that was explained by environmental effects increased significantly in subspecies lacking genetic variability. Collectively, these results suggest that the differences in heritability exhibited between the subspecies is driven largely by loss of genetic variability and an increase in environmental effects rather than a loss of phenotypic variation.
Interestingly, several genes that have been reported to have divergent genetic variants between Indica and Japonica were found within H 2 and h 2 genes. For instance, DOPPELGANGER1 (DPL1) showed significantly higher H 2 and h 2 in Indica relative to Japonica (H 2 : 0.92 and 0.27, respectfully, p H 2 = 0.011; h 2 : 0.81 and 0.17, p h 2 = 0.004; Fig. 4e). However for DOPPELGANGER2, the converse was true. Significantly higher H 2 and h 2 was observed in Japonica relative to Indica (H 2 : 0.87 and 0.03, p H 2 < 0.001; h 2 : 0.77 and 0, respectfully, p h 2 = 0.005; Fig. 4f ). Mizuta et al. [25] showed that DPL1 and DPL2 are important regulators of Indica-Japonica hybrid incompatibility, and non-functional alleles arose independently for DPL1 and DPL2 within the Indica and Japonica subspecies respectively. Thus the results reported by Mizuta . b Comparisons of narrow sense heritability between the two subspecies. Red colored points in A and B indicate genes with significantly heritable expression (FDR < 0.001). Differences in broad (c) and narrow sense heritability (d) between Indica and Japonica. The difference in heritability is calculated as I . e-h Standardized expression of agronomically important genes showing differences in genetic variability between subspecies. The heritability is provided below each box plot. I: Indica, J: Japonica et al. [25] are consistent with the divergent genetic variability in expression observed in our study. In addition to DPL1 and DPL2, a gene that is important for the regulation of shoot growth/ architecture, MOC1, also displayed divergent genetic variability between subspecies. MOC1 showed significant differences in both H 2 and h 2 (Fig. 4h). Collectively, these results show that the two subspecies are divergent at the transcriptional and genetic levels. Moreover, many genes exhibit large differences in genetic variability between the Indica and Japonica, suggesting that these genes may be regulated by divergent genetic mechanisms.

Joint eQTL analysis assesses cis-regulatory divergence between subspecies
The differences in the narrow-sense heritability between subspecies observed for some genes suggest a divergence in the genetic regulation of these genes. Using the transcriptional and genotypic data for this population, we next sought to identify genetic variants that can explain this divergent genetic regulation. To this end, a joint eQTL analysis was conducted across subspecies using the eQTL Bayesian model averaging (BMA) approach described by Flutre et al. [26]. With this approach, the posterior probability of specific configurations can be formally tested; in other words, the probability that an eQTL is present/active in both the Indica and Japonica subspecies or unique to a given subspecies can be determined. The 91 accessions were classified into Indica and Japonica subspecies using STRUCTURE as described earlier, yielding 35 Indica-type and 56 Japonica-type accessions. eQTLs were modeled for genes showing significant H 2 in at least one subspecies (6,307 genes) and 274,499 SNPs. For each gene, associations were tested for SNPs within 100kb of the transcription start site. A total of 5,097 genes were detected with one or more eQTL at an FDR of 0.05 (Additional file 7). This equates to approximately 81% of the genes displaying heritable expression, and indicates that a large portion of genes with heritable expression are regulated by variants in close proximity to the gene.
To identify eQTL genes that were specific to a given subspecies, the SNP with the highest probability of being the eQTL was selected for each gene, and the posterior probability for all three configurations (Indica-specific, Japonica-specific, and across subspecies) was compared. Of the 5,097 eQTL genes detected, 80% (4,077 genes; 3,826 unique SNPs) were detected across subspecies, 18% (914 genes; 880 unique SNPs) were detected for Indica accessions, and only 2% (106 genes; 103 unique SNPs) were detected in Japonica accessions. These results indicate that while a large portion of cis-eQTLs are shared across the two subspecies of cultivated rice, many genes are regulated by unique cis regulatory mechanisms that are specific to the Indica subspecies.

Signatures of selection are evident among subspecies specific eQTL
The presence or absence of cis-regulatory variants within a given subspecies may be the result of the unique domestication histories that have shaped Indica and Japonica, and/or driven by environmental adaptation of the wild progenitors from which they were derived. The absence of variation at the eQTL SNP could be due to sampling during differentiation of the wild progenitors or during domestication (e.g. lost purely by chance), or due to selective pressures imposed by the environment or humans. In the case of selection, we expect to see reduced genetic diversity around the eQTL compared to the rest of the genome. To determine whether the absence of subspecies-specific eQTL are the result of selection, we calculated the average nucleotide diversity (π) in 100 Kb windows around significant subspeciesspecific eQTL within each subspecies and compared these values to the overall average π for 100 Kb windows across the genome within each subspecies using a two-sided t-test. Comparisons within each subspecies of π for eQTLs and the genome-wide average should account for the inherent differences in π between the two subspecies.
Consistent with what would be expected under selection, a significant reduction in nucleotide diversity was observed for eQTL SNPs that were absent in a subspecies, as well as for regions around subspecies-specific eQTL (Fig. 5). For instance, for Indica-specific eQTL, the average π in Japonica was approximately 22% lower than the genome-wide average (0.138 and 0.176, respec-tively; p < 1 × 10 −15 ). Similarly, the average π in Indica for Japonica-specific eQTL was about 16% lower than the genome-wide average (0.235 and 0.279, respectively; p = 3.85 × 10 −10 ). Interestingly, slightly higher nucleotide diversity was observed for regions around subspeciesspecific eQTL in subspecies in which they were detected compared to genome-wide nucleotide diversity, as well as for shared eQTL when compared to genome-wide nucleotide diversity. Collectively, these results indicate that the absence of eQTL within a given subspecies may be the result of selective pressures that reduced genetic diversity within the eQTL regions.
Given the small sample size in the current study (n = 91) we sought to confirm these results using resequencing data for a larger population of 3,024 diverse rice accessions [27][28][29][30]. To this end, we extracted SNP information for 3,024 rice accessions in the same 100 Kb window surrounding eQTL, and examined π within each subpopulation for these regions. As above, π within these regions were compared with genome-wide averages for 100 kb windows. Consistent with the results derived from the 91 accessions, π within subspecies-specific eQTL was lower in subpopulations lacking the eQTL (Fig.6). For instance, for the Japonica subpopulations (japonica-x, subtropical japonica, temperate japonica, and tropical japonica) π estimates for Indica-specific eQTL were considerably lower than those for Indica subpopulations (indica-1A, indica-1B, indica-2, indica-3, and indica-x). The converse was true for Japonica-specific eQTL, with lower π observed in Indica subpopulations relative to Japonica. However for the shared eQTL, π estimates were higher than the genome-wide averages, suggesting that genetic diversity within regions that regulate gene expression is maintained. To identify specific loci that may have been targeted by selection, we selected eQTL regions with an average π within a 100 Kb window that was below the 5% quantile for genome-wide average for a given subspecies. Consistent with the results above, we observed a greater frequency of low diversity eQTL regions in subspecies lacking the subspecies-specific eQTL. For instance, approximately 11% of the 880 Indica-specific Nucleotide diversity at cis-eQTL. a Nucleotide diversity (π) for the most significant SNP for each cis-eQTL. The distribution of π is pictured fro each subspecies and each eQTL type. b Distribution of π for 100 Kb windows around the most significant SNP for each cis-eQTL. Genome-wide (GW) π was determined by randomly selecting X SNPs that were more than 100 kb from a cis-eQTL and low diversity SNPs (MAF <0.1 in both subspecies) were removed prior to analyses. Asterisks indicate a significant differences determined via Tukey's test between eQTL types (p < 1 × 10 −8 ) eQTL were found in regions of low diversity in Japonica (π Jap ≤ 0.0645). While for Japonica-specific eQTL, 14% (14 of the 103) eQTL regions were lying in regions of low diversity in Indica (π Ind ≤ 0.1617). However, for shared eQTL and for subspecies in which the subspeciesspecific eQTL was detected, the converse was true. Only a small percentage of eQTL regions were found within regions of low diversity. For instance, approximately 3.5% Fig. 6 Nucleotide diversity at cis-eQTL within subpopulations for 3,053 rice accessions. Average nucleotide diversity (π) for 100 kb regions surrounding Indica-specific, Japonica-specific, and shared eQTL are pictured in panels A, B, and C, respectively. For each, subpopulation and class of eQTL (e.g.Indica-specific, Japonica-specific, and shared) π was calculated for each SNP within 100 kb of the most significant eQTL SNP. π for the eQTL windows were compared to a genome wide (GW) average in which regions with eQTL and site with low diversity (MAF <0.01 in 10 of 12 subpopulations) were excluded. Asterisks indicate significant differences between GW and eQTL regions determined using a two-sided Student's t-test (* p < 0.05; ** p < 0.001). Subpopulations are named following [27] (aro: aromatic; ind1A: indica-1A; ind1B: indica-1B; ind2: indica-2; indx: indica-X; japx: japonica-X; subtrop: subtropical japonica; temp: temperate japonica; trop: tropical japonica) of shared eQTL were found in regions of low diversity in both Indica and Japonica, and less than 1% of subspecies eQTL were found in regions of low diversity in the subspecies in which they were detected. Collectively these results suggest that selective pressures may have shaped the cis-regulatory divergence of the Indica and Japonica subspecies.

Discussion
The differentiation between the Indica and Japonica subspecies of cultivated rice has been intensively characterized at the morphological, biochemical, and genetic levels [2,3,[5][6][7][8][9][10][11][12][31][32][33][34][35]. However, the divergence at the transcriptional levels remains understudied. Here, we provide a comprehensive analysis of the transcriptional and cisregulatory divergence between the major subspecies of rice, and show that the presence or absence of cis regulatory variants within the subspecies is a component of this divergence.
The transcriptional divergence is most evident in the large number of expressed genes showing differences in the magnitude or frequency of expression. Of the 25,732 genes showing evidence of expression in the current study, approximately 29% showed significant differences in expression levels between the two subspecies. Moreover, approximately 8% of expressed genes showed evidence of presence-absence expression variation. While few studies have examined the differences in expression levels between diverse populations of Indica and Japonica, recent studies have utilized whole genome sequencing to shed light on the genetic differentiation between the subspecies of cultivated rice [2,27]. In a recent study, Wang et al. [27] found that on average approximately 15% of all genes showed evidence of PAV between the genomes of Indica and Japonica accessions, further indicating that PAV is pervasive between the subspecies of cultivated rice. While the number of PAV reported by Wang et al. [27] are nearly two fold higher than those reported in the current study, it is important to note that only a single tissue was sampled for 91 accessions at a single time point. Therefore, while the expression data provides considerable insight into transcriptional variation in cultivated rice, it likely captures only a portion of the total transcriptome given the lack of temporal and spatial resolution. Moreover, Wang et al. [27] captured PAV using 3,010 resequenced rice genomes, while the current study utilized only a fraction of the variation of Wang et al. [27] with RNA sequencing of 91 accessions. Thus, increased sample size via larger populations and more sampling within tissue and developmental context may lead to a better agreement between PAV at the genome and transcriptional levels.
One major challenge for genomic studies that utilize both Indica and Japonica accessions is choosing an appropriate genome. While cultivated rice consists of two subspecies, many studies that have used accessions from both subspecies often map sequences to the Nipponbare reference genome [16,17,27,30,36]. Several studies have highlighted structural variation both within and between subspecies of cultivated rice. Thus, some genomic features may not be shared between diverse accessions and Nipponbare [10,37]. The current study highlights many differences between the Indica and Japonica subspecies, but does so under the assumption that the genomes of the two subspecies should not be too different. The overall high colinearity of the genomes of the two subspecies and the ability to recover fertile F1 individuals from Indica-Japonica hybrids suggests that this is a reasonable assumption.

Potential causes of transcriptional divergence between Indica and Japonica
Lower mean expression values or absence of expression in a given subspecies may be the result of both heritable and non-heritable effects. The availability of high density SNP information for RDP1 allowed us to begin to elucidate the genetic basis of the observed transcriptional divergence between the subspecies of cultivated rice. A notable portion of genes with evidence of PAV or DE also showed differences in genetic variability between the subspecies (13% and 9% of DE genes showed differences in H 2 and h 2 , respectively, and 20% and 15% of PAV genes showed differences in H 2 and h 2 ), indicating that for many genes, the genetic mechanisms that regulate expression may be different between the two subspecies. However, many genes that display divergent expression patterns have non-significant differences in genetic variability. There are several explanations for this. For one, the thresholds used to identify genetically divergent genes were quite stringent. For instance, genes must have a difference in genetic variability in either the broad sense greater than 0.4022 between subspecies to be labeled as statistically significant, and in the narrow sense 0.5364. Therefore, it is possible that many more DE or PAV genes have different genetic architectures in the two subspecies, but were missed because of the stringency of statistical threshold. A second possibility is that many of the genes showed divergent expression are influenced greatly by the environment, and thus have low heritability. Thus, these genes would be filtered out in these genetic analyses.
The heritable transcriptional divergence may be due to genetic variants that influence gene expression and are divergent between Indica and Japonica. These include large structural variants (e.g. deletions, insertions, inversions, and/or duplications), or SNPs that may act in cis or trans to influence gene regulation. While high density SNP information is available for this population and can be leveraged to identify SNPs that regulate expression and are divergent between the subspecies, the identification of larger structural variants that influence expression is only attainable through full genome sequencing, which is not currently available for RDP1. As more genetic resources become available for RDP1 this would be a promising future direction to resolve the causal basis of these transcriptional differences.
The availability of high density SNP information for RDP1 allowed us to begin to elucidate the genetic basis of the observed transcriptional divergence between the subspecies of cultivated rice, and classify genetic effects into those that are common between subspecies, or unique to a given subspecies. While the eQTL-BMA approach has proven to be a powerful framework for assessing the specificity of eQTL for a given tissue or population, one potential limitation of eQTL-BMA is that the framework only allows us to model cis-eQTL. Trans-eQTLs are often difficult to detect due the penalties associated with the large number of statistical tests performed, and because trans-eQTL often have small effect sizes and thus require larger dataset for detection. Several studies in humans have shown that cis-eQTL typically only explain 30-40% of genetic variation in expression [38][39][40]. Thus, the divergent regulatory variants captured in the current study only reflect a portion of the differences in genetic variation between the two subspecies. Further studies are necessary to shed light on the contribution of trans-regulatory variants on the genetic differentiation between Indica and Japonica transcriptomes.
The joint eQTL analysis facilitated the identification of 5,097 genes associated with one or more SNP in cis. For most of these genes (81%), the cis-regulatory variant was shared between both subspecies, indicating that much of the cis-regulatory variation is common between the two subspecies. This high degree of overlap is somewhat expected. For one, both Indica and Japonica originate from populations of the same species, Oryza rufipogon. Moreover, crosses between Indica and Japonica often produce viable offspring, indicating a high degree of colinearity and functional similarity between the genomes. Thus, while considerable differentiation between founder Oryza rufipogon populations has been reported and further divergence has likely occurred since domestication, the common origin and inter-specific comparability suggests that the transcriptional regulation and genome structure is similar [2].

Functional significance of transcriptional divergence
The current study elucidates the transcription divergence between the major subspecies of cultivated rice. Many of these genes with divergent expression, genetic variability, or regulatory variation have been reported to be underlying important agronomic traits, such as photoperiod adaptation and development. Therefore these observed differences may have potential agronomic significance. However, further studies are necessary to determine whether these expression patterns are conserved in other tissues or developmental stages.
Among these divergent genes, we identified three genes (OsPhyA, OsPhyC, and OsCO3), that have been reported to be associated with the timing of reproductive development in response to day length that had significant heritability in Indica only. The two phytochrome genes, OsPhyA and OsPhyC are activated under long-day conditions and repress flowering time through OsGhd7 [41,42]. Although no studies have shown whether OsCO3 participates directly in the pathway involving OsPhy genes, disruption of OsCO3 interferes with photoperiod sensitivity and/or flowering time [43]. For instance, Kim et al. [43] showed that the overexpression of OsCO3 delayed flowering under short-day conditions. In most rice varieties, short-days promote the transition from vegetative to reproductive growth [44]. However, temperate japonica rice varieties, which are adapted to higher latitudes have been selected to initiate flowering in long-days to escape the negative impact of low temperatures in autumn on pollen fertility [45][46][47]. All genes showed heritable expression only in the Indica subspecies, indicating that in the Japonica subspecies expression variation may be driven largely by non-genetic effects. Moreover, the patterns of genetic variability for these genes are consistent with their potential role in the adaptation of flowering in different environments for Indica and Japonica.
In addition to genes regulating rice phenology, several genes were identified that have been reported to play important roles in the regulation of shoot architecture (D18, MT2b, and MOC1). For instance, two genes dwarf 18 (D18) and Metallothionein2b (MT2b) have been reported to regulate plant height [48,49]. D18 encodes a GA-β hydroxylase and is involved with GA biosynthesis. Loss of function mutants exhibit a severe dwarf phenotype [48]. Interestingly, D18 was found have an Indica-specific eQTL, but did not exhibit a difference in H 2 or h 2 between the two subspecies (p = 0.046 and p = 0.19, respectively), indicating that genetic differences may be confined to local regions around D18.

Conclusion
The morphological and genetic differences between subspecies of cultivated rice have been studied extensively, however the divergence of Indica and Japonica at the transcriptional and regulatory levels is largely unresolved. Here, we provide, to date, the first detailed populationlevel characterization of transcriptional diversity within cultivated rice, and assess the divergence in trancriptomes and expression variation between Indica and Japonica. We find that many agronomically important genes exhibit differences in expression levels, and/or cis-regulatory variation between the subspecies. These resources provided by this study can serve as a foundation for future functional genomics studies in rice and other crops, and can be further utilized to connect gene function with natural variation in gene expression.
Seeds were dehusked manually and germinated in the dark for two days at 28°C in a growth cabinet (Percival Scientific), and were exposed to light (120 μmol m −2 s −1 ) twelve hours before transplanting to acclimate them to the conditions in the growth chamber. The seeds were transplanted to 3.25" x 3.25" x 5" pots filled with Turface MVP (Profile Products) in a walk-in controlled environment growth chamber (Conviron). The plants were cultivated in the absence of intentional stress conditions. The pots were placed in 36" x 24" x 8" tubs, that were filled with tap water. Fours days after transplanting the tap water was replaced with half-strength Yoshida solution [50] (pH 5.8). The pH of the solution was monitored twice daily and was recirculated from a reservoir beneath the tubs to the growth tubs. The temperatures were maintained at 28°C and 25°C in day and night respectively and 60% relative humidity. Lighting was maintained at 800 μmol m −2 s −1 using high-pressure sodium lights (Phillips).

RNA extraction and sequencing
Ten days after transplant, aerial parts of the seedlings were excised from the roots and frozen immediately in liquid nitrogen. The samples were ground with Tissuelyser II (Invitrogen) and total RNA was isolated with RNAeasy isolation kit (Qiagen) according to manufacturer's instructions. On-column DNAse treatment was performed to remove genomic DNA contamination (Qiagen). Sequencing was performed using Illumina HiSeq 2500. Sixteen RNA samples were combined in each lane. Two biological replicates were used for each accession.

Sequence alignment, expression quantification, and differential expression analysis
Quality control for raw reads was performed using the package FastQC [51]. The Illumina 101-bp single-end reads were screened and trimmed using Trimmomatic to ensure each read has average quality score larger than 30 and longer than 15 bp, and were aligned to the rice genome (Oryza sativa MSU Release 6.0) using TopHat (v.2.0.10), allowing up to two base mismatches per read. Reads mapped to multiple locations were discarded [52,53]. The number of reads for each gene sequence was counted using the HTSeq-count tool with the "union" resolution mode [54]. For down-stream genetic analyses, a variance stabilized transformation was performed on normalized read counts to provide approximately homoskedastic values in DEseq2 [55].
To identify genes that exhibited differential expression between the two subspecies, a mixed linear model was fit that included subspecies as the main fixed effect and accession as a random effect in lme4 [56]. This full model was compared to a reduced model the lacked subspecies as a fixed effect using a likelihood-ratio test. Prior to differential expression analysis, expression levels were quantile normalized to ensure a Gaussian distribution. Benjamini and Hochberg's method was used to control the false discovery rate, and genes with an FDR ≤0.001 were considered differentially expressed [57].
Genes showing differences in presence-absence expression variation (PAV) was determined using a mixedeffects logistic regression model. Briefly, for each sample the expressed genes (number of reads >10) were assigned 1, while those with 10 or less reads were assigned a 0. A logistic regression model was fit using the 'glmer' func-tion in 'lme4' and included subspecies as a fixed effect and accession as random [56]. The significance of the fixed effect of subspecies was determined by comparing the full model above with a reduced model that lacked subspecies using a likelihood-ratio test. Benjamini and Hochberg's method was used to control the false discovery rate, and genes with an FDR ≤0.001 were considered as having presence-absence expression variation [57].

Subspecies classification
The 91 accessions were classified into two subspecies using the software STRUCTURE [19]. Briefly, the software was run using the 44k SNP data, assuming two subpopulations (K=2), with 20000 MCMC replicates and a burn-in of 10000 MCMC replicates.

Expression and genetic diversity analyses
Principle component analysis of gene expression was conducted for the 91 accessions using 22,675 genes after variance stabilizing transformation. For, PCA of SNP data the 44k dataset described by Zhao et al. [16] was used. SNPs with a MAF <0.10 were removed prior to PCA analysis.
The coefficient of variation (CV) was used to estimate the diversity in gene expression within the Indica and Japonica subspecies. Prior to estimating CV genes with low expression (i.e. those with read counts of ≤10 in ≥20% of the samples) were removed, leaving a total of 22,503 genes in Japonica and 21,719 genes in Indica. For the estimation of π, SNPs were extracted for each subspecies and SNPs with MAF < 0.05 were excluded. In total 201,891 SNPs were retained for Indica and 161,715 for Japonica. π was estimated at each SNP using the site-pi function in VCFtools [58].

Heritability estimates
Heritability, both in the broad (H 2 ) and narrow sense (h 2 ), was estimated across subspecies for 22,675 genes that were expressed in both Indica and Japonica. To estimate H 2 a mixed model was fit using lme4 where accession was considered a random effect, and significance of H 2 was assessed using a restricted likelihood-ratio test in the RLRTsim package [56,59]. Benjamini and Hochberg's method was used to control the false discovery rate, and genes with an FDR ≤0.001 were considered to have significant genetic variability [57]. To assess hertiability in the narrow sense (h 2 ) a mixed model was fit in asreml-R [60]. Briefly, a genomic relationship matrix (G) was estimated according to [24] using the approximately 36,901 SNPs described by Zhao et al. [16]. G is estimated as G = ZcsZcs m , where Z cs is the centered and scaled marker matrix and m is the number of markers. A likelihood-ratio test was used to assess significance and Benjamini and Hochberg's method was used to control the false discovery rate. Genes with an FDR ≤0.001 were considered to have significant genetic variability [57].
Heritability was assessed within subspecies using the same approaches as described above. However, due to the unequal sample size for the Indica and Japonica subspecies, a random set of 35 Japonica accessions were selected. Genes showing low expression (< 10reads in < 20% of samples) in either subspecies were removed prior analysis, leaving 22,444 genes in Japonica and 22,068 genes in Indica.

Assessing differences in genetic variability between subspecies
To identify genes showing significant differences in genetic variability (H 2 or h 2 ) between subspecies, a permutation approach was used. Here, the 91 accessions were randomly partitioned into two groups of equal size (35 accessions each). Heritability was estimated as described above and the difference in heritability between each group was calculated. The resampling approach was repeated 100 times for both H 2 and h 2 . This process effectively estimated a null distribution of H 2 and h 2 values. The heritability estimates for each subspecies was used to calculate the differences in H 2 and h 2 between the two subspecies as H 2 = H 2 J − H 2 I or h 2 = h 2 J − h 2 I . These values were compared with the null distribution to assess significance.

Joint cis-eQTL analysis
eQTLs were jointly detected using the eQTL-BMA (Bayesian model averaging) described by Flutre et al. [26] for 26,675 genes and 274,499 SNPs (MAF >0.10) [61]. Prior to eQTL mapping BLUPs for each gene was calculated and the gene expression level of each gene was transformed into the quantiles of a standard Normal distribution with ties broken randomly. To control for the effects of population structure the first four PCs derived from PCA analysis of 44k SNP dataset were included in the linear model. Briefly, to identify eQTL and control false discovery rate (FDR) a gene-level permutation approach was used within the eQTL-BMA software. Using the eqtlbma_bf program, 10,000 permutations were performed with the following settings: -maf 0.1, -nperm 10000, -trick 1, -tricut 10 and -error uvlr. Genes were considered to have an eQTL if the FDR < 0.05. These permutations were used to estimate π 0 , the probability for a gene to have no eQTL in any subspecies. Here, expression from both Japonica and Indica samples were analyzed together with the option -error uvlr specified. Next, a hierarchical model with an expectation-maximization algorithm was used to estimate hyper-parameters and configuration probabilities using the eqtlbma_hm program. These configurations were Indica-specific, Japonica-specific, and present in both subspecies. Lastly, the eqtlbma_avg_bfs program was run to obtain (i) the posterior probability (PP) of a gene to have an eQTL in at least one subspecies, (ii) PP for a SNP to be the causal SNP for the eQTL, (iii) PP for the SNP to be an eQTL, (iv) PP for the eQTL to be present in one subspecies, and (v) PP for the eQTL to be present for a specific configuration. SNP-gene pairs were determined to be specific to a given subspecies or shared if the PP > 0.5 for a given configuration.

Detecting evidence of selection at cis-eQTL
To determine whether the absence of an eQTL was due to of selection, first SNPs from the HDRA dataset within 100kb of each significant eQTL were extracted for the 91 accessions [61]. For each SNP, nucleotide diversity was determined using the site-pi function in VCFtools and was averaged across the 100kb window [58]. Secondly, a genome-wide diversity level was determined for each subspecies. Here, SNPs that were within 100kb of an eQTL were excluded, as well as those that exhibited low diversity in both subspecies (MAF <0.1 in both Indica and Japonica). Nucleotide diversity was determined as described above for each SNP, and the average was taken for 100kb windows. For each class of eQTL (e.g. Indica-specific, Japonica-specific, and shared), a two-sided Student's ttest was performed to assess whether the mean π was different from the genome-wide average for each subspecies and class of eQTL.
A similar approach was taken for the 3kg data [30]. For each eQTL SNP, all SNPs within 100kb of the eQTL SNP was extracted from the 4.8M core SNP data. The MAF was determined for each of the 12 subpopulations in the 3kg data, and SNPs that had low diversity (MAF <0.01) in 10 of the 12 subpopulations were excluded from further analyses. As above, π was calculated for each site. An average π was determined for each subpopulation at each eQTL by taking the average π across the 100kb window. To obtain a genome wide average, eQTL regions were excluded and π was obtained for each subpopulation by averaging π across the 100kb region. Finally, as above a two-sided Student's t-test was performed to assess whether the mean π was different from the genome-wide average for each subpopulation and class of eQTL.