Genomics Analyses Reveal Unique Classification, Population Structure and Novel Allele of Neo-Tetraploid Rice

Neo-tetraploid rice (NTR) is a useful new germplasm that developed from the descendants of the autotetraploid rice (ATR) hybrids. NTR showed improved fertility and yield potential, and produced high yield heterosis when crossed with indica ATR for commercial utilization. However, their classification, population structure and genomic feature remain elusive. Here, high-depth genome resequencing data of 15 NTRs and 18 ATRs, together with 38 publicly available data of diploid rice accessions, were analyzed to conduct classification, population structure and haplotype analyses. Five subpopulations were detected and NTRs were clustered into one independent group that was adjacent to japonica subspecies, which maybe the reason for high heterosis when NTRs crossed with indica ATRs. Haplotype patterns of 717 key genes that associated with yield and other agronomic traits were revealed in these NTRs. Moreover, a novel specific SNP variation was detected in the first exon of HSP101, a known heat-inducible gene, which was conserved in all NTRs but absent in ATRs, 3KRG and RiceVarMap2 databases. The novel allele was named as HSP101–1, which was confirmed to be a heat response factor by qRT-PCR, and knockout of HSP101–1 significantly decreased the thermotolerance capacity of NTR. Interestingly, HSP101–1 was also specifically expressed in the anthers of NTR at pre-meiotic and meiosis stages under optimal environment without heat stress, and its loss-of-function mutant showed significant decrease in fertility of NTR. The construction of first genomic variation repository and the revelation of population structure provide invaluable information for optimizing the designs of tetraploid rice breeding. The detection of specific genomic variations offered useful genomic markers and new directions to resolve high fertility mechanism of NTR.


Background
Whole-genome duplication (WGD) or polyploidization played vital role in evolution, speciation and biological complexity of higher plants ( Van de Peer et al. 2009). Polyploids, including autotetraploids and allopolyploids, are widespread in flowering plants and 60%-70% flowering plants have one or more polyploid ancestor (Blanc and Wolfe 2004;Masterston 1994). The advantage of polyploids in plant growth, yield and environmental fitness prompt the crop breeders to explore their usage in modern crop trait improvement (Van de Peer et al. 2009;Renny-Byfield and Wendel 2014).
Autotetraploid rice is a germplasm that generated by artificial genome duplication of diploid rice and their development enriched the germplasm resources of rice (Tu et al. 2003;Luan et al. 2007;Cai et al. 2007;Lu et al. 2020). Although biological advantages, wide adaptability and high hybrid vigor were observed in autotetraploid rice Shahid et al. 2012;Wu et al. 2013;Tu et al. 2014;Yang et al. 2014), their production and commercial utilization were restricted by many factors, especially fertility (Shahid et al. 2010;He et al. 2011;Shahid et al. 2013;Wu et al. 2014;Wu et al. 2015;Li et al. 2017;Chen et al. 2018;Li et al. 2018). After many years of tremendous efforts, a newly developed tetraploid rice with high fertility, i.e. neo-tetraploid rice, was developed by our research group through crossing of autotetraploid rice Guo et al. 2016;Guo et al. 2017;Bei et al. 2019;Yu et al. 2020;Ghaleb et al. 2020), which may become new useful resources for rice breeding (Koide et al. 2020). Compared with ATR, several agronomic traits improved significantly in NTR, such as seed setting and grain number per panicle, and also contained non-parental variant genes associated with fertility and yield . Hybrids between NTR and ATR showed high heterosis and production (Guo et al. 2017;Ghaleb et al. 2020). Consequently, genetic diversity between ATR and NTR populations need to be assessed urgently to illustrate the population structure and genetic divergence.
The revelation of genetic divergence and population structure is very important either in maximizing hybrid vigor, optimizing breeding design or discovering trait associated variations. The population structure in plant species, such as wheat (Cavanagh et al. 2013;Chao et al. 2010;Maccaferri et al. 2005), maize (Romay et al. 2013), sorghum (Morris et al. 2013), barley (Malysheva-Otto et al. 2006), pearl millet (Serba et al. 2019) and weedy rice (De Leon et al. 2019), were illustrated using microsatellites or simple sequence repeat (SSR) markers and single-nucleotide polymorphisms (SNPs). With the development of next-generation sequencing, whole genome re-sequencing was used as a valid method to conduct genotyping in populations to analyze the genetic diversity. In diploid rice, population structure analysis based on genome re-sequencing data classified 3010 diverse accessions into nine subpopulations, which was accordant with their geographic distribution (Wang et al. 2018b).
Based on the genotypic data of population, haplotype analysis can be used to illustrate the genetic diversity. Haplotype analysis of genes that regulate important agronomic traits has been used to identify trait associated variations and functional genetic markers. The SNP variations in BADH1 gene were detected by highthroughput SNP genotyping method from 16 varieties differing in salt tolerance and aroma (Singh et al. 2010). Haplotype analysis of grain weight gene, GW2, in diverse indica and aromatic genotypes discovered four new SNPs and regrouped the haplotypes into four categories (Dixit et al. 2013). In another study, haplotype analysis of 21 salt stress-responsive candidate genes in 103 wild rice accessions with contrasting salt tolerance was conducted, which revealed the major alleles in different geographic regions (Mishra et al. 2016). Haplotype analysis using deep genome re-sequencing is becoming a reliable method to inspect genetic diversity. Genome resequencing of 104 rice varieties provided 18 million variations for genetic diversity analysis and haplotype reconstruction of agronomically important genes (Duitama et al. 2015). Using the sequencing data of 3 K rice genome panel, haplotypes of 120 genes regulating grain yield and grain quality were explored, and the assessment of superior haplotypes showed potential in developing next-generation tailor-made rice (Abbai et al. 2019). The construction of haplotype patterns of agronomically important genes in tetraploid rice, based on the previously reported genes in diploid rice, would provide valuable resources for their functional analysis.
In the present study, the genomic re-sequencing data of neo-tetraploid rice and autotetraploid rice lines was used to reveal the genome features, population structure and haplotype patterns. By these analyses, we planned (1) to determine the subspecies (indica-japonica) classification and population structure of neo-tetraploid rice, (2) to construct the haplotype patterns of agronomically important genes in tetraploid rice population, (3) to mine the novel specific alleles of neo-tetraploid rice, and (4) to validate the function of neo-tetraploid rice specific genes.

Genome Re-Sequencing and Variation Detection in Neo-Tetraploid and Autotetraploid Rice
The genome re-sequencing data of 15 neo-tetraploid rice lines (NTRs) and 18 autotetraploid rice lines (ATRs) with about 46-fold sequencing depth were used in this study, which included about 2 billion high quality pairend sequencing reads with an average ratio of Q30 score of 93.98% (Table S1). The average genome coverage ratios of MSU7 and R498 reference genome were 95.31% and 93.46%, respectively (Table S2). About 66.9 million variations, ranged from 0.21 to 3.50 million variations in each individual, were identified in 33 tetraploid rice (including ATR and NTR) lines when compared with MSU7 reference genome. Approximately 85.7 million variations, ranged from 1.26 to 3.44 million variations in each individual, were identified in 33 tetraploid rice (including ATR and NTR) lines when mapped onto R498 reference genome (Table S3).
The genomic variations from 33 tetraploid rice lines were combined based on their genomic positions, and a total of 7,445,008 and 7,880,885 variations were detected between 33 tetraploid rice lines and reference genomes of MSU7 and R498, respectively. A total of 0.79 million moderate-to-high effect variations were identified based on the annotation results from SnpEff software, which accounted for 10.61% of the total variations against MSU7 reference genome. The percentage of modifier and low variations were 6.96% and 82.42%, respectively (Table S4) (Table S5). The distributions of variation density were illustrated using "Circos" plot ( Fig. S1 and Fig. S2), and low-diversity regions (LDRs) were found in Chr5 (10-14 Mb) and Chr6 (2-6 Mb). These LDRs contained relatively low number of variations in indica ATRs, japonica ATRs, javanica ATRs and neo-tetraploid rice (NTRs) against two reference genomes.

Genomic Classification and Population Structure Analysis of Neo-Tetraploid Rice
There are large genetic divergences between the genomes of the two Asian cultivated rice (Oryza sativa L.) subspecies, indica and japonica. Mostly, the mapping rate is high and variations are less when the sequencing sample and the reference genome belonging to the same subspecies. Subspecies classification of NTRs was detected using the difference of mapping rate and genomic variation number between NTRs and two reference genomes (japonica reference genome, MSU7, and indica reference genome, R498). With minimal coverage of 5X, the average mapping rates were 92.39% and 89.13% with MSU7 and R498 in all NTRs, respectively. On the contrary, the average variation numbers of all NTRs were 1, 903,495 and 2,818,017 against MSU7 and R498, respectively. The tendency was consistent with the known japonica materials in the previous studies, which suggested that the genome sequences of NTRs were similar to japonica ( Fig. S3 and Fig. S4). Moreover, to identify the fractions of indica and japonica genomes present in ATRs and NTRs, 2,785,296 SNPs and InDels between MSU7 and R498 were used as markers. The results were consistent with the genome coverage and variation number, which also showed that NTRs have more fraction of MSU7 genomes than R498 (Fig. 1d). The classification was further validated using phenol reaction according to Cheng's index, and the glume color of indica ATR, Huanghuazhan-4x, and its diploid counterpart were dyed to dark purple, while NTRs, japonica ATR Taichung65-4x and its diploid counterpart were unstained (Fig. S5). These results indicated that NTRs can be classified into japonica subspecies.
Together with 38 published genome sequencing accessions, principal component analysis (PCA) and phylogenetic relationship analysis classified all materials into five different groups, namely Oryza rufipogon (common wild rice), indica, japonica, javanica (tropical japonica) and NTR, and the classifications of most samples were in accordance with their known subspecies features (Table S1, Fig. 1a, b). To validate the population structure results, the ADMIXTURE analysis was conducted in the k values (subgroup number) ranging from 2 to 8, and k = 5 was chosen as the best subgroup number because the lowest cross-validation error was detected when k = 5 (Fig. 1c, Table S6). These results indicated that the NTRs can be classified into an independent subgroup in tetraploid rice population. The independent position of NTRs compared to other type of tetraploid lines, especially indica autotetraploid lines, demonstrated the potential high heterosis effect between NTRs and ATRs, which was consistent with our previous results of F 1 hybrid's performance . Notably, genomic data also revealed that all NTRs contained wide compatibility gene S 5 n , which would play important roles in the breeding of intersubspecific tetraploid rice hybrids (Fig. S6).

Haplotype Analysis in Tetraploid Rice Population
Haplotype pattern of 55,801 annotated genes in rice whole genome (MSU7) was determined to evaluate the genetic diversity of tetraploid rice population. A total of 432,290 haplotypes were identified, with an average of 7.75 haplotypes per gene. The distribution of number of haplotypes per gene revealed that 40,013 out of 55,801 genes have 2 to 15 haplotypes (Fig. S7). However, 7772 genes only have one haplotype. GO enrichment analysis revealed that those zero-diversity genes participated in the basic key metabolic processes like GTP binding, proton-transporting ATPase activity and NADH dehydrogenase activity (Table S7).
In order to harness the genetic diversity of tetraploid rice for breeding, the haplotype variations of previously cloned genes associated with yield and other agronomically important traits were identified in tetraploid rice population (Table S8). In 717 agronomically important genes, 2912 haplotypes were detected, with an average of 4.06 haplotypes per gene (Table S9). The distribution of haplotypes per gene revealed that 563 out of 717 genes have 2 to 15 haplotypes per gene, while 147 genes have only one haplotype in tetraploid rice accessions (Fig. S8). For instance, the yield related genes Hd3a (LOC_ Os06g06320), NAL1 (LOC_Os04g52479) and sd1 (LOC_ Os01g66100) have 4, 6 and 9 haplotypes, while TAC1 (LOC_Os09g35980) gene only have one haplotype.

Identification of Novel Specific Alleles in Neo-Tetraploid Rice
The results of haplotype analysis were further used to identify NTR specific alleles compared to ATRs. The haplotype patterns of 55,801 annotated genes of MSU7 reference genome among 18 ATRs, 15 NTRs and 30 diploid rice lines (DRs) were compared. A total of 269 shared unique haplotypes were found in the NTRs, but no unique haplotype was detected in the ATRs when compared with the DR genomes, suggesting that NTRs may contain novel variations, which is neither present in DRs, nor in ATRs (Table S10). Moreover, the independent position of population structure and significant genetic differences between NTR and ATR populations prompted us to analyze their allele differentiation. Firstly, the haplotypes of 15 NTRs were compared with 18 ATRs one by one, and 3189 specific haplotypes were identified in 15 NTRs. Huaduo 8 and Huaduo 2421 contained 1014 and 964 specific haplotypes, respectively, which were higher than other NTRs (Fig. 2a). The intersections of haplotypes in each NTR line were analyzed by UpSet plot, and the numbers of haplotypes that shared by more than 12 NTRs are shown in Fig. 2b. The two haplotypes, which shared by all 15 NTR lines, and 14 haplotypes that shared by 13 NTR lines, were identified.
One allele was considered as NTR specific allele when all the haplotypes of NTRs were uniform but different from ATRs. Five NTR specific haplotypes were identified, and two of them have homozygous variations, including LOC_Os05g44340 (HSP101, heat response gene) and LOC_Os10g25730 (hypothetical protein) (Fig. 3a,  Fig. S9). Of these five genes, one gene, HSP101, was highly expressed in anther at meiosis stage in NTR Huaduo 1 under natural environment (Fig. S10a). One base substitution was detected in HSP101 of all NTR lines but not in ATRs. The allele distribution of HSP101 in 18 ATRs and 35 DRs were further investigated, after removing the low-quality sequenced materials, the allele of HSP101 could be divided into two types, namely indica (HSP101-i) and japonica (HSP101-j) types. However, HSP101 allele was different and uniform in all NTRs, with a unique base substitution from G to C that caused genomes present in autotetraploid rice (ATRs) and neo-tetraploid rice (NTRs). Sample names start with "Huaduo" are NTRs. Accession 96,025 and the accessions with the extensions of "-4x" are ATRs, and samples from different ecotypes were distinguished by color. Samples with extensions of "-G", "-I" represents japonica and indica diploid rice, respectively. Common wild rice samples start with "Huaye" or end with "-W" a missense mutation in exon 1 and amino acid change from proline to alanine (Fig. 3b). Haplotype network analysis revealed that novel allele of HSP101 has a closer relationship to the japonica haplotype of HSP101 (Fig.  3a). This variation was validated by sanger sequencing in 15 NTRs and 15 ATRs (Table S11), while the variation was not detected in diploid rice lines of the RFGB (Table  S12) and RiceVarMap2 (Table S13) databases, which suggested that it is a novel specific variation in NTRs. Therefore, this haplotype was named as "HSP101-1".
Function Analysis of HSP101-1 in Seedling Thermotolerance and Fertility of Neo-Tetraploid Rice The function of HSP101-1 in neo-tetraploid rice was first validated at seedling stage. qRT-PCR results displayed that HSP101-1 was up-regulated in leaf, sheath and root at seedling stage in Huaduo 1 after heat treatment, and the expression was up-regulated by 653.19-, 1204.37-and 1843.17-fold change in leaf, sheath and root, respectively. While, the expression decreased gradually after heat shock in Huaduo 1 (Fig. 4a, b and c). These results suggested that HSP101-1 is a heat stress response factor in neo-tetraploid rice. To evaluate the heat-tolerance of materials with haplotypes of HSP101-1, thermotolerance experiment was conducted at seedling stage using 2 NTR lines (HSP101-1), Huaduo 1 and Huaduo 2543, and Huanghuazhan (HSP101-i), Huanghuazhan-4x (HSP101-i), Taichung 65 (HSP101-j), Taichung 65-4x (HSP101-j), and the NTR parental lines, Jackson-4x (HSP101-j) and 96,025 (HSP101-j), were used as controls. The results showed that NTRs with the haplotype of HSP101-1 have enhanced thermotolerance capability and produced stronger seedlings than all controls. The heat sensitivity indexes of seedlings were 0.072 and 0.084 in Huaduo 1 and Huaduo 2543, while the heat sensitivity index in controls were ranged from Other than seedling stage, HSP101-1 was also presumed to play important role during reproductive stage. RNA-Seq data revealed that HSP101-1 was highly expressed in anther at meiosis stage in neo-tetraploid rice Huaduo 1 under natural environment (Fig. S10a). Secondly, gene expression of HSP101 in diploid rice, Nipponbare from RiceXPro database suggest that this gene was expressed in ovary at 1 day after flowering (Fig.  S11). The difference in expression patterns indicated that this gene may play divergent functions in diploid and neo-tetraploid rice. qRT-PCR results further confirmed that HSP101-1 was specifically expressed in anthers at PMA (pre-meiotic interphase), MA I (meiosis I) and MA II (meiosis II) under optimal conditions (without heat stress) (Fig. 6a). Knockout of HSP101-1 in neo-tetraploid rice cause 24.99% reduction in seed setting and 31.78% reduction in pollen fertility (Fig. 6b, c and d). Cytological observation on pollen development of hsp101-1 and wild type Huaduo 1 revealed degeneration of some microspores in tetrad stage (Fig. S13g) and asynchronous development of microspores in late microspore stage ( Fig.  S13l and S13p). These results revealed the function of HSP101-1 in regulating fertility of neo-tetraploid rice. Moreover, the seed setting of late and early seasons, which have different temperatures at flowering stage, were compared in NTRs and ATRs. In the high temperature environment during early season (Fig.  S10b), the seed set of ATRs were declined by 8.33% compared to late season, while the seed setting of NTRs increased by about 6.35% (Fig. S10c, Yu et al. 2020). These results indicated that HSP101-1 not only regulate the thermotolerance ability at seedling stage, but also play important role during reproductive stages of NTRs.

Construction of the First Large Scale Genomic Variation Repository of Tetraploid Rice
The high-quality genome sequencing and genomic variation data provided a way to precisely resolve the population structure of plants, which also facilitated the detection of functional genes in rice (Wang et al. 2018b), maize (Lai et al. 2010), soybean (Lam et al. 2011) and cotton (Ma et al. 2018). NTRs are important fertile materials for tetraploid rice breeding, and the development of NTRs provided rice breeders a new direction for the improvement and commercial utilization of tetraploid rice (Guo et al. 2017;Koide et al. 2020;Ghaleb et al. 2020). Understanding the genetic variations of NTRs will largely facilitate their utilization and gene functional analysis. In this context, the genomic variations of 15 NTRs and 18 ATRs were detected by using whole genome re-sequencing. This is the first genomic variation repository of tetraploid rice that will help us to find the responsible factors governing yield related traits in polyploid rice. Here, a total of 2 billion high quality pair-end sequencing reads of tetraploid rice were generated that covered the MSU7 and R498 reference genome by about 46.85-fold coverage depth. A total of 7,445,008 and 7, 880,885 variations were detected between 33 tetraploid Fig. 5 Thermotolerance evaluation of HSP101-1 knockout mutant. a Thermotolerance ability of wild type Huaduo 1 and hsp101-1 at seedling stage. b, c Heat sensitivity index of seedling length and root length of Huaduo 1 and hsp101-1 rice lines and reference genomes of MSU7 and R498, and the low-diversity region on Chr5 was overlapped with recently published genomes of four rice genotypes (Angad et al. 2020). These genomic variations will play an important role to decipher the underlying molecular mechanisms associated with yield performance and further improvement of tetraploid rice.

Neo-Tetraploid Rice Was Classified as Independent japonica Subpopulation with Unique Variations
Population structure and subspecies classification are important characterizations of new crop germplasm, which helps breeders for their proper utilization. As a novel kind of tetraploid rice, there is no information about the classification of NTR. The results of PCA, AD-MIXTURE and phylogenetic analysis classified tetraploid (including NTRs and ATRs) rice accessions into five subpopulations, which were consistent with their features of subspecies classification. Together with the indica/japonica reference genome coverage rate and genomic variations, NTRs can be classified into japonica subspecies with independent position. The population structure of tetraploid rice will be helpful for finding new directions of tetraploid rice improvement. As the high heterosis effect between two rice subspecies, japonica and indica, has been reported, the subspecies classification information might be useful to find the best combination with the highest heterosis (Huang et al. 2015). The results of the present classification explained that why NTRs produced high heterosis when crossed with indica tetraploid rice in a previous study (Guo et al. 2017). More importantly, wide-compatible gene, S 5 n , was Fig. 6 Impact of HSP101-1 on the fertility of neo-tetraploid rice. a Expression pattern analysis of HSP101-1 in neo-tetraploid rice Huaduo 1 using qRT-PCR. b Plant morphology, panicle and pollen fertility of wild type Huaduo 1 (left) and hsp101-1 (right). c, d Comparison of seed setting and pollen fertility between wild type Huaduo 1 and hsp101-1. 2 L, 2S, 3 L, 3S, 4 L, 4S: leaf and stem at meiosis stage, bicellular pollen stage and 5 days after flowering; M, M1, M3: young panicle at the length of 0.5 cm 1 cm and 5 cm; PMAA, PMAO, MA1A, MA1O, MA2A, MA2O, SCPA, SCPO, SCP2A, SCP2O, BCPA, BCPO: anther and ovary at pre-meiotic interphase, meiosis I, meiosis II, microspore early, microspore late and bicellular pollen stages. Scale bars: 20 cm for plant, 1 cm for panicle and 100 μm for pollen also detected in NTRs in this study that was consistent with previous reports, which indicated the promising utilization of NTRs in construction of tetraploid rice hybrids, especially subspecies hybrids (Chen et al. 2008;Guo et al. 2017;Chen et al. 2019;Ghaleb et al. 2020). Although further studies are required to comprehensively reveal fertility mechanism, these findings partially explained that why hybrids between NTRs and ATRs have higher fertility than the hybrids generated form ATRs. The independent population position of NTRs also indicated the existence of NTR specific variations, which provides new information for rice genetic variation resource library and would be important for the improvement of tetraploid rice.
The detection of unique genomic variations that present in the genome of NTRs, but not in ATRs, was considered as the interesting results, which supported the theory of "genome shock" (McClintock 1984;Xu et al. 2014). The autotetraploid rice was developed from the genome doubling of diploid rice. However, few genomic variations were detected between autotetraploid rice and their diploid counterpart. This phenomenon has been reported by previous studies, for example, there were only 185 homozygous variations (78 SNPs and 107 InDels) between ATR 02428-4x and its diploid counterpart 02428-2x ; and only 81 SNPs and 182 InDels were identified in ATR, T449, compared to its diploid counterpart, E249 (Chen et al. 2018). These variations between ATRs and their diploid counterparts tend to be occurred randomly in genome, so there was no unique haplotype detected between ATRs and diploid rice. In allopolyploids, several findings of novel variations support the theory of "genome shock" (McClintock 1984). Novel genomic variations have been detected in the polyploids of Arabidopsis (Madlung et al. 2005), Brassica napus (Gaeta et al. 2007;Song et al. 1995) and wheat (Han et al. 2005;Ozkan et al. 2001;Shaked et al. 2001). Recently, enormous amount of homoeologous exchanges (HEs) were found in segmental allotetraploid rice, which responsible for their genomic and phenotypic variations . Here, NTRs were developed from the crossing decedents of ATRs from different ecotypes, which can be considered as a new segmental allotetraploid rice. The hypothesis is that the unique haplotypes in NTRs compared with DRs may occur during their breeding process of early generations, and the beneficial alleles were selected and fixed in subsequent generations.
Novel Allele, HSP101-1, Is Indispensable for Thermotolerance and High Fertility in Neo-Tetraploid Rice Climate change or global warming may aggravate the decrease in yield and quality of rice (Battisti and Naylor 2009;Welch et al. 2010). The genetically improved rice is considered important to alleviate the threats. HSP101 is a member of HSP100/ClpB family, which is a crucial heat shock protein that regulates thermotolerance to cells, and over-expression of HSP101 significantly improved the thermotolerance ability of diploid rice (Katiyar-Agarwal et al. 2003). The function of HSP101 in heat tolerance was already reported (Hong and Vierling 2001), and its pleiotropic function in regulating development and fitness in plants was also found (Nieto-Sotelo et al. 2002;Tonsor et al. 2008). The accumulation of HSP101 was found in mature maize kernels under optimal temperature and disappeared after 3 days of imbibition. Its negative influences were revealed between HSP101 and growth rate of the primary root in addition to its function in thermotolerance (Nieto-Sotelo et al. 2002). HSP101 may enhance plant fitness in normal growth conditions, since the loss of function of HSP101 reduced about 33% fruits in Arabidopsis thaliana (Tonsor et al. 2008).
The identified novel allele of HSP101-1 in neotetraploid rice was specific to all the high yielding neo-tetraploid rice lines, which was absent in autotetraploid rice, 3KRG and RiceVarMap2 database. Consistent with diploid rice, HSP101-1 is a heat response factor at seedling stage in neo-tetraploid rice, and its association with NTR seedling heat tolerance was further proved by CRISPR/Cas9 gene knockout. HSP101-1 also showed potential to improve fertility and yield in the high temperature environment of early season. Other than thermotolerance, HSP101-1 also plays functional role in reproductive development. The expression of HSP101-1 was detected in anthers at pre-meiotic (PMA), meiosis I and meiosis II stages of neo-tetraploid rice. Moreover, loss-offunction mutant of HSP101-1 showed significant reduction in seed setting of neo-tetraploid rice. Hence, the novel allele of HSP101-1 is indispensable for high fertility in neo-tetraploid rice. These results revealed enriched allelic diversity of HSP101 gene and suggested promising potential function of this novel allele in heat stress and maintaining high fertility of neo-tetraploid rice.

Conclusions
Large-scale genomic variations of autotetraploid rice and neo-tetraploid rice lines, and population structure analysis classified neo-tetraploid rice lines into japonica subspecies with independent subpopulation position. Haplotype patterns of whole genome genes, especially yield related key genes, were analyzed in tetraploid rice population. Functional validation indicated that the novel haplotype of HSP101-1 plays important role in both seedling heat stress and fertility in neo-tetraploid rice. Here, the construction of first large-scale genomic repository of autotetraploid and neo-tetraploid rice, and the revelation of classification and population structure will play important role in optimizing the designs of tetraploid rice breeding. The specific variations in NTRs will facilitate further marker-assisted selection and functional genomics of tetraploid rice.

Calculation of Quantity and Quality of Genome Sequencing Data
Genome sequencing data of tetraploid rice, including 15 neo-tetraploid rice lines (NTRs) and 18 autotetraploid rice lines (ATRs) were collected from our bioproject PRJNA526117. The previously published diploid rice genome sequencing data of 38 accessions (8 common wild rice lines, 15 indica rice lines and 15 japonica rice lines) were downloaded from the NCBI SRA database with the bioproject accession number PRJNA396096 , PRJNA526117 ) and PRJNA407820 (Wang et al. 2018a) using the Aspera Connect (version 3.7.4.147727) software, and the downloaded SRA files were transformed into fastq format using the fastqdump command in SRAToolkit (http://ncbi.github.io/ sra-tools/). The quantity and quality of raw sequencing data was evaluated using the FastQC (v0.11.6) software (Andrews 2010) and the low-quality sequencing reads were removed.

Reads Mapping and Variation Calling
The sequencing reads that passed the quality control process were mapped onto the MSU7 (Nipponbare, O. sativa japonica) and R498 (Shuhui498, O. sativa indica) reference genome using BWA (0.7.17-r1188) (Li and Durbin 2009), and MarkDuplicates in Picard (Broad Institute 2019) were used to eliminate data of PCR duplication. The SAM files were sorted, indexed, and converted to BAM format using SAMtools (version 1.9) (Li 2011). The genomeCoverageBed in bedtools (v2.27.1) was used to estimate the reference genome coverage (Quinlan and Hall 2010). Genome Analysis Toolkit (GATK, version 3.8-0) was used to call the variations from the alignment file, and the analysis pipeline was constructed based on the GATK best practices (Mckenna et al. 2010). The variations were annotated by SnpEff (4.3 s) based on the annotation GFF3 files of two reference genomes (Cingolani et al. 2012), and variations were compared by using VCFtools (0.1.16) software (Danecek et al. 2011). Variation density of ATRs and NTRs was calculated using 100 Kb sliding windows, and the density distribution was illustrated using TBtools . In order to estimate the fraction of japonica and indica genomes present in the ATRs and NTRs, the variations between MSU7 and R498 genomes were detected by aligning the R498 to MSU7 genome. The generated SNPs and InDels were used as markers to calculate the fractions, and the results were illustrated by bar plots.
Population Structure and Phylogeny Analysis SNP variations were selected using "SelectVariants" from GATK (3.8-0) (Mckenna et al. 2010), and PCA (principal components analysis) was conducted using GCTA (1.91.4) software ) based on the whole genome SNP variations and the result was plotted using a script based on ggplot2 (Wickham 2016). A python script (Ortiz 2019) was used to convert the SNP VCF file to PHYLIP format. The PHYLIP format was further converted to MEGA file using MEGA X (version 10.0.4) (Kumar et al. 2018). Phylogenetic neighbor-joining tree was constructed with the bootstrap test of 1000, and the tree was visualized using FigTree (v1.4.3) software (https://github.com/rambaut/figtree). The population structure was analyzed using ADMIXTURE (version 1.3.0) software (Alexander and Lange 2011) with k (the number of groups) ranged from 2 to 8, and the k = 5 was a sensible modeling choice with the lowest cross validation error. The ancestry distributions of individuals were visualized using R script.

Phenol Reaction of NTR Grains
The phenol reaction method was used to validate the indica-japonica classification of 15 NTRs . Autotetraploid rice line Huanghuazhan-4x and its diploid counterpart, autotetraploid rice line Taichung65-4x and its diploid counterpart were used as indica and japonica controls. Fifty grains were randomly selected from each line and were soaked into 2% phenol solution for 48 h, and were washed by water. The color of glumes was observed and photos were taken of all dried seeds.

Haplotype Detection, Validation, and Construction of Gene Phylogenetic Tree
To conduct haplotype analysis, the nonsynonymous variations of 15 NTRs and 18 ATRs were selected and combined into one file. Then, the variations (SNPs and InDels) were used to construct the haplotype by our python script. The haplotypes with homozygous variations were considered as main haplotypes, while the haplotypes with heterozygous bases were still been counted in the total haplotype numbers. Firstly, a total of 717 agronomically important genes for yield that previously reported (Yonemaru et al. 2010;Abbai et al. 2019) in diploid rice were summarized (Table S8). The haplotype patterns of those genes in tetraploid rice population were further analyzed and haplotype number was counted. Secondly, haplotype analysis of 55,801 genes that annotated in rice reference genome (MSU7) was conducted in 15 NTRs and 18 ATRs. The zerodiversity genes, which were conserved in all tetraploid rice lines, were identified and the GO enrichment analysis were conducted using the agriGO v2.0 (Tian et al. 2017). To construct gene phylogenetic tree, the gene consensus sequences in our lines were obtained using consensus command in BCFtools (Danecek et al. 2011) and the phylogenetic neighbor-joining was constructed with the bootstrap test of 1000 in MEGA X (version 10.0.4) (Kumar et al. 2018). Haplotype network of HSP101 was constructed using the sequence of each sample by DnaSP6 (Rozas et al. 2017) and Network software (Bandelt et al. 1999). Gene structure was plotted using IBS . Primers were designed using Primer Blast software in NCBI, and the segment of neo-tetraploid and autotetraploid containing variations were amplified and the PCR products were sequenced using Sanger sequencing. The sequences were aligned using ClustalW Multiple Alignment in BioEdit software. The NTR specific variation in HSP101-1 was checked in the RFGB ) and RiceVarMap2  databases.

Real-Time Quantitative Reverse Transcription PCR (qRT-PCR) Analysis
The seedlings of neo-tetraploid rice line, Huaduo 1, at 3leaf stage was treated at 42°C for 1 h and then seedlings were transferred to growth chamber at 28°C for 1, 2 and 3 h. The leaf (L), leaf sheath (S) and root (R) after each hour were collected and kept in liquid nitrogen and stored at − 80°C. The samples from untreated seedlings were also collected as CK, and all the samples were taken in three biological replications. Total RNA from each sample was extracted and the reverse transcription was conducted using Evo M-MLV RT Kit with gDNA Clean for qPCR (Accurate Biotechnology). Primers used for qRT-PCR were designed using Primer Blast software in NCBI. The reaction of reverse transcription was as follow: incubation for 2 min at 42°C, followed by RT process of 37°C for 15 min, 85°C for 5 s and then stored at 4°C. The qRT-PCR was performed on Lightcy-cler480 (Roche) using the Hieff qPCR SYBR Green Master Mix (YEASEN). The reaction conditions were as follow: 30 s at 95°C, 40 cycles of 95°C denaturation for 5 s and 58°C annealing and extension for 20s. Three biological replications were performed in this experiment. Actin was used as an internal reference for qRT-PCR. The relative gene expression was calculated by the 2 -ΔΔCT method (Livak and Schmittgen 2001).

Construction of CRISPR/Cas9 Vector and Characterization of Gene Knockout Mutants
The targetDesign tool of CRISPR-GE was used to design two target-site primers for NTR specific gene, HSP101-1, using its genomic DNA sequence (http://skl.scau.edu. cn/home/) (Table S14, Xie et al. 2017). The constructed CRISPR/Cas9 vector (Butt et al. 2019;Ma et al. 2019) was transferred into NTR, Huaduo 1. Mutations were validated by Sanger sequencing using gene-specific target site primer. Sanger sequencing data were decoded using DSDecodeM program of CRISPR-GE tool kit (Fig.  S12, Table S15, Xie et al. 2017). The thermotolerance of hsp101-1 was checked in T1 generation by thermotolerance assay (see below). Seed setting of 16 T1 plants were investigated and compared with wild type, Huaduo 1. Pollen fertility was observed by using 1% I 2 -KI method with Motic BA200 microscope (Ghouri et al. 2019). Pollen development was observed by a whole mount eosin B confocal laser scanning microscopy (WE-CLSM) according to Wu et al. (2014).

Heat Treatment and Thermotolerance Assays
Heat treatment of two neo-tetraploid rice (NTR) lines, Huaduo 1 and Huaduo 2543, were performed based on the previously established protocol for diploid rice ) with minor modifications. Indica autotetraploid Huanghuazhan-4x and its diploid counterpart, japonica autotetraploid Taichung65-4x and its diploid counterpart and the parental lines of NTR were used as controls. To test basal thermo-tolerance ability, threeday-old seedlings that grown in an incubator at 28°C in water were treated at 48°C for 1 h, and then the seedlings were transferred to growth chamber at 28°C for about 72 h. After that, the root and seedling lengths of treated and untreated samples (CK) were measured and their morphology was photographed. The index of heat sensitivity was calculated based on the decrease in seedling length compared to the untreated samples.

Expression Pattern Analysis of HSP101-1 in Neo-Tetraploid Rice
The expression information of HSP101-1 was firstly predicted using our transcriptome data of Huaduo 1 , and expression pattern was illustrated using pheatmap (Kolde, 2019). Secondly, the expression pattern of HSP101-1 was validated by qRT-PCR (see above). A total of 21 samples, including leaf and stem at meiosis stage, bicellular pollen stage and 5 days after flowering, young panicle at the length of 0.5 cm, 1 cm and 5 cm, anther and ovary at meiosis I, meiosis II, microspore early, microspore late and bicellular pollen stages, were used to illustrate the expression pattern of HSP101-1 in neo-tetraploid rice Huaduo 1. The expression pattern of HSP101 in Nipponbare were retrieved from RiceXPro database (Sato et al. 2010).
Additional file 1: Figure S1. Circos plot illustrating the variation (including SNPs and InDels) density of different types of tetraploid rice against MSU7 reference genome. Figure S2. Circos plot illustrating the variation (including SNPs and InDels) density of different types of tetraploid rice against R498 reference genome. Figure S3. Difference of 5-fold mapping coverage ratio between japonica reference genome MSU7 and indica reference genome R498. Figure S4. Comparison of genomic variation counts that generated based on japonica reference genome MSU7 and indica reference genome R498. Figure S5. Phenol reaction of neo-tetraploid rice lines. Figure S6. Sequencing coverage of wide compatibility gene S5 (LOC_Os06g11010) in autotetraploid rice lines and neo-tetraploid rice lines. Figure S7. Distribution of haplotypes per gene of the 55,801 annotated genes in tetraploid rice. Figure S8. Distribution of haplotypes per gene in the 717 agronomically important genes in tetraploid rice. Figure S9. The phylogenetic tree of LOC_Os04g09310 (a), LOC_Os05g02840 (b), LOC_Os05g43000 (c) and LOC_Os10g25730 (d) in autotetraploid rice and neo-tetraploid rice lines. Figure S10. The expression pattern analysis of HSP101 and the impact of different temperatures during flowering to the seed setting in tetraploid rice lines. Figure S11.
The expression pattern analysis of HSP101 in diploid rice Nipponbare using the RiceXPro database. Figure S12. Sanger sequencing chromatogram of hsp101-1 mutants. Figure S13. Cytological observation of pollen development of hsp101-1 and wild type Huaduo 1.
Additional file 2: Table S1. Summary of quantity and quality of genome re-sequencing data. Table S2. Genome coverage of sequencing reads to MSU7 and R498 reference genome. Table S3. Number of genomic variations against MSU7 and R498 reference genome. Table S4. Number of genomic variations from different effect levels against MSU7 reference genome. Table S5. Variation density of different types of autotetraploid rice against MSU7 and R498 reference genome. Table S6. Cross-validation error of different k values in ADMIXTURE analysis. Table  S7. GO enrichment analysis of zero-diversity genes in tetraploid rice population. Table S8. Annotation of 717 agronomically important genes that previously reported in diploid rice. Table S9. Haplotype patterns of 717 agronomically important genes in autotetraploid rice (ATR) and neotetraploid rice (NTR) lines. Table S10. Neo-tetraploid rice (NTR) specific haplotypes compared with 30 diploid rice cultivars. Table S11. Validation of neo-tetraploid rice specific variation in HSP101-1 using Sanger sequencing. Table S12. The genomic variations in 3024 diploid rice accessions of HSP101 from rice functional genomics and breeding (RFGB) database. Table S13. The genomic variations in diploid rice accessions of HSP101 from RiceVarMap2 database. Table S14. Two target-site primers for NTR specific gene, HSP101-1. Table S15. Mutations of HSP101-1 knockout plants.