Genome-wide comparative transcriptome analysis of CMS-D2 and its maintainer and restorer lines in upland cotton

Cytoplasmic male sterility (CMS) conferred by the cytoplasm from Gossypium harknessii (D2) is an important system for hybrid seed production in Upland cotton (G. hirsutum). The male sterility of CMS-D2 (i.e., A line) can be restored to fertility by a restorer (i.e., R line) carrying the restorer gene Rf1 transferred from the D2 nuclear genome. However, the molecular mechanisms of CMS-D2 and its restoration are poorly understood. In this study, a genome-wide comparative transcriptome analysis was performed to identify differentially expressed genes (DEGs) in flower buds among the isogenic fertile R line and sterile A line derived from a backcross population (BC8F1) and the recurrent parent, i.e., the maintainer (B line). A total of 1464 DEGs were identified among the three isogenic lines, and the Rf1-carrying Chr_D05 and its homeologous Chr_A05 had more DEGs than other chromosomes. The results of GO and KEGG enrichment analysis showed differences in circadian rhythm between the fertile and sterile lines. Eleven DEGs were selected for validation using qRT-PCR, confirming the accuracy of the RNA-seq results. Through genome-wide comparative transcriptome analysis, the differential expression profiles of CMS-D2 and its maintainer and restorer lines in Upland cotton were identified. Our results provide an important foundation for further studies into the molecular mechanisms of the interactions between the restorer gene Rf1 and the CMS-D2 cytoplasm.


Background
Cotton is the most important fiber crop and an important oil-producing crop worldwide. As in other crop plants, utilization of heterosis is an important way to improve yield in cotton production. To date, most commercial cotton hybrids have been produced by artificial emasculation and pollination (AEP) in China [1] and India (http://www.cicr.org.in/), which is a time-consuming, labor-intensive and costly process. In addition, the purity of hybrid seeds produced by AEP cannot be guaranteed as some artificial emasculation may not completely remove the pollen. The cytoplasmic male sterility (CMS) system is an ideal tool for hybrid seed production, and it has been widely used to facilitate the use of heterosis in many crops [2]. CMS-D2 is one of the two major types of CMS [3][4][5][6] in cotton and has contributed to cotton heterosis utilization. Rf1 is the restorer gene and can recover the fertility of CMS-D2. Considering the importance of the CMS and restoration system, numerous molecular mapping studies have been conducted on of Rf1 in cotton [7][8][9][10][11][12][13]. Recently, a backcross population (BC 8 F 1 ) with plants distinguished as male fertile (F) or sterile (S) was generated and used to map the Rf1 gene by our group [14]. However, there have been few studies on the molecular mechanism of the restorer gene.
Over the past several years, next-generation sequencing (NGS) has been used in numerous research areas, resulting in high-throughput production of massive DNA and RNA data [15]. As a powerful tool for studying global transcriptional networks, transcriptome sequencing provides high-resolution data and has been widely used in many crops. In cotton, it has been used to study boll development [16], fiber development [17][18][19], leaf senescence [20], gland morphogenesis [21], abiotic stress responses [22][23][24], biotic stress responses [25,26], RNA editing in relation to CMS-D8 [27], and genic male sterility [28]. Differential display and gene chips were used to study the expression levels of differentially expressed genes (DEGs) associated with the fertility of CMS-D8 in cotton [29,30]. However, the global gene expression patterns of CMS-D2 and its interaction with its restorer gene Rf1 are still unknown. Now that the genome sequences of G. raimondii [31,32], G arboreum [33], and G hirsutum [34,35] have been published, gene annotation can be better performed, which will improve genome-wide transcriptome sequencing and analysis in cotton.
To better understanding the gene expression profiles affected by the restorer gene Rf1 in Upland cotton with the CMS-D2 cytoplasm, RNA-seq by the Illumina NGS technology was used in this study to identify DEGs in flower buds of fertile (i.e., restorer R line) and sterile (i.e., CMS A line) plants of a backcross population (BC 8 F 1 ) and its recurrent parent, i.e., the maintainer B line. GO and KEGG enrichment analysis showed that genes related to circadian rhythms were significantly affected by the presence of the restorer gene. The results from this study will serve as a foundation for further studies of the molecular mechanisms of interaction between the restorer gene Rf1 and the CMS-D2 cytoplasm.

Plant materials
In our previous study [14], the sterile line ZBA with the CMS-D2 cytoplasm was crossed with the restorer line Zhonghui46, and then the maintainer B line (designated dB3) with the normal fertile Upland cotton (AD1) cytoplasm was used as the recurrent male parent to backcross with the F 1 plants to construct a BC 8 F 1 population. In this segregating population, the sterile plants (designated dZB3) were considered to be the CMS-D2 A line, and the fertile plants (designated dZK3) were considered to be the restorer R line. All materials were provided by Institute of Cotton Research (ICR), Chinese Academy of Agricultural Science (CAAS). The BC 8 F 1 population and recurrent parent were grown in the Experimental Farm, ICR-CAAS, Anyang, Henan province, China. A randomized complete block design with three biological replications was used, and crop management practices followed local recommendations. On sunny days of about 30°C, flowering buds of about 3 mm in length (at roughly the stage of male meiosis) were collected and combined from 50 plants for each genotype in each replication. All harvested samples were snap-frozen in liquid nitrogen and stored at −80°C before use.

RNA extraction, RNA-seq library construction and sequencing
Total RNA was isolated using the Sigma Spectrum Plant Total RNA kit (Sigma-Aldrich, USA) according to the manufacturer's protocol. The concentration of each RNA sample was measured using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies Inc., USA). Nine individual libraries (three samples for each of the three genotypes) were constructed with an Illumina RNA TruSeq kit (Illumina, USA) per the manufacturer's instructions using 5 μg of total RNA. Subsequently, PCR amplification was performed using Phusion DNA polymerase (NEB, USA) for 15 PCR cycles, and f cDNA fragments of 300-500 bp were isolated from a 2% low range ultra agarose gel (Bio-Rad, USA). After quantification by TBS380 (Picogreen, Invitrogen, USA), the paired-end libraries were then sequenced using the Illumina HiSeq™ 2500 system (2 × 151 bp read length) at Shanghai Majorbio Bio-pharm Biotechnology Co., Ltd. (Shanghai, China).

Quantitative RT-PCR (qRT-PCR) validation
First-strand cDNA was generated from 1 μg total RNA from individual replications using a PrimerScript RT Reagent kit (Perfect Real Time, TaKaRa, Japan). Quantitative real-time RT-PCR was performed using SYBR® Premix Ex TaqTM (Perfect Real Time, TaKaRa, Japan) according to the manufacturer's instructions. Primers for qPCR were designed using the Primer Express software (Applied Biosystems, Foster City, CA, USA), synthesized commercially (Tianyi Huiyuan Biotechnology, Beijing, China), and are shown in Additional file 1. PCR analysis was performed using a CFX96TM instrument (Bio-Rad, USA). Each reaction contained 2 μl cDNA template, 800 nM of each primer and 10 μl 2 × SYBR® Premix Ex TaqTM, with ddH 2 O to bring the final volume to 20 μl. The reaction was pre-denatured at 95°C for 30 s, followed by 40 cycles of denaturation at 95°C for 5 s, annealing at 58°C for 20 s and extension at 72°C for 30 s. A melting curve was generated for each sample at the end of each run to determine the specificity of the amplified products. Each gene was analyzed in triplicate, and controls without template were also included. Actin was used as an internal control. The threshold cycle (Ct) values of each reaction were determined automatically by the instrument software, and the relative amount of each gene to the internal control was calculated using the eq. 2 −ΔΔCt , where ΔΔCt = (Ct target − Ct actin) sample X − (Ct target − Ct actin) sample 1. The whole assay protocol was repeated three times to ensure the reliability of the assay data. The standard deviations of the data were determined from the three independent experiments. The statistical significance of expression differences was analyzed using the Student's t-test.

Transcriptome sequencing and mapping
In this study, near-isogenic A, B and R lines each comprising three individual biological samples of 3 mm-long flowering buds at the stage of male meiosis were used to construct cDNA libraries for a deep Illumina sequencing. After filtering the raw reads, 48 population were obtained (Additional file 2). More than 90% of these clean reads were mapped to the G. hirsutum TM-1 reference genome (Additional file 3). The deep RNA-seq had a 90.55-91.89% genome coverage of the predicted genes in Upland cotton. In total, 62,001 of the 70,478 predicted transcripts in the reference TM-1 genome were identified in this study and were used for a further analysis.

GO and KEGG classification of the expressed genes
Blast2go was used to retrieve the GO functional annotations, and the results showed that 46,150 of the 62,001 predicted transcripts were successfully assigned GO annotations within the three main GO categories and 57 sub-categories (Fig. 1a). 'Metabolic process' (32,285 genes; representing 69.9% of transcripts in the biological process category), 'cellular process' (28,157 genes; 61.0%), and 'single-organism process' (23,292 genes; 50.5%) had the highest numbers of genes in the biological process category. 'Cell' (21,221 genes; representing 46.0% of transcripts in the cellular component category), 'cell part' (20,897 genes; 45.3%) and 'organelle' (14,269 genes; 30.9%) had the most genes in the cellular component category. 'Catalytic activity' (23,001 genes; representing 49.8% of transcripts in the molecular function category), 'binding' (22,866 genes; 49.5%) and 'transporter activity' (2677 genes; 5.8%) were the most important sub-categories in the molecular function category (Additional file 4). In addition, a total of 23,211 transcripts were categorized into 175 pathways (Additional file 5), among which metabolic pathways, biosynthesis of secondary metabolites and ribosome pathways contained the most transcripts (Fig. 1b).

Global Transcriptome changes
The number of reads mapped to the predicted transcripts of the TM-1 reference genome was calculated as the expression level for each gene. The following three comparisons of gene expression levels were performed: B (dB3) vs. A (dZB3), which had the isogenic nuclear genomes (containing the recessive non-functional rf1 allele) but different cytoplasms and fertility; B (dB3) vs. R (dZK3), both of which were isogenic and fertile but differed in their cytoplasms and Rf1 alleles; and A (dZB3) vs. R (dZK3), both of which had the same CMS-D2 cytoplasm but differed in fertility and Rf1 alleles. A total of 728 (442 upregulated and 286 downregulated), 918 (524 upregulated and 394 downregulated) and 456 (176 upregulated and 280 downregulated) DEGs were identified in the above three comparisons, respectively (Additional files 6-8). These DEGs represented a total of 1464 non-redundant genes, including 1368 that were distributed across the 26 chromosomes of G. hirsutum and 96 genes on 56 scaffolds (Fig. 2). It is interesting to note that Chr_D05 (with restorer gene Rf1) and the homeologous Chr_A05 (99.5 DEGs vs. 48.7 DEGs) carried more DEGs than the other chromosomes. Furthermore, among the 1464 DEGs, three possible mitochondrial targeted protein-coding genes (Gh_D01G1128, Gh_D06G0518 and Gh_A03G1169) and five possible chloroplast targeted protein-coding genes (Gh_A13G2212, Gh_A05G2854, Gh_A12G0821, Gh_A12G0217 and Gh_ D11G3195) were differentially expressed between dZK3 and dB3, and three possible chloroplast targeted proteincoding genes (Gh_Sca078114G01, Gh_D01G0297 and Gh_A07G1517) were differentially expressed between dZB3 and dB3. These DEGs may be affected by the CMS-D2 cytoplasm.
The distribution of unique and common DEGs for the three comparisons is shown in Fig. 3. The results indicated that 251 of 728 DEGs were unique to B (dB3) vs. A (dZB3), 408 of 918 were unique to B (dB3) vs. R (dZK3), and 192 of 456 were unique to A (dZB3) vs. R (dZK3). Compared with R (dZK3, containing the restorer gene), 136 common DEGs were identified in both B (dB3) and A (dZB3) containing the non-restoring gene. Compared with B (dB3, with normal Upland cotton cytoplasm), 349 common DEGs were identified in both A (dZB3) and R (dZK3), which contained the CMS-D2 cytoplasm. Compared with the male sterile A line (dZB3), 103 common DEGs were identified in the fertile B (dB3) and R (dZK3) lines.

GO and KEGG enrichment analysis of DEGs
For the 728 DEGs between B (dB3) and A (dZB3), 'metabolic process' , 'catalytic activity' and 'single-organism process' were the three most common GO terms (Additional file 9), and 'metabolic pathways' , 'biosynthesis of secondary metabolites' and 'microbial metabolism in diverse environments' were the three most common KEGG pathways (Additional file 10).
Seven DEGs associated with the GO terms 'molecular transducer activity' and 'electron carrier activity' were specifically upregulated and downregulated, respectively in dB3. For the 918 DEGs between B (dB3) and R (dZK3), 'metabolic process' , 'cellular process' and 'catalytic activity' were the three most common GO terms (Additional file 11), while the three most common pathways (Additional file 12) were the same as in B (dB3) and A (dZB3). Six DEGs associated with the 'cell junction' and 'symplast' were specifically upregulated in R (dZK3). For the 456 DEGs between A (dZB3) and R (dZK3), 'metabolic process' , 'cellular process' and 'binding' were the three most common GO terms (Additional file 13), and 'metabolic pathways' , 'biosynthesis of secondary metabolites' and 'drug metabolism cytochrome P450' were the three most common pathways (Additional file 14). Eleven DEGs associated with growth, six with structural molecule activity and five with electron carrier activity were specific upregulated in dZB3.
To identify significant GO categories and KEGG pathways among the three comparisons, further GO and KEGG enrichment analyses were performed. The GO categories 'negative regulation of circadian rhythm' , 'transcription regulatory region DNA binding' and 'regulatory region nucleic acid binding' had the highest enrichment ratios between the maintainer B line (dB3) and the CMS-D2 A (dZB3) line (Additional file 15), while 'longday photoperiodism' , 'negative regulation of sequencespecific DNA binding transcription factor activity' and 'negative regulation of circadian rhythm' had the highest enrichment ratios between the A line (dZB3) and the restorer R (dZK3) line (Additional file 16). ' Allene-oxide cyclase activity' , 'response to wounding' and 'oxidoreductase activity' had the highest enrichment ratios between the B (dB3) and the R (dZK3) lines (Additional file 17).
The three primary KEGG pathways with the highest ratios were 'circadian rhythm' , 'alpha-linolenic acid metabolism' ele ctr on ca rri er ac tiv ity pr ot ein bin din g tra ns cr ipt ion fa cto r ac tiv ity bin din g tra ns po rte r ac tiv ity ch an ne l re gu lat or ac tiv ity pr ot ein ta g ca ta lyt ic ac tiv ity m et all oc ha pe ro ne ac tiv ity gu an yl− nu cle ot ide ex ch an ge fa cto r ac tiv ity nu tri en t re se rv oir ac tiv ity en zy m e re gu lat or ac tiv ity m ole cu lar tra ns du ce r ac tiv ity str uc tu ra l m ole cu le ac tiv ity tra ns lat ion re gu lat or ac tiv ity nu cle ic ac id bin din g tra ns cr ipt ion fa cto r ac tiv ity an tio xid an t ac tiv ity m ole cu lar fu nc tio n re gu lat or  and 'sesquiterpenoid and triterpenoid biosynthesis' between the B (dB3) and A (dZB3) lines (Additional file 18); 'circadian rhythm' , 'protein processing in endoplasmic reticulum' and 'photosynthesis' between the A (dZB3) and R (dZK3) lines (Additional file 19); and 'protein processing in endoplasmic reticulum' , 'alpha-linolenic acid metabolism' and 'thyroid hormone synthesis' between the B (dB3) and R (dZK3) lines (Additional file 20). The results showed that the circadian rhythm pathway was an important and common pathway that was affected during meiosis.

Analysis of DEGs on Chr_D05 and DEGs related to circadian rhythms
In our previous study [14], the restorer gene Rf1 was shown to be located on Chr_D05 near position 54,287,522. In this study, Gh_D05G3189 and Gh_ D05G3427 near the target region were found to be specifically expressed in the fertile R lines but were not expressed in the A or B lines. To further understand the effect of DEGs from regions adjacent to Rf1, GO enrichment analysis of 105 DEGs on Chr_D05 was performed. The results demonstrated that 'sesquiterpene synthase activity' and '(+)-delta-cadinene synthase activity' were the two major GO terms with the highest enrichment ratios, while 'sesquiterpenoid and triterpenoid biosynthesis' , 'protein processing in endoplasmic reticulum' and 'carotenoid biosynthesis' were the three major pathways identified in KEGG enrichment analysis. To examine the correlation between the expression of the DEGs in different samples, a heatmap analysis was performed based on the FPKM values of the 105 DEGs on Chr_D05 with the restorer gene and 16 DEGs related to the circadian rhythm (Fig. 4). The results showed that DEGs participating in sesquiterpene synthase activity and (+)-delta-cadinene synthase activity were all expressed preferentially in the B line, while most of the genes related to protein processing in the endoplasmic reticulum were highly expressed in the R line. Furthermore, it was interesting to find that most DEGs related to the circadian rhythm were highly expressed in the R and A lines with the CMS-D2 cytoplasm, implying a possible connection between the circadian rhythm and the CMS-D2 cytoplasm.

Validation of RNA-seq data by qRT-PCR
To validate the RNA-seq data using real-time qRT-PCR, 11 DEGs were selected based on high fold-changes (Gh_A12G1505), specific expression in certain genotypes (Gh_A08G0004), chromosomal location on Chr_D05 (Gh_D05G0902, Gh_D05G1016, Gh_D05G3189, and Gh_D05G3427), and association with the circadian rhythm (Gh_D02G0690, Gh_A11G0920, Gh_A11G0926, Gh_D09G1513, and Gh_D12G1525). The expression patterns of these genes are shown in Fig. 5. The results showed that except for the Gh_D09G1513 gene, the expression patterns as determined by qRT-PCR were consistent with those obtained by RNA-seq, confirming the accuracy of the RNA-seq results in this study.

SNP identification of DEGs on Chr_D05
The DEGs located on Chr_D05 with the restorer gene Rf1 were chosen for identification of SNPs among the three lines (genotypes). For the 105 DEGs on Chr_D05, 11 SNP loci in 11 DEGs were identified between the sequences from the R line and those from the nonrestoring genome, i.e., the A and B lines, including seven loci in exons and four loci downstream of the coding sequences (Table 1). Among these genes, Gh_D05G3129, Gh_D05G3141, Gh_D05G3211 and Gh_D05G3427 were located within the predicted target region of Rf1. Therefore, some of them may be related to the fertilityrestoring gene, especially Gh_D05G3427, which is a proton pump-interactor 1-like gene that was expressed specifically in the restorer line.

Discussion
Illumina sequencing and sequence annotation The CMS system is considered the most important tool and is ideal for cotton hybrid seeds production. A restorer line containing a restorer gene is the determinant for the CMS system. Thus, to understand restorer genes, a large number of molecular mapping studies have been conducted. However, there have been no reports about how the restorer gene Rf1 affects gene expression. In the present study, transcriptome sequencing was performed to generate large amounts of cDNA sequence hirsutum used as the reference genome, more than 90% of clean reads were mapped to the reference genome. In total, 62,001 of the 70,478 predicted transcripts in the reference genome were identified in this study through gene annotation. Thus, the transcriptomic data in this study met the basic requirements needed for a comparative analysis. Finally, 1464 DEGs were identified among the three lines, many of which could serve as potential targets for future studies aimed at discovering the molecular mechanism of nucleo-cytoplasmic interactions.

DEGs in the restorer Gene located on chromosome c5
The 1464 DEGs were mapped to 26 chromosomes and 56 scaffolds of G. hirsutum. Chr_D05 and its homeologous chromosome Chr_A05 were the two chromosomes with the most DEGs. In our previous study, the restorer gene Rf1 was mapped to Chr_D05 [14]. This implied that the expression profiles of these genes may be affected by the restorer gene. Sesquiterpene synthase activity, (+)-delta-cadinene synthase activity and carotenoid biosynthesis were identified as important pathways according to the GO enrichment analysis of the 105 DEGs on Chr_D05. Cotton (+)-delta-cadinene synthase has been reported as a sesquiterpene cyclase that catalyzes a branch-point step leading to the biosynthesis of sesquiterpene phytoalexins, including gossypol [38][39][40]. In plants, carotenoids are crucial for various biological processes, such as photosynthesis, photoprotection, and regulation of growth and development [41][42][43][44], as well as responses to the environment [45,46]. During field tests, the fertility of CMS-D2 restorer containing the restorer gene was affected by the environment. Therefore, whether there are correlations between terpene biosynthesis and functions of the restorer gene requires further study. In our study, Gh_D05G3427, which had a SNP and specifically expression in the restorer line, was identified in the predicted target region of Rf1 on Chr_D05. It is a proton pump-interactor 1-like gene (PPI1). Previous studies have demonstrated that the PPI1 is a novel protein that can interact with the C-terminal autoinhibitory domain of the plasma membrane (PM) H(+)-ATPase [47]. PM H + −ATPases are important for plant nutrient acquisition and can be detected at the whole plant level [48][49][50]. Furthermore, some PM H + −ATPases only expressed in anther tissues have been identified [51][52][53], implying that this type of genes is important for male gametogenesis. In this study, the PM H + −ATPases regulatory gene Gh_D05G3427 was identified specifically in the restorer line. Thus, it could be a potentially  important gene that interacts with the restorer gene and affects male gametophyte development. Further study of this gene is needed to elucidate the genetic and molecular mechanism of fertility restoration associated with Rf1.
The circadian rhythm pathway and its relationship with pollen development Previous research has shown that the circadian rhythm pathway is involved in the promotion of reproductive organs development in the vegetative stage in higher plants [54][55][56], photosynthesis [57,58], starch metabolism [59][60][61], phytohormone response [61][62][63], hypocotyl elongation [64,65], and plant-pathogen interaction [66]. Additionally, some research has indicated that the circadian rhythm pathway is involved in the male sterility transition [67,68]. In this current study, several genes associated with the circadian rhythm were identified, some of which comprise interlocking transcriptional feedback loops that play important roles in the plant central clock. Some loops integrate environmental factors, such as light and temperature, into the central clock through the input signaling pathway and import the rhythm signal into downstream signaling pathways through output signaling pathways [69,70]. Here, circadian rhythm differences between the fertile and sterile lines were also identified, and the differential expression profiles of the genes related to the circadian rhythm were confirmed by qRT-PCR. However, how the restorer gene regulates the circadian rhythm, which in turn regulates male fertility, needs a further study.

Conclusions
Through genome-wide comparative transcriptome analysis, 1464 DEGs were identified in flower buds among the fertile R line, maintainer B line and sterile A line. The Rf1-carrying Chr_D05 and the homeologous Chr_A05 had more DEGs than the other chromosomes. qRT-PCR further confirmed the accuracy of the RNAseq results. The circadian rhythm pathway was identified as an important pathway differing between the fertile and sterile lines by GO and KEGG enrichment analysis.
In the predicted target region of Rf1 on Chr_D05, Gh_D05G3427 was found to be expressed specifically in the restorer line and to have a restorer line specific SNP.
Our results provide useful data for future investigations into the molecular mechanisms of nucleo-cytoplasmic interaction in CMS cotton.
Ethics approval and consent to participate All the cotton lines used and analyzed were public and available for non-commercial purpose. This article did not contain any studies with human participants or animals performed by any of the authors.