Genome wide search to identify reference genes candidates for gene expression analysis in Gossypium hirsutum

Background Cotton is one of the most important commercial crops as the source of natural fiber, oil and fodder. To protect it from harmful pest populations number of newer transgenic lines have been developed. For quick expression checks in successful agriculture qPCR (quantitative polymerase chain reaction) have become extremely popular. The selection of appropriate reference genes plays a critical role in the outcome of such experiments as the method quantifies expression of the target gene in comparison with the reference. Traditionally most commonly used reference genes are the “house-keeping genes”, involved in basic cellular processes. However, expression levels of such genes often vary in response to experimental conditions, forcing the researchers to validate the reference genes for every experimental platform. This study presents a data science driven unbiased genome-wide search for the selection of reference genes by assessing variation of > 50,000 genes in a publicly available RNA-seq dataset of cotton species Gossypium hirsutum. Result Five genes (TMN5, TBL6, UTR5B, AT1g65240 and CYP76B6) identified by data-science driven analysis, along with two commonly used reference genes found in literature (PP2A1 and UBQ14) were taken through qPCR in a set of 33 experimental samples consisting of different tissues (leaves, square, stem and root), different stages of leaf (young and mature) and square development (small, medium and large) in both transgenic and non-transgenic plants. Expression stability of the genes was evaluated using four algorithms - geNorm, BestKeeper, NormFinder and RefFinder. Conclusion Based on the results we recommend the usage of TMN5 and TBL6 as the optimal candidate reference genes in qPCR experiments with normal and transgenic cotton plant tissues. AT1g65240 and PP2A1 can also be used if expression study includes squares. This study, for the first time successfully displays a data science driven genome-wide search method followed by experimental validation as a method of choice for selection of stable reference genes over the selection based on function alone. Electronic supplementary material The online version of this article (10.1186/s12870-019-1988-3) contains supplementary material, which is available to authorized users.


Abstract
Background: With the advent of newer breeds and transgenic varieties of commercial crops, qPCR (quantitative polymerase chain reaction) experiments have become extremely popular for quick expression checks. Selection of appropriate reference genes plays a critical role in quantifying the expression of target gene. Most commonly used reference genes in expression studies are the "house-keeping genes", involved in basic cellular processes. However, expression levels of such genes often vary in response to experimental conditions, forcing the researchers to validate the reference genes in every experiment. This study presents a data science driven unbiased genomewide search results for selection of reference genes by assessing variation of >50,000 genes in a publicly available RNA-seq dataset of cotton species Gossypium hirsutum. Selected candidate genes were validated experimentally across 33 samples from normal and transgenic G. hirsutum plants, harvested from different areas of the plant at different time points under various developmental conditions. Experimental validation also includes commonly used genes from literature to suggest the most stable set of 5 genes to be used for assessment of quantitative expression in cotton plants ( Fig.1). Result: Five genes (TMN5, TBL6, UTR5B, AT1g65240, CYP76B6) identified by data-driven analysis, along with two commonly used reference genes for cotton found in literature (GhPP2A1 and GhuBQ14) were validated using qPCR in a set of 33 experimental samples consisting of different tissues (leaves, square, stem and root), different stages of leaf (young and mature) and square development (small, medium and large) in both transgenic and non-transgenic plants. Expression stability of the genes was evaulated using four different algorithms -DeltaCT, Genorm, BestKeeper and Normfinder. GhPP2A1 and TMN5 were identified as the most stable genes, followed by GhuBQ14 across all the samples tested. Conclusion: This study, for the first time successfully displays a data science driven genome-wide search method followed by experimental validation as a method of choice for selection of stable reference genes for experiment with cotton species. Based on the results we recommend use of GhPP2A1, TMN5 and GhuBQ14 as the optimal candidate reference genes in qPCR experiments with normal or transgenic cotton plant tissues.

Background
Gene expression analysis by qPCR is the most reliable, accurate and cost-effective technique for studying differential gene expression [1]. This technique has an edge over other commonly employed methods in gene expression like Northern blot hybridization and RT-PCR (reverse transcription polymerase chain reaction), owing to its high sensitivity, specificity, accuracy and broad dynamic range [2,3]. However, the major pitfalls affecting qPCR data are the quality and integrity of RNA samples, efficiency of cDNA synthesis, and PCR efficiency leading to inter-sample variations in amplifications [4]. Therefore, seletion of a normalization strategy plays a crucial role for the interpretation of biologically meaningful data, and the use of an appropriate reference gene or genes as the method for normalizations is the most preferrred [5].
A good reference gene should have a constant level of expression across various conditions and its expression is assumed to be unaffected by experimental parameters [6,7] or between cells of different tissues [8]. Traditionally used reference genes for qPCR studies were carryovers from previous studies which were adopted for non/semi-quantitative methods [9], and it was cautioned to discontinue their usage for qPCR, as numerous studies began to demonstrate that the transcript levels of these genes can also vary considerably under different physiological conditons [10] Availability of genome-wide expression data through high throughput experiments like microarray has helped to identify stable genes that could be used for normalization in qPCR experiments [1]. Among the initial work done [11], the main focus was on evaluating a set of known reference gene candidates for stability of expression using several normalization algorithms -Genorm [12], Normfinder [13] and BestKeeper [14]. However, some researchers also tried assessing gene stability using bioinformatics approach [15], statistical measures like the coefficient of variation (CV) [16], and the difference in DNA entropies in promoters driving the expression of reference and tissue specific genes [17]. Assuming that true reference genes should follow Normal distribution across samples, it was attempted to discover reference genes for expression studies in Soybean using CV and p-value from a normality test [18]. In another attempt, an automated workflow called findRG [19] was proposed to find reference genes in different plant species and human cancers using the CV as the primary measure.
Cotton is one of the most economically important cultivated crop. It is the major source of natural fiber for the textile industry and an important target for genetic modification [20]. Transgenic technology has been applied to cotton for improving the agronomic traits, tolerance to insects, resistance to herbicides and fiber qualities [21]. We, in this study have used Bollgard II (Monsanto, procurrred commercially) transgenic cotton that contains two Bacillus thuringiensis genes Cry1Ac and Cry2Ab proven to have good insecticidal efficacy against cotton bollworm (Helicoverpa armigera) and other Lepidopteran larvae when they feed on it. Numerous studies have focused in identifying the most suitable reference gene for different cotton species under biotic and abiotic stresses [10].
However, study on the effect of expression of a foreign or transgene becomes very important, as various studies have indicated an altered expression of endogenous genes because of the experession of transgenes in both plants [22,23] and animals [24]. Thus, the influence of the transgene on the expression levels of endogenous gene, especially those used as reference genes needs to be thoroughly evaluated [25] prior to the start of a study.
In this paper, we have employed an unbiased genome-wide study from a publicly available gene expression dataset for cotton (G. hirsutum) to identify potential reference gene candidates. We have also evaluated two popular reference genes, GhPP2A1 [10,26] and GhuBQ14, a polyubiquitin [10,27,28,] through the same experimental conditions, to finally select 3 most stable genes for gene expression analysis in cotton. Another polyubiquitin, UBQ10 [29] and reference genes involved in metabolic processes like actin [21,30,31,32], 5SrRNA [33] and histone 3 [27,30,35] has not been included in this study, since they are liable to be regulated under metabolic stresses. We have addressed the expression stability of the potential candidate reference genes to be used in both nontransgenic and transgenic lines of G. hirsutum under various experimental conditions comprising of different tissues (leaves, stem, squares and root), age categories (one month to three month old plant), developmental stages of leaves (young and mature leaves) and square (small, medium and large squares). A data-driven analysis approach complemented with experimental validation used in this study can be extended to other scientific model system with large number of data.

Results
Statistical Analysis of RNA-sequence data Out of the 51,272 genes from RNA-seq dataset with available JGI annotation, 40,135 genes were further filtered that had non-zero expression value (FPKM) in at least 50% of the 55 experimental conditions [36] used in the dataset. Silhouette analysis indicated that only two clusters were most optimal for the analysis. A representation of the two clusters in (CV, MAD, 1-p) hyperspace is shown in

Selection of Commonly Used Reference Genes
Commonly used reference genes employed in cotton research in recent years have been listed in Additional file 2. Most commonly used gene in literature -GhuBQ14, a polyubiquitin and GhPP2A1, a gene with known stability under biotic stress conditions [26] were included in experimental validation along with the other candidate genes selected from RNA-seq data.

Selection of Primers
Melt curve analysis from the top 10 selected genes using pooled cDNA as the target gave 5 genes that met the criteria for good primers (Additional file 3). There was no amplification, or high Cq values (> 35 Cq), observed for non-template controls. Calculation of primer efficiencies using five-fold dilution of cDNA for all the five reference gene primers gave r 2 > 0.97 and 90-110% efficiency (E) values ( Table 1) Expression behaviour of candidate gene in transgenic and normal tissue samples Analysis of expression of the reference genes in transgenic and normal tissue (Fig. 3) samples showed GhPP2A1 to be the most stable reference gene having the lowest median level and uniform expression in both transgenic and non-transgenic leaf, square, stem and root tissues, followed by TMN5. TBL6 showed a similar median level of expression comparable to TMN5, but showed higher variation across the study groups. UTR5B, although showed much lower median expression level than TMN5 and TBL6 exhibited more variations across the study groups.

Expression behaviour of reference genes across various plant parts
Analysis of expression of the reference genes across various plant parts (Fig. 4) leaf, root, square and stem showed GhPP2A1 to be the best reference gene having the lowest median level and uniform expression across various plant parts with most of the values falling in 2 nd and 3 rd quartile. However, the expression of GhPP2A1 in stem tissue is slightly higher when compared to other tissues. This was followed by TMN5, which also showed least variation in expression in various plant parts.
Expression behaviour of reference gene across two age category The analysis of expression of the reference genes between two age category (Fig. 5) one and three month(s) old showed GhPP2A1 to be the best reference gene having the lowest median level and uniform expression across both the age categories followed by TMN5.
Expression behaviour of reference gene across developmental stages in leaf and square tissues GhPP2A1 and TMN5 shows to be a good reference gene across young and mature leaf, follwed by TBL6 (Fig. 6). GhPP2A1 and UTR5B shows lower median expression levels in different developmental stages of square tissues (Fig. 7). AT1g65240 and CYP76B6 showed uniform expression levels in all the developmental stages of square tissue however, shows higher expression values and thus not appropriate to be a good reference gene candidate.
Comprehensive stability analysis of the candidate reference genes Stability analysis of all seven-candidate reference genes using Cq values from all study samples were carried out using RefFinder tool. Geometric means of ranks obtained from all the four algorithms (Delta CT, BestKeeper, Normfinder, and Genorm) was used to rank the most stable genes. The order of comprehensive stability ranking (Fig. 8) was GhPP2A1>TMN5>GhuBQ14>TBL6>UTR5B AT1g65240>CYP76B6. Based on the primer efficiency test, we took forward five novel genes for further experiments and also included GhuBQ14 and GhPP2A1. GhUBQ14, coding for polyubiquitin is one of the most widely used reference gene and known to be stable under various abiotic stresses [10,45], and GhPP2A1, an enzyme involved in Phosphatase activity has been reported to be stable under biotic stress, Helicoverpa infections [10,26]. The expression stability of these seven genes were checked by qPCR in experimental sets including different tissues, age category, developmental stages and different genotypes. The three genes in the decreasing order of stability as ranked by RefFinder was GhPP2A1, TMN5 and GhuBQ14 (Fig. 8). GhPP2A1 encoding protein phosphatase 2A1 and TMN5, encoding for Transmembrane 9 superfamily member 5 protein, which functions in protein localization to membrane, appeared to be the most stable reference gene under all the tested experimental conditions.

Discussion
AT1g65240 and CYP76B6 were the least stable genes, and both these genes showed high expression in square tissue. The protein coded by AT1g65240 gene, Aspartic proteinase-like protein has been shown to be expressed in seed pods and proposed to play a role in the processing and degradation of the storage proteins in the seeds [42], and thus explaining its high expression levels in the square tissues. Similarly, CYP76B6, geraniol hydroxylase, belongs to the family of CYP76 coding for cytochrome P450 enzyme, that catalyzes the single or double oxidation of all linear monoterpenols, derivatives of which are particulary found in flower, fruit and young leaves [43] and hence failed to qualify as a reference gene in this study. This is the first time, validation of reference genes to be used in qPCR was done combining the data driven approach and experimental validation executed on the same platform for non-transgenic and transgenic cotton plants. This study becomes importants as the transgenic cotton plants accounts for 95% of the cotton grown in the total cotton growing area in India [44] , and periodic estimation of transgene expression levels is a critical check point, to ascertain the functionality of these transgenic crops. However, this study only covers the expression stability of the reference genes under normal field conditions and does not include the effect of biotic and abiotic stress conditions on the candidate reference genes.

Conclusion
The present study has employed a data -driven approach for identification and validation of reference genes to be used for qPCR studies in transgenic and non transgenic lines of G. hirsutum under various experimental conditions. Out of the five new candidate reference genes TMN5, TBL6, UTR5B, CYP76B6 and At1g65240 analysed , TMN5 shows a stable level of expression in all the conditions tested in our experiments and hence stands out to be a potential reference gene.
GhPP2A1 and GhuBQ14, previously validated [45] as good reference gene candidate for non transgenic cotton tissues showed the same stability in transgenic cotton plant. We recommend use of these three genes as reference genes in qPCR experiments with cotton species G. hirsutum.

Methods
Workflow for the analysis comprised of two broad sections -statistical analysis of publicly available RNA-seq data to identify candidate reference genes with least variations and validation of the genes in experiment (Fig. 1). To assess stability of a gene, two measures were adopted -(i) CV = x ̂ σ x where x ̂ and σ x are mean and standard deviation of a variable x respectively and (ii) the normality p-value measured by the Shapiro-Wilks Test (p-value < 0.05 indicates the distribution to be away from Normal) [18]. CV, albeit most frequently used, is affected by outliers, and hence fails to be a robust measure. To address this, a third parameter -Median Absolute Deviation (MAD = m e d i a n x -x ̂ where x ̂ is the median of x) [38] was used after normalization with median. MAD is a measure of the spread of the distribution and being based on medians, is less susceptible to deviations by outliers.

Gene Expression Data Acquisition
An ideal set of reference genes should have low or similar statistical variation across samples (represented by low values of CV and MAD) and should behave as much like a normal distribution (high value of normality p-value or low values of 1 -p-value). Therefore, a k-medoids clustering of genes were clustered based on the values of the three statistical parameters -CV, MAD (normalized to respective z-scores) and 1 -p-value using the PAM (Partitioning around Medoids) algorithm originally proposed by Kaufman and Rousseeuw [39]. Medoid based clustering approach was chosen over the more commonly used k-means method to reduce the effect of outlier genes in cluster determination. Optimal number of clusters required is calculated using the Silhouette graphical method of Rosseeuw [40]. Cluster having the lowest medoid values for each of the three parameters was selected and genes in the cluster was ranked using the Euclidean distance d = C V 2 + M A D 2 + ( 1 -p ) 2 (all parameters replaced by their z-scores) in this three-parameter hyperspace. Top 100 genes with least values of Euclidean distance were selected and their subcellular location and function were analyzed using Gene Ontology annotation.

Selection of Commonly used Reference Genes from Literature
Most commonly used reference genes were shortlisted from literature keeping in mind their frequencies of usage in published scientific literature on cotton from 2016 till recent. Researchers use no unique keywords to report studies on reference genes for expression studies. Hence, a very broad methodology was adopted in which all articles in PubMed were searched for occurrence of any of the terms "reference gene" or "control gene" or "housekeeping gene" along with co-occurrence of the term Gossypium hirsutum anywhere in title and abstract. Obtained abstracts were manually curated to find the relevant articles that described studies on reference genes specifically in the context of Total RNA isolation Total RNA was extracted from fresh tissues as described previously [41] with slight modifications.
Briefly, fresh tissue were homogenized to a fine paste in precooled mortar and pestle using 10 ml of extraction buffer (400 μl of β-mercaptoethanol and 4% polyvinyl pyrrolidone) per gram of the plant tissue. An equal volume of water saturated phenol, chloroform and isoamyl alcohol, at a ratio 25:24:1, was added, mixed thoroughly and centrifuged. Separated aqueous phase was extracted with chloroform-isoamyl alcohol, followed by addition of lithiun chloride (LiCl) to a final concentration of 3 M. After overnight incubation at −20˚C, the RNA precipitate was re-suspended in 2M LiCl, centrifuged and washed with 70% ice-cold ethanol. Pellet was air-dried and dissolved in 500 μl of sterile water.
RNA integrity was checked on 1% agarose gel (1%, w/v) and the concentration and purity was The reaction conditions were -preincubation at 95°C for 10 seconds followed by the amplification for 45 cycles (95°C -10 second; 60°C -15 seconds; and 72°C -15 seconds). Two independent biological samples of each experimental condition were evaluated in technical triplicates for all the reference genes studied. Negative controls in which the cDNA was replaced with nuclease free water were also included for each primer pair.

Analysis of Gene Expression Stability
From Quantification cycle (Cq) values of each gene in qPCR experiments mean of non-template control (NTC) Cq values were subtracted to obtain C q = C q (sample) -Mean C q (NTC) and relative expression as 2 -Cq for each replicate. Geometric mean of expression values of replicates are plotted for the chosen reference genes across different samples as depicted in results.
To analyze stability of expression, mean Cq values for each gene for all 33 experiments were fed into the RefFinder tool [18] that integrates the currently available major computational programs (Genorm, Normfinder, BestKeeper, and the comparative DeltaCt method) to compare and rank the tested candidate reference genes. Comprehensive stability rank of each gene was calculated as the geometric mean of stability rank given by each method.   Figure 1 Work Flow to identify candidate reference genes with least variations and validation of the genes in experiment.

Figure 2
Cluster of genes in the three-dimensional space of CV, MAD and 1-p obtained using the PAM method. Genes marked in red represent cluster #1 with least medoid values as shown in