pcr: an R package for quality assessment, analysis and testing of qPCR data

Background Real-time quantitative PCR (qPCR) is a broadly used technique in the biomedical research. Currently, few different analysis models are used to determine the quality of data and to quantify the mRNA level across the experimental conditions. Methods We developed an R package to implement methods for quality assessment, analysis and testing qPCR data for statistical significance. Double Delta CT and standard curve models were implemented to quantify the relative expression of target genes from CT in standard qPCR control-group experiments. In addition, calculation of amplification efficiency and curves from serial dilution qPCR experiments are used to assess the quality of the data. Finally, two-group testing and linear models were used to test for significance of the difference in expression control groups and conditions of interest. Results Using two datasets from qPCR experiments, we applied different quality assessment, analysis and statistical testing in the pcr package and compared the results to the original published articles. The final relative expression values from the different models, as well as the intermediary outputs, were checked against the expected results in the original papers and were found to be accurate and reliable. Conclusion The pcr package provides an intuitive and unified interface for its main functions to allow biologist to perform all necessary steps of qPCR analysis and produce graphs in a uniform way.


INTRODUCTION
Real-time quantitative PCR (qPCR) is a commonly used technique to analyze the relative gene expression level in the biomedical research. In most cases, small scale experiments are designed to quantify the level of mRNA among experimental conditions. Some advanced machines and optimized protocols have simplified the experiments to a highly efficient one-step process, allowing the effective analysis of a large scale of qPCR data. However, all processes for assessing the quality of the data, performing the analysis and reporting the results are not done in the most uniform way across the literature (Bustin & Nolan, 2004). Different analysis models have been proposed and implemented in different software environments (Pabinger et al., 2014). Furthermore, requirements and guidelines for reporting qPCR data were independently introduced (Bustin et al., 2009). Table 1 Average C T value for c-myc and GAPDH at different input amounts. In this report, we introduce an open source R package for performing quality assessment, modeling and testing for statistical significance of qPCR data in a uniform way. In its current version, the pcr package implement two methods for relative quantification of mRNA expression proposed originally by Livak & Schmittgen (2001), in addition to the necessary steps to check the assumption of these methods. Also, we implement a number of methods to check for statistical significance in qPCR data which were introduced in SAS by Yuan et al. (2006). Finally, the package provides unified interface to make the analysis accessible and the ability to make graphs of the different analysis steps for visual inspection and preparation of publication-level figures. We start by describing the process for generating the data in the original papers, briefly introduce the methods and apply them to the original data using the pcr.

Data sources
To illustrate the usage of the pcr package and to apply it to qPCR data, we used real qPCR datasets from two published papers. In addition, we compared the results obtained by the pcr package to that of the original paper to ensure the reliability. In their original article, Livak & Schmittgen (2001) collected total RNA from human tissues; brain and kidney. c-myc and GAPDH primers were then used for cDNA synthesis and used as input in the PCR reaction. Seven different dilutions where used as input to the PCR reaction (three replicates each), this dataset was referred to as ct3 and shown in Table 1. Six replicates for each tissue were run in separate tubes. This dataset was referred to as ct1 through this document and shown along with the difference calculations in Tables 2 and 3. At the second work, Yuan et al. (2006) extracted total RNA from Arabidopsis thaliana plant treated and control samples (24 samples each), and performed qPCR analyses using MT7 and ubiquitin primers. This dataset was referred to as ct4 and shown the results of the different testing methods that applied in the original paper in Table 4. In all datasets, the raw C T (Cycle Threshold), also known as C q , was recorded (Bustin et al., 2009) and will be used as input to the pcr analysis pipeline.

Statistical methods
In contrast with the absolute quantification of the amount of mRNA in a sample, the relative quantification uses a internal control (reference gene) and/or a control group (reference group) to quantify the mRNA of interest relative to these references. This relative quantification was sufficient to draw conclusions in most of the biomedical applications involving qPCR. A few methods were developed to perform these relative quantification. These methods generally require different assumptions and models for the analysis. The most common two of these methods were described here in the following sections.

The comparative C T methods
The comparative C T methods assume that the cDNA templates of the gene/s of interest as well as the control/reference gene have similar amplification efficiency, and also that this amplification efficiency is near perfect. This means that, at a certain threshold during the linear portion of the PCR reaction, the amount of the gene of the interest and the control double each cycle. Another assumption is that the expression difference between two genes or two samples can be captured by subtracting one (gene or sample of interest) from another (reference). The final assumption is that the reference doesn't change with the treatment or the course in question. The formal derivation of the double delta C T model is described here Livak & Schmittgen (2001). Briefly, the C T is given by: And the relative expression by: where C T ,q is the difference in the C T (or their average) of a gene of interest and a reference gene in a group of interest. C T ,cb is the the difference in the C T (or their average) of a gene of interest and a reference gene in a reference group. The error term is given by: where s 1 is the standard deviation of a gene of interest and s 2 is the standard deviation of a reference gene.

Standard curve
In comparison, this model doesn't assume perfect amplification but rather actively use the amplification in calculating the relative expression. Therefore, when the amplification efficiency of all genes are 100% both methods should give similar results. The standard curve method is applied using two steps. First, serial dilutions of the mRNAs from the samples of interest are used as input to the PCR reaction. The linear trend of the log input amount and the resulting C T values for each gene are used to calculate an intercept and a slope. Secondly, these intercepts and slopes are used to calculate the amounts of mRNA of the genes of interest and the control/reference in the samples of interest and the control sample/reference. These amounts are finally used to calculate the relative expression in a manner similar to the later method, just using division instead of subtraction. The formal derivation of the model is described here (Yuan et al., 2006). Briefly, the amount of RNA in a sample is given by: And the relative expression is given by: 10 log amount (5) where C T is the cycle threshold of a gene. b is the intercept of C T log10 input amount. m is the slope of C T . And the error term is given by: where: where s is the standard deviation.X is the average. cv is the coefficient of variation or relative standard deviation.

Statistical significance tests
Assuming that the assumptions of the first methods are holding true, the simple t -test can be used to test the significance of the difference between two conditions ( C T ). t -test assumes, in addition, that the input C T values are normally distributed and the variance between conditions are comparable. Wilcoxon test can be used if sample size is small, and those two last assumptions are hard to achieve. Two use the linear regression here. A null hypothesis is formulated as following, C T ,target,treatment − C T ,control,treatment = C T ,target,control − C T ,control,control .
This is exactly the C T value as explained earlier. So the C T is estimated and the null is rejected when C T = 0.

Quality assessment
Fortunately, regardless of the method used in the analysis of qPCR data, The quality assessment can be done in a similar way. It requires an experiment similar to that of calculating the standard curve. Serial dilutions of the genes of interest and controls are used as input to the reaction and different calculations are made. The amplification efficiency is approximated be the linear trend between the difference between the C T value of a gene of interest and a control/reference ( C T ) and the log input amount. This piece of information is required when using the C T model. Typically, the slope of the curve should be very small and the R 2 value should be very close to one. A value of the amplification efficiency itself is given by 10 −1/slope and should be close to 2. Other analysis methods are recommended when this is not the case. Similar curves are required for each gene using the C T value instead of the difference for applying the standard curve method. In this case, a separate slope and intercept for each gene are required for the calculation of the relative expression.

Availability & installation
The pcr packages is available on CRAN, the main repository for R packages and can be installed by invoking install.packages in an R (≥3.4.2) session. The package's source code is also available on github, https://github.com/MahShaaban/pcr along with the development version.

# i n s t a l l t h e p c r p a c k a g e f r o m CRAN i n s t a l l . p a c k a g e s ( ' p c r ' )
The examples shown in this article are explained in greater details in the package vignette that can be accessed through browseVignette('pcr'). Moreover, the package documentation provides detailed instruction on the input and the output of each function (e.g., ?pcr_analyze).

Functionality & user interface
The pcr package provides different methods for performing quality assessment, modeling and testing real-time qualitative PCR data through the unified interface of three functions: pcr_assess, pcr_analyze and pcr_test, respectively.
Quality assessment pcr_assess provides two methods for assessing the quality of qPCR data. These are 'efficiency' and 'standard_curve' to calculate the amplification efficiency and gene standard curves as described in the methods section. The following code block applies both methods to the dataset ct3, shown in Table 1. Using the argument plot as TRUE in the pcr_assess function provides the a graphic presentation of the amplification and the standard curves as shown in Fig. 1.

Analysis models
Similarly, pcr_analyze provides two methods to model the C T values and calculates the relative expression of target genes. 'delta_delta_ct' performs the C T method described previously. The average relative expression of the target gene in the condition of interest is given by the Eqs. (1) and (2) and the standard deviation by Eq. (3). The calculations are applied to the dataset 'ct1', shown in Table 2 and Fig. 2A. 'relative_curve' performs the relative standard curve quantification, average relative expression/amount of the target gene in the condition of interest is given by Eqs. (4) and (5) and the standard deviation by Eqs. (6) and (7). The calculation is applied to the same dataset 'ct1' and is shown in Table 3 and Fig. 2B.  Figure 2 Relative expression of c-myc in human brain and kidney tissues. Total RNA from human brain and kidney tissues were used to synthesize cDNA and real-time quantitative PCR reaction with cmyc and GAPDH primers. C T values from six independent replicates were used to calculate the expression of c-myc in the kidney normalized by GAPDH and relative to the brain using The   The pcr package workflow In Fig. 3, we suggest a workflow based on three steps; Quality assessment, modeling and testing for statistical significance, these can be applied by using the pcr package functions; pcr_assess, pcr_analyze and pcr_test respectively. The workflow is based on the C T from two kinds of experiments. First, the C T values for the genes of interest and a reference gene in the different experimental groups. Second, the C T values of the same genes from a serial dilution experiment. The amplification efficiency can be calculated by calculating the C T from the first and the slope and the intercept from the second experiment. The C T model and the relative expression are then calculated and tested. In the case of the two groups comparison, any of the testing methods can be applied otherwise the linear regression can be used for multiple group comparisons. Alternatively, the standard curve methods can be applied by calculating the amount of cDNA in each sample from the C T values and the individual curves for each gene. These amounts are finally used to obtain the relative expression of the genes of interest. Pabinger et al. (2014) surveyed the tools used to analyze qPCR data across different platforms. They included nine R packages which provide very useful analysis and visualization methods. Table 6 shows a comparison of the features of these nine packages from the original publication, in addition to the pcr package. Some of these packages focuses one certain model and some are designed to handle high-throughput qPCR data (e.g., chipPCR). Most of these packages are hosted in CRAN and a few on the Bioconductor so they adhere to Bioconductor methods and data containers (e.g., qpcR).

Comparison with other packages
In comparison, pcr provides a unified interface for different quality assessment, analysis and testing models. The different functions take similar inputs to perform different analysis steps which requires minimal formating at the end of the user. The main input and the output are tidy data.frame, and the package source code follows the tidyverse practices. This package targets the small scale qPCR experimental data and the R users. The interface and documentation choices were made with such users in mind and require no deep knowledge in specific data structures or complex statistical models. Users can go from the raw C T values through different analysis steps to publication graphs in a few simple lines of code.

Limitations & future directions
The current version of the pcr package (1.1.0) provides only methods to estimate the expression of genes in a certain condition relative to another. Other methods were proposed for absolute quantification of the copy number of genes in samples (Whelan, Russell & Whelan, 2003). Also, the comparative C T methods assume that the PCR reaction has a close to perfect amplification rates. Other methods were proposed to model the data when this assumption is not true (Liu & Saint, 2002;Tichopad et al., 2003). We are planning to implement methods for absolute quantification and dealing with less than perfect amplification efficiency cases in a future version of the package.

CONCLUSION
To sum, the pcr package is an open source R package for quality assessing, modeling and testing real-time quantitative PCR data. The package provide an intuitive and unified interface for its main functions to allow biologist to perform all necessary steps of qPCR analysis and produce graphs in a uniform way.