New Statistical Methods for Constructing Robust Differential Correlation Networks to characterize the interactions among microRNAs

Yu, Danyang; Zhang, Zeyu; Glass, Kimberly; Su, Jessica; DeMeo, Dawn L.; Tantisira, Kelan; Weiss, Scott T.; Qiu, Weiliang

doi:10.1038/s41598-019-40167-8

Download PDF

Article
Open access
Published: 05 March 2019

New Statistical Methods for Constructing Robust Differential Correlation Networks to characterize the interactions among microRNAs

Danyang Yu¹,
Zeyu Zhang²,
Kimberly Glass³,
Jessica Su³,
Dawn L. DeMeo³,
Kelan Tantisira³,
Scott T. Weiss³ &
…
Weiliang Qiu ORCID: orcid.org/0000-0002-9155-6957³

Scientific Reports volume 9, Article number: 3499 (2019) Cite this article

2454 Accesses
7 Citations
Metrics details

Subjects

Abstract

The interplay among microRNAs (miRNAs) plays an important role in the developments of complex human diseases. Co-expression networks can characterize the interactions among miRNAs. Differential correlation network is a powerful tool to investigate the differences of co-expression networks between cases and controls. To construct a differential correlation network, the Fisher’s Z-transformation test is usually used. However, the Fisher’s Z-transformation test requires the normality assumption, the violation of which would result in inflated Type I error rate. Several bootstrapping-based improvements for Fisher’s Z test have been proposed. However, these methods are too computationally intensive to be used to construct differential correlation networks for high-throughput genomic data. In this article, we proposed six novel robust equal-correlation tests that are computationally efficient. The systematic simulation studies and a real microRNA data analysis showed that one of the six proposed tests (ST5) overall performed better than other methods.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Genome-wide association studies

Article 26 August 2021

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Introduction

A microRNA (miRNA) is a non-coding RNA molecule that plays an important role in RNA silencing and post-transcriptional regulation of gene expression^1,2. There is increasing evidence that miRNAs are closely related to various human complex diseases. Each miRNA can interact with hundreds of genes and plays various roles in tumorigenesis, metastasis, proliferation³. In addition, a single gene can also be targeted by multiple miRNAs, which constitutes complex miRNA-target interactions³.

The interplay among miRNAs plays an important role in the development of complex human diseases. The network provides a natural way to model the interactions among miRNAs, with nodes representing miRNAs and edges representing interactions between miRNAs⁴. The main advantages of network-based approaches include their feasibility for large amounts of data, resistance to disturbances, and ease of visual interpretation⁵. The development of network theory makes it possible to calculate the global properties of these networks, providing insight into the behavior of the systems they represent⁶. A gene co-expression network is an undirected graph in which nodes represent genes or probes, and a pair of nodes is connected by an edge if there is a significant co-expression relationship between them⁷. An edge weight indicates the magnitude of the co-expression (e.g., correlation) between the pair of nodes connected by the edge. A gene co-expression network focuses on the interplay of multiple genes, checking whether these genes are over or under expressed simultaneously.

By comparing the gene network based on diseased samples (cases) and that based on non-diseased samples (controls), one can know which gene pairs are involved in the development of the disease. We call a network of genes as a gene differential correlation network if the edges in the network connect pairs of genes having significantly different edge weights (i.e., correlations) between the gene co-expression network based on cases and that based on controls (Fig. 1). Constructing differential correlation networks is different from detecting differentially mean expression in that it can identify changes of co-expression between disease and normal subjects and provide a potential biological interaction when they don’t show different mean expression level^8,9. To the best of our knowledge, no research has constructed differential correlation networks using miRNAs yet.

To construct a differential correlation network, we usually first test for each pair of nodes if the correlation between the pair is the same or not between cases and controls. We then connect a pair of nodes if the difference of the correlations between cases and controls is greater than a threshold. The benchmark statistical method for testing equal correlation of a pair of random variables between two independent populations is the Fisher’s Z-transformation test. However, it is sensitive to the violation of the normality assumption. The normal distribution cannot be guaranteed in real data analysis. The violation of the normality assumption would result in inflating type I error rate (i.e., false positive rate).

Some improved tests for equal correlation have been proposed to be robust against the violation of the normality assumption. However, there are some limitations of those methods. For instance, the two methods (twocor and twopcor) proposed by Wilcox¹⁰ are bootstrapping based methods, which are computationally intensive. To construct a gene differential correlation network, we need to test equal correlation between cases and controls for G(G-1)/2 pairs of gene probes, where G is the number of gene probes, which is usually large (~20,000) in whole genome-wide data analysis. Hence, it is not efficient to use bootstrapping-based methods to construct gene differential correlation networks.

ROS-DET (combination of robust correlations and hypothetical testing) proposed by Kayano et al. (2011) only focuses on pairs of genes that have positive correlations in one subject group and negative correlations in another subject group¹¹. It ignores the scenarios where two correlations are significantly different but have the same directions. The result of other methods without bootstrapping, such as Zou’s method¹² and HC4 (heteroscedastic-consistent estimators) method¹³, can be unsatisfactory, even under normality assumption¹⁴. Hence, there is a great need to develop a robust and fast test for equal correlation. In this article, we proposed 6 novel tests for equal correlation and performed systematic simulation studies and a real data analysis to compare the performances of these new methods with existing methods. For the real dataset, we construct a robust differential correlation network. Note that the term ‘robustness’ in this article is different from that in the theory of robust network, in which robustness indicates a network performs well after attacks. In this article, robustness indicates that the method performs well when model assumptions, such as normality assumption, are violated.

Results

Simulation studies

We compared the 6 proposed equal-correlation tests (ST1, ST2, ST3, ST4, ST5, ST6) with 4 existing tests (twopcor, twocor, twohc4cor¹⁰ and Fisher’s Z-transformation test) using systematic simulation studies, in which we evaluated if the 6 proposed methods could achieve higher power than the 4 existing methods, while keeping the nominal type I error rate (0.05), when random variables X and Z are generated from normal or non-normal distributions. The methods twopcor and twocor are bootstrapping-based methods.

Following Wilcox (2009), we generated the observations of the random variables X and Z from g-and-h-distributions¹⁵ (see Section 3 of Supplementary Document I for the definition of a g-and-h distribution) for cases and controls. In a g-and-h distribution, the parameters g and h are both non-negative. If both g and h are equal to zero, then the g-and-h distribution is the standard normal distribution. A positive value of g indicates a skewed distribution. A positive value of h indicates that the g-and-h distribution has heavier tail than the standard normal distribution. As g becomes larger, the g-and-h distribution would be more asymmetric. As h becomes larger, the g-and-h distribution would have heavier tail than the standard normal distribution. To compare performance of different methods for distributions with different skewnesses and tails, we considered ten scenarios: (1) g = 0.2 and h = 0.2 (asymmetrical distribution with heavy tail); (2) g = 0.5 and h = 0.5 (more asymmetrical distribution with heavier tail); (3) g = 1 and h = 1 (most asymmetrical distribution with heaviest tail); (4) g = 0.2 and h = 0 (asymmetrical distribution with relatively light tail); (5) g = 0.5 and h = 0 (more asymmetrical distribution with heavy tail); (6) g = 1 and h = 0 (most asymmetrical distribution with relatively light tail); (7) g = 0 and h = 0.2 (symmetrical distribution with heavy tail); (8) g = 0 and h = 0.5 (symmetrical distribution with heavier tail); (9) g = 0 and h = 1 (symmetrical distribution with heaviest tail); and (10) g = 0 and h = 0 (standard normal distribution).

Also following Wilcox (2009), we generated observations for a pair of random variables X ad Z. We first generated random numbers X1 and e₁ for cases and X2 and e₂ for controls based on g-and-h distributions. Then we generated random numbers of Z1 and Z2 via the formula z₁ = θ₁x₁ + λ_j(x₁)e₁for cases and z₂ = θ₂x₂ + λ_j(x₂)e₂ for controls, j = 1, 2, 3. The parameters θ₁ and θ₂ indicate the magnitude of correlations between z₁ and x₁ and between z₂ and x₂, respectively. For example, θ₁ = 0 indicates z₁and x₁ are uncorrelated. θ₁ > 0 indicates z₁ and x₁ are positively correlated. θ₁ < 0 indicates z₁ and x₁ are negatively correlated. The functions λ_j(x), j = 1, 2, 3 indicate the three variance patterns (vp) of z. For vp1, λ₁(x_i) = 1; For vp2, λ₂(x_i) = |x_i| + 1; For vp3, ${\lambda }_{3}({x}_{i})=\frac{1}{|{x}_{i}|+1}$. With vp2, the conditional variance of Z given X, would be large if X is close to its mean; With vp3, the conditional variance of Z given X, would be small if is close to its mean. The random error terms e₁ and e₂ have the same distributions as x₁ and x₂¹⁴.

To evaluate the effect of sample size on the performances of the equal-correlation tests, we considered 3 different sample sizes: (1) 30 cases and 30 controls; (2) 100 cases and 100 controls; and (3) 200 cases and 200 controls.

To evaluate the type I error rates of the tests, we considered two scenarios: (1) θ₁ = θ₂ = 0; and (2) θ₁ = θ₂ = 1; To evaluate the powers of the tests, we set θ₁ = 0 and θ₂ = 1. So we totally have 10 × 3 × 3 × 2 = 180 scenarios for evaluating Type I error rates and 10 × 3 × 3 × 1 = 90 scenarios for evaluating powers.

For each scenario, we generated 100 datasets. For each dataset, we generated 1000 pairs of random variables X and Z for cases and controls, respectively. Because bootstrapping-based methods (twopcor and twocor) would cost too much time, we evaluated the performances of twopcor and twocor only in the scenarios where nCases = nControls = 100, g = 0.2, and h = 0.2 (nCases is the number of cases and nControls is the number of controls).

To evaluate the performances of equal-correlation tests, we used Type I error rate and power in simulation studies. Figures 2 and S6 showed that the median Type I error rates of twocor and ST5 did not exceed the nominal level 0.05 in all scenarios in the simulation studies, which shows the excellent robustness of twocor and ST5. The median Type I error rates of ST1 and ST2 (when h = 0, θ₁ = 0, θ₂ = 0), and ST6 (when variation pattern is vp3, h = 0, θ₁ = 1, θ₂ = 1 and n = 100 or 200) were just a little bit higher than 0.05 in few scenarios, which shows the symmetry does not affect the type I error rates of ST1, ST2 and ST6, but the heave tail does. For twopcor, the median Type I error rates were higher than 0.05 when the variance pattern was vp1 or vp2. When θ₁ = 1, θ₂ = 1(i.e., correlations between X and Z are non-zero, but the same, in both cases and controls), the median Type I error rates of twohc4cor and ST4 are higher than 0.05; For ST3 the median Type I error rates are higher than 0.05 when θ₁ = 0, θ₂ = 0 (i.e., correlations between X and Z are zero in both cases and controls).

For the power analyses, the median powers of all six methods increase as sample size increases. (see Fig. 3) For all six methods, the powers were smaller when the variation pattern was vp2. Among the methods (ST1, ST2, ST5, ST6 and twocor) that have the median Type I error rates smaller than 0.05 in almost all simulation scenarios, the ST5 and twocor always have the highest power, which is not affected by sample sizes, distributions and variance patterns. The median powers of ST5 were almost equal to those of twocor in all scenarios.’

We summarized the simulation results in Fig. 4 and Table S1. Figure 4 is a plot of the number n_reject of scenarios with mean type I error rate significantly >0.05 versus the median rank of power m. For each scenario and each equal-variance test, we used one-sample t-test to test the null hypothesis H₀ that the mean of the 100 estimated type I error rates from 100 simulated data sets is significantly ≤0.05 versus the alternative hypothesis H_a that the mean of the 100 estimated type I error rates from 100 simulated data set is significantly >0.05. If the p-value of the one-sample t-test <0.05, we claimed for this scenario and this equal-variance test, the mean type I error rate >0.05. For a given scenario, we ranked in terms of power the equal-variance tests that did not reject the null hypothesis H₀. For ranks with ties, average ranks were used. For the equal-variance tests that rejected the null hypothesis H₀, we set their ranks as missing values. Since for each scenario evaluating power (θ₁ = 0, θ₂ = 1), there are two corresponding scenarios evaluating Type I error rate (θ₁ = θ₂ = 0 or θ₁ = θ₂ = 1), we set ranks as missing values if one of the two corresponding scenarios with mean Type I error rate > 0.05. We then obtained the median m of the rank for each equal-variance test. The right panel of Fig. 4 is based on all scenarios. The proposed equal-variance test ST5 located at the left-bottom corner of the left panel of Fig. 4 (n_reject = 0, m = 1.75), indicating it had the smallest number of false positive rate and largest power, hence performed best. The methods twocor and twopcor have not appeared in the right panel of Fig. 4 because they are bootstrapping-based methods, which are computationally intensive. We only include them in the scenarios where nCases = nControls = 100, g = 0.2, and h = 0.2. The left panel of Fig. 4 is the plot for these scenarios, from which we can see that twocor performed the best (n_reject = 0, m = 1.5) and ST5 performed the second (n_reject = 0, m = 2).

Real data analysis

We downloaded the miRNA dataset GSE15008 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse15008) from the public data repository Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo) to construct differential correlation networks of miRNAs using the 6 equal-correlation tests (3 proposed tests and 3 existing tests). GSE15008 contains 677 miRNAs from 174 non-small-cell lung cancer (NSCLC) tissues and 187 adjacent normal tissues from patients. After data preprocessing, 178 miRNAs were kept for differential correlation analysis (The details about the data preprocessing are shown in Section 4 of Supplementary Documents I). The quantile plots (Fig. S2) and plots of the top 2 principal components (Fig. S3) based on the 178 miRNAs showed no obvious patterns.

Before constructing differential correlation network, we inspected the normality of expression data of the 178 selected miRNAs using Shapiro–Wilk test for cancer tissues and for normal tissues, separately. The normality assumption for majority of microRNAs is violated. Specifically, 80.34% of FDR adjust p-values for testing normality are smaller than 0.05 for cancer tissues, 63.48% of FDR adjust p-values are smaller than 0.05 for normal tissues, and 88.20% of FDR adjust p-values are smaller than 0.05 for either cancer tissues or normal tissues. Please see Table S4 for p-values and adjusted p-values of miRNAs. We also did Wilcoxon signed rank test for each of the 178 selected miRNAs to check if they are differentially expressed between cancer tissues and adjacent normal tissues. Please see their FDR adjusted p-values in Table S5.

We randomly divided the 174 NSCLC samples into two equal parts. One part (87 samples) formed the cases of the discovery set and the other part (87 samples) formed the cases of the validation set. Similarly, we randomly divided the 187 normal samples into roughly two equal parts. One part (94 samples) formed the controls of the discovery set and the other part (93 samples) formed the controls of the validation set.

We compared ST1, ST5, ST6, Fisher’s test, twohc4cor, and twocor in the real data analysis. We did not include ST2 because of its low power in the simulation studies. We did not include ST3 and ST4 because they were no better than ST5 and ST6 in the simulation studies. We did not include twopcor because of its long computational time and its low power in the simulation studies.

We claimed that the differential correlation of a pair of miRNAs is validated if its FDR-adjusted p-value < 0.05 in the discovery set and raw p-value < 0.05 in the validation set. We applied Benjamini and Yekutieli method¹⁶ to calculate FDR-adjusted p-values.

We called the network formed by the pairs of miRNAs with validated differential correlations as the validated differential correlation network. We visualize validated differential correlation networks by Cytoscape.

We chose the miRNA having the maximum number of edges in the validated differential correlation network as the hub miRNA. We applied the web-tool miRSystem¹⁷ to predict the genes targeted by the hub miRNA and to obtain the KEGG pathways enriched in these genes.

To evaluate the differential correlation networks in the real data analysis, we used the proportion of the validated edges in the discovery set ${\rm{r}}=\frac{{e}_{1}}{{e}_{2}}$, where e₁ is the number of validated edges (i.e., the pairs of microRNAs having FDR-adjusted p-values < 0.05 in the discovery set and raw p-values < 0.05 in the validation set). e₂ is the number of edges detected based on only the discovery set (i.e., the pairs of microRNAs having FDR-adjusted p-values < 0.05 in the discovery set).

The numbers e₁ of validated edges obtained by the 6 proposed tests are 0 (ST1), 35 (ST5), 37 (ST6), 71 (Fisher), 0 (twohc4cor) and 440 (twocor). ST6 had the highest validation rate (r = 100.00%), followed by ST5 (r = 94.59%), Fisher (r = 88.75%), and twocor (r = 74.83%). In terms of running time, Fisher is the fastest method among the 6 methods, which took only 2.42 seconds. ST1 is the second fastest method, which took 5.97 seconds. ST5 and ST6 took 35.92 seconds and 37.32 seconds, respectively. Twohc4cor used 89.88 seconds, which is around 2.5 more times than ST5/ST6. Twocor, the bootstrapping-based method, took the longest time: 31429.21 seconds (i.e., 8.73 hours). The results of the differential correlation analyses of the 6 methods for the real dataset GSE15008 are summarized in Table 1.

Table 1 Summary of the differential correlation analyses of the 6 methods for the real dataset GSE15008.

Full size table

Figure 5 showed the validated differential correlation networks based on ST5, ST6, Fisher, and twocor. The hubs in the differential correlation networks obtained by ST5 and ST6 are the same, i.e., hsa-miR-143, which targets 531 genes (Table S2). These 531 genes are enriched in 5 KEGG pathways (LEISHMANIASIS, PRION DISEASES, GNRH SIGNALING PATHWAY, ECM RECEPTOR INTERACTION, and ALZHEIMER’S DISEASE) (Table S3). The hub detected by Fisher is has-miR-492, which targets 107 genes (Table S2). These 107 genes are not enriched in KEGG pathways. The hub detected by twocor is hsa-miR-363, which targets 573 genes. These 573 genes are enriched in 5 KEGG pathways (DILATED CARDIOMYOPATHY, CALCIUM SIGNALING PATHWAY, ECM RECEPTOR INTERACTION, LONG TERM POTENTIATION, and REGULATION OF ACTIN CYTOSKELETON) (Table S3).

Note that microRNAs in the differential correlation network are not necessarily having different mean expression levels between cases and controls (i.e., not necessarily differentially expressed). Hence, constructing differential correlation network can identify disease-associated microRNAs that would be missed in differential expression analysis. For example, in our real data analysis, the microRNA hsa-miR-152 in differential correlation networks detected by ST5 and ST6 is not differentially expressed between cancer tissue and adjacent normal tissue (FDR adjusted p-value of the Wilcoxon signed rank test equals to 0.9997). Please see Fig. 6 (scatter plots of expression levels with hsa-miR-152 and has-miR-145) and Fig. 7 (histograms of has-miR-152) for cases and for controls, separately.

Discussion

A differential correlation network of miRNAs, characterizing the differences between the co-expression network among cases and that among controls, could help understand how pairs of miRNAs affect the disease of interest. In this article, we proposed six novel robust tests (ST1, ST2, ST3, ST4, ST5, and ST6) for equal correlation and compared them with four existing tests (twocor, twopcor, Fisher’s Z- transformation test, and twohc4cor) in the construction of differential correlation networks. Simulation studies showed that ST5 performed as well as twocor (a bootstrapping-based method), and had the highest powers among the tests that kept the nominal type I error rates in all scenarios in our simulation studies. Real data analysis also showed the good performance of ST5. ST5 had the second highest validation rate (r = 94.59%) in the real data analysis, followed by Fisher (r = 88.75%) and twocor (r = 74.83%), while ST6 had the highest validation rate (r = 100%). Furthermore, ST5 is computationally fast. Hence, the proposed equal-correlation test ST5 could be used to construct robust differential correlation networks in genomic data analysis.

Although twocor showed good performance in simulation studies, twocor detected too many differential correlation edges (e₁ = 395, e₂ = 537), but had the smallest validation rate (r = 73.56%) in the real data analysis. Moreover, as a bootstrapping-based method, twocor is computationally intensive. Therefore, twocor probably is not suitable to construct differential correlation networks in genomic data analysis, in which the number of nodes is large.

In the real data analysis, the rank of the proportion of validated edges is ST6, ST5, Fisher, and twocor. ST1 and twohc4cor failed to detect any significant differential correlation based on the discovery sets. It indicates that ST1 and twohc4cor are not powerful methods, which are also shown in the simulation studies. In the validated differential correlation networks of ST5, ST6 and twocor, the top three nodes having the largest number of edges are same (hsa-miR-143, hsa-miR-145, hsa-miR-363). The miRSystem predicted that hsa-miR-143 (the hub detected by ST5 and ST6), has-miR-145, and hsa-miR-363 (the hub detected by twocor) are connected to pathways related to lung cancer, such as GNRH signaling pathway¹⁸ (empirical-p-value is 0.01490), calcium signaling pathway¹⁹ (empirical-p-value is 0.01872), and TGF-BETA signaling pathway²⁰ (empirical-p-value is 0.02706). However, hsa-miR-492 (detected by Fisher) is not related to lung cancer (Table S3). We surmised the reason why the hub selected by Fisher is not related to lung cancer is the high false positive rates of the Fisher’s Z- transformation test.

From Fig. 4, we observed that the power increases as sample size increases in our simulation studies, which matches our intuition. Specifically, in Fig. 4 the bottom left panel (nCases = nControls = 30), bottom middle panel (nCases = nControls = 100), and bottom right panel (nCases = nContrls = 200) have increasing ranges of median powers: [0.1, 0.5] (nCases = nControls = 30), [0.3, 0.75] (nCases = nControls = 100), and [0.5, 0.95] (nCases = nContrls = 200). Figure S6 contains more examples. (You can download Figure S6 at https://sites.google.com/view/weiliangqiu/supplementary-documents/figure-s6).

In this article, there are a couple of limitations. Firstly, the sample size (174 NSCLC tissues and 187 normal tissues) of the GEO dataset GSE15008 is not very large and we did not find an independent dataset to do validation. Instead, we randomly split the GSE15008 dataset into two sets: discovery set and validation set. Secondly, we could not derive the asymptotic or approximate distribution of the ST5 test statistic yet. Instead, we numerically demonstrated that the distribution of the ST5 test under the null hypothesis could be approximated by the chi square distribution with one degree of freedom. Further research on the asymptotic distribution of ST5 is needed.

Although ST5 combines ST3 and ST4, the degree of freedom of the corresponding chi square test statistic is still one, like ST1, ST2, ST3, and ST4. Although ST6 combines ST3 and ST4, it has 2 degree of freedom. That is, ST5 uses more information, in the meantime it does not increase degree of freedom. We think this is the possible reason why ST5 overall performed better than ST1, ST2, ST3, ST4, and ST6. Further investigations are warranted.

Heavy tail (h) has more effect on type I error rate than symmetry (g). For instance, for Fisher’s test, the median type I error rates are 0.05 (setting: θ₁ = θ₂ = 1, g = 0, h = 0, vp1, nCase = nControls = 200), 0.52 (setting: θ₁ = θ₂ = 1, g = 1, h = 0, vp1, nCases = nControls = 200), 0.52 (setting: θ₁ = θ₂ = 1, g = 0, h = 1, vp1, nCases = nControls = 200), and 0.85 (setting: θ₁ = θ₂ = 1, g = 1, h = 1, vp1, nCases = nControls = 200), respectively. Please see Figure S6.

In summary, we proposed 6 robust tests for equal correlation to construct differential correlation networks and found ST5 had overall good performance in both the simulation studies and the real data analysis. However, if the expression data follow normal distribution, Fisher-z test should be used due to high computational efficiency and high detection ability. The ST5 test is highly recommended to construct differential correlation network to characterize the effects of the interactions between genes (not limited to mi-RNAs) on diseases (not limited to lung cancer) when the expression data violates normality assumption for majority of genes, hence to help uncover the molecular mechanisms of complex human diseases.

Methods

To construct a differential correlation network for a set of miRNAs, we first test for each pair of miRNAs if their correlation among cases is the same as that among controls using an equal-correlation test. We then set a criterion to determine if the test is significant or not (i.e., if the pair of miRNAs should be connected by an edge in the differential correlation network or not). To determine if a test is significant or not, we need to control for multiple testing since G(G-1)/2 tests are performed, where G is the total number of miRNAs. We claimed that the differential correlation of a pair of miRNAs is validated if its FDR-adjusted p-value < 0.05 in the discovery set and raw p-value < 0.05 in the validation set.

Novel equal-correlation tests

For given two random variables X and Z, we would like to test if the correlation corr(X, Z) in cases is the same as that in controls. Inspired by the joint test of equal mean and equal variance proposed by Ahn and Wang²¹ and the improved Ahn and Wang’s equal-variance tests proposed by Qiu et al.²², we proposed 6 robust tests (ST1, ST2, ST3, ST4, ST5, ST6) for equal correlation without the normality assumption.

ST1 Test

Let’s consider the following logistic regression:

$$log(\frac{{{\rm{p}}}_{i}^{I}}{1-{{\rm{p}}}_{i}^{I}})={\beta }_{0}^{I}+{\beta }_{1}^{I}{{\rm{w}}}_{i}^{I},\,i=1,2,3,\ldots ,{n}_{0}+{n}_{1}$$

(1)

where

$${{\rm{w}}}_{i}^{I}=\{\begin{array}{c}\frac{{n}_{1}({x}_{i}-{\bar{x}}_{1})({z}_{i}-{\bar{z}}_{1})}{\sqrt{{\sum }_{{y}_{j}=1}{({x}_{j}-{\bar{x}}_{1})}^{2}{\sum }_{{y}_{j}=1}{({z}_{j}-{\bar{z}}_{1})}^{2}}},\,{y}_{i}=1,\,i=1,\,2,\,3,\,\ldots ,\,{n}_{0}+{n}_{1}\,\\ \frac{{n}_{0}({x}_{i}-{\bar{x}}_{0})({z}_{i}-{\bar{z}}_{0})}{\sqrt{{\sum }_{{y}_{j}=0}{({x}_{j}-{\bar{x}}_{0})}^{2}{\sum }_{{y}_{j}=0}{({z}_{j}-{\bar{z}}_{0})}^{2}}},\,{y}_{i}=0,\,i=1,\,2,\,3,\,\ldots ,\,{n}_{0}+{n}_{1},\,\end{array}$$

n₀ represents the number of controls, n₁ represents the number of cases, y_i = 1 indicates that the i-th subject is a case, and y_i = 0 indicates that the i-th subject is a control. ${{\rm{p}}}_{i}^{I}=Pr({y}_{i}=1|{{\rm{w}}}_{i}^{I})$ is the probability that the i-th subject is a case given ${{\rm{w}}}_{i}^{I}$. ${\bar{x}}_{1}=\sum _{{y}_{i}=1}\frac{{x}_{i}}{{n}_{1}}$ and ${\bar{x}}_{0}=\sum _{{y}_{i}=0}\frac{{x}_{i}}{{n}_{0}}$ are sample means of X in cases and controls, respectively. ${\bar{z}}_{1}=\sum _{{y}_{i}=1}\frac{{z}_{i}}{{n}_{1}}$ and ${\bar{z}}_{0}=\sum _{{y}_{i}=0}\frac{{z}_{i}}{{n}_{0}}$ are sample means of Z in cases and controls, respectively. The score test statistic of the above logistic regression for testing the null hypothesis ${H}_{0}^{I}$: ${\beta }_{1}^{I}=0$ is

$${{\rm{T}}}^{I}=\frac{{{\rm{U}}}^{I}}{{\rm{var}}({{\rm{U}}}^{I})}\mathop{\to }\limits^{{H}_{0}}{\chi }_{1}^{2},$$

(2)

where ${\chi }_{1}^{2}$ is the chi squared distribution with one degree of freedom,

$${{\rm{U}}}^{I}=\sum _{i=1}^{{n}_{0}+{n}_{1}}({y}_{i}-\bar{y}){{\rm{w}}}_{i}^{I},$$

$${\rm{var}}({{\rm{U}}}^{I})=\bar{y}(1-\bar{y})\sum _{i=1}^{{n}_{0}+{n}_{1}}{({{\rm{w}}}_{i}^{I}-{\bar{w}}^{I})}^{2},$$

and

$${\bar{w}}^{I}=\sum _{i=1}^{{n}_{0}+{n}_{1}}\frac{{{\rm{w}}}_{i}^{I}}{{n}_{0}+{n}_{1}}.$$

That is, when U^I is large, we reject ${H}_{0}^{I}:{\beta }_{1}^{I}=0$. We can show that ${{\rm{U}}}^{I}=\sum _{i=1}^{{n}_{0}+{n}_{1}}({y}_{i}-\bar{y}){w}_{i}\propto \widehat{{\rho }_{1}}-\widehat{{\rho }_{2}}$ (See the section 1 of Supplementary Document I). Hence, testing ${H}_{0}^{I}:{\beta }_{1}^{I}=0$ is equivalent to test for ρ₁ = ρ₂.

Note that in logistic regression (1), the random variable y_i is conditional on the random variable ${{\rm{w}}}_{i}^{I}$.

We denote the score test in (2) as ST1 Test.

ST2 Test

ST2 test is an improved version of ST1 by replacing the sample variances by the square of median absolute deviations (MAD) since MAD is more robust to outliers than standard deviation. Let’s consider the following logistic regression:

$$log(\frac{{p}_{i}^{{\rm{II}}}}{1-{p}_{i}^{{\rm{II}}}})={\beta }_{0}^{{\rm{II}}}+{\beta }_{1}^{II}{{\rm{w}}}_{i}^{{\rm{II}}},$$

(3)

where ${p}_{i}^{II}=Pr({y}_{i}=1|{{\rm{w}}}_{i}^{{\rm{II}}}).$

The related formulas are shown below:

$${{\rm{U}}}^{{\rm{II}}}=\sum _{i=1}^{{n}_{0}+{n}_{1}}({y}_{i}-\bar{y}){{\rm{w}}}_{i}^{{\rm{II}}}$$

$${w}_{i}^{{\rm{II}}}=\{\begin{array}{c}\frac{({x}_{i}-{\tilde{x}}_{1})({z}_{i}-{\tilde{z}}_{1})}{\sqrt{ma{d}_{1}{({x}_{j})}^{2}ma{d}_{1}{({z}_{j})}^{2}}},\,{y}_{i}=1\\ \frac{({x}_{i}-{\tilde{x}}_{0})({z}_{i}-{\tilde{z}}_{0})}{\sqrt{ma{d}_{0}{({x}_{j})}^{2}ma{d}_{0}{({z}_{j})}^{2}}},\,{y}_{i}=0\end{array}$$

mad_k(x_j) and mad_k(z_j) are the MADs for y_j = k,

${\tilde{x}}_{1}=median({x}_{i})\,{\rm{and}}\,{\tilde{z}}_{1}=median({z}_{i})\,{\rm{for}}\,{y}_{i}=1,$

${\tilde{x}}_{0}=median\,({x}_{i})\,and\,{\tilde{z}}_{0}=median({z}_{i})for\,{y}_{i}=0$

$${\rm{var}}({{\rm{U}}}^{{\rm{II}}})=\bar{y}(1-\bar{y})\sum _{i=1}^{{n}_{0}+{n}_{1}}{({w}_{i}^{{\rm{II}}}-{\bar{w}}^{II})}^{2}$$

$${\bar{w}}^{II}=\sum _{i=1}^{{n}_{0}+{n}_{1}}\frac{{w}_{i}^{{\rm{II}}}}{{n}_{0}+{n}_{1}}$$

The ST2 test statistic for testing the null hypothesis ${H}_{0}^{II}:{\beta }_{1}^{II}=0$ is

$${T}^{{\rm{II}}}=\frac{{{\rm{U}}}^{{\rm{II}}}}{{\rm{var}}({{\rm{U}}}^{{\rm{II}}})}\mathop{\to }\limits^{{H}_{0}^{II}}{\chi }_{1}^{2}.$$

ST3 Test

To get more robust weights than ${\bar{w}}^{I}$ and ${\bar{w}}^{II}$, we utilized the M type correlation¹⁰ in ST3 test. The idea of the M type correlation is to use a robust version of trimmed mean to replace sample mean in calculating sample correlation. Please refer to Section 2 of Supplementary Document I for the details about the ST3 test.

ST4 Test

In ST4 test, we defined the weights ${{\rm{w}}}_{i}^{{\rm{IV}}}$ based on the Spearman’s rank correlation. The formulas are below:

$${{\rm{U}}}^{{\rm{IV}}}=\sum _{i=1}^{{n}_{0}+{n}_{1}}({y}_{i}-\bar{y}){{\rm{w}}}_{i}^{{\rm{IV}}}$$

$${{\rm{w}}}_{i}^{{\rm{IV}}}=\{\begin{array}{c}\frac{{n}_{1}^{2}-1-6{d}_{i}^{2}}{{n}_{1}^{2}-1},\,{y}_{i}=1\\ \frac{{n}_{0}^{2}-1-6{d}_{i}^{2}}{{n}_{0}^{2}-1},\,{y}_{i}=0\end{array}$$

where d_i = rank(x_i)−rank(z_i), where rank (x) represents the ranks of x in both case and control, and

$${\rm{var}}({{\rm{U}}}^{{\rm{IV}}})=\bar{y}(1-\bar{y})\sum _{i=1}^{{n}_{0}+{n}_{1}}{({{\rm{w}}}_{i}^{{\rm{IV}}}-{\bar{w}}^{{\rm{IV}}})}^{2}$$

$${\bar{w}}^{{\rm{IV}}}=\sum _{i=1}^{{n}_{0}+{n}_{1}}\frac{{{\rm{w}}}_{i}^{{\rm{IV}}}}{{n}_{0}+{n}_{1}}$$

$${T}^{{\rm{IV}}}=\frac{{{\rm{U}}}^{{\rm{IV}}}}{{\rm{var}}({{\rm{U}}}^{{\rm{IV}}})}\mathop{\to }\limits^{{H}_{0}^{IV}}{\chi }_{1}^{2}$$

ST5 Test

Our systematic simulations (see the Result Section) showed that the performances of ST3 and ST4 are complement of each other. So, we combined the test statistics of ST3 and ST4 to form the test statistic of ST5. Specifically, the test statistic of ST5 is the average of T^III and T^IV:

$${T}^{{\rm{V}}}=({T}^{{\rm{III}}}+{T}^{{\rm{IV}}})/2.$$

It is challenging to derive the asymptotic distribution of the test statistic T^V. We guess that the asymptotic distribution of the test statistic T^V is close to the chi square distribution with one degree of freedom $({\chi }_{1}^{2})$. To numerically support this guess, we generated a simulated dataset with 50,000 pairs of random variables X and Z for 100 cases and 100 controls from a g-and-h distribution under the null hypothesis ${H}_{0}^{V}\,$that the correlation between X and Z in cases is the same as that in controls (see Simulations Section for more details). We then calculated the values of T^V for each of the 50,000 pairs of random variables X and Z and drew the histogram of these 50,000 values of T^V. Figure 8 showed that the histogram of the test statistic T^V is very close to the density of the chi-squared distribution with one degree of freedom. Hence, in this article, we assumed that ${T}^{{\rm{V}}}\mathop{\to }\limits^{{H}_{0}^{V}}{\chi }_{1}^{2}$. Further investigation is warranted.

ST6 Test

We can combine ST3 and ST4 based on the following multiple logistic regression:

$$log(\frac{{p}_{i}^{VI}}{1-{p}_{i}^{VI}})={\gamma }_{0}+{\gamma }_{1}{w}_{i}^{{\rm{III}}}+{\gamma }_{2}{w}_{i}^{{\rm{IV}}},$$

(4)

where ${p}_{i}^{VI}=Pr({y}_{i}=1|{w}_{i}^{{\rm{III}}},{w}_{i}^{{\rm{IV}}})$. The score test statistic for testing for the composite null hypothesis ${H}_{0}^{VI}:{\gamma }_{1}=0$ and γ₂ = 0 is

$${T}^{{\rm{VI}}}={U}_{joint}^{T}\times {[c\widehat{o}v({U}_{joint})]}^{-1}\times {U}_{joint}\mathop{\to }\limits^{{H}_{0}^{VI}}{\chi }_{2}^{2},$$

where

$${U}_{joint}={({U}^{III},{U}^{IV})}^{T},$$

$$\widehat{cov}({U}_{joint})=\bar{{\rm{y}}}(1-\bar{{\rm{y}}}){\rm{\Omega }},$$

$${\rm{\Omega }}=(\begin{array}{c}\mathop{{\sum }^{}_{{\rm{i}}=1}}\limits^{{n}_{1}+{n}_{2}}{({w}_{i}^{{\rm{III}}}-{\bar{{\rm{w}}}}^{{\rm{III}}})}^{2}\,\mathop{{\sum }^{}_{{\rm{i}}=1}}\limits^{{n}_{1}+{n}_{2}}({w}^{{\rm{III}}}-{\bar{{\rm{w}}}}^{{\rm{III}}})({w}_{i}^{{\rm{IV}}}-{\bar{{\rm{w}}}}^{{\rm{IV}}})\\ \mathop{{\sum }^{}_{{\rm{i}}=1}}\limits^{{n}_{1}+{n}_{2}}({w}_{i}^{{\rm{III}}}-{\bar{{\rm{w}}}}^{{\rm{III}}})({w}_{i}^{{\rm{IV}}}-{\bar{{\rm{w}}}}^{{\rm{IV}}})\,\mathop{{\sum }^{}_{{\rm{i}}=1}}\limits^{{n}_{1}+{n}_{2}}{({w}_{i}^{{\rm{IV}}}-{\bar{{\rm{w}}}}^{{\rm{IV}}})}^{2}\end{array}).$$

The proof that the asymptotic distribution of ${T}^{{\rm{VI}}}\mathop{\to }\limits^{{H}_{0}^{VI}}{\chi }_{2}^{2}$ is given in the Supplementary Document II.

Code availability

The R package including functions for implementing ST1, ST2, ST3, ST4, ST5, ST6, and Fisher’s Z-transformation test can be download from the website: https://CRAN.R-project.org/package=corTest.

Data Availability

The datasets analyzed during the current study are available in the GEO repository, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse15008.

References

Bartel, D. P. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell 116, 281–297 (2004).
Article CAS Google Scholar
Ambros, V. The functions of animal microRNAs. Nature. https://doi.org/10.1038/nature02871 (2004).
Article PubMed Google Scholar
Chen, X., Xie, D., Zhao, Q. & You, Z.-H. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform https://doi.org/10.1093/bib/bbx130.
Barabási, A.-L. & Oltvai, Z. N. Network biology: understanding the cell’s functional organization. Nature Reviews Genetics 5, 101–113 (2004).
Article Google Scholar
Silverman, E. K. & Loscalzo, J. Network medicine approaches to the genetics of complex diseases. Discov Med 14, 143–152 (2012).
PubMed PubMed Central Google Scholar
Glass, K. & Girvan, M. Finding New Order in Biological Functions from the Network Structure of Gene Annotations. PLOS Computational Biology 11, e1004565 (2015).
Article ADS Google Scholar
Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 302, 249–255 (2003).
Article ADS CAS Google Scholar
Siska, C. & Kechris, K. Differential correlation for sequencing data. BMC Res Notes 10, 54 (2017).
Article Google Scholar
de la Fuente, A. From ‘differential expression’ to ‘differential networking’ - identification of dysfunctional regulatory networks in diseases. Trends Genet. 26, 326–333 (2010).
Article Google Scholar
Wilcox, R. R. Introduction to Robust Estimation and Hypothesis Testing. (Academic Press 2011).
Kayano, M., Takigawa, I., Shiga, M., Tsuda, K. & Mamitsuka, H. ROS-DET: robust detector of switching mechanisms in gene expression. Nucleic Acids Res 39, e74 (2011).
Article CAS Google Scholar
Zou, G. Y. Toward using confidence intervals to compare correlations. Psychol Methods 12, 399–413 (2007).
Article Google Scholar
Cribari-Neto, F. Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics & Data Analysis 45, 215–233 (2004).
Article MathSciNet Google Scholar
Wilcox, R. Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Communications in Statistics - Simulation and Computation 38, 2220–2234 (2009).
Article MathSciNet Google Scholar
Hoaglin, D. C. Summarizing Shape Numerically: The g-and-h Distributions. in Exploring Data Tables, Trends, and Shapes (eds Hoaglin, D. C., Mosteller, F. & Tukey, J. W.) 461–513 (John Wiley & Sons, Inc.) https://doi.org/10.1002/9781118150702.ch11 (2006)
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165–1188 (2001).
Article MathSciNet Google Scholar
Lu, T.-P. et al. miRSystem: An Integrated System for Characterizing Enriched Functions and Pathways of MicroRNA Targets. PLOS ONE 7, e42390 (2012).
Article ADS CAS Google Scholar
Lu, C., Huang, T., Chen, W. & Lu, H. GnRH participates in the self-renewal of A549-derived lung cancer stem-like cells through upregulation of the JNK signaling pathway. Oncology Reports 34, 244–250 (2015).
Article CAS Google Scholar
Yang, H., Zhang, Q., He, J. & Lu, W. Regulation of calcium signaling in lung cancer. J Thorac Dis 2, 52–56 (2010).
CAS PubMed PubMed Central Google Scholar
Jeon, H.-S. & Jen, J. TGF-beta signaling and the role of inhibitory Smads in non-small cell lung cancer. J Thorac Oncol 5, 417–419 (2010).
Article Google Scholar
Ahn, S. & Wang, T. A powerful statistical method for identifying differentially methylated markers in complex diseases. in Biocomputing 201369–79 (WORLD SCIENTIFIC). https://doi.org/10.1142/9789814447973_0008 (2012).
Qiu, W. et al. New Score Tests for Equality of Variances in the Application of DNA Methylation Data Analysis. 8 (2017).

Download references

Author information

Authors and Affiliations

Department of Information and Computing Science, College of Mathematics and Econometrics, Hunan University, Hunan, China
Danyang Yu
Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, China
Zeyu Zhang
Channing Division of Network Medicine, Brigham and Women’s Hospital/Harvard Medical School, Boston, USA
Kimberly Glass, Jessica Su, Dawn L. DeMeo, Kelan Tantisira, Scott T. Weiss & Weiliang Qiu

Authors

Danyang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly Glass
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Su
View author publications
You can also search for this author in PubMed Google Scholar
Dawn L. DeMeo
View author publications
You can also search for this author in PubMed Google Scholar
Kelan Tantisira
View author publications
You can also search for this author in PubMed Google Scholar
Scott T. Weiss
View author publications
You can also search for this author in PubMed Google Scholar
Weiliang Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.Q. initiated and supervised the project. D.Y., assisted by W.Q., designed and implemented the simulation study and real data analysis. Z.Z., K.G., J.S., D.D., K.T. and S.W. provided comments and suggestions to the project, helped interpret the results, and design the figures and tables. All authors contributed to the manuscript writing and approved the submission of the manuscript.

Corresponding author

Correspondence to Weiliang Qiu.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary_Document I and Supplementary_Document II

Table S1

Table S2

Table S3

Table S4

Table S5

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yu, D., Zhang, Z., Glass, K. et al. New Statistical Methods for Constructing Robust Differential Correlation Networks to characterize the interactions among microRNAs. Sci Rep 9, 3499 (2019). https://doi.org/10.1038/s41598-019-40167-8

Download citation

Received: 27 June 2018
Accepted: 11 February 2019
Published: 05 March 2019
DOI: https://doi.org/10.1038/s41598-019-40167-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.