A robust Pearson correlation test for a general point null using a surrogate bootstrap distribution

In this note we present a robust bootstrap test with good Type I error control for testing the general hypothesis H0: ρ = ρ0. In order to carry out this test we use what is termed a surrogate bootstrap distribution. The test was inspired by the studentized permutation for testing H0: ρ = 0, which was proven to be exact in certain scenarios and asymptotically correct overall. We show that the bootstrap based test is robust to a variety of distributional scenarios in terms of proper Type I error control.


Introduction
The term "correlation" was introduced by Galton as a synonym for regression and the term "coefficient of correlation" was first used by Edgeworth, who also derived the first sample estimator of linear association [1], [2]. Estimation and testing based on the correlation coefficient is one of the most used approaches towards examining the association between two continuous variables. Even though Edgeworth first derived the estimator of the sample correlation coefficient it was Karl Pearson who is credited with much of its development via his 1896 paper [3], and why we often refer to the "Pearson Correlation Coefficient" as given by the form where −1 < ρ < 1, μ XY is the covariance between two random variables X and Y, σ X is the standard deviation of X and σ Y is the standard deviation of Y. Throughout this note let (X 1 , Y 1 ), (X 2 , Y 2 ), � � �, (X n , Y n ) be n paired observations from a non-degenerate joint distribution F XY (x, y). The well-known sample estimator for ρ at (1) is given asr ¼ P n i¼1 ðX i À � XÞðY i À � Y Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 ðX i À � XÞ 2 q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 where � X ¼ P n i¼1 X i =n and � Y ¼ P n i¼1 Y i =n. The estimatorr is also well-known to be the maximum likelihood estimator for the bivariate normal correlation parameter.
The sampling distribution forr at (2) under bivariate normality was derived by R. A. Fisher [4]. For the case when ρ = 0, "Student" [5] surmised that ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðn À 2Þr 2 =ð1 Àr 2 Þ p has a t-distribution with n − 2 degrees-of-freedom. The t-distribution based test, assuming bivariate normality, for testing the hypothesis H 0 : ρ = 0 versus H 1 : ρ 6 ¼ 0 is generally what is found in most statistical software packages. For the more general bivariate normal case when ρ 6 ¼ 0 the exact distribution forr at (2) is a bit more unwieldy [1]. Remarkably, Fisher discovered a large sample asymptotic normal variance stabilized approximation based on what is now known as Fisher's z-transformation, which is given as This transformed variable tends to normality much faster than the classic distribution free based asymptotic normal approximation. The first order variance of z at (3) is 1/(n − 3). Various refinements of Fisher's work have been carried out over the years in terms of both bias reduction and refined variance expressions [1]. For the more general test of H 0 : ρ = ρ 0 (including H 0 : ρ = 0 versus one-sided alternatives) Fisher'z z-transformation approach, again under bivariate normality assumptions, is what most current statistical software packages employ. Alternatively, the "distribution free" large sample approximation for the sampling distribution forr may be derived using the multivariate delta method as outlined in Serfling [6] page 126 and given by first noting that we can express the sample estimator asr ¼ gðVÞ, where such that as per Serfling [6] we havê r � ANðr; n À 1 dΣd 0 Þ: ð5Þ For expression (5) S is the 5 × 5 variance-covariance matrix for the vector of the components of the summands of V given by ðX 1 ; Y 1 ; X 2 1 ; Y 2 1 ; X 1 Y 1 Þ. The elements of d, which will play a key role in our application are given as More refined approximations for the sampling distribution ofr via Edgeworth techniques for non-normal populations [7], [8].
There have been several investigations examining the robustness of the bivariate normality assumptions as it pertains to type I error control for testing H 0 : ρ = ρ 0 using Fisher's z-transformation or the more specific case of H 0 : ρ = 0 using Student's t distribution. One of the earliest of these investigations is due to Pearson [9] who concluded that "the results suggest that the normal bivariate surface can be mutilated and distorted to a remarkable degree without affecting the frequency distribution of r", which has since been disproved [7]. Havlicek and Peterson [10] also incorrectly concluded "that the Pearson r is insensitive to extreme violations of the basic assumptions of normality." There has been a substantial number of simulation based studies, primarily in the psychology literature that note when tests about the correlation coefficient, assuming bivariate normality incorrectly, tend to fail [11]- [15].
We will repeat the simulation study of DiCiccio and Romano [16] and re-examine the Type I error control for testing H 0 : ρ = ρ 0 based on Fisher's z-transformation, the classic large sample approximation and our new bootstrap approach. To the best of our knowledge there does not seem to be an extensive simulation study of the straightforward asymptotic approximation above given by using the corresponding moment estimators in place of their population counterparts and setting ρ = ρ 0 under the null. DiCiccio and Romano [16] do include results for the specific case for testing H 0 : ρ = 0.
As an alternative to either assuming bivariate normality or using a large sample asymptotic approximation one may consider a permutation test approach for the specific test H 0 : ρ = 0 [17]. The key sticking point that is often overlooked by practitioners is that the permutation test is only exact in terms of Type I error control when using the metric ρ to test H 0 : F XY = F X F Y . Only under specific distributional assumptions, e.g. bivariate normality or certain families of elliptical distributions, is the permutation test exact for testing H 0 : ρ = 0. Heuristic investigations for permutation testing about the correlation coefficient go back several decades [18]. Even recently, Tuǧran [19] make the same mistakes of the past when drawing conclusions regarding the permutation test's applicability under normality versus non-normality.
It was only until very recently that DiCiccio and Romano [16] provided a careful theoretical treatment of the subject summarizing the methodologies and providing guidance for the most appropriate permutation approach for testing H 0 : ρ = 0. In their work, DiCiccio and Romano [16] provide a novel method of obtaining a permutation test that is exact when testing H 0 : ρ = 0 is equivalent to testing H 0 : F XY = F X F Y and asymptotically controls the Type I error when testing H 0 : ρ = 0 is not equivalent to testing H 0 : F XY = F X F Y . The key to their approach is standardizing the test statistic using the large sample variance at (5) under H 0 : ρ = 0 such that as n ! 1 the quantiles of the studentized test statisticr= ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi n À 1 dSd 0 p for the permutation distribution and the true sampling distribution "converge almost surely to the corresponding quantiles of the standard normal distribution." In fact, the specific form of the standardization they utilize for testing H 0 : ρ = 0 can be traced back to Tschuprow [20].
As an overall strategy for testing H 0 : ρ = 0 the permutation testing approach that appears in DiCiccio and Romano [16] appears to be the most robust strategy to date in terms of controlling Type I error rates across a variety of distributional assumptions. In particular it is clear that the over-reliance on the bivariate normality assumption, when testing H 0 : ρ = 0, and utilized in many statistical software packages is not a particularly robust assumption. As noted by R. C. Geary [21]: "Normality is a myth; there never was, and never will be, a normal distribution." One might extend this quote in an even stronger fashion relative to bivariate normality relative to its existence in real-life data analysis problems and testing about the correlation coefficient.
In this note we build upon the work of DiCiccio and Romano [16] by introducing a bootstrap-like resampling approach for testing the general null hypothesis H 0 : ρ = ρ 0 , which has virtually the same properties as the permutation approach for testing about H 0 : ρ = 0. The general difference between our approach and the approach of DiCiccio and Romano [16] is that the permutation testing approach is essentially a sampling without replacement method under the null hypothesis while our approach is a sampling with replacement approach under the null hypothesis approach. This subtle difference will allow us to develop a more general testing approach for H 0 : ρ = ρ 0 . Both approaches rely on a properly standardized test statistic based on (5) under H 0 : ρ = 0 such that the large sample Type I error control controlled exactly for certain scenarios and in general asymptotically. The test can also be inverted to provide precise confidence interval for ρ as an alternative to the more standard bootstrap confidence interval methods [22].

Materials and methods
The main thrust of the bootstrap correlation test of H 0 : ρ = ρ 0 is to generalize the approach of DiCiccio and Romano [16] by noting that we can approximate the distribution function of the sample correlation estimatorrjH 0 given F XY |H 0 using a surrogate distribution function, which we will describe below. Bootstrap samples will be drawn from the surrogate distribution function F ST |H 0 � F XY |H 0 , which is described in detail below.
In terms ofthe preliminaries let (X 1 , Y 1 ), (X 2 , Y 2 ), � � �, (X n , Y n ) be an i.i.d. bivariate sample from an absolutely continuous distribution F XY with marginal distributions denoted F X and F Y . In terms of our application let the 2 × 2 positive definite correlation matrix for the stan- Next, represent the Cholesky decomposition of the p × p matrix Γ as such that A 0-1 is defined. The Cholesky decomposition is a key component of the bootstrap test that we propose given below.
Denote the n × 2 matrix of standardized observations as (U . . and respectively. Apply the transformation ðS . . . where A is the decomposed matrix at (8), where more specifically stated such that we have transformed observations are given as S i = U i and T i ¼ rU i þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Next denote s 2 rðX;YÞ ¼ n À 1 d XY Σ XY d 0 XY , where S XY is the 5 × 5 variance-covariance matrix for the vector ðX 1 ; Y 1 ; X 2 1 ; Y 2 1 ; X 1 Y 1 Þ given earlier at (5) and d XY is the 1 × 5 vector defined at (6). For the transformed variables similarly denote s 2 rðS; where S ST is the 5 × 5 variance-covariance matrix for the vector ðS 1 ; T 1 ; S 2 1 ; T 2 1 ; S 1 T 1 Þ and the vector d ST is defined similar to d XY defined at (6) now using the moments based on S and T in place of X and Y.
Under H 0 : ρ = ρ 0 the parameter ρ is "known". The estimator for the variance of rðX; YÞjH 0 ¼r 0 ðX; YÞ will be denoted as where population moments are replaced with their corresponding sample moment counterparts throughout d XY;H 0 and S XY , and ρ = ρ 0 . In a similar fashion denote the estimator for the variance ofrðS; TÞjH 0 ¼r 0 ðS; TÞ as  (11) The result follows using the same arguments as in Theorem 2.1 given the key feature that μ X , σ X , μ Y and σ Y are known quantities.
under the null hypothesis H 0 : ρ = ρ 0 such that Eðr 0 ðS � ; T � ÞjH 0 Þ ! r 0 . This is done by first drawing independent bootstrap samples from the marginal distribution functionsF X andF Y and applying an empirical version of the transformation at (11), i.e. replace μ X , σ X , μ Y and σ Y with their respective sample counterparts. Essentially we are drawing nonparametric bootstrap samples fromF ST jH 0 to estimate the bootstrap distribution of the sample variance forr 0 ðS; TÞjH 0 with the scale based on the underlying properties of the distribution F ST |H 0 . We can then rescale based on the sample variance for r 0 ðX; YÞjH 0 in order to have an asymptotically valid testing procedure. More specifically we can follow the the steps below to test H 0 : ρ = ρ 0 versus H 1 : ρ > ρ 0 as an extension of the rescaling ideas put forward by DiCiccio and Romano [16] for testing the specific hypothesis H 0 : ρ = 0. Note also that the procedure we propose can be modified easily to handle the alternatives H 1 : ρ < ρ 0 and H 1 : Out testing algorithm now follows these steps: 1. Calculate the Pearson sample correlationr 0 ðX; YÞ (the subscript is added for notational convenience), which has the same functional form asr (¼r 0 ðX; YÞ) given at (2), and its corresponding variance estimate under H 0 , s 2 r 0 ðX;YÞ given at (13).
2. Draw samples of size n independently and with replacement from each marginal distributionF X andF Y and denote the pairs of independently drawn observations as 3. Standardize each observation using the observed sample means and sample standard deviations as 4. Denote the n × 2 matrix of standardized bootstrap observations as (U �. . .

Apply the transformation
whereF ST is defined through the marginal distributionF X andF Y and the transformation at Step 5 in our algorithm and ϕ(c) and F(�) are the standard normal density and cumulative distribution functions, respectively. Eqs (18) and (19) where the polynomials p i depend on F ST in a smooth way. If the distance between between Þ then expression (20) holds. Again, recall thatF ST is defined through the marginal distributionF X andF Y and the transformation at Step 5 in our algorithm, where using standard techniques it is straightforward to showF

Comment 2.3.
For the special and important case H 0 : ρ = 0 the only practical difference between our approach and the permutation based method of DiCiccio and Romano [16] is that sr� 0 ðS;TÞ is recalculated per each bootstrap replicate due to sampling with replacement mechanism whereas the permutation value is fixed across all permutations under the null. We do show however in our simulation study that the bootstrap test for H 0 : ρ = 0 does appear to have some cases where it works slightly better than the permutation test, likely due to a finer grid of values in the resampling scheme of with replacement as compared to without replacement.

Results
For our simulation study we tested H 0 : ρ = ρ 0 versus H 1 : ρ > ρ 0 for ρ 0 = 0.0, 0.3 and 0.6 with sample sizes n = 10, 25, 50, 100, 200. Each simulation utilized 10,000 Monte Carlo replications and the bootstrap resampling algorithm used B = 500 resamples. We compared Type I error control between the studentized permutation test of DiCiccio and Romano [16], the new bootstrap test, Fisher's z-transform and the straight large sample approximation given Eq (5) using sample moment estimators for the variance estimation. The tests examined the Type I error control α = 0.05 as was performed by DiCiccio and Romano [16].
We utilized the same marginal distributions for F X and F Y from DiCiccio and Romano [16] page 1218 for the case H 0 : ρ = 0 for our simulation study: 1. Multivariate normal (MVN) with mean zero and identity covariance. One can see DiCiccio and Romano [16] for more detail regarding these specific distributions.
For the general case H 0 : ρ = ρ 0 we utilized the marginal distributions above, standardized the marginal variables X and Y to mean = 0 and variance = 1 and applied the transformation at (11) under H 0 . The results are contained in Tables 1-3.
The first striking point to make is how inflated the Type I error can be for the test based on Fisher's z-transformation. For the t 4.1 at H 0 : ρ = 0.3 we see an estimated Type I error rate of 0.1847 at n − 200. In the other direction the test based on Fisher's z-transformation can also have much lower than anticipated Type error rate, e.g. see H 0 : ρ = 0.3, n = 100 for the circular distribution.
In terms of the straight large sample approximation [6] we see that it performs fairly well across all distributions for samples of size n � 25, but has inflated Type I error rates for n = 10 usually twice the desired α.
For the specific case H 0 : ρ 0 = 0 we see that the bootstrap test and permutation test are comparable with the bootstrap test actually subtlety outperforming the permutation test, even for large sample sizes, for certain scenarios. The bootstrap test and asymptotic tests also have  In Table 4 we compared the power of the bootstrap test versus the exact Student's t-test under multivariate normality assumptions. We see that the bootstrap test compares favorably to a test based on optimal assumptions and one that is limited by the form of the null hypothesis, i.e. the exact Student t-test for testing H 0 : ρ 0 = 0 is specific to this form of the null hypothesis and does not generalize to other values of ρ 0 . We see from Table 4 that the exact test has a slight power advantage for small samples (n = 10), which dimensions rapidly as a function of larger sample sizes. A similar analogy would be found comparing the Wilcoxon rank-sum test versus the two-sample t-test in terms of efficiency where the two-sample t-test is only more efficient in general to the Wilcoxon rank-sum test under the strict and often unrealistic assumpion of normality.

Example
As a straightforward example of our approach we tested H 0 : ρ = ρ 0 versus H 1 : ρ > ρ 0 for ρ 0 = 0 and ρ 0 = 0.3 using lactate levels measured in the blood and the cerebrospinal fluid (CSF) in 13 female subjects [20]. The data are provided in Table 5. A scatterplot of the paired data is given in Fig 1. We used SAS 9.4 (SAS Institute Cary, NC) to test the marginal normality of the data using the Shapiro-Wilk test, the multivariate skewness and kurtosis coming from a bivariate normal distribution using the built in Mardia tests and the overall test of bivariate normality using the Henze-Zirkler T test. The results from SAS given in Table 6. We see that none of the tests rejects a given aspect of normality at α = 0.05. We do note that there is low power to do so given n = 13. The estimated correlation between blood and CSF lactate levels wasr ¼ 0:57 For the test H 0 : ρ = 0 versus H 1 : ρ > 0 the p-values were 0.0198, 0.0398, 0.0770 and 0.0628 for the Fisher's z transformation based test, the large sample test, the bootstrap test and the studentized permutation test of DiCiccio and Romano [16], respectively. The bootstrap and permutation tests used 5,000 resamples. If were testing at level α = 0.05 there is agreement between the permutation and bootstrap test of not rejecting H 0 while the test based on Fisher's z transformation and the large sample approximation indicate to reject H 0 . Given the small sample size and the results of our simulation study we would recommend that the permutation or bootstrap p-values are likely the more useful calculations in terms of drawing the correct conclusion, particularly given the closeness of the values between the Fisher's z transformation based p-value and the large sample approximation, which we demonstrated did not control the Type I error in small samples.
For the test H 0 : ρ = 0.3 versus H 1 : ρ > 0.3 the p-values were 0.149, 0.1399, 0.1792 and not applicable (doesn't generalize) for the Fisher's z based test, the large sample test, the bootstrap test and the studentized permutation test of DiCiccio and Romano [16], respectively. All three tests do not reject H 0 at α = 0.05.

Discussion and conclusion
In this note we present a robust bootstrap test with good Type I error control for testing the general hypothesis H 0 : ρ = ρ 0 . The test was inspired by the studentized permutation given by DiCiccio and Romano [16] for testing H 0 : ρ = 0, which was proven to be exact in certain scenarios and asymptotically correct overall. We believe that statistical software packages should employ both the bootstrap test and the studentized permutation given by DiCiccio and Romano [16] as the default tests over tests based on bivariate normal assumptions. In addition, it should be noted that the bootstrap test can be inverted to form a confidence interval about ρ. This will be examined in our future work.