Regression models of Pearson correlation coefficient

We propose two simple regression models of Pearson correlation coefficient of two normal responses or binary responses to assess the effect of covariates of interest. Likelihood-based inference is established to estimate the regression coefficients, upon which bootstrap-based method is used to test the significance of covariates of interest. Simulation studies show the effectiveness of the method in terms of type-I error control, power performance in moderate sample size and robustness with respect to model mis-specification. We illustrate the application of the proposed method to some real data concerning health measurements.


Introduction
Most regression models have been developed to describe the relationship between the expected value of response(s) and a number of covariates (predictors). In some situations, it is desired to understand the influence of covariates on the strength of Pearson correlation between two responses. For example, in some psychological study, it is of interest to understand the impact of age and/or sex on the association between physical functionality and mental functionality of the elder (Thomas et al., 2016).
One traditional measure for such association is the partial correlation coefficient, i.e., the correlation coefficient of two responses after removing the covariate effect (Anderson, 2003). However, its inference depends on the normality assumption and it misses the connection to the actual values of the covariates which is supposed to have different effects on the association. In terms of the rank-based counterpart, Liu et al. (2018) proposed covariateadjusted partial Spearman's rank correlation using probability scale residuals, which is particularly useful for ordinal responses.
Another method is to use a conditional approach to assess the correlation at different levels of the covariates. Bartlett (1993) proposed to model the Fish z-transformed correlation by a polynomial regression of a single covariate z. His model builds a linear regression up on paired observations of (z k , r(z k )), k = 1, . . . , K, where r(z k ) is the sample correlation coefficient of two responses at z k over K distinct levels of z. One clear drawback of this method is that it requires repeated measures of responses at z k , which would lead to the demand of a large total sample size and even larger when multiple covariates are included. Other studies along this direction include Yu and Dunn (1982) and Paul (1989). Wilding et al. (2011) proposed a model to relate a transformed correlation coefficient of two normal responses (y 1 and y 2 ) with a linear combination of covariates (z 1 , . . . , z p ) through the probit link, i.e., (1 + ρ)/2 = (γ 0 + γ 1 z 1 + · · · + γ p z p ), where ρ = corr(y 1 , y 2 ) and is the distribution function of the standard normal variable. (Restricted maximum likelihood corrected LRT is used to test the significance of covariates.) We point out that the linear transformed correlation (1 + ρ)/2 guarantees the range for a distribution function. However, it can lose efficiency when ρ is known positive (since it can be modelled directly without transformation).
All the aforementioned methods consider the continuous responses. In many situations, there are only binary responses available, e.g., after dichotomization. In this paper, we propose two simple regression models of Pearson correlation coefficient of two normal responses and two binary responses (Section 2). Specifically, when the correlation coefficient is known positive, the logistic link function is used and when the correlation coefficient is arbitrary in (−1, 1), the hyperbolic tangent (or Fisher z-transformation) link function is used. Likelihood-based inference is developed to estimate the regression coefficients (Section 3). We demonstrate the performance of the proposed method by simulation in Section 4 and illustrate the applications in great detail to a real data concerning physical functionality and mental functionality of the elder (Section 5). We defer the technical details in Appendix.

Model
Let (y 1 , y 2 ) be a pair of responses of a subject with correlation coefficient ρ = corr(y 1 , y 2 ). Let x 1 , . . . , x p denote p associated covariates of the paired responses, which follow a joint distribution f (x 1 , . . . , x p ). Consider modelling ρ as a monotonic function of the linear combination of these covariates through where x = (1, x 1 , . . . , x p ) and β = (β 0 , β 1 , . . . , β p ) . When ρ > 0, we use the logistic function h(x) = (1 + e −x ) −1 , referred by link 1. When −1 < ρ < 1, we use the hyperbolic tangent function h(x) = tanh(x) = (e 2x − 1)/(e 2x + 1), referred by link 2. We adopt these two commonly used link functions to map the real line to intervals (0, 1) and (−1, 1), respectively, as they are analytically simple. Other links can also be used, e.g., the probit link for ρ > 0. When the correlation is known positive, the logistic function is preferred as it is more efficient for estimation and easy for interpretation. Let h and h denote the first and second derivatives of h, respectively. Our goal is to assess model (1).

Bivariate binary responses
Assume that both y 1 and y 2 are binary responses. Let E(y 1 ) = p 1 and E(y 2 ) = p 2 . For a, b = 0, 1, let I ab = I(y 1 = a, y 2 = b), where I(·) is the indicator function, and p ab = E(I ab ) = P(y 1 = a, y 2 = b). Clearly, p 00 + p 01 + p 10 + p 11 = 1. By definition, we have Inversely, given p 1 , p 2 and ρ, we can obtain For j = 1, 2, we further model the expected value of y j by a logistic regression through (1) and (5), p ab is a function of (β, η).) By (5), the log-likelihood is Then, the first and second derivatives of with respect to β are, respectively where The detailed expressions of e ab and f ab under the two models of h are given in Appendix. When E(I ab | x) = p ab , we have E(s) = 0.

Inference
Let {(y 1 , y 2 , x 1 , . . . , x p ) : i = 1, . . . , n} denote independent samples of (y 1 , y 2 , x 1 , . . . , x p ) of size n. For i = 1, . . . , n and j = 1, 2, denote The gradient (first derivative) and the Hessian matrix (second derivative) of the log-likelihood over n independent samples with respect to β are s(β, where the expectation is taken with respect to the joint distribution of (y i1 , y i2 , x i ). Let I = E{s i (β, η)s i (β, η)} denote the Fisher information matrix with respect to β. It is noted that unlike the usual regression of the mean value of response, the regularity condition does not hold for the model (1) (Tsiatis, 2006, Theorem 3.2). (It can be easily verified that I = − .) Let η be the maximum likelihood estimate (MLE) of η obtained from the marginal models. For instance, the marginal MLEs of the nuisance parameters under the bivariate normal response case are Next, we estimate β by using the Newton-Raphson method through iterating where β (r) is the estimate of β at the rth iteration. The convergence of Newton-Raphson method depends on the initial point and the negative definitiveness of the Hessian matrix, which guarantees the (unique) global maximizer of the log-likelihood function. The initial estimate β (1) can be chosen such that h(x β (1) ) is in the range of ρ. However, the Hessian matrix involves the plug-in estimators of the nuisance parameters. It is not trivial to show its negative definitiveness theoretically. Nevertheless, numerical study shows the condition holds with all eigenvalues being negative. Denote the root of the Newton-Raphson method by β.  (7) converges.
where is the covariance matrix given in the proof.
The significance of the overall model is assessed by testing the hypothesis H 0 : β 1 = · · · = β p = 0. Denote the β 1 , . . . , β p ) and cov( β −0 ) is the estimate of the covariance of β −0 . By Theorem 3.1, under the null hypothesis, W follows a chi-square distribution of degrees of freedom p asymptotically. The significance of individual covariate x j (j = 1, . . . , p) is assessed by testing the hypothesis H 0j : β j = 0 using the statistic w j = β j / se( β j ), where se( β j ) is the estimate of the standard deviation of β j , which is the square root of the (j, j)th element of cov( β −0 ). The null distribution of w j is asymptotically standard normal. Since the exact expression of is rather complicated, we use bootstrap to estimate it in practice.

Simulation study
We conduct simulations to examine the performance of the proposed model for making inference in estimation and testing.

Setup
Unlike possible large numbers of covariates for the mean model, a small number of relevant covariates are usually adequate for modelling the correlation coefficient. Consider the number of covariates p to be two and three, respectively. Set the sample size n to be 100, 200, 400 and 1000, respectively.
For bivariate normal responses, set the marginal variances σ 2 1 = σ 2 2 = 1. For both bivariate normal responses and bivariate binary responses, we set both γ 1 and γ 2 to be a vector of p + 1 zeros for the marginal mean models. It simplifies the data generation without simplifying the estimation procedure. Let x 1 , . . . , x p be independent uniform random variables in (0,1).
For the correlation coefficient model, set β 0 = 0.25. Under the nonnull model, set β 1 ,. . . ,β p under two link functions as in Table 2, where only the first covariate is significant. Under the null model, simply set all β 1 , . . . , β p to be zeros. We set (0.25, 0, . . . , 0) as the initial values for β under both link functions.
Throughout, we set the significant level to be 0.05 for both the global test and individual test, and set the number of replications, B, to be 1000.

Results
First, denote the empirical root of mean square error of β by ERMSE is the estimate of β in the bth replication. It measures the performance of the estimation in terms of magnitude. Second, denote the directional consistency rate (CR) by CR i is the true correlation coefficient in the bth replication and ρ (b) ). It measures the performance of estimation in terms of sign direction. Table 3 presents the type-I error of the global test and individual tests under the null model. It is seen that the proposed resampling test controls the type-I error well at the nominal level.
Columns 4-5 and 9-10 of Table 4 present the ERMSE and CR under the nonnull model considered in Section 4.1. It is seen that as the sample size increases the ERMSE decreases while the CR increases as expected. The CR ranges from 85% to 100% indicating that the proposed model yields a good fit in the sign direction of the correlation coefficient. Columns 7-9 and 12-15 of Table 4 present the powers of the global test and individual tests of the covariates. It is seen that both the global test and the individual test for the significant covariate (i.e., x 1 ) gain power Table 2. Specifications of β 1 , . . . , β p under p = 2, 3 and two link functions.  as the sample size increases as desired. The rejection rates for the insignificant covariates (i.e., x 2 and x 3 ) are close to the nominal level as desired. We note that the power performance depends on the actual model. In general, we see the proposed method works well under moderate sample size.

Sensitivity analysis
We adopt the simulation model from Wilding et al. (2011) for bivariate normal responses, where y 1 ∼ N(1 + 2x 1 , 1), y 2 ∼ N(1.5 + 1.2x 1 , 1) and their correlation coefficient is given by ρ(y 1 , y 2 ) = 2 (0.5 + β 1 x 1 ) − 1 with x 1 being a uniform random variable in (0, 1). Consider β 1 to be 0, 0.5 and 1, respectively, to represent the null case and two nonnull cases. Set n to be 30, 60 and 150, respectively. We compute the rejection rate for testing the significance of x 1 using the proposed method with link 2 in comparison with Wilding et al. (2011)'s method by which the model is correctly specified. This serves as a sensitivity analysis under model mis-specification. Table 5 reports the rejection rates under the considered cases. It is seen that under the null case when β 1 = 0 the proposed method controls the type-I error well. Under the nonnull cases when β 1 = 0.5 and 1 the proposed method is less powerful than Wilding et al.'s method when the sample size is 30 and becomes comparable in power when the sample size gets larger ( 60).

Real data application
Our data are taken from the survey of National Health Measurement Study in 2005-2006(Fryback, 2009. It collects responses to health-related quality of life questionnaires and health status questions through Short Form SF-36 questionnaires from 3844 older adults in the continental United States (1641 males and 2203 females).

Bivariate normal responses
First, we illustrate the application of the proposed model to the bivariate normal responses case. The SF-36 items were aggregated into eight health concepts, which were further summarized into the physical component summary (PCS) and the mental component summary (MCS), both in the range of (0, 100) with higher scores indicating better physical and mental health functions, respectively [28]. Several medical studies showed that the PCS declined significantly with age, but that the MCS did not change with age (Kim, 2016;Ware Jr et al., 1994). To continue along these lines, we applied the proposed method to further investigate the effect of gender and marital status. We used the Fisher z-transformation (i.e., tanh link) to accommodate negative value of correlation. The estimate model for the correlation between PCS and age is found to be tanh(ρ(PCS, age)) = −0.203 − 0.145 × MARRIED + 0.018 × SEX, where MARRIED is 1 if married or living with a partner and 0 otherwise, and SEX is 1 if female and 0 if male. The corresponding p-values for marital status and gender are < 0.001 and 0.592, respectively, indicating strong significance of marriage and insignificance of gender. The negative coefficient of MARRIED implies that the status of being married or living with a partner aggravates the declination of PCS with age (−0.283 for people married or living with a partner and −0.209 for people living by themselves). In other words, the people who are married or living with a partner have relatively lower physical functionality than the people who live by themselves when they age in general. On the other hand, our model concurred with the conclusion of insignificant correlation between MCS and age. In addition, neither gender nor marital status has significant effect on this correlation. (The details are omitted.) Moreover, we applied the proposed model to study the age effect on the correlation of PCS and MCS. (Throughout this section the observations of age are standardized, denoted by AGE * , before fitting the model as covariate.) Insignificant result is found and supported by the visual inspection of the correlations over 20 age intervals determined by the percentiles (Figure 1(a)). This result is different from the finding of Wilding et al. (2011), who showed a nonlinear downward trend of the correlation coefficient over age based on the 2003-2004 Health Outcome Survey data. When we further include gender and marital status as covariates, there is no significant gender effect  either (p-value 0.312) as also seen in Figure 1(b). However, there appears a very mild effect of marital status (pvalue 0.236) in the way that the status of being married or living with a partner slightly reduces the correlation from otherwise slight positive to the level of near zero (Figure 1(c)). (The fitted model is tanh(ρ(PCS, MCS)) = 0.148 − 0.016 × AGE * − 0.063 × MARRIED − 0.052 × SEX.)

Bivariate binary responses
Second, we applied the proposed method to model the correlation of two binary indicators of stroke (STR) and diabetes (DIA) using the covariates of age, gender and marital status. Since the correlation is known positive, the logistic link is used. The estimated model which includes the second-order age effect is logistic(ρ(STR, DIA)) = − 2.050 − 1.352 × AGE * − 0.090 × AGE * 2 + 0.938 × MARRIED + 0.270 × SEX with strong significance of the linear age effect (p-value < 0.001) and marriage effect (p-value 0.028) and insignificance of the second-order term of age (p-value 0.786) and gender effect (p-value 0.675). It is seen from Figure 2(a) that the correlation exhibits a parabola shape in age where the largest value (about 0.25) occurs around age 55 and then starts to decline afterwards. This indicates that the association of stroke and diabetes is strongly age dependent. The diabetes becomes a less significant risk factor to stroke after a certain age, e.g., 70 (Kim, 2016). The status of married or living with a partner contributes significantly to the positive correlation (0.145 for married vs 0.106 for otherwise) especially for people who are more than 50 years old (Figure 2(c)), which could be explained by the evidence of low physical functionality found before.

Concluding remarks
We propose two simple regression models of Pearson correlation coefficient of two continuous responses or two binary responses. Likelihood-based inference is developed to estimate the model and test the significance of covariates. Our method for the binary response case is new to the literature and is useful for analysing data when outcomes are observed in binary form (such as yes or no) as seen in many psychological and sociological studies. The proposed method is easy to implement and computationally affordable. (The Newton-Raphson iteration converges in a few steps.) We have made R package available upon request.
The limitation of the present paper is that the proposed method does not handle the case of mixed responses, i.e., one continuous response and one binary response, or more generally an ordinal response categorized from a latent variable, such as 'mild', 'moderate', 'severe'. Regression model of correlation involving latent variable(s) is worth investigation. We will report the result in a separate work.

Disclosure statement
All authors declare no conflict of interest.

Data availability statement
The data that support the findings of this study are available from the corresponding author upon request.