A machine learning-based approach for estimating and testing associations with multivariate outcomes

David Benkeser; Andrew Mertens; John M. Colford; Alan Hubbard; Benjamin F. Arnold; Aryeh Stein; Mark J. van der Laan

doi:10.1515/ijb-2019-0061

Open Access Published by De Gruyter August 13, 2020

A machine learning-based approach for estimating and testing associations with multivariate outcomes

David Benkeser , Andrew Mertens , John M. Colford , Alan Hubbard , Benjamin F. Arnold , Aryeh Stein and Mark J. van der Laan

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2019-0061

Abstract

We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize machine learning to detect nonlinear relationships and covariate interactions and propose a measure of association that captures these relationships. A hypothesis test about the proposed associative measure can be used to test the strong null hypothesis of no association between a set of variables and a multivariate outcome. Simulations demonstrate that this hypothesis test has greater power than existing methods against alternatives where covariates have nonlinear relationships with outcomes. We additionally propose measures of variable importance for groups of variables, which summarize each groups’ association with the outcome. We demonstrate our methodology using data from a birth cohort study on childhood health and nutrition in the Philippines.

Keywords: canonical correlation; epidemiology; machine learning; multivariate outcomes; variable importance

1 Introduction

In statistical analysis, we often encounter situations where a correlation between a multivariate set of covariates and a multivariate outcome is of scientific interest. A classical approach to this problem is canonical correlation analysis [1], [2]. Canonical correlation maximizes the correlation between a linear combination of the multivariate outcome and a linear combination of the covariates. Several test statistics have been proposed for significance testing of canonical correlation including Wilks’ Λ [3], the Hotelling-Lawley trace [1], [2], the Pillai-Bartlett trace [4], [5], and Roy’s largest root [6]. One potential drawback to canonical correlation analysis is that it may fail to identify associations when nonlinear relationships exist between covariates and outcomes. This has led to recent interest in nonlinear extension of canonical correlation analysis including kernel canonical correlation analysis [7], [8], [9].

Another analytic approach in settings with multivariate outcomes is latent variable analysis. Many definitions of latent variables appear in the literature, and we refer interested readers to more thorough discussions in [10], [11], [12], [13], [14]. A commonly used definition is that a latent variable is a hypothetical construct that cannot be measured by the researcher and that describes some underlying characteristic of the participant. Others have strongly rejected the use of latent variables, instead opting for empirical explanations of observed phenomenon [15]. For our purposes it suffices to say that latent variable analysis tends to entail an unsupervised grouping of the observed outcomes into a set of lower-dimensional latent features. One technique commonly employed to this end is principal components analysis (PCA). Researchers often use PCA to reduce the observed multivariate outcome to a low-dimensional (e.g., univariate) outcome and may examine the factor loadings to ascribe a-posteriori meaning to the reduced outcomes. Researchers further may use these outcomes to test for associations of covariates. However, such tests may fail to identify associations between predictors and outcomes due to the unsupervised nature of the outcomes’ construction.

In this work, we propose an alternative method for measuring association between a multivariate outcome and a set of covariates. Our approach is related to canonical correlation analysis, but rather than maximizing the correlation between linear functions of covariates and outcomes, we maximize the predictive accuracy of a flexible machine learning algorithm and a convex combination of the outcome. The method identifies the univariate outcome that is “most easily predicted” by the covariates and a summary measure of how well that “easiest-to-predict” outcome may be predicted. This approach adapts to complex relationships between covariates and outcomes and identifies associations in situations where traditional canonical correlation does not. However, in contrast to recent nonparametric canonical correlation proposals, our method provides a measure of association on a familiar scale, and also provides asymptotically justified inference, including confidence intervals and hypothesis tests. In certain situations, our method may further provide a novel means of constructing a clinically interpretable latent variable. However, we view our method as a latent variable approach only in the sense of Harman (1960) [16], who described latent variables as a convenient means of summarizing a number of variables in many fewer factors. Our approach fits in this definition by providing a single summary measure of the strength of association between covariates and multivariate outcomes.

2 Methodology

Suppose we observe n independent copies of O := (X₁, …, X_D, Y₁, …, Y_J), where X := (X₁, …, X_d) is a D-dimensional vector of covariates and Y := (Y₁, …, Y_J) is a J-dimensional vector of real-valued outcomes. Without loss of generality, we assume that each outcome has mean zero and standard deviation one and otherwise make no assumptions about the true distribution P₀ of O. We define Yω:=∑j=1JωjYj as a convex combination of the outcomes, where ωj≥0 for all j and ∑jωj=1. We are interested in the outcome Yω that is “easiest-to-predict” based on X using a particular prediction function. To solidify some key ideas, for now assume that we are given weights ω and a prediction function ψω that takes x as input and outputs ψω(x), a real-valued prediction of Yω. We assume for now that such a prediction function is available at the outset, e.g., a machine learning algorithm trained on external data.

A measure of how well a given algorithm ψω predicts Yω is mean squared-error,

(1)ξ0,ω(ψω):=E0[{Yω−ψω(X)}2],

which measures the average squared distance between the prediction made by ψω and the outcome Yω. Here we use the notation E₀ {f (X, Y)} to denote the average value of f (X, Y) under the true distribution P₀. Mean squared-error can be scaled to obtain a measure of accuracy that does not depend on the scale of the outcome. Specifically, let μω:=E0(Yω) denote the marginal mean of Yω, which by assumption equals zero, and define

ρ0,ω2(ψω):=1−ξ0,ω(ψω)ξ0,ω(μω),

as the proportional reduction in mean squared-error when using ψω as opposed to μω to predict Yω. We refer to this as a nonparametric R², as this measure may be interpreted similarly as the R² measure in the familiar context of linear models: values near to one indicate that nearly all of the variance in Yω is explained by ψω. However, we note that the squared notation here is a misnomer in that ρ0,ω2(ψω) can be negative, which indicates that the marginal mean of Yω provides a better fit than ψ0,ω. Nevertheless, we maintain the colloquial squared-notation. If ψω provides reasonably accurate predictions of Yω then ρ0,ω2(ψω) may serve as a summary measure of the strength of the association between X and Y. For example, rejecting the null hypothesis that ρ0,ω2(ψω)=0 may allow us to conclude that there is an association between X and at least one of the components of Y.

With these ideas in mind, we turn now to the setting where we must simultaneously (i) develop a function ψω to predict the combined outcome, (ii) learn a reasonable choice of weights ω, and (iii) based on choices, develop a sensible estimate of an association between X and Y. If ψω is restricted to linear functions of X, and when X and Y are of relatively low dimension, then canonical correlation analysis provides a straightforward means of accomplishing these tasks. However, when more aggressive data fitting approaches are employed to develop ψω, over fitting becomes a concern and may prevent simple correlative statistics from being used to provide accurate summaries of associations. To address these challenges, we propose an approach centered around cross-validation and data-adaptive target parameters [17]. Here, we provide a high-level overview of the procedure, which allows us to introduce notation to describe the parameter that we are ultimately interested in estimating. In the following subsection, we describe in detail how the full estimation procedure unfolds in a given data set.

First, we note that for a given ω, the maximizer of ρ0,ω2(ψω) over all functions mapping an observation x to a real value is ψ0,ω:=E0(Yω|X). Due to linearity of expectation,

(2)E0(Yω|X)=E0(∑j=1JωjYj)=∑j=1JωjE0(Yj|X).

This implies that for any ω, a prediction function for Yω may be constructed by building a prediction function (or equivalently, an estimate of the conditional mean) for each Y_j and weighting those prediction functions by ω. An alternative approach is to use multivariate regression methods to simultaneously predict the whole vector Y [18]. However, we find the univariate approach appealing in that it allows a rich collection of univariate learning methods to be utilized. One approach that is particularly appealing is regression stacking or super learning [19], [20], [21], the motivation for and details of which we discuss in detail in the Appendix. Given the choice of algorithm for developing a prediction function, we proceed with a description of our approach.

To begin, we assume that the data are randomly partitioned into K splits of approximately equal size. For k = 1, …, K, we use T_{n, k} ⊂ {1, …, n} to denote the indices of observations in the k-th training sample (consisting of all the splits expecting the k-th) and V_n,k to denote the indices of observations in the k-th validation sample (i.e., the k-th split itself).

To develop a procedure for sensibly determining outcome weights, we propose an algorithm Ω that maps a given set of data indices into weights that are most promising for prediction using the selected algorithm. The weights based on the k-th training sample are selected by maximizing a (nested) cross-validated measure of accuracy of predictions made by the learner of the combined outcome. We use ωn,k=Ω(Tn, k) denote the weights selected in the k-th training set.

Next, the k-th training set is used to develop a prediction function for Yωn,k. We denote by Ψωn,k a mapping between a given set of data indices and an algorithm for predicting Yωn,k, and let ψn,k=Ψωn,k(Tn,k). Thus, ψn,k is a prediction function developed in the k-th training set that maps x to a prediction of Yωn,k, the composite outcome based on weights also developed in the k-th training set.

Finally, using the respective validation samples, we estimate the following data-adaptive parameter to assess association between X and Y,

ρ0n,Ω2(Ψ):=1−∑k=1Kξ0,ωn,k(ψn,k)∑k=1Kξ0,ωn,k(Y¯n,k),

where Y¯n,k is the training sample average of Yωn,k and we remind readers of the definition of ξ0,⋅ in equation (1). The quantity ρ0n,Ω2(Ψ) describes average performance of the training sample-specific learners for predicting each training sample-specific outcome. The notation for this parameter deliberately connotes that the parameter is dependent on P₀ (via the mean squared error portion), the observed data (through the random training splits), our specific procedure for developing weights Ω, and the chosen learner Ψ.

2.1 Detailed approach

We now provide a step-by-step approach of how our procedure unfolds in a given data set. We include accompanying illustrations for the two-fold cross-validation case (Figures 1 and 2).

$Figure 1: A flowchart illustrating computation of two-fold cross-validated measure of association between X and Y. This procedure maps the full data set into ρn,Ω2(Ψ)${\rho }_{n,\text{{\Omega}}}\hat{2}\left({\Psi}\right)$, an estimate of ρ0n,Ω2(Ψ)${\rho }_{0n,\text{{\Omega}}}\hat{2}\left({\Psi}\right)$. Notation: Tn, k, the k-th training sample; Vn,k, the k-th validation sample; Ψj (Tn, k), the learner for the j-th outcome fit in the k-th training sample; Ω (Tn, k), outcome weights computed in the k-th training sample; ΨΩ(Tn,k)(Tn,k)${{\Psi}}_{\text{{\Omega}}\left({T}_{n,k}\right)}\left({T}_{n,k}\right)$, the learner fits from the k-th training sample combined using weights also computed in the k-th training sample; ρn,Ω2(ΨΩ)${\rho }_{n,\text{{\Omega}}}\hat{2}\left({{\Psi}}_{\text{{\Omega}}}\right)$, the cross-validated nonparametric R2 for the composite learner.$

Figure 1:

A flowchart illustrating computation of two-fold cross-validated measure of association between X and Y. This procedure maps the full data set into ρn,Ω2(Ψ), an estimate of ρ0n,Ω2(Ψ). Notation: T_{n, k}, the k-th training sample; V_n,k, the k-th validation sample; Ψ_j (T_{n, k}), the learner for the j-th outcome fit in the k-th training sample; Ω (T_{n, k}), outcome weights computed in the k-th training sample; ΨΩ(Tn,k)(Tn,k), the learner fits from the k-th training sample combined using weights also computed in the k-th training sample; ρn,Ω2(ΨΩ), the cross-validated nonparametric R² for the composite learner.

$Figure 2: A flowchart illustrating two-fold outcome weighting procedure, which maps a set of observed data indices, Tn, k into ωn,k${\omega }_{n,k}$, a vector of convex weights for the outcome Y. Notation: Tn, k, the k-th training sample; Vn, k, the k-th validation sample; Tn, k, k′, the k′-th nested training sample; Vn, k, k′, the k′-th nested validation sample; Ψω(Tn,k,k′)${{\Psi}}_{\omega }\left({T}_{n,k,{k}\hat{\prime }}\right)$, the prediction function for the multivariate outcome Yω${Y}_{\omega }$ trained using the k′-th nested training sample; ρn,k,ω2(Ψω)${\rho }_{n,k,\omega }\hat{2}\left({{\Psi}}_{\omega }\right)$, the cross-validated R2 for the composite learner.$

Figure 2:

A flowchart illustrating two-fold outcome weighting procedure, which maps a set of observed data indices, T_{n, k} into ωn,k, a vector of convex weights for the outcome Y. Notation: T_{n, k}, the k-th training sample; V_{n, k}, the k-th validation sample; T_{n, k, k′}, the k′-th nested training sample; V_{n, k, k′}, the k′-th nested validation sample; Ψω(Tn,k,k′), the prediction function for the multivariate outcome Yω trained using the k′-th nested training sample; ρn,k,ω2(Ψω), the cross-validated R² for the composite learner.

Step 1.

Partition the data. Randomly partition the full data set into K splits of approximately equal size. Let T_n,k⊂{1, …, n} to denote the indices of observations in the k-th training sample and V_n,k to denote the indices of observations in the k-th validation sample for k = 1, …, K.

Step 2.

Train a learner for the multivariate outcome using the training data. For each split k, use the training sample to train the chosen learner of the multivariate outcome, resulting in learner Ψ (T_{n, k}) for k = 1, …, K.

Step 3.

Fit outcome weights using the training data. We now detail our procedure Ω for mapping a given training set into a vector of weights. The following procedure (illustrated in Figured 2 for K′ = 2) is applied in each of the K training samples.

Step 3a.

Partition the given training data. Given a training data set consisting of observations {O_i : i ∈ T_{n, k}}, randomly partition the data into K′ nested splits of approximately equal size. Let T_n,k,k′ ⊂ T_{n, k} denote the indices of observations in the k′-th training sample nested in T_{n, k} and similarly use V_{n, k, k′} to denote the indices of observations in the k′-th validation sample nested in T_{n, k} for k′ = 1, …, K′.

Step 3b.

Fit learner for each outcome in training data. For k′ = 1, …, K′, develop a prediction function using observations T_n,k,k′ of the multivariate outcome, resulting in prediction function Ψω(Tn,k,k′) for a given set of weights ω. We note that if the procedure for developing the prediction function itself involves cross-validation, this step will involve further nested cross-validation of T_{n, k, k′}.

Step 3c.

Compute cross-validated R². For any ω, we can use the data in the k-th validation sample to compute an estimate of mean squared-error of Ψω(Tn,k,k′),

ξn,ω,k,k′(Ψω(Tn,k,k′)):=1|Vn,k,k′|∑i∈Vn,k,k′{Yω,i−Ψω(Tn,k,k′)(Xi)}2.

The cross-validated mean squared-error is the average over the validation folds, ξn,ω,k(Ψω):=1K′∑k′=1Jξn,ω,k,k′(Ψω(Tn,k,k′)). Similarly, define

Y¯ω,n,k:=1|Tn,k|∑i∈Tn,kYω,i and ξn,ω(Y¯ω,n,k):=1|Tn,k|∑i∈Tn,k{Yω,i−Y¯ω,n,k}2,

as the empirical average weighted outcome among observations in T_{n, k} and the empirical mean squared-error for predicting the composite outcome Yω using the training sample mean Y¯ω,k. For any ω, we can compute the cross-validated nonparametric R² measure as

ρn,k,ω2(Ψω):=1−ξn,ω(Ψω)ξn,ω(Y¯ω,n,k) .

Step 3d.

Maximize cross-validated R². Compute ωn,k=argmaxωρn,k,ω2(Ψω) using standard optimization software. In our simulation and data analysis, we used an interior algorithm with augmented Lagrange multipliers [22], [23].

Step 4.

Combine learners based on outcome weights. Given weights ωn,k, compute the prediction function ψn,k=Ψωn,k(Tn,k) for predicting Yωn,k.

Step 5.

Compute cross-validated nonparametric R². Using each validation sample, compute the mean squared-error for the combined learner,

ξn,k(ψn,k):=1|Vn,k|∑i∈Vn,k{Yωn,k,i−ψn,k(Xi)}2,

which provides an evaluation of the performance of the composite learner ψn,k for predicting the composite outcome Yωn,k based on the training sample-specific weights. We define ξn,Ω(ΨΩ):=1K∑k=1Kξn,k(ψn,k) as the average measure across the K splits. Similarly, estimate the mean squared-error of the composite empirical mean

Y¯ωn,k:=1|Tn,k|∑i∈Tn,kYωn,k,i,

via

ξn,k(Y¯ωn,k):=1|Vn,k|∑i∈Vn,k{Yωn,k,i−Y¯ωn,k}2,

and define ξn,Ω(Y¯Ω):=1K∑k=1Kξn,k(Y¯ωn,k). The estimate of ρ0n,Ω2(Ψ) is

ρn,Ω2(Ψ):=1−ξn,Ω(ΨΩ)ξn,Ω(Y¯Ω).

2.2 Inference

An asymptotically justified confidence interval can be constructed for ρ0n,Ω2(Ψ) using influence function-based methodology [17]. We propose to construct a Wald-style 100(1−α)% confidence interval based on log{ξn,Ω(ΨΩ)/ξn,Ω(Ψ¯Ω)} and back-transform this interval to obtain an interval on the original scale. Thus, our interval is of the form

1−exp[log{ξn,Ω(ΨΩ)ξn,Ω(Y¯Ω)}±z1−α/2σnn1/2],

where z1−α/2 is the 1−α/2 quantile of the standard normal distribution and σn2 is a consistent estimate of the asymptotic variance of log{ξn,Ω(ΨΩ)/ξn,Ω(Y¯Ω)}. Similarly, a level α one-sided test of the null hypothesis of no association between X and Y can be performed by rejecting the null hypothesis whenever

log{ξn,Ω(ΨΩ)/ξn,Ω(Y¯Ω)}σn/n1/2<zα.

A closed-form variance estimator is constructed as follows. For k = 1, …, J, we define

Dn,k,Ψ,Ω(O):={Yωn,k−ψn,k(X)}2−ξn,k(ψn,k) , and

Dn,k,Y¯,Ω(O):={Yωn,k−Y¯ωn,k}2−ξn,k(Y¯ωn,k) .

These equations represent cross-validated estimates of the influence function of ξn,k(ψn,k) and ξn,k(Y¯ωn,k), respectively. We define In,k:=(Dn,k,Ψ,Ω,Dn,k,Y¯,Ω)T as the bivariate estimated influence function and use the delta method to obtain a cross-validated estimate of the influence function of log{ξn,Ω(ΨΩ)/ξn,Ω(Y¯Ω)}. The estimated gradient of this transformation at the parameter estimates is gn:=(ξn,Ω−1(ΨΩ),−ξn,Ω−1(Y¯Ω))T. A consistent variance estimator is

σn2:=gnT{1K∑k=1K1|Vn,k|∑i∈Vn,kIn,k(Oi)In,kT(Oi)}gn.

3 Variable importance

Often we are not only be interested in an overall summary of the association between X and Y, but also in the relative importance of each component of X. Machine learning is often criticized as a “black box” approach, in which it is difficult to understand the relative contributions of different variables [24]. To provide better understanding of the “black box,” we propose to study differences in the estimated association between X and Y when considering different subsets of variables. Our proposal is similar to variable importance measures proposed for specific machine learning algorithms, such as random forests [25], because they measure the change in predictive performance with and without each predictor variable considered. Although existing approaches may have poorly behaved statistical inference [26], the present approach yields straightforward, asymptotically justified inference.

To assess the importance of a subset S ⊂ {1, …, D} of variables X˜:={Xs:s∈S}, we propose to repeat our procedure, but restrict only to variables not in this subset. We obtain an estimate ρn,Ω˜2(Ψ˜Ω˜) of the cross-validated performance measure for the composite prediction algorithm based on this subset of variables. The joint importance of the variables in X˜ is defined by a contrast between ρ0n,Ω2(ΨΩ) and ρ0n,Ω˜2(ΨΩ˜) such as

(3)Δ0n(ΨΩ, Ψ˜Ω˜):=ρ0n,Ω2(ΨΩ)−ρ0n,Ω˜2(ΨΩ˜).

Point estimates for (3) are constructed by plugging in the estimates, and asymptotically justified confidence intervals and hypothesis tests about these estimates are constructed using influence functions as in the previous section.

4 Simulations

We evaluated the finite-sample performance of the proposed estimators in two simulations. In the first, we studied the operating characteristics of our estimator. This simulation confirms that the asymptotic theory developed in previous sections leads to reasonable finite-sample performance for the estimators. In the second simulation, we compared our proposed method to existing methods for assessing associations between X and Y. This simulation shows that our method correctly identifies associations in situations where existing methods do not.

4.1 Simulation 1: operating characteristics

We generated 1000 simulated data sets by independently sampling n copies of X₁, X₂, X₃, X₇, X₈, X₉ from a Uniform(0,4) distribution and n copies of X₄, X₅, X₆ from a Bernoulli distribution with success probabilities of 0.75, 0.25, and 0.5. We sampled ϵ1,ϵ2,ϵ3 independently from a Normal (0,5) distribution and let

Y1=X1+2X2+4X3+X4+2X5+4X6+2X7+ϵ1 ,

Y2=X1+2X2+4X3+X4+2X5+4X6+2X8+ϵ2 , and

Y3=X1+2X2+4X3+X4+2X5+4X6+2X9+ϵ3 .

The optimal R-squared for each outcome is about 0.60. The true optimal weighting scheme for the outcomes is ω0=(1/3, 1/3, 1/3) and the true optimal R-squared for the optimally composite outcome is approximately 0.81. We considered sample sizes of n = 100, 500, 1000, 5000. For computational simplicity, we considered a small learner library consisting of a main terms linear regression (function SL.glm in SuperLearner package), an intercept-only regression (SL.mean), and a forward-stepwise selection linear regression model (SL.step.forward). We used 10 folds for each layer of cross-validation, K=K^* = J = 10.

The estimates of predictive performance were biased downwards in smaller sample sizes (Figure 3). However, even with n = 100 the average estimated performance was only about 5% too small. The confidence interval coverage was less than nominal for small sample sizes, but had nominal coverage in larger samples.

$Figure 3: Results of the first simulation study. Left: Bias of ρn,Ω2(ΨΩ)${\rho }_{n,\text{{\Omega}}}\hat{2}\left({{\Psi}}_{\text{{\Omega}}}\right)$. Right: Estimated coverage of nominal 95% intervals (with Monte Carlo confidence intervals) for the proposed confidence interval.$

Figure 3:

Results of the first simulation study. Left: Bias of ρn,Ω2(ΨΩ). Right: Estimated coverage of nominal 95% intervals (with Monte Carlo confidence intervals) for the proposed confidence interval.

We excluded X₂ and then excluded X₇ to estimate the additive difference in R-squared values (equation (3)) for these two variables. Examining the data-generating mechanism, we note that X₂ was important for predicting each individual outcome and should be important for predicting the composite outcome. The optimal R-squared for predicting each individual outcome without X₂ was 0.52, and the additive importance of X₂ was 0.60 − 0.52 = 0.08. For the composite outcome, the optimal weightings were unchanged (1/3, 1/3, 1/3), but the optimal R-squared without X₂ decreased to 0.68 and the additive importance of X₂ for the composite outcome was 0.81 − 0.68 = 0.13. In contrast, X₇ was important only for predicting Y₁ and had no effect on predicting Y₂ or Y₃. Therefore, the optimal composite outcome was expected to upweight Y₂ and Y₃ when excluding X₇, and the importance of X₇ for predicting the optimal composite outcome may be minimal. The optimal weights without X₇ were (0.24, 0.38, 0.38), whereas the optimal R-squared without X₇ was 0.79, leading to an additive importance of 0.81 − 0.79 = 0.02.

The importance measures for X₂ exhibited substantial bias (25% truth) when n = 100, but both measures were unbiased with sample sizes >1000 (Figure 4). The nominal 95% confidence interval coverage was less than nominal in small samples, but the coverage was >90% for sample sizes >500.

Figure 4:

Results of second simulation study. Left: Bias of the estimated variable importance measure comparing a super learning that uses all of the variable to one that omits either X₂ (triangle) or X₇ (diamond). Right: Estimated coverage for nominal 95% intervals (with Monte Carlo confidence intervals) for the proposed confidence intervals.

4.2 Simulation 2: power of associative hypothesis tests

In this simulation, X was simulated as above. The outcome Y was simulated as a 10-dimensional multivariate normal variate with mean μ=(μ1(x),…,μ10(x))T and covariance matrix Σ. Letting Σ_{i, j} denote the (i, j) entry in Σ, we set Σ_{i, i} = 5 for i = 1, …, 10. We also set Σ_{1, 2} = Σ_{1, 3} = Σ_{4, 7} = −2 and Σ_{3, 9} = Σ_{4, 9} = Σ_{5, 9} = 2, while also appropriately setting the corresponding lower triangle elements to these values. All other off-diagonal elements were set to zero. We studied three settings. The first setting had no association between X and Y, and we set μj(x)=0 for j = 1, …, 10. The second setting had a linear association between X and one component of Y. Here, we set μ6(x)=−2+0.75x1 and μj(x)=0 for j ≠ 6. The third setting had a nonlinear association between X and one component of Y. Here, we set μ6(x)=−2+0.75(x−2)2 and μj(x)=0 for j ≠ 6.

We studied the power of level 0.05 tests of the hypothesis of no association between X and Y. Power was estimated as 100 times the proportion of the 1,000 total simulations where the null hypothesis was rejected. We tested this hypothesis with the proposed one-sided hypothesis test with all layers of cross validation set to five-fold. We used a super learner library that included an intercept-only regression, a main terms regression, and a generalized additive model [27]. We also tested this hypothesis using four common tests of canonical correlation: Wilks’ Λ, the Hotelling-Lawley trace, the Pillai-Bartlett trace, and Roy’s largest root. Rather than comparing these statistics to their respective approximate asymptotic distributions, we used permutation tests to compute p-values [28]. We also studied the power of a test based on a principal components approach. For this test, we constructed a composite outcome based on the first principal component of Y and subsequently fit a main-terms linear regression on X. The null hypothesis of no association was rejected whenever the p-value associated with the global F-test for this linear regression was less than 0.05.

Under the null hypothesis, each canonical correlation-based and the PCA-based test had approximately nominal type I error (Figure 5, left panel). However, our proposed test was found to be conservative, falsely rejecting the null hypothesis less than 1% of the time. In the second scenario, the canonical correlation-based tests had greater power to reject the null hypothesis than our proposed test at the two smallest sample sizes (middle panel). The PCA-based test had poor power, as the first principal component fails to adequately reflect the component of Y for which a relationship with X exists. In the third scenario, only our proposed test had power to correctly reject the null hypothesis (right panel).

Figure 5:

Power of various level 0.05 tests for rejecting the null hypothesis of no association between X and Y under the null hypothesis (left), linear association (middle), and nonlinear association (right).

4.3 Additional simulations

In the web supplementary material, we include several additional simulations that explore various settings. In particular, we repeat simulation one using a more aggressive super learner library and using a more complex data generating process. The results in terms of bias and confidence interval coverage are similar as with the results presented here. We also repeated simulation 2 using covariates that each have a Normal distribution, a setting classically associated with canonical correlation analysis. Our results are again largely similar to the results of simulation 2 presented here.

5 Data analysis

Neurocognitive impairment may affect 250 million children under five years globally [30]. The Healthy birth, growth, and development knowledge integration initiative was established, in part, to inform global public health programs toward optimizing neurocognitive development [31]. Neurocognitive development is often studied through studies that enroll pregnant women and follow their children through early childhood and adolescence. Covariate information about the child’s parents, environment, and somatic growth is recorded at regular intervals and in early adolescence, children complete tests that measure diverse domains of neurocognitive development such as motor, mathematics, and language skills. Researchers are often interested in assessing the correlation between covariate information and neurocognitive development. Such an assessment may be useful for developing effective prediction algorithms that identify children at high risk for neurocognitive deficits [32].

The ongoing Cebu Longitudinal Health and Nutrition Study (CLHNS) enrolled Filipino women who gave birth in 1983–1984, and the children born to these women have been followed prospectively [33], [34]. The long-term follow-up with these children has enabled researchers to quantify the long-term effects of prenatal and early childhood nutrition and health on outcomes (adolescent and adult health, economics, and development) [35], [36]. We are interested in assessing the strength of correlation between prenatal and early childhood data and later schooling achievement. We are also interested in understanding the extent to which somatic growth (height and body weight) early in life associates with later neurocognitive outcomes [37], [38]. In 1994, achievement tests were administered to 2166 children in three subjects: mathematics, and English and Cebuano languages.

We applied the present methods to assess association of the three test scores with variables collected from birth to age two years (Table 1). For variables that had missing values, we created an indicator of missingness, and missing values of the original variable were set = 0. The library of candidate super learner algorithms included a random forest, gradient boosted machines, and elastic net regression algorithms. The tuning parameters for each algorithm were chosen via nested five-fold cross-validation. We used 10-fold cross-validation for each layer of cross-validation.

Table 1:

Variables used to predict achievement test scores in Cebu Longitudinal Health and Nutrition Study analysis.

Group	Variables
Health care	Health care access, use of preventive health care
Household	Child: adult ratio, child dependency ratio, crowding index, urban score
Socioeconomic status	Total income, socioeconomic status
Water and sanitation	Sanitation, access to clean water
Parental	Mother age, father age, mother height, mother education (y),
Parental	Father education (y), marital status, mother age first child, parity
Growth	Weight-for-age Z-score, height-for-age z-score [29] (0, 6, 12, 18, 24 mo)
Other	Mother smoked during pregnancy, child’s sex, gestational age at birth

The estimated cross-validated R-squared for predicting test scores was 0.24 (95% CI: 0.21, 0.27) for mathematics, 0.31 (95% CI: 0.28, 0.34) for English, and 0.23 (95% CI: 0.20, 0.26) for Cebuano. The estimated association with the composite outcome was 0.32 (95% CI: 0.28, 0.35). The estimated association with the optimally weighted outcome was only slightly higher than predicting an equally weighted outcome (0.30, 95% CI: 0.27, 0.34), as well as an outcome weight based on a linear combination based on the first principal component of the test scores (0.28, 95% CI: 0.25, 0.31).

We computed variable importance measures by repeating the procedure, eliminating groups of variables and estimating the additive change in performance. We found that the child’s sex and parental information were responsible for the largest proportion of the association with achievement test score performance (Figure 6). However, the somatic growth variables, as a group, modestly increased the association, with an estimated change in association of 0.01 (95% CI: 0.00, 0.02; p-value = 0.02).

Figure 6:

The change in R-squared for predicting each outcome and the composite outcome when eliminating groups of variables. Abbreviations: CI, confidence interval; SES, socioeconomic status.

6 Discussion

Our proposed method provides a new means of summarizing associations with multivariate outcomes and can provide powerful tests for association in large samples when complex and nonlinear relationships are present. We found that existing tests provide greater power in small samples against alternatives with linear relationships between covariates and outcomes. This is unsurprising as our approach relies on several layers of cross-validation, which may stretch small samples too thin. Thus, we suggest that in small samples, it may be preferable to adopt traditional approaches to assessing associations in multivariate settings.

We also found that our test was conservative under the null hypothesis. This can be explained by the fact that the true value of our data-adaptive target parameter was often less than zero, while our test was based on testing a value of zero for this parameter. We suggest that a permutation test could be used to construct a hypothesis test with better operating characteristics; however, this may often prove to be computationally intractable in practice. In such cases, we may instead opt for the more conservative, but less computationally intensive test.

A possible criticism of our approach is the data-adaptive nature of the parameter ρ0n,Ω2(ΨΩ). One could argue that this parameter is only a useful measure of association insofar as the relevant prediction functions approximate ψ0,ω0, the unknown true conditional mean of the optimally combined outcome. This strengthens the argument for using super learning to a prediction function that is as accurate as possible. It would be interesting to extend the approach to estimate target parameters where rather than conditioning on the training-sample-specific learners in the parameter definition, we instead estimate some correlation that depends on ψ0,ω0, as in [39]. However, this would require a more involved estimation procedure, such as one-step estimation or targeted minimum loss estimation. Similarly, our test of the null hypothesis of no association will only powered against alternatives where linear combinations of outcomes are associated with X. In the presence of more complex relationships, our test may not have power.

An interesting extension suggested by a review is to use cross-validation to select between our proposed measure of association and traditional canonical correlation analysis (CCA). This would require developing an objective criteria that could be used in a given training sample to determine whether there was sufficient evidence to suggest that CCA provides an adequate summary of the association between X and Y. For example, if a linear model for each component of Y has the highest (nested) cross-validated risk in a given training sample, then we may believe that CCA will appropriately describe the association between X and Y. Then, we could use the corresponding validation sample to compute the canonical correlation. Such a modification of our procedure may yield increased power in situations where linear relationships truly are present, while providing an adaptive way to switch to a nonlinear approach. We leave to future work the theoretical and practical study of such an approach. Future extensions may consider nonlinear combinations of the outcome, perhaps via alternating conditional expectation [40], which would bring our proposal closer in scope to that of kernel CCA methods.

Though our proposal is computationally intensive, the vast majority of the computational burden lies in fitting the candidate learners, which can be implemented in a parallelized framework. The computational burden of the procedure may be further reduced by clever choices of number of cross-validation folds. For example, in the context of using super learning to separately develop a prediction function of each of the J outcomes, if K = K′ and we also build a super learner based on K-fold cross-validation, then the procedure requires fitting K3∑j=1JMj total candidate learners. However, if K′ = K − 1 and we use (K − 2)-fold cross-validation to construct the super learner, it is possible to re-use candidate learner fits at different points in the procedure, thereby drastically decreasing the computational burden of the procedure. In fact, we can show that this approach requires only (K3+5K)/6×∑j=1JMj candidate learner fits. Thus, for K = 10, the latter approach is expected to execute in 17.5% the time it would take if all layers of cross-validation are set equal to 10. An R package implementing our methods is available (https://github.com/benkeser/cvma).

It may also be of interest to consider extensions for estimating covariate-adjusted effects of variables on data-adaptively-combined outcome; estimation of these effects and associated inference may be facilitated via cross-validated targeted maximum likelihood [17], [41].

Corresponding author: David Benkeser, Department of Biostatistics and Bioinformatics, Emory University, School of Public Health, Atlanta, 30322, USA, E-mail: benkeser@emory.edu

Funding source: National Heart, Lung, and Blood Institute

Award Identifier / Grant number: 1R01HL137808-01A1

Funding source: Bill and Melinda Gates Foundation

Award Identifier / Grant number: OPP1147962

Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This research was funded by National Heart, Lung, and Blood Institute (grant no. 1R01HL137808-01A1) and Bill and Melinda Gates Foundation (grant no. OPP1147962).
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

Appendix: Super learner

An appealing approach for developing a prediction function is regression stacking [19], [20], also known as super learning [21]. Here we describe how to develop a super learner separately for each of the J outcomes, though in theory, this approach could be extended to simultaneous prediction of the vector Y. We first construct a library of M_j candidate learners for each of the J outcomes. The library of estimators can include parametric model-based estimators as well as machine learning algorithms. Parametric model-based methods could include linear regression of Y_j on X, while machine learning-based methods could include random forests [25], gradient boosted machines [42], or deep neural networks [43]. If X is high-dimensional, learners can incorporate dimension-reduction strategies. For example, we may include a main-terms linear regression of Y_j on all variables in X, while also including a linear regression using only a subset of X chosen based on their univariate association with Y_j. The library of learners can also include different choices of tuning parameters for machine learning algorithms, an important consideration for methods that require proper selection of many tuning parameters (e.g., deep learning).

Because there is no way to know a-priori which of these myriad estimators will be best, a cross-validated empirical criterion is used to adaptively combine, or ensemble, the various learners. This approach is appealing in the present application in that the best learner for each outcome might be different. By allowing the data to determine the best ensemble of learners for each outcome, we expect to obtain a better fit over all outcomes and thereby obtain a more accurate summary of the association between X and Y. Indeed, the super learner is known to be optimal in the sense that, under mild assumptions, its goodness-of-fit is essentially equivalent to the (unknown) best-fitting candidate estimator [44], [45]. A full treatment of super learning [21] and the R package [46] are available. Here, we provide a brief overview of the procedure.

Super learning is often based on K^*-fold cross validation, and we illustrate the method for K^* = 2 in Figure 7. The super learner for a given outcome Y_j may be implemented in the following steps:

$Figure 7: A flowchart illustrating a two-fold super learner procedure, which maps a set of observed data indices, Fn*${F}_{n}\hat{\text{{\ast}}}$ into Ψj,βn,j(Fn*)${{\Psi}}_{j,{\beta }_{n,j}}\left({F}_{n}\hat{\text{{\ast}}}\right)$ an ensemble prediction function for outcome Yj. Notation: Tn,k, the k-th training sample; Vn,k, the k-th validation sample; Ψj,m(Tn,k), the m-th learner for the j-th outcome fit in the k-th training sample; ξn,k(Ψj,m)${\xi }_{n,k}\left({{\Psi}}_{j,m}\right)$, the k-th cross-validated mean squared-error; ξn(Ψj,m)${\xi }_{n}\left({{\Psi}}_{j,m}\right)$, cross-validated mean squared-error; βn,j${\beta }_{n,j}$, a J-length vector of super learner weights; Ψj,m(Fn*)${{\Psi}}_{j,m}\left({F}_{n}\hat{\text{{\ast}}}\right)$, the m-th learner for the j-th outcome fit using {Oi:i∈Fn*}$\left\{{O}_{i}:i\in {F}_{n}\hat{\text{{\ast}}}\right\}$.$

Figure 7:

A flowchart illustrating a two-fold super learner procedure, which maps a set of observed data indices, Fn* into Ψj,βn,j(Fn*) an ensemble prediction function for outcome Y_j. Notation: T_n,k, the k-th training sample; V_n,k, the k-th validation sample; Ψ_j,m(T_n,k), the m-th learner for the j-th outcome fit in the k-th training sample; ξn,k(Ψj,m), the k-th cross-validated mean squared-error; ξn(Ψj,m), cross-validated mean squared-error; βn,j, a J-length vector of super learner weights; Ψj,m(Fn*), the m-th learner for the j-th outcome fit using {Oi:i∈Fn*}.

Step 1:

Partition the data. Starting with a data set consisting of observations {Oi:i∈Fn*} for Fn*⊆{1,…,n}, we randomly partition the data into K^* splits of approximately equal size. We define the k^*-th training sample as the observed data with observations in the k*-th split removed. We use Tn,k*⊂Fn* to denote the indices of observations in the k^*-th training sample and Vn,k* to denote the indices of observations in the k^*-th validation sample for k^* = 1, …, K^*.

Step 2:

Fit learners in training data. We now use the observations in the training sample to fit each of the M_j candidate learners. We use Ψ_j,m to denote the m-th candidate learner for the j-th outcome. The learner Ψ_j,m takes as input a set of indices and returns a prediction function for Y_j. We use Ψj,m(Tn,k*) to denote the m-th candidate learner for the j-th outcome fit in the k^*-th training sample. For example, Ψ_j,m might correspond with fitting an ordinary least squares linear regression of Y_j on X, in which case Ψj,m(Tn,k*) corresponds to the fitted linear predictor based on the regression fit in the k^*-th training sample and Ψj,m(Tn,k*)(x) corresponds to the linear predictor of the fitted regression evaluated at the observation x.

Step 3:

Evaluate learners in validation data. Next, we use the data in the k^*-th validation sample to evaluate each of the learners Ψj,m(Tn,k*), m=1,…,Mj via

ξn,k∗(Ψj,m):=1|Vn,k∗|∑i∈Vn,k∗{Yj,i−Ψj,m(Tn,k∗)(Xi)}2,

where we use |Vn,k*| to denote the number of observations in the k^*-th validation sample. We obtain an overall estimate of the performance of the m-th learner via

ξn(Ψj,m):=1K∗∑k∗=1Kξn,k∗(Ψj,m).

Step 4:

Find ensemble weights. The cross-validation selector m˜ is defined as the single learner with the smallest value of ξn and one may use Ψj,m˜(Fn*) as the learner for the j-th outcome, i.e., we refit the cross-validation-selected learner using all the data and use this as prediction function. However, it is often possible to improve on the fit of the cross-validation selector by considering ensemble learners of the form Ψj,β=∑mβj,mΨj,m, where the weights βj,m are non-negative for all m and sum to 1. The prediction function Ψj,βj takes as input X, computes the prediction of Y_j using each of the M_j learners, and returns a convex combination of these predictions. We can compute βn,j:=arg minβjξn(Ψj,βj), the choice of weights that minimizes cross-validated mean squared-error, using standard optimization software.

Step 5:

Refit learners using full data. We next refit the learners that received non-zero weight in βn,j using all observations in Fn*. We denote these fits Ψj,m(Fn*) for m = 1, …, M_j.

Step 6:

Combine learners into super learner. The super learner is the convex combination of the M_j learners using weights βn,j and is denoted by Ψj(Fn*):=Ψj,βn,j(Fn*).

References

1. Hotelling, H. The most predictable criterion. J Educ Psychol 1935;26:139. https://doi.org/10.1037/h0058165.Search in Google Scholar

2. Hotelling, H. Relations between two sets of variates. Biometrika 1936;28:321–77. https://doi.org/10.1093/biomet/28.3-4.321.Search in Google Scholar

3. Wilks, SS. Certain generalizations in the analysis of variance. Biometrika 1932;24:471–94. https://doi.org/10.2307/2331979.Search in Google Scholar

4. Bartlett, M. The statistical significance of canonical correlations. Biometrika 1941;32:29–37. https://doi.org/10.1093/biomet/32.1.29.Search in Google Scholar

5. Pillai, KS. On the distribution of the largest or the smallest root of a matrix in multivariate analysis. Biometrika 1956;43:122–7. https://doi.org/10.2307/2333585.Search in Google Scholar

6. Roy, SN. The individual sampling distribution of the maximum, the minimum and any intermediate of the p-statistics on the null-hypothesis. Sankhyā: Indian J Stat 1945;7:133–58.Search in Google Scholar

7. Nandy, RR, Cordes, D. Novel nonparametric approach to canonical correlation analysis with applications to low cnr functional mri data. Magn Reson Med 2003;50:354–65. https://doi.org/10.1002/mrm.10537.Search in Google Scholar PubMed

8. Andrew, G, Arora, R, Bilmes, J, Livescu, K. Deep canonical correlation analysis. In: International conference on machine learning; 2013:1247–55 pp.Search in Google Scholar

9. Michaeli, T, Wang, W, Livescu, K. Nonparametric canonical correlation analysis. In: International conference on machine learning; 2016:1967–76 pp.Search in Google Scholar

10. Glymour, C, Scheines, R, Spirtes, P, Kelly, K. Discovering causal structure: artificial intelligence, philosophy of science, and statistical modeling. New York: Academic Press; 1987.10.1016/B978-0-12-286961-7.50010-XSearch in Google Scholar

11. Sobel, M. Causal inference in latent variable models. In: von Eye, A, Clogg, C, editors. Latent variables analysis: applications for developmental research. Thousand Oaks, CA, USA: Sage Publications, Inc.; 1994:3–35 pp.Search in Google Scholar

12. Edwards, JR, Bagozzi, RP. On the nature and direction of relationships between constructs and measures. Psychol Methods 2000;5:155. https://doi.org/10.1037/1082-989x.5.2.155.Search in Google Scholar PubMed

13. Hägglund, G. Milestones in the history of factor analysis. In: Structural equation modeling: present and future. Lincolnwood, IL, USA: Scientific Software Inc.; 2001.Search in Google Scholar

14. Bollen, K. Latent variables in psychology and the social sciences. Annu Rev Psychol 2002;53:605–34. https://doi.org/10.1146/annurev.psych.53.100901.135239.Search in Google Scholar PubMed

15. Skinner, BF. About behaviorism: Vintage; 1976.Search in Google Scholar

16. Harman, HH. Modern factor analysis. University of Chicago Press; 1960.Search in Google Scholar

17. Hubbard, AE, Kherad-Pajouh, S, van der Laan, MJ. Statistical inference for data adaptive target parameters. Int J Biostat 2016;12:3–19. https://doi.org/10.1515/ijb-2015-0013.Search in Google Scholar PubMed

18. Izenman, AJ. Modern multivariate statistical techniques: regression, classification and manifold learning. New York, USA: Springer; 2008.10.1007/978-0-387-78189-1Search in Google Scholar

19. Wolpert, DH. Stacked generalization. Neural Network 1992;5:241–59. https://doi.org/10.1016/s0893-6080(05)80023-1.Search in Google Scholar

20. Breiman, L. Stacked regressions. Mach Learn 1996;24:49–64. https://doi.org/10.1007/bf00117832.Search in Google Scholar

21. van der Laan, MJ, Polley, EC. Super learner. Stat Appl Genet Mol Biol 2007;6:1–23. https://doi.org/10.2202/1544-6115.1309.Search in Google Scholar

22. Ye, Y. Interior algorithms for linear, quadratic, and linearly constrained non-linear programming, Ph.D. thesis: Department of ESS, Stanford University; 1987.Search in Google Scholar

23. Ghalanos, A, Theussl, S. Rsolnp: general non-linear optimization using augmented lagrange multiplier method, r package version 1.16; 2015.Search in Google Scholar

24. Freitas, AA. Comprehensible classification models: a position paper. ACM SIGKDD Explor Newslett 2013;15:1–10. https://doi.org/10.1145/2594473.2594475.Search in Google Scholar

25. Breiman, L. Random forests. Mach Learn 2001;45:5–32. https://doi.org/10.1023/a:1010933404324.10.1023/A:1010933404324Search in Google Scholar

26. Strobl, C, Zeileis, A. Danger: high power!– exploring the statistical properties of a test for random forest variable importance: Department of Statistics: University of Munich Technical ReportsAccessed: November 17, 2016.Search in Google Scholar

27. Hastie, T, Tibshirani, R. Generalized additive models: Wiley Online Library; 1990.Search in Google Scholar

28. Menzel, U. CCP: Significance tests for canonical correlation analysis (CCA), r package version 1.1; 2012. Available from: https://CRAN.R-project.org/package=CCP.Search in Google Scholar

29. WHO Multicentre Growth Reference Study Group. WHO child growth standards: length/height-for-age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age: methods and development. Geneva: World Health Organization; 2006.Search in Google Scholar

30. Black, M, Walker, S, Fernald, L, Andersen, C, DiGirolamo, A, Lu, C, et al.. Lancet early childhood development series steering committee, early childhood development coming of age: science through the life course. Lancet 2017;389:77–90. https://doi.org/10.1016/S0140-6736(16)31389-7.Search in Google Scholar

31. Jumbe, N, Murray, JC, Kern, S. Data sharing and inductive learning – toward healthy birth, growth, and development. N Engl J Med 2016;374:2415–7. https://doi.org/10.1056/NEJMp1605441.Search in Google Scholar

32. Walker, SP, Wachs, TD, Gardner, JM, Lozoff, B, Wasserman, GA, Pollitt, E, et al.. Child development: risk factors for adverse outcomes in developing countries. Lancet 2007;369:145–57. https://doi.org/10.1016/s0140-6736(07)60076-2.Search in Google Scholar

33. Cebu Longitudinal Health and Nutrition Survey. Available from: http://www.cpc.unc.edu/projects/cebu[Accessed: November 17, 2016].Search in Google Scholar

34. Adair, LS, Popkin, BM, Akin, JS, Guilkey, DK, Gultiano, S, Borja, J, et al.. Cohort profile: the cebu longitudinal health and nutrition survey. Int J Epidemiol 2011:619–25. https://doi.org/10.1093/ije/dyq085.Search in Google Scholar PubMed PubMed Central

35. Daniels, MC, Adair, LS. Growth in young Filipino children predicts schooling trajectories through high school. J Nutr 2004;134:1439–46. https://doi.org/10.1093/jn/134.6.1439.Search in Google Scholar PubMed

36. Daniels, MC, Adair, LS. Breast-feeding influences cognitive development in Filipino children. J Nutr 2005;135:2589–95. https://doi.org/10.1093/jn/135.11.2589.Search in Google Scholar PubMed

37. Carba, DB, Tan, VL, Adair, LS. Early childhood length-for-age is associated with the work status of Filipino young adults. Econ Hum Biol 2009;7:7–17. https://doi.org/10.1016/j.ehb.2009.01.010.Search in Google Scholar PubMed PubMed Central

38. Smithers, LG, Lynch, JW, Yang, S, Dahhou, M, Kramer, MS. Impact of neonatal growth on IQ and behavior at early school age. Pediatrics 2013;132:e53–60. https://doi.org/10.1542/peds.2012-3497.Search in Google Scholar PubMed PubMed Central

39. Feng, J, Williamson, B, Simon, N, Carone, M. Nonparametric variable importance using an augmented neural network with multi-task learning. In: International conference on machine learning; 2018:1496–505 pp.Search in Google Scholar

40. Breiman, L, Friedman, JH. Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 1985;80:580–98. https://doi.org/10.1080/01621459.1985.10478157.Search in Google Scholar

41. Zheng, W, van der Laan, MJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation, UC Berkeley Division of Biostatistics Working Paper Series; 2010 Paper 273.10.2202/1557-4679.1181Search in Google Scholar PubMed PubMed Central

42. Friedman, JH. Greedy function approximation: a gradient boosting machine. Ann Stat 2001:1189–232. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar

43. Ripley, BD. Pattern recognition and neural networks. Cambridge, UK: Cambridge University Press; 2007.Search in Google Scholar

44. van der Laan, MJ, Dudoit, S, Keles, S. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol Biol 2004;3:1–23. https://doi.org/10.2202/1544-6115.1036.Search in Google Scholar PubMed

45. van der Laan, MJ, Dudoit, S, van der Vaart, A. The cross-validated adaptive epsilon-net estimator. Stat Decis 2006;24:373–95. https://doi.org/10.1524/stnd.2006.24.3.373.Search in Google Scholar

46. Polley, E, LeDell, E, Kennedy, C, Lendle, S, van der Laan, M. Superlearner: super learner prediction, r package version 2.0-21; 2013. Available from: http://CRAN.R-project.org/package=SuperLearner.Search in Google Scholar

Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/ijb-2019-0061).

Received: 2019-05-22

Accepted: 2020-06-18

Published Online: 2020-08-13

This work is licensed under the Creative Commons Attribution 4.0 International License.

A machine learning-based approach for estimating and testing associations with multivariate outcomes

Abstract

1 Introduction

2 Methodology

2.1 Detailed approach

Step 1.

Step 2.

Step 3.

Step 3a.

Step 3b.

Step 4.

Step 5.

2.2 Inference

3 Variable importance

4 Simulations

4.1 Simulation 1: operating characteristics

4.2 Simulation 2: power of associative hypothesis tests

4.3 Additional simulations

5 Data analysis

6 Discussion

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

References

Supplementary Material

Journal and Issue

Articles in the same Issue