Fusing Sufficient Dimension Reduction with Neural Networks

We consider the regression problem where the dependence of the response Y on a set of predictors X is fully captured by the regression function E(Y | X)=g(B'X), for an unknown function g and low rank parameter B matrix. We combine neural networks with sufficient dimension reduction in order to remove the limitation of small p and n of the latter. We show in simulations that the proposed estimator is on par with competing sufficient dimension reduction methods in small p and n settings, such as minimum average variance estimation and conditional variance estimation. Among those, it is the only computationally applicable in large p and n problems.


Introduction
In this paper we focus on the regression problem where the dependence of the response Y on a set of predictors X is fully captured by the regression function E(Y | X). Moreover, we further assume there exists a linear projection, or, reduction of the predictors that encapsulates all the modeling information in X about E(Y | X).
Specifically, we assume the conditional distribution of Y given a p-variate random vector X satisfies the regression model Y = g(B X) + ε, (1) where B ∈ R p×k of rank k < p, ε ∈ R is a random variable with E(ε | X) = 0 and Var(ε) = E ε 2 = η 2 < ∞, and g is an unknown continuously differentiable non-constant function. The projection B X is a linear sufficient dimension First we fit an arbitrary neural net to the data, and in the second stage we refine the estimate with a specific architecture using a bottleneck. The premise of the two stage NN − SDR estimator is conceptually similar to MAVE with the difference that we use neural nets as universal function approximators compared to nonparametric local linear smoothing methods. The advantage of this approach is that it retains the accuracy of state of the art SDR methods while it can be easily deployed to large scale datasets frequently encountered in applications. It also obtains predictions at nearly no additional computational cost compared to fully nonparametric methods used in MAVE and CVE. Further, the extension of the proposed NN − SDR estimator to online learning, where new data are dynamically added, is straightforward.
The paper is organized as follows. In Section 2 we give a short overview of the theoretical foundations of SDR, and in Section 3 we present neural nets and the notation used throughout. In Section 4 we propose the novel two stage estimator and in Section 5 describe the algorithm. Then in Section 6 we draw the analogy to existing SDR methods and demonstrate its performance in Sections 7,8,9 via simulations and data examples. Our concluding remarks are in Section 10.

Mean Subspace
Let (Ω, F, P) be a probability space, Y a univariate continuous response and X a p-variate continuous predictor, jointly distributed, with (Y, X ) : Ω → R p+1 .
The interest of this paper is to estimate the mean subspace denoted by S E(Y |X) (see [12,13,27]). Definition. A linear space S E(Y |X) ⊆ R p is called a mean subspace if for any basis B ∈ R p×k of S E(Y |X) with dim(S E(Y |X) ) = k ≤ p, An equivalent characterisation of S E(Y |X) is given by where P S is the orthogonal projection on the space S with respect to the usual inner product. This can be seen by letting B ∈ R p×k be a basis of S E(Y |X) , then P S E(Y |X) X = B(B B) −1 B X. The only random element is B X and therefore the sigma algebras generated by B X and P S E(Y |X) X are the same. This yields the equivalence of (2) and (3).
For a predictor vector X whose density is supported on a convex set, with positive definite variance-covariance matrix, Var(X) = Σ x , the mean subspace model (2) is equivalent to the regression model (1) in the Introduction. Their equivalence derives from so that span{B} = S E(Y |X) , where span is the column space of B.
The mean subspace S E(Y |X) captures all the information in X about Y that is contained in the first conditional moment E(Y | X). That is, if we are only interested in the conditional mean, X can be replaced by B X ∈ R k without any loss of information. When k is significantly smaller than p, this results in substantial savings in reducing the complexity of the modeling problem.
Our focus in this paper is the estimation of the mean subspace. By the equivalence of (2) and (3) the sufficient reduction depends only on the subspace S E(Y |X) and not on a particular basis. Therefore, without loss of generality, we let B ∈ S(p, k), where S(p, k) = {V ∈ R p×k : V V = I k }, (5) denotes the Stiefel manifold, that comprises of all p × k matrices with orthonormal columns.

The Multi Layer Perceptron (MLP)
In this section we briefly review the concept of a Multi Layer Perceptron (MLP [19,33,18,23]) and introduce the corresponding notation.
An MLP is the concatenation of layers. Each layer consists of simple functions is a matrix of weights, b (l) is the bias vector of layer l, and together they form an affine transformation, on which the activation function φ(·) is applied component-wise. The formal definition is provided next. Definition. A Multi Layer Perceptron (MLP) with N layers from R p → R is a function with the following structure where Θ = (W 1 , b 1 , . . . , W N , b N ) and the l-th layer is given by with weights W (l) ∈ R n l ×n l−1 , bias b (l) ∈ R n l , and a non-constant, continuous activation function φ (l) : R → R that is applied component-wise.
The notation Θ = (W 1 , b 1 , . . . , W N , b N ) means that all parameters of an MLP are collected in vectorised form into the vector Θ = (vec(W 1 ), b 1 , . . . , vec(W N ), b N ) ∈ R N j=1 n l−1 n l +n l , where the operation vec : R n l−1 ×n l → R n l−1 n l stacks the columns of a matrix one after another. Note that in general an MLP does allow for multi-dimensional output, but for the sake of simplicity we only allow univariate responses and therefore only univariate outputs.
The first layer that receives the input x is called the input layer, and the last layer is the output layer. All other layers are called hidden layers. A widely used activation function is the so called ReLU (Rectified Linear Unit) given by The ReLu activation function will be used throughout this paper. Other popular choices include sigmoid functions like the tangens-hyperbolicus. i.e., x = (x 1 , . . . , x 4 ) ∈ R 4 . The first layer f (1) has output dimension 6, or 6 so called neurons, W 1 ∈ R 6×4 , b 1 ∈ R 6 . The second layer, f (2) , has 4 neurons with W 2 ∈ R 4×6 , b 2 ∈ R 4 , and the output layer, f (3) , has 1 neuron with W 3 ∈ R 1×4 , b 2 ∈ R. The arrows represent the weights of the layer. At each node (neuron), the bias is added before the activation function φ (l) is applied.
The universal approximator theorem [21,Thm 3] established that Multi Layer Perceptrons (MLPs) are universal approximators of functions. Theorem 1, which asserts that any continuously differentiable function can be approximated arbitrarily close on compact sets by an MLP, reproduces it. Theorem 1. Let MLP ∞ be the set of all one layer MLP's with arbitrarily many neurons in the first layer and the activation function φ is non-constant and bounded, then MLP ∞ is uniformly m dense in C m (R p ) on compact sets, where C m (R p ) is the space of all m-times differentiable functions on R p .

Input
Hidden 1 Hidden 2 Output Figure 1: Example Architecture of a 3 Layer MLP.
An application of Theorem 1 with m = 1, yields that for g(B x) ∈ C 1 (R p ) of model (1), for every arbitrary compact set K ⊂ R p and for all ν > 0, there exists a one layer MLP f MLP1 (·; Θ) such that Therefore, the conditional expectation E(Y | X = x) = g(B x) and its gradients can be approximated arbitrarily close on compact sets by a one layer MLP. This serves as the basis for the proposed estimation procedure in Section 4.

Neural Net SDR
Theorems 2 and 3 present two ways of identifying B in model (1) at the population level. They serve as the motivation for the proposed estimators. Theorem 2. Assume model (1) holds. Let whereg Therefore, which completes the proof.
For V ∈ S(p, k), let where (9) is the target function at the population level for MAVE and identifies S E(Y |X) as shown in Theorem 3.
Next we define the three different neural nets we use in the proposed estimators. Definition. Using the notation for MLP's in Section 3, we define where V ∈ S(p, k) given in (5).
Our estimation method is run in two stages. The first uses g NN_OPG (x; Θ 1 ) in (12) as an estimator forg in Theorem 2 . The second, or, refinement stage, estimates g in model (1) with g NN_wrap (x; Θ 2 ) in (13) and g(B x) in model (1) with g NN (x; (V, Θ 2 )) in (14). Both (12) and (14) are used to estimate E(Y | X = x) but the latter uses the specific structure of model (1) for refinement.
The MLP in (14) is the same as in (13) save for an additional input layer with the identity as the activation function; i.e., φ (1) (x) = x. Further, the first layer forms a bottleneck, as depicted in Figure 2, since V x ∈ R k with k p. (12) serves as an estimate for E(Y | X = x) = g(B x) : R p → R, (13) for g : R k → R, and (14) is a refined estimate of g(B x) in model (1). The bottleneck of the MLP in (14) is conceptually similar to autoencoders [see, e.g., [25,26]] with the important difference that the latter are analogous to nonlinear principal components and unsupervised; that is, independent of the response. Figure 2 illustrates the MLP in (14) with input dimension p = 4, x = (x 1 , . . . , x 4 ) ∈ R 4 . The first layer represents the 2-dimensional linear reduction V ∈ S(4, 2) and the rest of the network coincides with (13).
For the proposed estimator, we can use any MLP that has more neurons than p in the first hidden layer in (12) and (13). For the sake of simplicity we opted for a 1 layer MLP with 512 neurons as default, since this gave satisfactory results in simulations. Further, the performance in simulations was robust against different architectures if sufficient regularisation is applied via dropout in the training of the MLP [see [35]].

Initial Estimator
We assume (Y i , X i ) i=1,...,n is a random sample from the joint distribution of Y and X given by model (1). Let  Figure 2: Illustration of g NN in (14) be the objective function for the initial estimator, where L : R × R → [0, ∞) is a loss function. The training of the initial MLP in (12) is carried out by minimizing the objective function in (15), The resulting g NN_OPG (x; (1) if the squared error loss, is used. (16), which is an estimate for ∇g(B X i ) ∈ R p . We let that is an estimator for Σ ∇ in (7). The NN_OPG estimator is defined as where v 1 , . . . , v k are the first k eigenvectors of (18).
By Theorem 2, under model (1), span{Σ ∇ } = span{B} = S E(Y |X) . If we assume that (18) is a consistent estimator for (7), then the NN_OPG estimator in (19) is consistent for S E(Y |X) in model (1). B NN_OPG in (19) is used as an initial starting value for the optimization in (21) in order to obtain the refined estimator B NN .
The loss function L is determined by model (1) and the conditional distribution of Y | X. If the response Y and predictors X are continuous and the error term in (1) has a conditional Gaussian distribution, then the squared error loss function corresponds to the likelihood function. If Y is Bernoulli or multinomial distributed, then the cross entropy loss function can be used, and if Y is Poisson distributed then the deviance is the natural choice for the loss function. In general, the loss function is the relevant part of the likelihood in the conditional distribution of Y | X and agrees with the loss function in generalized linear models for conditional distributions in the exponential family.

Refinement Estimator
The second stage is the refinement of the initial estimator in (19). The NN_OPG estimator is obtained via the gradient of the trained MLP in (12). The training of the function g NN_OPG : R p → R in (16) suffers from the curse of dimensionality if the input dimension is large. In this case, the accuracy of the estimation of (18) is adversely affected as learning a nonlinear function and its gradient with a high dimensional input space is difficult. The refinement procedure explicitly incorporates the defining assumption of model (1) that a lower dimension projection of the input, B X, can replace the original input X. This is realised via the function g NN in (14).
Definition. The target function for the refinement estimator is given by is a loss function, and V ∈ S(p, k). Further, we set and the NN refinement estimator is given by B NN .
The simultaneous optimization with respect to V and Θ 2 in (21) corresponds to simultaneous estimation of the sufficient reduction B and the link function g in model (1). The partially trained function T NN (·, Θ 2 ) is an estimate for (8) if the squared error loss function (17) is used.
Under squared error loss, if (13) is a consistent estimator for g in (1), By Theorem 3, S E(Y |X) = span{B} = span{arg min V∈S(p,k) T (V)} and the expectation would be that, subject to regularity conditions, the NN refinement estimator is consistent.
Nevertheless, proving the consistency of (19) and the refinement estimator B NN in (21) requires neural nets consistently estimate any function g from a sample of model (1). To the best of the authors' knowledge there is no such result available in the literature.
The optimization in (21) is solved via stochastic gradient descent training [see Section 5 or any other first order training algorithm for neural nets]. These algorithms require a starting value for the parameters, (V, Θ 2 ), to be trained. In simulations the accuracy of the refined estimate B NN in (21) was very sensitive to the initialization of V. We conjecture that a consistent estimator for B in (1), such as (19), is required in order to obtain a consistent estimate from the refinement procedure in (21).

Algorithm
In this section the computation algorithm of the estimates B NN_OPG in (19) and B NN in (21) is given. Both estimators depend on training a MLP using tensorflow [1] with an R interface provided by the R-package [2].
For regularisation during the training, we apply dropout with a rate of 0.4 (see [35]) after each fully connected hidden layer; i.e., the nodes are randomly set to 0 with probability 0.4 during each update step in the training procedure.
..,n , we fix the batch_size to m ≤ n, and the number of epochs to ep. Let f MLP N (x; Θ) be an N -layer MLP. The objective function is given by A rough outline of stochastic gradient descent (SGD) is given in Algorithm 1.
If the sample size n is not a multiple of the batch size m, then in the last run of the inner loop the sum of gradients is extended to n. Further, if there are restrictions placed on some of the parameters, the corresponding part of the parameter vector Θ is projected back to the restricted set after applying the update step. This is the case for V ∈ S(p, k) in (14), where the projection back onto the Stiefel manifold in (5) is done via a polar decomposition.
An important feature of stochastic gradient descent training is that the complexity is linear in the sample size n if the number of epochs ep and the batch size m are chosen independently of n. Moreover, in simulations we observed that fewer epochs ep suffice to find well trained neural nets for large samples.
The proposed NN estimator for the mean subspace is a two stage procedure, as described next.
Stage 1: Obtain the B NN_OPG estimator in (19) by solving the optimization in (16). This is done via the stochastic gradient descent (SGD) algorithm with random initialization of the starting value Θ 1 to obtain the estimate (16).
In the second stage, we use the weights and bias obtained by training g NN_OPG in (12) as initialization for the parameters of g NN_wrap in (13), and B NN_OPG as initial value for V ∈ S(p, k) in (14).
This two stage initialization scheme is important for the performance of the proposed estimator since a random initialization of the parameters of the second stage adversely affects the accuracy of the estimator.

Analogy to Minimum Average Variance Estimation (MAVE) and Outer Product Gradient (OPG)
In this section we draw the analogy of OPG and MAVE, both introduced by [40], with the proposed estimators. The theoretical motivation for OPG is given by Theorem 2 and for MAVE by Theorem 3. First we start by describing the MAVE algorithm briefly.
where a = g(V X 0 ) ∈ R, b = ∇g(V X 0 ) ∈ R k . Therefore we obtain the following approximation for σ 2 (V X 0 ) in (9), for some weights w i,0 that sum to 1 ( i w i,0 = 1). The weights play a crucial role in the estimation. They are given by for a k dimensional kernel K(·), and a bandwidth h ∈ R + .
Since K(·) =K( · 2 ) for a monotone decreasing univariate kernelK(·) : R → R + , [40] recognized that the weights depend only on the distance of V (X l − X 0 ) (the further V X i is away from V X 0 the worse the linear expansion is and the less weight we assign).
Then, an estimator for σ 2 (V X 0 ) in (9) is given bŷ and the target function T (V) is estimated by where the weights are given in (23). The Gaussian kernel is usually used with bandwidth satisfying h = h n ∝ n −1/(4+k) , as typically done in nonparametric function estimation, in order to obtain optimal asymptotic properties.
Definition. The MAVE estimator for S E(Y |X) = span{B} in model (1) is given by

Analogy of NN − SDR estimation to MAVE
The optimization in (24) corresponds to local linear smoothing of E(Y i | V X i ) with weights given in (23). After the local linear estimatesâ j ,b j in (25) are obtained, assume that the weights in (23) (25) can be written as where L is the squared error loss and E(Y i | V X i ) the local linear smooth. Under this simplifying assumption, (27) is the same as (8) except that the conditional expectation is estimated via local linear smoothing in MAVE as opposed to neural nets for NN in (21).

Analogy of NN_OPG to OPG
The OPG estimator estimates (7) in Theorem (2) via local linear smoothing of ∇g(B X i ). Specifically, if V = I p in (25), we let (a j , b j ) n j=1 denote the solutions of the optimization in (25). Then, b j is an estimate for ∇g(B X j ) and is an estimator for Σ ∇ in Theorem 2.
Definition. The outer product gradient (OPG) estimator for S E(Y |X) = span{B} is defined as where v 1 , . . . , v k are the first k eigenvectors of (28).
An iterative algorithm to solving the optimization problem in (26) is given in [40]. The OPG estimator in (29) can be used as an initial value for the optimization procedure of the MAVE estimator.

Simulations
We compare the estimation accuracy of NN estimation with the forward model based sufficient dimension reduction methods, and mean outer product gradient estimation (meanOPG), mean minimum average variance estimation (meanMAVE [40]) and the recently developed conditional variance estimator (CVE) [14]. The first two, meanOPG and meanMAVE, are implemented in the R-package [37], and CVE in the R package CVarE [24].
We report results for three architectures for g NN_OPG and g NN_wrap used in NN estimation. The first is a single layer MLP with 128 hidden neurons, the second has 512 and the third is a two layer MLP with 48 hidden neurons each. The results  N (0, 1).
were largely undifferentiated for hidden neuron values between 128 and 512. For the two layer MLP, we obtained similar results for more than 48 neurons, which is already a small number. All three architectures use dropout (see [35]) with probability 0.4 4 after each fully connected hidden layer except in the reduction layer of the g NN . All architectures in (12) are trained in (16)  We consider the same six models (M1-M6) as in [14], which are reproduced in Table 1. Throughout, we set p = 20, b 1 = (1, 1, 1, 1, 1, 1, 0 where e j denotes the p-vector with jth element equal to 1 and all others are 0. In M7, the first three columns are the identity vectors and b 4 = (2e 4 + e 5 )/ √ 5 which is taken from [16]. The error term ε is independent of X for all models. In M2, M3, M4, M5 and M6, ε ∼ N (0, 1). For M1, ε has a generalized normal distribution GN (a, b, c) with densitiy f ε (z) = c/(2bΓ(1/c)) exp((|z − a|/b) c ) [see [34]], with location 0 and shape-parameter 0.5 for M1, and the scale-parameter is chosen such that Var(ε) = 0.25. The dimension k is assumed to be known throughout.
The variance-covariance structure of X in models M1 and M4 satisfies Σ i,j = 0.5 |i−j| for i, j = 1, . . . , p. In M5, X is uniform with independent entries on the p-dimensional hyper-cube. The link functions of M4 is studied in [40], but we use p = 20 instead of 10 and a non identity covariance structure for M4. In M2, Z ∼ 2Bernoulli(0.3) − 1 ∈ {−1, 1}, where 1 q = (1, 1, ..., 1) T ∈ R k , this yields that X has a mixture normal distribution with a mixture probability of 0.3. M7 is a challenging four dimensional model studied in [16].
We generate r = 100 replications of models M1 -M7 and estimate B using the different sufficient dimension reduction methods. The accuracy of the estimates is assessed using which lies in the interval [0, 1]. The factor √ 2k normalizes the distance, with values closer to zero indicating better agreement and values closer to one indicating strong disagreement.
We report the average acc.err and their standard deviations in Table 2. All three network architectures, NN 128 − SDR, NN 512 , NN 48,48 yield similar results, highlighting the robustness of the method with respect to the architecture. We choose NN 512 as our default setup for the following simulations in Section 8. For M1 and M2, CVE yields the most accurate estimation of the reduction B, followed by the NN − SDR estimators. OPG and MAVE show the worst performance for the first two models. In M3, the NN − SDR estimators are on par with CVE and OPG, whereas MAVE exhibits the worst performance. In M4, the NN − SDR estimators are on par with OPG, MAVE, while CVE is slightly worse than the rest. In M5, OPG and MAVE are the most accurate, with CVE nearly on par. For M6, the NN − SDR estimators yield the best results followed by OPG and MAVE. M7 is challenging for all methods, with MAVE, OPG, and NN 512 − SDR the best performing three.
The NN 512 − SDR estimator is better or on par with OPG, MAVE, and CVE except for M5. This is not surprising in the case of NN − SDR and OPG/MAVE as they are built on a similar idea. The main difference is that MAVE uses local linear smoothing instead of neural nets.   Table 3 we report the mean and standard deviation for the out of sample prediction errors in M1-M7 over r = 100 replications. For each data set and replication, we sampled a test set with sample size 1000 from each model and predicted the response Y via the predict function in R for OPG, MAVE, and CVE. For NN − SDR, the predictions are given by g NN (X new , ( B NN , Θ 1 )) in (14). For M1 and M2, CVE gives the smallest out of sample prediction errors, followed by the NN

Large sample size simulation
In this section we simulate data from models M6 and M7 and increase both the number of predictors p and the sample size n. We monitor the estimation accuracy by acc.err in (30) as in Section 7, the out of sample prediction error and the required time for the estimation of a reduction.
We examined two simulation settings. In the first, we simulated from model M7 using the same p = 20 and increased the sample size significantly (n = 2 u , u = 7, 9,11,13). The results are displayed in Table 4. We do not report values for To explore how simultaneous growth of the sample size and the number of predictors affect performance, the second simulation revisits M6, where we successively increase both the sample size n and p. The sample sizes considered are n ∈ {1000, 4000, 16000, 64000, 256000} with corresponding p ∈ {32, 63, 126, 253, 506}, which is roughly p ∝ √ n. We observed that for larger sample sizes, fewer epochs in the training phase of the neural net suffice. To demonstrate this, the number of epochs was reduced as n and p increased, as follows. For (n, p) = (1000, 32), 200 and 400 epochs were used in the two steps of the refined NN, respectively, and at each subsequent setting, epoch numbers were halved.
The results of this simulation are shown in Table 5, which reports the mean and standard deviation (in parentheses) over 10 repetitions of acc.err in (30), the out of sample prediction errors, and the runtime as measured internally via the user time obtained by the R function system.time(). The advantage of NN emerges in Table 5. As both n and p grow, MAVE is no longer computable in realistic time. For example, for n = 64000, p = 253, one calculation for MAVE takes about 12 hours to complete. Hence, we report only one value for acc.err and prediction error. In contrast, NN takes about 9 minutes to complete one run for the same setting and about 28 minutes to complete one run for n = 256000, p = 506. For n = 1000, 4000, 16000, and p = 32, 63, 126, NN − SDR exhibits slightly higher values of estimation error and lower values of out-of-sample prediction error than MAVE.
The mean runtimes of the two methods are plotted against the sample size in Figure 3. We see that the runtime for MAVE explodes to exceed 12 hours only for one dataset at sample size 64000. On the other hand, NN computes in reasonable time.
Thus, NN − SDR is the only forward model based SDR method that is applicable to truly large data while obtaining small estimation and out-of-sample prediction errors. Moreover, for smaller data sets, both in terms of n and p, it maintains competitive performance.

Data Analysis
We analyze three data sets. The first in Section 9.1 is of relatively small sample size (n = 506) and number of predictors (p = 12), the second in Section 9.2 is of large n (21, 613) and small p (16), and the third in Section 9.3 is of very large n (382, 168) and small to medium p = 40.

Boston Housing
In this section we apply the refined NN estimator on the Boston Housing data and compare its performance with the other two mean subspace SDR methods, MAVE and CVE. This data set has been extensively used as a benchmark for assessing regression methods [see, for example, [22]], and is available in the R-package mlbench. The data comprise of 506 instances of 14 variables from the 1970 Boston census, 13 of which are continuous. The binary variable chas, indexing proximity to the Charles river, is omitted from the analysis since all three methods operate under the assumption of continuous predictors. The target variable is the median value of owner-occupied homes, medv, in $1, 000. The 12 predictors are crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft), indus (proportion of non-retail business acres per town), nox (nitric oxides concentration (parts per 10 million)), rm (average number of rooms per dwelling), age (proportion of owner-occupied units built prior to 1940), dis (weighted distances to five Boston employment centres), rad (index of accessibility to radial highways), tax (full-value property-tax rate per $10, 000), ptratio (pupil-teacher ratio by town), lstat (percentage of lower status of the population), and b stands for 1000(B − 0.63) 2 where B is the proportion of blacks by town.   We set the dimension of the reduction B to two; i.e., k = 2, for all three methods and compute prediction errors using squared error loss and leave-one-out cross validation. The NN with one layer and 512 neurons is fitted on the n − 1 training data and compute the predicted value for the left out data point. Both CVE and MAVE were applied to the standardized training data. The mean and standard deviation (in parentheses) of the 506 prediction errors are displayed in Table 6. The CVE method results in the smallest prediction error followed by NN − SDR, which, on the other hand, has the smallest standard error. MAVE is the least accurate. The analysis for k = 1 yielded similar results. In this example of small n-small p, nonparametric methods are expected to do well, which is what we observe for CVE followed by MAVE. Nevertheless, the performance of the large sample NN − SDR method is roughly on par with both.

KC Housing
Further, we use kc_house_data set in the R package MAVE to compare NN − SDR estimation with MAVE. The data set contains 21613 observations on 20 variables. The target variable is price, the price of a sold house. We use 16 predictors after omitting id, date, and zip code: bedrooms (number of bedrooms), bathrooms (number of bathrooms), sqft_living (square footage of the living room(, sqrt_log (square footage of the log), floors (total floors in the house), waterfront (whether the house has a view a waterfront(1: yes, 0: not)), view (unknown), condtion (condition of the house), grade (unknown), sqft_above (square footage of house apart from basement), sqft_basement (square footage of the basement), yr_built (built year), yr_renovated (year when the house was renovated), lat (latitude coordinate), long (longitude coordinate), sqft_living15 (living room area in 2015(implies some renovations)), sqrt_lot15 (lot area in 2015(implies some renovations)).
We perform 10-fold cross-validation in order to obtain an unbiased estimate of the out of sample prediction error. We set k = 1 and report the average fraction of the mean squared prediction error divided by the variance of the response on the test set, as well as its standard error, in Table 7. Our NN − SDR estimator has out of sample mean squared error that is about half the variance of the response on the test set, whereas MAVE's is less than 2 percent lower than the variance of the response. This means that the NN − SDR regression explains roughly half of the total variance in the response whereas MAVE hardly explains any. Further, even though the MAVE reduction is estimated in roughly the same time as NN − SDR, in 6 out of the 10 folds the predict function for MAVE produces an error. We also report the 10-fold cross-validated prediction error for CVE, which yields the best result as it explains more than 70% of the total variance in the response but could not be computed, in its current implementation, on a personal computer. 5 The coefficients of the reductions are given in Table 8. NN − SDR extracts information from all variables as it places non-zero weights of varying size on all. MAVE, on the other hand, selects waterfront and the co-linear sqft_living, sqft_above, sqft_basement (sqft_living = sqft_above + sqft_basement) and drops all other variables. Moreover, it allocates the same weight to the collinear variables with opposite signs, effectively discounting all three and ultimately declaring only waterfront relevant.
These results indicate that MAVE breaks down in the analysis of this data set. Since sqft_living = sqft_above + sqft_basement, we dropped sqft_basement to investigate the effect of collinearity. In Figure 4, we plot the response versus the MAVE reduction computed on all predictors in the left panel, versus the MAVE reduction without sqft_basement and versus the NN − SDR reduction in the right panel. The reduced predictors are strikingly different. The NN − SDR reduction is smooth and captures a clear nonlinear heteroskedastic relationship with price. The plot in the left panel captures the failure of MAVE to extract the predictive information in the predictors, as no apparent pattern emerges. Moreover, the data are arbitrarily split in the groups defined by the binary waterfront variable. Once the collinearity is removed, MAVE captures the relationship between Y and X but nevertheless it again splits the data into two new arbitrary classes for the renovated and non-renovated houses. This variable takes either value 0 (not renovated) or the renovation year that ranges between 1934 and 2015. The black points in the middle panel correspond to 0 and red to the period 1934-2015.
We further draw attention to the semblance of the data clouds across the two categories in the middle panel and the NN − SDR reduction in the right panel. Both MAVE and NN − SDR discover the same pattern, with the correlation  In Table 8, we also provide the coefficients of the last two eigenvectors, corresponding to the two smallest eigenvalues in decreasing order, of the sample covariance matrix of the predictors. The next to last places most of the weight on waterfront and the last on sqft_living and sqft_above, sqft_basement. Moreover, the vector of coefficients of the MAVE reduction based on all predictors in the first column seems to be the sum of the last and the down-weighted second to last eigenvectors of the sample covariance matrix of X. This relates to the fact that the sample covariance matrix of X is singular of rank 16 = p − 1. Thus, the last eigenvector dominates all others and largely agrees with the MAVE reduction coefficients. We investigate the effect of collinearity on MAVE and CVE in Section 9.2.1. Table 8: Estimated linear reductions for the kc_house_data data set. Z are the remaining 16 predictors when dropping sqft_basement from X. The last column v 17 (X), v 16 (X) are the PCA coefficients corresponding to the last two smallest eigenvalue of Var(X) for un-scaled X.

The case of singular Σ x
We consider the effect of collinear predictors on the sufficient dimension reduction techniques MAVE, CVE, and NN − SDR. We assume that Σ x = Var(X) is singular and show that, in this case, the mean subspace is not uniquely identifiable.
Let p = 10 and Y = (B X) 2 + 0.5ε, where ε ∼ N (0, 1) is independent from X and B = e 4 the fourth standard basis vector. We draw 100 random samples (Y i , X i ) i=1,...,n of size n = 100 from this model and calculate the MAVE, CVE and NN 512 estimators of B. The median, mean and standard deviation of the estimation errors for the subspace in (30) are reported in Table 9.  In this example, in particular, MAVE seems to focus solely on estimating U instead of B. This does not hold in general. We offer an explanation by setting c = αc in (31) for a scalar α. Following the rationale below (31),B = B + αU is a reduction for any α. Since MAVE, CVE and NN − SDR work withB ∈ S(p, k), α determines the weight placed on U relative to B. For large α, U dominates the reductionB and MAVE fails to identify the mean subspace. In contrast, CVE and NN − SDR remains robust in its ability to accurately estimate the mean subspace.
We conjecture that MAVE's vulnerability is numerical in nature and relates to the implementation algorithm in the MAVE package. We also conjecture that CVE and NN − SDR are more robust than MAVE.

Beijing Air Quality Data
The Beijing Multi-Site Air-Quality Data [42] available at the UCI machine learning repository 6 includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites in Beijing. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. After removing missing data entries, the data contains 382168 complete measurements.
The target P M 2.5 [ug/m 3 ] is the concentration of particle matter in the air with less than 2.5 micrometres in diameter.
We included the categorical variables to demonstrate that NN − SDR can handle dummy variables even though it is not designed for this. Given the large sample size we used 2 epochs for the first stage and 3 for the second refinement stage of the training. Due to the large sample size MAVE and CVE are infeasible to compute while NN-SDR executes in less than 3 minutes per fold run on the CPU of personal computer. As a comparison, we included the linear model (lm) as well as the Multivariate Adaptive Regression Splines (mars, [17,20]), as both can be applied to large regressions, provided p < n and are computationally efficient.
In Table 10, the mean of the 10-fold cross validation prediction errors is reported. The linear model exhibits the worst performance, as expected. NN − SDR improves upon the linear model for all choices of dimension we examined, beats mars for k = 3, 4 and obtains the minimum MSPE for k = 4. Thus, not only is NN − SDR the best method with respect to predictive accuracy, but it also provides an assessment of the true structural dimension of the relationship (k = 4) between the response and the predictors. This confirms the improved performance of mars, a multivariate nonparametric fitting method, over the linear model and points to the nonlinearity of the relationship.

Discussion
We introduced the novel NN − SDR estimator for the mean subspace. NN − SDR combines a sufficient dimension reduction approach with neural nets to first reduce the predictor vector and then estimate its functional relationship with the response and predict it. The estimator is shown to be competitive with state-of-the-art SDR approaches, such as MAVE and CVE, in simulations and data applications. Moreover, it is the only one among them that is computationally feasible for big data, where both p and n are large, even on personal computers.
In view of our simulation results, the NN − SDR estimator appears to be consistent. Nevertheless, we could not resolve the theoretical challenges involving neural nets to formally prove consistency, as this would require showing the consistency of neural net estimates, which remains an open problem.
A particularly attractive feature of NN − SDR, in contrast to MAVE and CVE, is that it is naturally configured for online training if new data become available due to the stochastic gradient descent algorithm described in Section 5. Specifically, the algorithm effortlessly updates the parameters of NN − SDR with further gradient steps using the new data.