Slice inverse regression with score functions

: We consider non-linear regression problems where we assume that the response depends non-linearly on a linear projection of the co-variates. We propose score function extensions to sliced inverse regression problems, both for the ﬁrst-order and second-order score functions. We show that they provably improve estimation in the population case over the non-sliced versions and we study ﬁnite sample estimators and their consistency given the exact score functions. We also propose to learn the score function as well, in two steps, i.e., ﬁrst learning the score function and then learning the eﬀective dimension reduction space, or directly, by solving a convex optimization problem regularized by the nuclear norm. We illustrate our results on a series of experiments.


Introduction
Non-linear regression and related problems such as non-linear classification are core important tasks in machine learning and statistics.In this paper, we consider a random vector x ∈ R d , a random response y ∈ R, and a regression model of the form y = f (x) + ε, which we want to estimate from n independent and identically distributed (i.i.d.) observations (x i , y i ), i = 1, . . ., n.Our goal is to estimate the function f from these data.A traditional key difficulty in this general regression problem is the lack of parametric assumptions regarding the functional form of f , leading to a problem of non-parametric regression.This is often tackled by searching implicitly or explicitly a function f within an infinite-dimensional vector space.While several techniques exist to estimate such a function, e.g., kernel methods, local-averaging, or neural networks (see, e.g., [14]), they also suffer from the curse of dimensionality, that is, the rate of convergence of the estimated function to the true function (with any relevant performance measure) can only decrease as a small power of n, and this power cannot be larger than a constant divided by d.In other words, the number n of observations for any level of precision is exponential in dimension.
A classical way of by-passing the curse of dimensionality is to make extra assumptions regarding the function to estimate, such as the dependence on a lower unknown low-dimensional subspace, such as done by projection pursuit or neural networks.More precisely, throughout the paper, we make the following assumption: (A1) For all x ∈ R d , we have f (x) = g(w x) for a certain matrix w ∈ R d×k and a function g : R k → R.Moreover, y = f (x) + ε with ε independent of x with zero mean and finite variance.
The subspace of R d spanned by the k columns w 1 , . . ., w k ∈ R d of w has dimension less than or equal to k, and is often called the effective dimension reduction (e.d.r.) space.The model above is often referred to as a multiple-index model [38].We will always make the assumption that the e.d.r.space has exactly rank k, that is the matrix w has rank k (which implies that k d).
Given w, estimating g may be done by any technique in non-parametric regression, with a convergence rate which requires a number of observations n to be exponential in k, with methods based on local averaging (e.g., Nadaraya-Watson estimators) or on least-squares regression [see, e.g., 14,29].Given the non-linear function g, estimating w is computationally difficult because the resulting optimization problem may not be convex and thus leads to several local imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 minima.The difficulty is often even stronger since one often wants to estimate both the function g and the matrix w.
Our main goal in this paper is to estimate the matrix w, with the hope of obtaining a convergence rate where the inverse power of n will now be proportional to k and not d.Note that the matrix w is only identifiable up to a (right) linear transform, since only the subspace spanned by its column is characteristic.
Method of moments vs. optimization.This multiple-index problem and the goal of estimating w only can be tackled from two points of views: (a) the method of moments, where certain moments are built so that the effect of the unknown function g disappears [6,23], a method that we follow here and describe in more details below.These methods rely heavily on the model being correct, and in the instances that we consider here lead to provably polynomialtime algorithms (and most often linear in the number of observations since only moments are computed).In contrast, (b) optimization-based methods use implicitly or explicitly non-parametric estimation, e.g., using local averaging methods to design an objective function that can be minimized to obtain an estimate of w [35,13].The objective function is usually non-convex and gradient descent techniques are used to obtain a local minimum.While these procedures offer no theoretical guarantees due to the potential unknown difficulty of the optimization problem, they often work well in practice, and we have observed this in our experiments.
In this paper, we consider and improve a specific instantiation of the method of moments, which partially circumvents the difficulty of joint estimation by estimating w directly without the knowledge of g.The starting point for this method is the work by Brillinger [6], which shows, as a simple consequence of Stein's lemma [26], that if the distribution of x is Gaussian, (A1) is satisfied with k = 1 (e.d.r. of dimension one, e.g., a single-index model), and the input data have zero mean and identity covariance matrix, then the expectation E(yx) is proportional to w.Thus, a certain expectation, which can be easily approximated given i.i.d.observations, simultaneously eliminates g and reveals w.
While the result above provides a very simple algorithm to recover w, it has several strong limitations: (a) it only applies to normally distributed data x, or more generally to elliptically symmetric distributions [7], (b) it only applies to k = 1, and (c) in many situations with symmetries, the proportionality constant is equal to zero and thus we cannot recover the vector w.This has led to several extensions in the statistical literature which we now present.
Using score functions.The use of Stein's lemma with a Gaussian random variable can be directly extended using the score function S 1 (x) defined as the negative gradient of the log-density, that is, S 1 (x) = −∇ log p(x) = −1 p(x) ∇p(x), which leads to the following assumption: (A2) The distribution of x has a strictly positive density p(x) which is differentiable with respect to the Lebesgue measure, and such that p(x) → 0 when x → +∞.We will need the score to be sub-Gaussian to obtain consistency results.Given Assumption (A2), then [28] showed, as a simple consequence of integration by parts, that, for k = 1 and if Assumption (A1) is satisfied, then E(yS 1 (x)) is proportional to w, for all differentiable functions g, with a proportionality constant that depends on w and ∇g.This leads to the "average derivative method" (ADE) and thus replaces the Gaussian assumption by the existence of a differentiable log-density, which is much weaker.This however does not remove the restriction k = 1, which can be done in two ways which we now present.
Sliced inverse regression.Given a normalized Gaussian distribution for x (or any elliptically symmetric distribution), then, if (A1) is satisfied, almost surely in y, the conditional expectation E(x|y) happens to belong to the e.d.r.subspace.Given several distinct values of y, the vectors E(x|y) or any estimate thereof, will hopefully span the entire e.d.r.space and we can recover the entire matrix w, leading to "slice inverse regression" (SIR), originally proposed by Li and Duan [23], Duan and Li [12] and Li [21].This allows the estimation with k > 1, but this is still restricted to Gaussian data.In this paper, we propose to extend SIR by the use of score functions to go beyond elliptically symmetric distributions, and we show that the new method combining SIR and score functions is formally better than the plain ADE method.
From first-order to second-order moments.Another line of extension of the simple method of Brillinger [6] is to consider higher-order moments, namely the matrix E(yxx ) ∈ R d×d , which, with normally distributed input data x and, if (A1) is satisfied, will be proportional (in a particular form to be described in Section 2.2) to the Hessian of the function g, leading to the method of "principal Hessian directions" (PHD) from Li [22].Again, k > 1 is allowed (more than a single projection), but thus is limited to elliptically symmetric data.Janzamin, Sedghi and Anandkumar [19] proposed to used second-order score functions to go beyond this assumption.In order to define this new method, we consider the following assumption: (A3) The distribution of x has a strictly positive density p(x) which is twice differentiable with respect to the Lebesgue measure, and such that p(x) and ∇p(x) → 0 when x → +∞.
Given (A1) and (A3), then one can show [19] that E(yS 2 (x)) will be proportional to the Hessian of the function g, where , thus extending the Gaussian situation above where S 1 was a linear function and S 2 (x), up to linear terms, proportional to xx .In this paper, we propose to extend the method above to allow an SIR estimator for the second-order score functions, where we condition on y, and we show that the new method is formally better than the plain method of [19].
Learning score functions through score matching.Relying on score functions immediately raises the following question: is estimating the score imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 function (when not available) really simpler than our original problem of nonparametric regression?Fortunately, a recent line of work [1] has considered this exact problem, and formulated the task of density estimation directly on score functions, which is particularly useful in our context.We may then use the data, first to learn the score, and then to use the novel score-based moments to estimate w.We will also consider a direct approach that jointly estimates the score function and the e.d.r.subspace, by regularizing by a sparsity-inducing norm.
Fighting the curse of dimensionality.Learning the score function is still a non-parametric problem, with the associated curse of dimensionality.If we first learn the score function (through score matching) and then learn the matrix w, we will not escape that curse, while our direct approach is empirically more robust.
Note that Hristache, Juditsky and Spokoiny [16] suggested iterative improvements of the ADE method, using elliptic windows which shrink in the directions of the columns of w, stretch in all others directions and tend to flat layers orthogonal to w. Dalalyan, Juditsky and Spokoiny [11] generalize the algorithm to multi-index models and proved √ n-consistency of the proposed procedure in the case when the structural dimension is not larger than 4 and weaker dependence for d > 4. In particular, they provably avoid the curse of dimensionality.Such extensions are outside the scope of this paper.

Contributions.
In this paper, we make the following contributions: • We propose score function extensions to sliced inverse regression problems, both for the first-order and second-order score functions.We consider the infinite sample case in Section 2 and the finite sample case in Section 3.They provably improve estimation in the population case over the nonsliced versions, while we study in Section 3 finite sample estimators and their consistency given the exact score functions.• We propose in Section 4 to learn the score function as well, in two steps, i.e., first learning the score function and then learning the e.d.r.space parameterized by w, or directly, by solving a convex optimization problem regularized by the nuclear norm.• We illustrate our results in Section 5 on a series of experiments.

Estimation with infinite sample size
In this section, we focus on the population situation, where we can compute expectations and conditional expectations exactly, while we focus on finite sample estimators with known score functions in Section 3 with consistency results in Section 3.3, and with learned score functions in Section 4.

SADE: Sliced average derivative estimation
Before presenting our new moments which will lead to the novel SADE method, we consider the non-sliced method, which is based on Assumptions (A1) and imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 (A2) and score functions (the method based on the Gaussian assumption will be derived later as corollaries).The ADE method is based on the following lemma: Lemma 1 (ADE moment [28]).Assume (A1), (A2), the differentiability of g and the existence of expectation E(g (w x)).Then E(S 1 (x)y) is in the e.d.r.subspace.
Proof.Since y = f (x) + ε, and ε is independent of x with zero mean, we have dx by integration by parts, which leads to the desired result.Note that in the integration by parts above, the decay of p(x) to zero for x → +∞ is needed.
The ADE moment above only provides a single vector in the e.d.r.subspace, which can only potentially lead to recovery for k = 1, and only if E(g (w x)) = 0, which may not be satisfied, e.g., if x has a a symmetric distribution and g is even.
We can now present our first new lemma, the proof of which relies on similar arguments as for SIR [23] but extended to score functions.Note that we do not require the differentiability of the function g.Lemma 2 (SADE moment).Assume (A1) and (A2).Then, E(S 1 (x)|y) is in the e.d.r.subspace almost surely (in y).
Proof.We consider any vector b ∈ R d in the orthogonal complement of the subspace Span{w 1 , . . ., w k }.We need to show, that b E(S 1 (x)|y) = 0 with probability 1.We have by the law of total expectation b E S 1 (x)|y = E E b S 1 (x)|w 1 x, . . ., w k x, y |y .
Because of Assumption (A1), we have y = g(w x) + ε with ε independent of x, and thus We now prove that almost surely E b S 1 (x)|w x = 0, which will be sufficient to prove Lemma 2. We consider the linear transformation of coordinates: x = w x ∈ R d , where w = (w 1 , . . ., w k , w k+1 , . . ., w d ) is a square matrix with full rank obtained by adding a basis of the subspace orthogonal to the span of the k columns of w.Then, if p is the density of x, we have p(x) = (det w)  It is thus sufficient to show that for all x1 , . . ., xk .We have ∂ xj dx j = 0 by Assumption (A2).This leads to the desired result.
The key differences are now that: • Unlike ADE, by conditioning on different values of y, we have access to several vectors E(S 1 (x)|y) ∈ R d .• Unlike SIR, SADE does not require the linearity condition from [21] anymore and can be used with a smooth enough probability density.
In the population case, we will consider the following matrix (using the fact that E(S 1 (x)) = 0): which we will also denote E E(S 1 (x)|y) ⊗2 , where for any matrix a, a ⊗2 denotes aa .The matrix above is positive semi-definite, and its column space is included in the e.d.r.space.If it has rank k, then we can exactly recover the entire subspace by an eigenvalue decomposition.When k = 1, which is the only case where ADE may be used, the following proposition shows that if ADE allows to recover w, so is SADE.
We will also consider the other matrix (note the presence of the extra term y 2 ) because of its direct link with the non-sliced version.Note that we made the weak assumption of existence of matrices V 1,cov and V 1,cov , which is satisfied for majority of problems.
imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 Proposition 1. Assume (A1) and (A2), with k = 1, as well as differentiability of g and existence of the expectation Eg (w x).The vector w may be recovered from the ADE moment (up to scale) if and only if Eg (w x) = 0.If this condition is satisfied, then SADE also recovers w up to scale (i.e., V 1,cov and V 1,cov are different from zero).
Proof.The first statement is a consequence of the proof of Lemma 1.If SADE fails, that is, for almost all y, E(S 1 (x)|y) = 0, then E(S 1 (x)y|y) = 0 which implies that E(S 1 (x)y) = 0 and thus ADE fails.Moreover, we have, using operator convexity [33]: showing that the new moment is dominating the ADE moment, which provides an alternative proof of the rank of V 1,cov being larger than one if E(yS 1 (x)) = 0.
Elliptically symmetric distributions.If x is normally distributed with mean vector µ and covariance matrix Σ, then we have and we recover the result from [23].Note that the lemma then extends to all elliptical distributions of the form ϕ( 1 2 (x−µ) Σ −1 (x−µ)), for a certain function ϕ : R + → R. See [23].
Failure modes.In some cases, slice inverse regression does not span the entire e.d.r.space, because the inverse regression curve E(S 1 (x)|y) is degenerated.For example, this can occur, if k = 1, y = h(w 1 x) + ε, h is an even function and w 1 x has a symmetric distribution around 0. Then E(S 1 (x)|y) ≡ 0, and thus it is a poor estimation of the desired e.d.r.directions [10].
The second drawback of SIR occurs when we have a classification task, for example, y ∈ {0, 1}.In this case, we have only two slices (i.e., possible values of y) and SIR can recover only one direction in the e.d.r.space [9].
Li [22] suggested another way to estimate the e.d.r.space which can handle such symmetric cases: principal Hessian directions (PHD).However, this method uses the normality of the vector X.As in the SIR case, we can extend this method, using score functions to use it for any distribution, which we will refer to as SPHD, which we now present.

SPHD: Sliced principal Hessian directions
Before presenting our new moment which will lead to the SPHD method, we consider the non-sliced method, which is based on Assumptions (A1) and (A3) and score functions (the method based on the Gaussian assumption will be derived later as corollaries).The method of [19], which we refer to as "PHD+" is based on the following lemma (we reproduce the proof for readability): imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 Lemma 3 (second-order score moment (PHD+) [19]).Assume (A1),(A3),twice differentiability of function g and existence of the expectation E(∇ 2 g(w x)).Then E(S 2 (x)y) has a column space included in the e.d.r.subspace.
Proof.Since y = f (x) + ε, and ε is independent of x, we have using integration by parts and the decay of p(x) and ∇p(x) for x → ∞.This leads to the desired result since ∇ 2 f (x) = w∇ 2 g(w x)w .This was proved by Li [22] for normal distributions.
Failure modes.The method does not work properly if rank(E(∇ 2 g(w x))) < k.For example, if g is linear function, E(∇ 2 g(w x)) ≡ 0 and the estimated e.d.r.space is degenerated.Moreover, the method fails in symmetric cases, for example, if g is an odd function with respect to any variable and p(x) is even function, then rank(E(∇ 2 g(w x))) < k.
We can now present our second new lemma, the proof of which relies on similar arguments as for PHD [22] but extended to score functions (again no differentiability is assumed on g): Lemma 4 (SPHD moment).Assume (A1) and (A3).Then, E(S 2 (x)|y) has a column space within the e.d.r.subspace almost surely.
Proof.We consider any a ∈ R d and b ∈ R d orthogonal to the e.d.r.subspace, and prove, that a E S 2 (x)|y b = 0. We use the same transform of coordinates as in the proof of Lemma 2: x = w x ∈ R d .Then ∇ 2 p(x) = det( w) • w∇ 2 p(x) w , b = w b = (0, . . ., 0, bk+1 , . . ., bd ) and we will prove, that E a S 2 (x)b|w x = 0 almost surely, and w S2 ( It is sufficient to show, that for all x1 , . . ., xk : We have: because for any j ∈ {k+1, . . ., n}: which leads to the desired result.
In order to combine several values of y, we will estimate the matrix , which is the expectation with respect to y of the square (in the matrix multiplication sense) of the conditional expectation from Lemma 4, as well as Proof.The first statement is a consequence of the proof of Lemma 3.Moreover, using the Lowner-Heinz theorem about operator convexity [33]: showing that the new moment is dominating the PHD moment, thus implying that rank[V 2 ] rank E(yS 2 (x)) .
Therefore, if rank E(yS 2 (x)) = k, then rank[V 2 ] = k (note that there is not such a simple proof for V 2 ).
Elliptically symmetric distributions.When x is a standard Gaussian random variable, then S 2 (x) = −I + xx , and thus , and we recover the sliced average variance estimation (SAVE) method by Cook and Weisberg [8].However, our method applies to all distributions (with known score functions).

Relationship between first and second order methods
All considered methods have their own failure modes.1.
Our conditions rely on the full possible rank of certain expected covariance matrices.When the function g is selected randomly from all potential functions from R k to R, rank-deficiencies typically do not occur and it would be interesting to show that indeed they appear with probability zero for certain random function models.
Comparison of different methods using score functions.

Estimation from finite sample
In this section, we consider finite sample estimators for the moments we have defined in Section 2. Since our extensions are combinations of existing techniques (using score functions and slicing) our finite-sample estimators naturally rely on existing work [17,39].
In this section, we assume that the score function is known.We consider learning the score function in Section 4.

Estimator and algorithm for SADE
Our goal is to provide an estimator for We use the natural consistent estimator In order to obtain an estimator of E Cov S 1 (x)|y , we consider slicing the real numbers in H different slices, I 1 , . . ., I H , which are contiguous intervals imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 that form a partition of R (or of the range of all y i , i = 1, . . ., n).We then compute an estimator of the conditional expectation (S 1 ) h = E(S 1 (x)|y ∈ I h ) with empirical averages: denoting ph the empirical proportion of y i , i = 1, . . ., n, that fall in the slice I h (which is assumed to be strictly positive), we estimate We then estimate Cov S 1 (x)|y ∈ I h by Note that it is important here to normalize the covariance computation by 1 n ph −1 (usual unbiased normalization of the variance) and not 1 n ph , to allow consistent estimation even when the number of elements per slice is small (e.g., equal to 2).
We finally use the following estimator of V 1,cov : The final SADE algorithm is thus the following: -Divide the range of y 1 , . . ., y n into H slices I 1 , . . ., I H . Let ph > 0 be the proportion of y i , i = 1, . . ., n, that fall in slice I h .-For each slice I h , compute the sample mean ( Ŝ1 ) h and covariance ( Ŝ1 ) cov,h : -Find the k largest eigenvalues and let ŵ1 , . . ., ŵk be eigenvectors in R d corresponding to these eigenvalues.
Choice of slices.There are different ways to choose slices I 1 , . . ., I H : -all slices have the same length, that is we choose the maximum and the minimum of y 1 , . . ., y n , and divide the range of y into H equal slices (for simplicity, we assume that n is a multiple of H), -we can also use the distribution of y to ensure a balanced distribution of observations in each slice, and choose where Later on in our experiments, we use the second way, where every slice has c = n/H points, and our consistency result applies to this situation as well.Note that there is a variation of SADE, which uses standardized data.If we denote the standardized data as S1 ( in [21]).
Computational complexity.The first step of the algorithm requires O(n) elementary operations, the second O(nd 2 ) operations, the third O(Hd 2 ) and the fourth O(kd 2 ) operations.The overall dependence on dimension d is quadratic, while the dependence on the number of observations is linear in n, as common in moment-matching methods.
Estimating the number of components.Our estimation method does not depend on k, up to the last step where the first k largest eigenvectors are selected.A simple heuristic to select k, similar to the selection of the number of components in principal component analysis, would select the largest k such that the gap between the k-th and (k + 1)-th eigenvalue is large enough.This could be made more formal using the technique of [21] for sliced inverse regression.

Estimator and algorithm for SPHD
We follow the same approach as for the SADE algorithm above, leading to the following algorithm, which estimates V 2 = E [E S 2 (x)|y)] 2 and computes its principal eigenvectors.Note that E E(S 2 (x)|y ∈ I h ) = E S 2 (x) = 0.
-Divide the range of y 1 , . . ., y n into H slices I 1 , . . ., I H . Let ph > 0 be the proportion of y i , i = 1, . . ., n, that fall in slice I h .-For each slice, compute the sample mean ( Ŝ2 ) h of S 2 (x): -Compute the weighted covariance matrix V2 = H h=1 ph ( Ŝ2 ) 2  h , find the k largest eigenvalues and let ŵ1 , . . ., ŵk be eigenvectors corresponding to these eigenvalues.
Computational complexity.The first step of the algorithm requires O(n) elementary operations, the second O(nd 2 ) operations, the third O(Hd 3 ) and the fourth O(kd 2 ) operations.The overall dependence on dimension d is cubic, hence the method is slower than SADE (but still linear in the number of observations n).

Consistency for the SADE estimator and algorithm
In this section, we prove the consistency of the SADE moment estimator and the resulting algorithm, when the score function is known.Following Hsing and Carroll [17] and Xhu and Ng [39], we can get √ n-consistency for the SADE algorithm with very broad assumptions regarding the problem.
In this section, we focus on the simplest set of assumptions to pave the way to the analysis for the nuclear norm in future work.The key novelty compared to [17,39] is a precise non-asymptotic analysis with precise constants.
Now we formulate and proof the main theorem, where • * is the nuclear norm, defined as A * = tr Theorem 1.Under assumptions (L1) -(L4) we get the following bound on V1,cov − V 1,cov * : for any δ < 1 n , with probability not less than 1 − δ: The proof of the theorem can be found in Appendix A.2.The leading term is , with a tail which is sub-Gaussian.We thus get a √ n- The dependency in d could probably be improved, in particular when using slices of sizes c that tend to infinity (as done in [24]).

Learning score functions
All previous methods can work only if we know the score function of first or second order.In practice, we do not have such information, and we have to learn score functions from sampled data.In this section, we only consider the first-order score function (x) = S 1 (x) = −∇ log p(x).
We first present the score matching approach of [1], and then apply it to our problem.

Score matching to estimate score from data
Given the true score function (x) = S 1 (x) = −∇ log p(x) and some i.i.d.data generated from p(x), score matching aims at estimating the parameter of a model for the score function ˆ (x), by minimizing a empirical quantity aiming to estimate As is, the quantity above leads to consistent estimation (i.e., pushing ˆ close to ), but seems to need the knowledge of the true score (x).A key insight from [1] is to use integration by parts to get (assuming the integrals exist): by integration by parts, where The first part of the last right hand side does not depend on ˆ while the two other parts are expectations under p(x) of quantities that only depend on ˆ .Thus is can we well approximated, up to a constant, by: Parametric assumption.If we assume that the score is linear combination of finitely many basis functions, we will get a consistent estimator of these parameters.That is, we make the following assumption: (A4) The score function (x) is a linear combination of known basis functions ψ j (x), j = 1, . . ., m, where ψ j : R d → R d , that is (x) = m j=1 ψ j (x)θ * j , for some θ * ∈ R m .We assume that the score function and its derivatives are squared-integrable with respect to p(x).
In this paper, we consider for simplicity a parametric assumption for the score.In order to go towards non-parametric estimation, we would need to let the number m of basis functions to grow with n (exponentially with no added assumptions), and this is an interesting avenue for future work.In simulations in Section 5, we consider a simple set of basis function which are localized functions around observations; these can approximate reasonably well in practice most densities and led to good estimation of the e.d.r.subspace.Moreover, if we have the additional knowledge that the components of x are statistically imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 independent (potentially after linearly transforming them using independent component analysis [18]), we can use separable functions for the scores.
We introduce the notation so that the score function ˆ we wish to estimate has the form We also introduce the notation The empirical score function may then be written as: which is a quadratic function of θ and can thus be minimized by solving a linear system in running time O(m 3 + m 2 dn) (to form the matrix and to solve the system).
Given standard results regarding the convergence of 1 n n i=1 Ψ(x i )Ψ(x i ) ∈ R m×m to its expectation and of 1 n n i=1 (∇ • ˆ )(x i ) ∈ R m to its expectation, we get a √ n-consistent estimation of θ * under simple assumptions (see Theorem 2).

Score matching for sliced inverse regression: two-step approach
We can now combine our linear parametrization of the score with the SIR approach outlined in Section 3. The true conditional expectation is and belongs to the e.d.r.subspace.We consider H different slices I 1 , . . ., I H , and the following estimator, which simply replaces the true score by the estimated score (i.e., θ * by θ), and highlights the dependence in θ.
The estimator V1,cov can be rewritten as where α i,j = 1 if i = j in the same slice 0 otherwise Using linear property (x) = Ψ(x) θ: In the two-step approach, we first solve the score matching optimization problem to obtain an estimate for the optimal parameters θ * and then use them to get the k largest eigenvectors of covariance matrix V1 .This approach works well in low dimensions when the score function can be well approximated; otherwise, we may suffer from the curse of dimensionality: if we want a good approximation of the score function, we would need an exponential number of basis functions.In Section 4.3, we consider a direct approach aiming at improving robustness.
Consistency.Let also provide the result of consistency in the case of unknown score function under Assumption (A4) for the two-step algorithm.We will make the additional assumptions: is not degenerated and we let λ 0 denote its minimal eigenvalue.
Now, let us formulate and proof non-asymptotic result about the real and the estimated e.d.r.spaces.We need to define a notion of distance between subspaces spanned by two sets of vectors.We use the square trace error R 2 (w, ŵ) [15]: Using Corollary 4.1 and its discussion in [30], we can derive, that R(w, ŵ) , where w and ŵ are the real and the estimated e.d.r.spaces respectively.Using Weyl's inequality [27] : ε for all i = 1, . . ., d and taking ε = (λ k − λ k+1 )/2 = λ k /2 we get: We can now formulate our main theorem for the analysis of the SADE algorithm: Theorem 4. Consider assumptions (A1), (A2), (A4), (L1)-(L4), (M1), (M2), (M3) and: For δ 1/n and n large enough: n > c 1 + c 2 log 1 δ for some positive constants c 1 and c 2 not depending on n and δ: with probability not less than 1 − δ.
imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 We thus obtain a convergence rate in 1/ √ n, with an explicit dependence on all constants of the problem.The dependence on dimension d and number of basis functions m is of the order (forgetting logarithmic terms): O( d 3/2 n 1/2 + m 5/2 n 1/2 θ * ).For k > 1, our dependence on the sample size n is improved compared to existing work such as [11] (while it matches the dependence for k = 1 with [16]), but this comes at the expense of assuming that the score functions satisfy a parametric model.
For a fixed number of basis functions m, we thus escape the curse of dimensionality as we get a polynomial dependence on d (which we believe could be improved).However, when no assumptions are made on score functions, the number m and potentially the norm θ * has to grow with the number of observations n.We are indeed faced with a traditional non-parametric estiamation problem, and the number m will need to grow when n grows depending on the smoothness assumptions we are willing to make on the score function, with an effect on θ * (and probably λ 0 , which we will neglect in the discussion below).While a precise analysis is out of the scope of the paper and left for future work, we can make an informal argument as follows: in order to approximate the score with precision ε with a set of basis function where θ * is bounded, we need m(ε) basis functions.Thus, we end up with two sources of errors, an approximation error ε and an estimation error of order O(m(ε) 5/2 /n 1/2 ).Typically, if we assume that the score has a number of bounded derivatives proportional to dimension or if we assume the input variables are independent and we can thus estimate the score independently for each dimension, m(ε) is of the order 1/ε 1/r , where r is independent of the dimension, leading to an overall rate which is independent of the dimension.

Score matching for SIR: direct approach
We can also try to combine these two steps to try to avoid the "curse of dimensionality".Our estimation of the score, i.e., of the parameter θ is done only to be used within the SIR approach where we expect the matrix V1,cov to have rank k.Thus when estimating θ by minimizing Rscore (θ), we may add a regularization that penalizes large ranks for V1,cov (θ) = 1 where we highlight the dependence on θ ∈ R m .By enforcing the low-rank constraint, our aim is to circumvent a potential poor estimation of the score function, which could be enough for the task of estimating the e.d.r.space (we see a better behavior in our simulations in Section 5).
Introduce matrix We may then penalize the nuclear norm of A(θ), or potentially consider norms that take into account that we look for a rank k (e.g., the k-support norm on the spectrum of A(θ) [25]).We have, imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 Combining two penalties, we have a convex optimization task: Efficient algorithm.Following [3], we consider reweighted least-squares algorithms.The trace norm admits the variational form (see [4]): The optimization problem (4.5) can be reformulated in the following way: Note that the objective function is a quadratic function of θ.Decompose matrix A in the form: Rearrange the regularizer term: Introduce notation: We can now estimate the complexity of this algorithm.Firstly we need to evaluate the quadratic form in Rscore

Experiments
In this section we provide numerical experiments for SADE, PHD+ and SPHD on different functions f .We denote the true and estimated e.d.r.subspaces as E and Ê respectively, defined from w and ŵ.

Known score functions
Consider a Gaussian mixture model with 2 components in R d : where θ = (6/10, 4/10), Contour lines of this distribution, when d = 2 are shown in Figure 1.
The error ε has a standard normal distribution.To estimate the effectiveness of an estimated e.d.r.subspace, we use the square trace error R 2 (w, ŵ) where w and ŵ are the real and the estimated e.d.r.vectors respectively and P and P are projectors, corresponding to these matrices.Note, that (4.3) is the special case of this formula with orthonormal matrices.To show the dominance of SADE over SIR (which should only work for elliptically symmetric distributions), we consider the rational multi-index model imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 of the form d = 10; y = Here the real e.d.r.space is generated by 2 specific vectors: (1, 0, 0, . . ., 0) and (0, 1, 0, . . ., 0).The number of slices is H = 10, and we consider numbers of observations n from 2 10 to 2 16 equally spaced in logarithmic scale, and we conduct 100 replicates to obtain means and standard deviations of the logarithms of square trace errors divided by √ 100 = 10 (to assess significance of differences) as shown in Figure 2.Even in this simple model, the ordinary SIR algorithm does not work properly, because the distribution of the inputs x has no elliptical symmetry.When n → ∞, the squared trace error tends to some nonzero constant depending on the properties of the density function, whereas SADE shows good performance with slope −1 (corresponding to a √ n-consistent estimator).Now, we compare the moments methods SADE, PHD+, SPHD.Although the goal of this paper is to compare moment matching techniques, we compare them with the state-of-the-art MAVE method [34], [31], [32].It is worth noting two properties which we have already discussed concerning these methods: 1. Sliced methods have a wider application area than unsliced ones: SADE is stronger than ADE and SPHD is stronger than PHD+.2. SADE can not recover the entire e.d.r.space in several cases, for example, a classification task or symmetric cases.For those cases, we should use second-order methods (i.e., methods based on S 2 ). 3. MAVE works better, but the goal of this paper is to compare momentmatching techniques.In Figure 11, we provide examples where MAVE suffers from the curse of dimensionality and performs worse that momentmatching methods.
We conduct 3 experiments, where H = 10, d = 10, the error term ε has a normal standard distribution; numbers of observation n from 2 10 to 2 16 equally spaced in logarithmic scale and we made 10 replicas to evaluate sample means and variations: • Rational model of the form: where the effective reduction subspace dimension is k = 2.
Results are shown of Figure 3 and we can see, that first-order method SADE works better, than second-order SPHD + and SPHD works better, than PHD+, that is slicing make the method more robust.• Classification problem of the form (5.4) We can see, that the error of SADE is close to 0.5 (Figure 4).This means that the method finds only one direction in the e.d.r.space.Moreover, SPHD gave better results than PHD+.
We chose 100 Gaussian kernels as basis functions: where X i,j form the uniform grid on the square [−2, 2] 2 (note that this does not imply any notion of Gaussianity for the underlying density, and the Gaussian kernel here could be replaced by any differentiable local function).
We choose α = 0.01, σ = 1 and conduct 100 replicates with numbers of observations n from 2 10 to 2 16 equally spaced in logarithmic scale.The results of the experiments are presented in Figure 6: the square trace error tends to zero as n increases.In high dimensions, we can not use a uniform grid because of the curse of dimensionality.Instead of this, we will use n Gaussian kernels, centered in the imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 sample points X i .Note here that we can only recover an approximate score function, but our goal is the estimation of the e.d.r.subspace.
We use our reweighted least-squares algorithm to solve the convex problem (4.5).We conduct several experiments on functions from the previous section.
We plot the performance as a function of the kernel bandwidth h to assess the robustness of the methods.On Figure 7 we provide the relationship between the square trace error and h for quadratic model (5.5) for d = 10 and n = 1000.In Figures 8, 9, 10, we provide the relationship between the square trace error and σ for rational model (5.2) for d = 10, n = 1000; d = 20, n = 2000 and d = 50, n = 5000.We see that for large d, the one-step algorithm is more robust than the two-step algorithm.Moreover, the experiments show that even with weak score functions, correct estimation of the e.d.r.space can be performed.Comparison with MAVE.We consider the rational model (5.3) with n = 10000, k = 10 and dimension d of data in the range [20,150].Probability density p(x) has independent components, each of which has a form of mixture of 2 Gaussians with weights (6/10, 4/10), means (−1, 1) and standard variations (1,2).The results for MAVE, SADE with known score and for SADE with unknown score are shown on the Figure 11.We can see, that both SADE with known and unknown scores lose in case of low dimension of data, but more resistant to the curse of dimensionality (as expected as MAVE relies on nonparametric estimation in a space of dimension k, which is here larger).Moreover, the complexity of our moment-matching technique is linear in the number of observations n, while MAVE is superlinear.

Conclusion
In this paper we consider a general non-linear regression model and the dependence on a unknown k-dimensional subspace assumption.Our goal was direct imsart-ejs ver.2014/10/16 file: "Sliced Inverse regression with score functions".texdate: April 18, 2018 estimation of this unknown k-dimensional space, which is often called the effective dimension reduction or e.d.r.space.We proposed new approaches (SADE and SPHD), combining two existing techniques (sliced inverse regression (SIR) and score function-based estimation).We obtained consistent estimation for k > 1 only using the first-order score and proposed explicit approaches to learn the score from data.It would be interesting to extend our sliced extensions to learning neural networks: indeed, our work focused on the subspace spanned by w and cannot identify individual columns, while [20] showed that by using proper tensor decomposition algorithms and second-order score functions, columns of w can be consistently estimated (in polynomial time).Thus, the deviation from the population version, may be split into four terms as follows: We now bound each term separately.
Bounding T 1 .We have The range cannot grow too much, i.e., as log n.Indeed, assuming without loss of generality that Ey = 0, we have max{y For the second term T 4,2 above, if we select any element indexed by a, b, then and We can then use Bernstein's inequality [5,Theorem 2.10], to get that with probability less than e −t then We can also get upper bound for this quantity, using −η a instead of η a .Thus, with t = log 8d 2 δ , we get that all d(d + 1)/2 absolute deviations are less than 2τ

Fig 1 .
Fig 1. Contour lines of Gaussian mixture pdf in 2D case.

Fig 4 .
Fig 4. Mean and standard deviation of R 2 (E, Ê) divided by √ 10 for the the classification problem (5.4).
Effect of slicing.We now study the effect of slicing and show that in the population case, it is superior to the non-sliced version, as it recovers the true e.d.r.subspace in more situations.Proposition 2. Assume (A1) and (A3).The matrix w may be recovered from the moment in Lemma 3 (up to right linear transform) if and only if E[∇ 2 g(w x)] has full rank.If this condition is satisfied, then SPHD also recovers w up to scale.