Suﬃcient Dimension Reduction via Principal L q Support Vector Machine

: Principal support vector machine was proposed recently by Li, Artemiou and Li (2011) to combine L1 support vector machine and suﬃcient dimension reduction. We introduce the principal L q support vector machine as a uniﬁed framework for linear and nonlinear suﬃcient dimension reduction. By noticing that the solution of L1 support vector machine may not be unique, we set q > 1 to ensure the uniqueness of the solution. The asymptotic distribution of the proposed estimators are derived for q > 1. We demonstrate through numerical studies that the proposed L2 support vector machine estimators improve existing methods in accuracy, and are less sensitive to the tuning parameter selection.


Introduction
The emergence of computer power and the increase in storage capabilities have provided scientists the necessary tools to collect and store high dimensional data.In an effort to reduce the dimensionality of the data before applying classical techniques for inference, sufficient dimension reduction has seen great development among recent statistics literature.The main objective of sufficient dimension reduction is to estimate a p × d matrix β with d < p, such that where Y is the response and X is a p-dimensional predictor.The column space of β in (1) is called the dimension reduction space.Under mild assumptions (Cook, 1998a;Yin, Li and Cook, 2008), the intersection of all dimension reduction spaces is a dimension reduction space itself.This unique minimum dimension reduction space is called the central space, and is denoted by S Y |X .The dimensionality of S Y |X is called the structural dimension.We assume the existence of the central space throughout this article.Since the introduction of the seminal sliced inverse regression method in Li (1991), many sufficient dimension reduction procedures have been proposed in the literature, such as Cook and Weisberg (1991), Cook (1998b), Xia et al. (2002), Li, Zha and Chiaromonte (2005), Li and Wang (2007), etc.More recently, Li, Artemiou and Li (2011) proposed the principal support vector machine, which combines the ideas of sliced inverse regression (Li, 1991), contour regression (Li, Zha and Chiaromonte, 2005) and support vector machine (Cortes and Vapnik, 1995;Vapnik 1998).By employing L1 support vector machine and focusing on separating hyperplanes rather than slice means, the principal support vector machine improves the accuracy of popular inverse regression estimators.
As demonstrated elegantly in Li, Artemiou and Li (2011), when we apply a modified L1 support vector machine for binary response Y , the normal vector ψ from the optimal hyperplane ψ T X −t = 0 naturally belongs to the central space S Y |X .For continuous response, the predictors are separated into several slices according to the values of the responses, and multiple support vector machines are implemented to find the optimal hyperplanes that separate these slices.The principal eigenvectors of the normal vectors from these hyperplanes, known as the principal L1 support vector machines estimators, thus recover the central space.In spite of the popularity of L1 support vector machine among practitioners and researchers, the corresponding objective function is not strictly convex and may have multiple optimal solutions (Burges and Crisp, 1999).More specifically, if the optimal hyperplane is described by an equation ψ T X − t = 0 for some ψ ∈ R p and t ∈ R, then L1 support vector machine may have multiple optimal solutions where all of them share the same ψ but have different t.On the other hand, Lq support vector machine with q > 1 enjoys unique solution due to the strict convexity of its objective function.See, for example, Burges and Crisp (1999) and Abe (2002).This motivates us to consider Lq support vector machine for sufficient dimension reduction with q > 1.
We extend Li, Artemiou and Li (2011) and propose the principal Lq support vector machine with q > 1 in this article.The principal Lq support vector machine inherits the benefits of the principal L1 support vector machine, and combines both linear and nonlinear sufficient dimension reduction in a general framework.By focusing on the theoretical development of the principal Lq support vector machine with q > 1, we clearly demonstrate the connections and differences between our proposal and the existing principal L1 support vector machine estimator.As we will see later, both estimators depend on the tuning parameter known as the misclassification penalty.When the misclassification penalty goes to infinity, these estimators become equivalent.Our proposal improves the accuracy of the existing estimators at the sample level, and it enjoys the additional benefit of being less sensitive to the choice of the misclassification penalty.Along with the theoretical development of the principal Lq support vector machine estimator, we develop a more complete asymptotic theory for the existing support vector machine literature.

Principal Lq support vector machine
Let {(X i , Y i ), i = 1, . . ., n} be an i.i.d.sample of (X, Y ).Denote Σ = var(X) and X = n −1 ∑ n i=1 X i .Suppose Y is binary random variable with values ±1.The Lq support vector machine (Abe, 2010) is defined through the following optimization problem, Here λ > 0 is a tuning parameter often referred to as the cost or misclassification penalty.The vector ξ = (ξ 1 , . . ., ξ n ) T , where ξ i 's are the misclassification distances with ξ i = 0 for correctly classified points and ξ i > 0 for incorrectly classified points.The separating hyperplane ψ T X − t = 0 is described by ψ ∈ R p and t ∈ R. The solution (ψ * , t * ) to this minimization problem gives the optimal hyperplane.For fixed ψ and t, minimizing (2) over ξ leads to solution , where a + = max(a, 0).Plug ξ * i into (2) leads to the following unconstrained minimization problem, At the population level, (3) corresponds to In a regression setting the response Y is a continuous variable.Let A 1 and A 2 be two disjoint sets of the range of Y and define Ỹ = I(Y ∈ A 2 ) − I(Y ∈ A 1 ) to be the discretized response variable.We modify (4) and define the following objective function, where ψ T Σψ and Ỹ replaces ψ T ψ and Y in ( 4) respectively.Replacing Y with Ỹ allows us to handle continuous as well as discrete response Y in (5).As we will see in the next theorem, adding Σ in the first term of ( 5) is essential to the unbiasedness of the resulting principal Lq support vector machine estimator.
Theorem 1 Suppose E(X|β T X) is a linear function of β T X, where β is as defined in (1).
Theorem 1 suggests that we can estimate the central space S Y |X via minimization of objective function (5).Note that for population level objective function such as Λ(ψ, t) in ( 5), the minimizer is denoted by (ψ 0 , t 0 ).For sample level objective function such as (2), we denote the minimizer by (ψ * , t * , ξ * ).With q = 1 in (5), Λ(ψ, t) reduces to the objective function proposed in Li, Artemiou and Li (2011).Although there is a unique value ψ 0 that minimizes Λ(ψ, t) in this case, the value of t 0 that minimizes Λ(ψ, t) is not unique.This is because the second term of the objective function Λ(ψ, t) is not a strictly convex function of t when q = 1.On the other hand, the second term becomes a strictly convex function of t when q > 1, which guarantes the uniqueness of the solution (ψ 0 , t 0 ).Without interrupting the flow of the main article, we provide in Appendix B the sufficient conditions for the existence of non-unique minimizer t 0 for Λ(ψ, t) when q = 1.

Sample estimation algorithm
Given i.i.d.sample {(X i , Y i ), i = 1, . . ., n}, we study the sample algorithm of principal Lq support vector machine to estimate the central space S Y |X .We first develop a general result for q > 1 and then focus on q = 2 for our implementation.Let Σ be the sample covariance estimator.The sample version objective function of the principal Lq support vector machine can be modified from (2) as follows.
Proposition 1 For q > 1, the solution ζ * of ( 7) is given by , where α is the solution to the following optimization problem: We relegate the proof of Proposition 1 in Appendix A.
Note that Ỹ has entries ±1 and α T α = (α ⊙ Ỹ ) T (α ⊙ Ỹ ).Thus for the special case of q = 2, (8) reduces to the following quadratic programming problem, For the corresponding problem with q = 1, one can show that solving the sample version of (5) leads to ζ * = 1 2 Z T (α ⊙ Ỹ ), with α being the solution to One can follow the proof of Theorem 3 in Li, Artemiou and Li (2011) for the derivation of (10).We notice an interesting fact by comparing ( 9) and (10).Namely, the two problems become equivalent as λ → ∞.We will discuss this property further in our numerical studies section.It is easy to see that using q > 2 in Proposition 1 will not give a quadratic programming problem.While the asymptotic theory will be developed for any q > 1, our numerical studies focus on q = 2.We present the principal L2 support vector machine algorithm to conclude this section.Suppose for now the structural dimension d of the central space S Y |X is known.
3. For each q r , construct Ỹ r i = I(Y i > q r ) − I(Y i ≤ q r ).Let ζr be the solution of 5. Estimate S Y |X by the subspace spanned by û1 , . . ., ûd .

Asymptotic results for LqSVM
In this section we discuss the asymptotic results for the PLqSVM with q > 1. Assume E(X) = 0 without loss of generality.First we introduce the following notations: , and Σ † = diag(Σ, 0), where diag(A, B) denotes a block diagonal matrix with A and B on the block diagonals.Λ(ψ, t) in ( 5) can be rewritten as E{m(θ, W )}, where Denote the corresponding sample version objective function as E n {m(θ, W )}. Let θ 0 and θ be the minimizer of E{m(θ, W )} and E n {m(θ, W )} respectively.Before we state the asymptotic distribution of θ in Theorem 2, the gradient and the Hessian matrix of the Lq objective function E{m(θ, W )} are provided in the next two propositions.
Theorem 2 Suppose the conditions in Propositions 2 and 3 are satisfied.Then where H is given in Proposition 3.
To apply Theorem 2 for the proposed estimator of S Y |X , recall from the algorithm in Section 3 that for a fixed dividing point q r , we have a corresponding Ỹ r , r = 1, . . ., Theorem 3 Suppose the conditions in Propositions 2 and 3 are satisfied.Then √ nvec( V − V ) converges to multivariate normal with mean 0 and variance Let D be a diagonal matrix with diagonal elements being the d leading eigenvalues of V .Let U = (u 1 , . . ., u d ), where u's are the d leading eigenvectors of V .Denote Û = (û 1 , . . ., ûd ) correspondingly.We get the asymptotic distribution of as a result of Corollary 1 in Bura and Pfeiffer (2008).
Corollary 1 Suppose the conditions in Propositions 2 and 3 are satisfied, and V has rank d.
In the last step of the principal L2 support vector machine algorithm in Section 3, we extract d leading eigenvectors of V .Since d is unknown in practice, one needs to estimate the dimensionality of S Y |X before successful implementation of the proposed algorithm.We propose a modified BIC criterion for this purpose.Define where ρ i ( V ) denotes the ith largest eigenvalue of V .Then we estimate d by d, the maximizer of G n (k) over k = 0, 1, . . ., p. Similar criteria have been used in Zhu, Miao and Peng (2006), Wang and Yin (2008).Our criterion is different from the one used in Li, Artemiou and Li (2011) as we include number of slices H, the predictor dimensionality p, and the misclassification penalty λ in (15).The consistency of d is provided next.
Theorem 4 Suppose the conditions in Propositions 2 and 3 are satisfied, and V has rank d.Then lim n→∞ P ( d = d) = 1.

Nonlinear sufficient dimension reduction
The dimension reduction in (1) aims at finding d features that are linear combinations of the original predictors, and will be referred to as the linear sufficient dimension reduction.Let φ : R p → R d be nonlinear functions satisfying Identifying φ(X) is known as nonlinear sufficient dimension reduction.Model ( 16) was first formulated in Cook (2007), and has been studied in Wu (2008) Note that ( 17) is parallel to (5) with q = 2. Let (ψ 0 , t 0 ) be the minimizer of Λ(ψ, t) over all (ψ, t) ∈ H × R. Suppose σ{φ(X)} is the σ-field generated by φ(X).Under proper conditions, one can show that ψ 0 (X) is measurable σ{φ(X)}, which means ψ 0 is a function of the sufficient predictor φ(X).
The derivation follows Theorem 2 in Li, Artemiou and Li (2011) and is thus omitted.Based on i.i.d.sample {(X i , Y i ), i = 1, . . ., n}, we now describe the principle for the sample level estimation.Suppose H can be spanned by {h 1 , . . ., h G }, where we choose h j ∈ H to satisfy E n (h j (X)) = 0, j = 1, . . ., G. Define Ψ ∈ R n×G with the element in the ith row and jth column being h j (X i ).The sample version of (17) becomes where Let (c * , t * ) be the minimizer of ( 18) over (c, t) ∈ R G ×R. Denote Ỹ = ( Ỹ1 , . . ., Ỹn ) T and P Ψ = Ψ(Ψ T Ψ) −1 Ψ T .Parallel to Proposition 1, we have the following result Proposition 4 The solution c * of ( 18) is given by , where α * is the solution to the quadratic programming problem: Following similar procedures as in Li, Artemiou and Li (2011), we describe the details of carrying out nonlinear sufficient dimension reduction through reproducing kernel Hilbert space as follows.For the function class H, we use the reproducing kernel Hilbert space based on mapping κ : R p × R p → R. Common choices of κ include the Gaussian radial kernel and the polynomial kernel.Define kernel matrix K n ∈ R n×n , with κ(X i , X j ) being the element in the ith row and jth column of K n .Define Q n = I n − J n /n, where I n is the n × n identity matrix and J n is the n × n matrix whose entries are 1.Let w g be the eigenvector corresponding to λ g , the gth largest eigenvalue of Q n K n Q n for g = 1, . . ., n. From Proposition 2 in Li, Artemiou and Li (2011), we know Ψ becomes (w 1 , . . ., w G ).After plugging Ψ = (w 1 , . . ., w G ) into (19) and applying Proposition 4, we get c * ∈ R G .Recall from the sample level algorithm in Section 3 that Ỹ r = ( Ỹ r 1 , . . ., Ỹ r n ) T , where Ỹ r i = I(Y i > q r )−I(Y i ≤ q r ) and q r denotes the equally spaced sample percentiles of {Y 1 , . . ., Y n } for r = 1, . . ., H − 1.When we replace Ỹ in (19) with Ỹ r , the corresponding solution c * becomes c r * .Let û1 , . . ., ûd be the d leading eigenvectors of ∑ H−1 r=1 c r * (c r * ) T .For t = 1, . . ., d and g = 1, . . ., G, denote the gth component of ût as ûtg .For i = 1, . . ., n and g = 1, . . ., G, denote the ith component of w g as w gi .From ( 16), we have φ(x) ∈ R d as a nonlinear reduction of x ∈ R p .At the sample level, the tth component of φ(x) is then estimated by , where

Numerical studies
We use synthetic examples as well as real data analysis to demonstrate the finite-sample performance of the proposed methods in this section.Example 1 : This example is designed to compare the principal Lq support vector machine estimators for linear sufficient dimension reduction.As it has been demonstrated in Li, Artemiou and Li (2011) that the principal L1 support vector machine can consistently outperform popular methods such as sliced inverse regression (Li, 1991), sliced average variance estimation (Cook and Weisberg, 1991), and directional regression (Li and Wang, 2007), we focus on comparing the principal L1 support vector machine with the newly proposed principal L2 support vector machine estimator.Consider where X ∼ N (0, I p ), σ = .2,and ε ∼ N (0, 1) independent of X.We set q r as equally spaced sample percentiles of {Y 1 , . . ., Y n } for r = 1, . . ., H − 1, and define Ỹ r i = I(Y i > q r ) − I(Y i ≤ q r ).Let sample size n = 100, number of slices H = 10, 20, 50, and p = 10, 20, 30.Suppose β ∈ R p×d is the basis of the central space.Denote its sample estimator as β.We measure the accuracy of β by ∆ = ∥P β − P β ∥, where P β = β(β T β) −1 β T , P β = β( βT β) −1 βT , and ∥ • ∥ is the Frobenius norm.
The results are summarized in Table 1.The entries are of the form a(b), which are the means and the standard errors of ∆ based on 200 repetitions.Smaller values in Table 1 mean better estimation.In all models across different combinations of p and H, we see that the principal L2 support vector machine can consistently improve over its L1 counterpart for λ = 1.When λ increases to 10 and 100, the estimation improves for both L1 and L2 support vector machine, and the difference between the two methods become smaller.This verifies the theoretical finding from Section 3, where we showed that the two algorithms become equivalent as λ → ∞.Example 2 : This example is to examine the validity of estimating the structural dimension d via the modified BIC criterion (15).We include Model I and Model III from the previous example, and compare principal Lq support vector machine with q = 1 or q = 2.The misclassification penalty is  fixed to be λ = 1.Across p = 10, 20, 30, H = 10, 20 and n = 200, 300, 400, we report in Table 2 the proportions that d is correctly estimated based on 200 repetitions.We see that both principal support vector machine estimators work reasonably well for Model I where true d = 1.In the more challenging case of Model III where d = 2, the superiority of principal L2 support vector machine becomes more obvious.The estimator d based on the principal L1 support vector machine could lead to very bad performances, especially when n = 200 or p = 30.As n increases and p decreases, both methods improve and get a higher proportion of correctly identified d.
We repeated the experiments for p = 10 and λ = 10, 100 to further investigate the role of λ.We show the results in the Table 3 along with the results for λ = 1 (which are the same from Table 2).For model 1, the criterion was still perfect when q = 2 while it's performance was decreasing when q = 1.For model 3, for both q = 1 and q = 2 the performance was decreasing for larger λ's.When λ = 10, q = 2 was still outperforming and for λ = 100 the two were mostly equivalent.Example 3 : This real data analysis is to demonstrate the effect of misclassification penalty λ on principal support vector machine estimators.Consider the concrete slump test data studied in Yeh (2007).The response variable is the concrete flow.There are 7 predictors: cement, slag, fly ash, water, superplasticizer (SP), coarse aggregate, and fine aggregate.The sample size is n = 103.Fix H = 20 and d = 1, we compare the L1 and the L2 estimators across λ = 1, 10, 1000.We report the components of β in Table 4.Although the two estimator are seemingly different when λ = 1, they become very close to each other when λ = 1000.This confirms our findings in Section 3. In the first row of Figure 1, we provide scatterplots of Y versus βT X based on the principal L1 support vector machine estimators.We see the patterns change significantly while λ increases.From the scatterplots in the second row of Figure 1, we see that the principal L2 support vector machine estimator are less sensitive to the choice of λ.Example 4 : We study nonlinear sufficient dimension reduction via the principal Lq support vector machine in this example.In addition to Model III: Y where X ∼ N (0, I p ), σ = .2,and ε ∼ N (0, 1) independent of X. Set λ = 1, n = 100, p = 10, 20, 30, and H = 10, 20, 50.Based on the description in Section 5, we aim to find a monotone transformation of the sufficient predictor φ(X), which is for Models III, IV and V respectively.To measure the accuracy of the nonlinear sufficient dimension reduction estimators, we report the absolute value of Spearman correlation between φ(X) and φ(X).Note that this measure is invariant under monotone transformation.Table 5 is based on 200 repetitions, where values closer to 1 means better estimation.The Gaussian radial basis kernel κ(X i , X j ) = e −γ∥Xi−Xj ∥ 2 is used.We set the tuning parameter as γ = 1/(E∥X − X ′ ∥) 2 , where X and X ′ are independent copies of N (0, I p ).Since the principal L2 support vector machine is mainly designed to improve existing linear sufficient dimension reduction estimators, it is comforting to observe in Table 5 that the principal L2 support vector machine is slightly better than its L1 counterpart for nonlinear sufficient dimension reduction.Scatterplots of Y versus βT X across λ = 1 (first column), λ = 10 (second column), and λ = 1000 (third column), q = 1 (first row), and q = 2 (second row).

Discussion
We propose the principal Lq support vector machine for sufficient dimension reduction.Compared with its L1 counterpart, the principal Lq support vector machine estimator is more robust to the choice of the misclassification parameter, and enjoys more accurate estimation of the central space.
In an effort to combine weighted support vector machine and sufficient dimension reduction, Shin et al. (2014) proposed probability-enhanced sufficient dimension reduction.The misclassification reweighted scheme for the principal L1 support vector machine was studied in Artemiou and Shu (2014).Development of weighted Lq support vector machines is worth exploration.Another open question is about the choice of the tuning parameter λ.The bootstrap method in Ye and Weiss (2003) could potentially be used to facilitate the selection of λ, and the theoretical justification of such procedures needs future investigation.Further to this, a limitation of our study comes with the lack of investigation of the role of λ in the theoretical framework.Another interesting question currently investigated by the authors is the use of equality instead of inequality in the constraint in (2).This leads to the Least Squares SVM (LSSVM) introduced by Suykens et al (2002).In the classification context, LSSVM give an analytic solution compared to the LqSVM which require quadratic programming but suffer in the sense that every point is considered a support vector.Whether similar advantages will hold in the dimension reduction framework is still under investigation.
From the definition that θ = (ψ T , t) T , we have |ψ (23) and we get Because E(∥X∥ 2 ) < ∞, E(1 + ∥X∥ 2 ) 2 < ∞ and from E(∥X∥ q−1 ) < ∞ we have that condition 2 of Lemma 1 is implied by (24).Now that the two conditions of Lemma 1 are verified, for w / ∈ N θ (m), take derivatives of m(θ, w) and we get Take expectation of (25).Apply Lemma 1 to get the desired result.✷ To compute the derivative of expectation of a non-Lipschitz function, two additional Lemmas are needed before we prove Proposition 3. The first one is true if U and V are linearly independent and the second covers the case when they are linearly dependent.Let D ϵ=0 denote the operation of first taking derivative with respect to ϵ and then evaluating the derivative at ϵ = 0.
Lemma 2 Let U and V be random variables, h(u, v) be a measurable R k -valued function, and b be a constant.Suppose: 1. the joint distribution of (U, V ) is dominated by the Lebesgue measure; 2. for each v, the function u denotes the conditional probability density function of U given V ; 3. for each component h Then, for any constant a, the function ϵ → E{(b − U + ϵ(η − V )) q−1 h(U, V )I(U + ϵV < a + ϵη)} is differentiable at ϵ = 0 with derivative Proof.We need to show that, for each i = 1, . . ., k, the limit exists.By the mean value theorem for integration, there exists ξ ∈ (0, ϵ) such that where the inequality follows from assumptions 2 and 3.By the dominated convergence theorem, the limit in (27) becomes Apply the generalized Leibniz integral rule and (28) becomes and we get the desired result.✷ Lemma 3 Let U and V be linearly dependent random variables and h(u) be a measurable R k -valued function.Suppose 1. the distribution of U is dominated by the Lebesgue measure; 2.
Proof.Suppose, without loss of generality, V = κU for some κ > 0. We have The generalized Leibniz integral rule leads to: Under the condition that U = a and V = kU , we have the desired result.✷ Proof of Proposition 3. The first term of D θ [E{m(θ, W )}], or (2ψ T Σ, 0) T , is jointly differentiable with derivative 2diag(Σ, 0).We focus on the second term of D We first consider the case ỹ = 1 and verify directional differentiability of the function (ψ, t) → E{X † (1 + t − ψ T X) q−1 I(ψ T X < t + 1)| Ỹ = 1}.To do this we define ψ and δ to be linearly independent vectors in R p .Let η be a number.The directional derivative along (δ T , η) T is the derivative of the following function with respect to ϵ at ϵ = 0: The derivative of the equation above becomes The first term is equal to 0. Since this holds for all (δ T , η) T , the function (ψ, t) If δ and ψ are linearly dependent vectors in R p , then ψ T X and ψ T X are linearly dependent random variables.We apply Lemma 3 in the similar fashion to arrive at the same directional derivative.The case for ỹ = −1 can be proved similarly.Hence the directional derivative of D θ [E{m(θ, W )}] is given by equation (14).✷ Proof of Theorem 2. The proof is similar to Jiang, Zhang and Cai (2008), and is thus omitted.A different approach to this can be seen also in Koo et al (2008).Denote indicator function I j = I{(X, Ỹ ) ∈ I j } and define P j = P (I j ).Assume there exists a unique minimizer ψ.Since the first term in (30) is not affected by the value of t we ignore it in this development.Assume E(X) = 0 without loss of generality.We focus on the the second term of (30), which is equal to Now define s = min{s 1 , s 2 }, where s 1 = min{1 − (ψ T X − t) for (X, Ỹ ) ∈ I 3 }, and s 2 = min{−1 − (ψ T X − t) for (X, Ỹ ) ∈ I 4 }.According to Figure 1, the value of s will be either the minimum distance between the purple circles to the blue dash line, or the minimum distance of the red crosses to the green dash line.Instead of the original separating hyperplane ψ T X − t = 0, we now consider the new hyperplane ψ T X −t ′ = 0, where t ′ = t−s.Note that s > 0, 1−(ψ T X −t ′ ) = 1−(ψ T X −t)−s, and 1 + (ψ T X − t ′ ) = 1 + (ψ T X − t) + s.With the new separating hyperplane, we observe 1.All the points that were in I 1 satisfies 1 − (ψ T X − t) < 0 and Ỹ = 1.Thus 1 − (ψ T X − t ′ ) < 0, and these points will still be correctly classified.2. All the points that were in I 2 satisfies 1 − (ψ T X − t) = 0 and Ỹ = 1.Thus 1 − (ψ T X − t ′ ) < 0, and these points will still be correctly classified.3.All the points that were in I 3 satisfies 1 − (ψ T X − t) > 0 and Ỹ = 1.Because s ≤ s 1 = min{1 − (ψ T X − t) for (X, Ỹ ) ∈ I 3 }, 1 − (ψ T X − t ′ ) ≥ 1 − (ψ T X − t) − s 1 ≥ 0. These points will now either continue to be incorrectly classified or become correctly classified as a point on the support vector.The latter happens if 1 − (ψ T X − t) = s.4. All the points that were in I 4 satisfies 1 + (ψ T X − t) < 0 and Ỹ = −1.Because s ≤ s 2 = min{−1 − (ψ T X − t) for (X, Ỹ ) ∈ I 4 }, we have 1 + (ψ T X − t ′ ) ≤ 1 + (ψ T X − t) + s 2 ≤ 0. These points will either continue to be correctly classified as non-support points or become correctly classified as a point on the support vector.The latter happens if −1 − (ψ T X − t) = s.

∑ 6 i=1Figure 2 .
Figure2.All circles correspond to Ỹ = 1 and all crosses correspond to Ỹ = −1.The black circles, the blue circles and the purple circles belong to I1, I2 and I3 respectively.The red crosses, the green crosses and the orange crosses belong to I4, I5 and I6 respectively.The dashed blue line, the solid black line and the dashed green line correspond to ψ T X − t = 1, ψ T X − t = 0 and ψ T X − t = −1 respectively.
Fukumizu, Bach and Jordan (2009)ukumizu, Bach and Jordan (2009).Following Li, Artemiou and Li (2011), we discuss nonlinear sufficient dimension via the principal L2 support vector machine in this section.Let H be a reproducing kernel Hilbert space of the functions of X with inner product ⟨•, •⟩ H .We assume H to have finite dimensionality for technical convenience, although this is not required in general.Let Σ : H → H be the covariance operator such that ⟨f 1 , Σf 2

Table 1
Estimating S Y |X via the principal Lq support vector machine.The means and standard errors of ∆ are reported based on 200 repetitions in Example 1.

Table 2
Estimating structural dimension d via the principal Lq support vector machine.The proportions that d = d are reported based on 200 repetitions in Example 2.

Table 3
Proportion of correct estimation of structural dimension d via the principal Lq support vector machine for λ = 10, 100 and p = 10 for q = 1 and q = 2.

Table 4
Comparing the principal Lq support vector machine estimators across different λ.Components of β are reported based on real data in Example 3.

Table 5
Estimating φ(X) for nonlinear sufficient dimension reduction.The means and standard errors of Spearman correlation are reported based on 200 repetitions in Example 4.