Support Vector Regression for Right Censored Data

We develop a unified approach for classification and regression support vector machines for data subject to right censoring. We provide finite sample bounds on the generalization error of the algorithm, prove risk consistency for a wide class of probability measures, and study the associated learning rates. We apply the general methodology to estimation of the (truncated) mean, median, quantiles, and for classification problems. We present a simulation study that demonstrates the performance of the proposed approach.

1. Introduction. In many medical studies, the quantity of interest is the time until some event occurs, where typical events of interest include death and cancer remission. The time at which the event of interest occurs is called the failure time. Learning the failure time distribution function, or quantities that depend on this distribution, as a function of the medical state of the patient, is one of the main goals of survival analysis research. Since the medical study may end before the failure event occurs, and the patient may drop out of the study, data are typically subject to right censoring. In this case, the failure time is not known, and instead, a lower bound on the failure time is given. Consequently, when applying learning methods to data from such studies, one needs to take into account the censored nature of the observations. Estimation of a patients's failure time quantities, such as the expectation and the median, is usually done under stringent assumptions on the failure time distribution function. Commonly used distribution models include parametric models such as the Weibull distribution, and semiparametric models such as proportional hazard models (see Lawless, 2003, for both). Even when less stringent models such as nonparametric estimation are used, it is typically assumed that the distri- * The first author was funded in part by a Gillings Innovation Laboratory (GIL) award at the UNC Gillings School of Global Public Health. The second author was funded in part by NCI grant CA142538. bution function is smooth in both time and covariates (Dabrowska, 1987;Gonzalez-Manteiga and Cadarso-Suarez, 1994). These assumptions seem restrictive, especially when considering today's high-dimensional data settings.
In this paper, we propose a support vector machine (SVM) learning method for right censored data. The choice of SVM is motivated by the fact that SVM learning methods are easy-to-compute techniques that enable estimation under weak or no assumptions on the distribution (Steinwart and Chirstmann, 2008). SVM learning methods, which we review in detail in Section 2, are a collection of algorithms that attempt to minimize the risk with respect to some loss function. An SVM learning method typically minimizes a regularized version of the empirical risk over some reproducing kernel Hilbert space (RKHS). The obtained minimizer is referred to as the SVM decision function. The SVM learning method is the mapping that assigns to each data set its corresponding SVM decision function.
We adapt the SVM framework to right censored data as follows. First, we represent the distribution's quantity of interest as a Bayes decision function, i.e., a function that minimizes the risk with respect to a loss function. We then construct a data-dependent version of this loss function using inverse-probability-of-censoring weighting (Robins et al., 1994). We minimize a regularized empirical risk with respect to this data-dependent loss function to obtain an SVM decision function for censored data. Finally, we define the SVM learning method for censored data, or censored SVM learning method in short, as the mapping that assigns for every censored data set its corresponding SVM decision function.
Note that unlike the standard SVM decision function, the censored SVM decision function is obtained as the minimizer of a data-dependent loss function. In other words, for each data set, a different minimization loss function is defined. Moreover, minimizing the empirical risk no longer consists of minimizing a sum of i.i.d. observations. Consequently, different techniques are needed to study the theoretical properties of the censored SVM learning method.
We prove a number of theoretical results for the proposed censored SVM learning method. We first prove that the censored SVM decision function is measurable and unique. We then show that the censored SVM learning method is a measurable learning method. We provide a probabilistic finite-sample bound on the difference in risk between the learned censored SVM decision function and the infimum risk within the RKHS. We further show that the SVM learning method is consistent for every probability measure for which the censoring is independent of the failure time given the covariates, and the probability that no censoring occurs is positive given the covariates. Finally, we compute learning rates for the censored SVM learning method. We also provide a simulation study that demonstrate the performance of the censored SVM learning method. Our results are carried out under some conditions on the approximation RKHS and the loss function, which can be easily verified. We also assume that the estimation of censoring probability at the observed points is consistent.
One drawback of the proposed approach is the need to estimate the censoring probability at observed points. This estimation is required in order to use inverse-probability-of-censoring weighting for constructing the data-dependent loss function. We remark that in many applications it is reasonable to assume that the censoring mechanism is simpler than the failure-time distribution; in these cases, estimation of the censoring distribution is typically easier then estimation of the failure distribution. For example, the censoring may depend only on a subset of the covariates, or may be independent of the covariates; in the latter case, an efficient estimator exists. Moreover, when the only source of censoring is administrative, in other words, when the data is censored because the study ends at a prespecified time, the censoring distribution is often known to be independent of the covariates. The results presented in this paper hold for any censoring estimation technique. We present results for both correctly specified and misspecified censoring models. We also discuss in detail the special cases of the Kaplan-Meier and the Cox model estimators (Fleming and Harrington, 1991).
While the main contribution of this paper is the proposed censored SVM learning method and the study of its properties, an additional contribution is the development of a machine learning framework for right censored data. The principles and definitions that we discuss in the context of right censored data, such as learning methods, measurability, consistency, and learning rates, are independent of the proposed SVM learning method. This framework can be adapted to other learning methods for right censored data, as well as for learning methods for other missing data mechanisms.
Other learning algorithms have been suggested for survival data. Biganzoli et al. (1998) and Ripley and Ripley (2001) used neural networks. Johnson et al. (2004), Shivaswamy et al. (2007, Shim andHwang (2009), andZhao et al. (2011), among others, suggested different versions of SVM. As far as we know, the theoretical properties of these algorithms have never been studied. In the context of multistage decision problems, Goldberg and Kosorok (2012) proposed a Q-learning algorithm for right censored data for which a theoretical justification is given. However, the algorithm discussed therein is not an SVM learning method, and it is assumed that the censoring is independent of both failure time and covariates. We believe that this work is an important step in developing methodology for learning survival data.
The paper is organized as follows. In Section 2 we review right-censored data and SVM learning methods. In Section 3 we discuss the use of SVM for right-censored data, when no censoring is present. Section 4 discusses the difficulties that arise when applying SVM to right censored data and proposes a censored SVM learning method. Section 5 contains the main theoretical results, including finite sample bounds and consistency. Simulations appear in Section 6. Concluding remarks appear in Section 7. The lengthier proofs are provided in the Appendix.

Preliminaries.
In this section we establish the notation used throughout the paper. We begin by introducing right censored data (Section 2.1). We then discuss loss functions (Section 2.2).
Finally we discuss SVM learning methods (Section 2.3). For right censored data we follow Fleming and Harrington (1991) (hereafter abbreviated FH91). For the loss function and the SVM definitions we follow Steinwart and Chirstmann (2008) (hereafter abbreviated SC08).
2.1. Right Censored Data. We assume that data consist of n independent and identicallydistributed random triplets D = {(Z 1 , U 1 , δ 1 ), . . . , (Z n , U n , δ n )}. The random vector Z is a covariate vector that takes its values in a compact set Z ⊂ R d . The random variable U is the observed time defined by U = T ∧ C, where T ≥ 0 is the failure time, C is the censoring time, and where and is 0 otherwise, i.e., δ = 1 whenever a failure time is observed.
Let S(t|Z) = P (T > t|Z) be the survival functions of T , and let G(t|Z) = P (C > t|Z) be the survival function of C. We make the following assumptions: (A1) C takes its values in the segment [0, τ ] for sone finite τ > 0, and inf z∈Z G(τ − |z) ≥ 2K > 0.
(A2) C is independent of T , given Z.
The first assumption assures that there is a positive probability of censoring over the observation time range ([0, τ ]). Note that the existence of such a τ is typical since most studies have a finite time period of observation. In the above, we also define F (t−) to be the left-hand limit of a right continuous function F with left-hand limits. The second assumption is standard in survival analysis and ensures that the joint nonparametric distribution of the survival and censoring times, given the covariates, is identifiable.
We assume that the censoring mechanism can be described by some simple model. Below, we consider two possible examples, although the main results do not require any specific model. First, we need some notation. For every t ∈ [0, τ ], define N(t) = 1{U ≤ t, δ = 0} and Y(t) = 1{U > t} + 1{U = t, δ = 0}. Note that since we are interested in the survival function of the censoring variable, N(t) is the counting process for the censoring, and not for the failure events, and Y(t) is the at-risk process for observing a censoring time. For a cadlag function A on (0, τ ], define the product integral φ(A)(t) = 0<s≤t (1 + dA(s)) (van der Vaart and Wellner, 1996). Define P n to be the empirical Example 1. Independent censoring: Assume that C is independent of both T and Z. Definê ThenĜ n (t) = φ(−Λ)(t) is the Kaplan-Meier estimator for G.Ĝ n is a consistent and efficient estimator for the survival function G (FH91).
Example 2. The proportional hazards model: Consider the case that the hazard of C give Z is of the form e Z β dΛ for some unknown vector β ∈ R d and some continuous unknown nondecreasing function Λ with Λ(0) = 0 and 0 < Λ(τ ) < ∞. Letβ be the zero of the estimating ThenĜ n (t|z) = φ(1 − eβ zΛ (t)) is a consistent and efficient estimator for survival function G

(FH91).
Even when no simple for the censoring mechanism is assumed, the censoring distribution can be estimated using a generalization of the Kaplan-Meier estimator of Example 1.
Example 3. Generalized Kaplan-Meier: Let k σ : Z × Z → R be a kernel function of width .
Then the generalized Kaplan-Meier estimator is given byĜ n (t|z) = φ(−Λ)(t|z), where the product integral φ is defined for every fixed z. Under some conditions, Dabrowska (1987Dabrowska ( , 1989) proved consistency of the estimator and discussed its convergence rates.
Usually we denote the estimator of the survival function of the censoring variable G(t|Z) bŷ G n (t|Z) without referring to a specific estimation method. When needed, the specific estimation method will be discussed. When independent censoring is assumed, as in Example 1, we denote the estimator byĜ n (t).
Remark 4. By Assumption (A1), inf z∈Z G(τ |z) ≥ 2K > 0, and thus if the estimatorĜ n is consistent for G, then, for n large enough, inf z∈ZĜn (τ |z) > K > 0. In the following, for simplicity, we assume that the estimatorĜ n is such that inf ZĜn (τ |Z) > K > 0. In general, one can always replaceĜ n byĜ n ∨ K n , where K n → 0. In that case, for all n large enough, inf ZĜn (τ |Z) > K > 0 and for all n, infĜ n > 0.

Loss Functions.
Let the input space (Z, A) be a measurable space. Let the response space Y be a closed subset of R. Let P be a measure on Z × Y.
is called a loss function if it is measurable. We say that a loss function L is convex if L(z, y, ·) is convex for every z ∈ Z and y ∈ Y. We say that a loss function We say that L is Lipschitz continuous if there is a constant c L such that the above holds for any a with c L (a) = c L .
For any measurable function f : Z → R we define the L-risk of f with respect to the measure P as R L,P (f ) = E P [L(Z, Y, f (Z))]. We define the Bayes risk R * L,P of f with respect to loss function L and measure P as inf f R L,P (f ), where the infimum is taken over all measurable functions f : Z → R. A function f * L,P that achieves this infimum is called a Bayes decision function.
We present a few examples of loss functions and their respective Bayes decision functions. In the next section we discuss the use of these loss functions for right censored data.
Example 5. Binary classification: Assume that Y = {−1, 1}. We would like to find a function f : Z → {−1, 1} such that for almost every z, P (f (z) = Y |Z = z) ≥ 1/2. One can think of f as a function that predicts the label y of a pair (z, y) when only z is observed. In this case, the desired function is the Bayes decision function f * L,P with respect to the loss function L BC (z, y, s) = 1{y · sign(s) = 1}. In practice, since the loss function L BC is not convex, it is usually replaced by the hinge loss function L HL (z, y, s) = max{0, 1 − ys}.
Example 6. Expectation: Assume that Y = R. We would like to estimate the expectation of the response Y given the covariates Z. The conditional expectation is the Bayes decision function f * L,P with respect to the squared error loss function L LS (z, y, s) = (y − s) 2 .
Example 7. Median and quntiles: Assume that Y = R. We would like to estimate the median of Y |Z. The conditional median is the Bayes decision function f * L,P for the the absolute deviation loss function L AD (z, y, s) = |y − s|. Similarly, the α-quantile of Y given Z is obtained as the Bayes decision function for the loss function Note that the functions L HL , L LS , L AD , and L α for α ∈ (0, 1) are all convex. Moreover, all these functions except L LS are Lipschitz continuous, and L LS is locally Lipschitz continuous when Y is compact.
2.3. Support Vector Machine (SVM) Learning Methods. Let L be a convex locally Lipschitz continuous loss function. Let H be a separable reproducing kernel Hilbert space (RKHS) of a bounded measurable kernel on Z (for details regarding RKHS, the reader is referred to SC08, Chapter 4).
. . , (Z n , Y n )} be a set of n i.i.d. observations drawn according to the probability measure P . Fix λ and H be as above. Define the empirical SVM decision function where is the empirical risk.
For some sequence {λ n }, define the SVM learning method L, as the map for all n ≥ 1. We say that L is measurable if it is measurable for all n with respect to the minimal completion of the product σ-field on (Z × Y) n × Z. We say that that L is (L-risk) P -consistent if We say that L is universally consistent if for all distributions P on Z × Y, L is P -consistent.
We summarize some known results regarding SVM learning methods.
Theorem 8. Let L be a convex locally Lipschitz continuous loss function, Choose 0 < λ n < 1 such that λ n → 0, and λ 2 n n → ∞. Then (a) The empirical SVM decision function f D 0 ,λn exists and is unique.
(d) If the RKHS H is dense in the set of integrable functions on Z, then the SVM learning method L is universally consistent.
The proof of (a) follows from SC08, Lemma 5.1 and Theorem 5.2. For proof of (b), see SC08, Lemma 6.23. The proof of (c) follows from SC08 Theorem 6.24. The proof of (d) follows from SC08, Theorem 5.31, together with Theorem 6.24.
3. SVM for Survival Data without Censoring. In this section we present a few examples of the use of SVM for survival data. We discuss the case in which no censoring is introduced.
We show how different quantities of the conditional distribution of T given Z can be represented as Bayes decision functions. We then show how SVM learning methods can be applied to these estimation problems and review theoretical properties of these SVM learning methods. In the next section we will explain why the standard SVM techniques cannot be employed directly when censoring is introduced.
Let (Z, T ) be a random vector where Z is a covariates vector that takes its values in a compact where (Z, T ) are distributed according to a probability measure P on Z × T .
Note that the conditional expectation E P [T |Z] is the Bayes decision function for the least squares loss function L LS . In other words where the minimization is taken over all measurable real functions on Z (see Example 6). Similarly, the conditional median and the α-quantile of T |Z can be shown to be the Bayes decision functions for the absolute deviation function L AD and L α , respectively (see Example 7). In the same manner, one can represent other quantities of the conditional distribution T |Z using Bayes decision function.
Defining quantities of the survival function as Bayes decision functions is not limited to regression (i.e., to a continuous response). Classification problems can also arise in the analysis of survival data (see, for example, Ripley and Ripley, 2001;Johnson et al., 2004). For example, let ρ, 0 < ρ < τ be a cutoff constant. Assume that survival to a time greater than ρ is considered as death unrelated to the disease (i.e., remission) and a survival time less than or equal to ρ is considered as death resulting from the disease. Denote In that case, the decision function that predicts remission when the probability of Y = 1 given the covariates is greater than 1/2 and failure otherwise is a Bayes decision function for the binary classification loss L BC of Example 5.
problems, Y is typically the identity function and for classification Y can be defined, for example, as in (4). Let L be a convex locally Lipschitz continuous loss function, Note that this include the loss functions L LS , L AD , L α , and L HL . Define the empirical decision function as in (1) and the SVM learning method L as in (2). Then, it follows from Theorem 8, for an appropriate RKHS H and regularization sequence {λ n } that L is measurable and universally consistent.

Censored SVM.
In the previous section, we presented a few examples of the use of SVM for survival data. In this section we explain why standard SVM techniques cannot be applied directly when censoring is introduced. We then explain how to use the inverse probability censoring weighting (Robins et al., 1994) to obtain a censored SVM learning method. Finally, we show that the obtained censored SVM learning method is well defined.
Let D = {(Z 1 , U 1 , δ 1 ), . . . , (Z n , U n , δ n )} be a set of n i.i.d. random triplets of right censored data (see Section 2.1). Let L : Z × Y × R → [0, ∞) be a convex locally Lipschitz loss function. Let H be a separable RKHS of a bounded measurable kernel on Z. We would like to find an empirical SVM decision function. In other words, we would like to find the minimizer of where λ > 0 is a fixed constant, and Y : T → Y is a known function. The problem is that the failure times T i may be censored, and thus unknown. While a simple solution is to ignore the censored observations, it is well known that this can lead to severe bias (Tsiatis, 2006).
In order to avoid this bias, one can reweight the uncensored observations. Note that at time T i , the i-th observation has probability G(T i − |Z i ) ≡ P (C i ≥ T i |Z i ) not to be censored, and thus, one can use the inverse of the censoring probability for reweighting in (5) (Robins et al., 1994).
More specifically, define the random loss function L n : whereĜ n is the estimator of the survival function of the censoring variable based on the set of n random triplets D (see Section 2.1). When D is given, we denote L n D (·) ≡ L n (D, ·). Note that the function L n D is no longer random. In order to show that L n D is a loss function, we need to show that L n D is a measurable function.
Lemma 9. Let L be a convex locally Lipschitz loss function. Assume that the estimation pro- Proof. By Remark 4, the functionĜ n (u|z) → 1/Ĝ n (u|z) is well defined. Since by definition, both Y and L are measurable, we obtain that (u, z, δ) → δL(Y (u), z)/Ĝ n (u|z) is measurable.
We define the empirical censored SVM decision function to be The existence and uniqueness of the empirical censored SVM decision function is ensured by the following lemma: Lemma 10. Let L be a convex locally Lipschitz loss function. Let H be a separable RKHS of a bounded measurable kernel on Z. Then there exists a unique empirical censored SVM decision function.
Proof. Note that given D, the loss function L n D (z, u, δ, ·) is convex for every fixed z, u, and δ.
Hence, the result follows from Lemma 5.1 together with Theorem 5.2 of SC08.
Note that the empirical censored SVM decision function is just the empirical SVM decision function of (1), after replacing the loss function L with the loss function L n D . However, there are two important implications to this replacement. First, empirical censored SVM decision functions are obtained by minimizing a different loss function for each given data set. Second, the second expression in the minimization problem (6), namely, is no longer constructed from a sum of i.i.d. random variables.
We would like to show that the learning method defined by the empirical censored SVM decision functions is indeed a learning method. We first define the term learning method for right censored data or censored learning method for short.
Choose 0 < λ n < 1 such that λ n → 0. Define the censored SVM learning method L c , as L c (D) = f c D,λn for all n ≥ 1. The measurability of the censored SVM learning method L c is ensured by the following lemma, which is an adaptation of Lemma 6.23 of SC08 to the censored case.
Lemma 12. Let L be a convex locally Lipschitz loss function. Let H be a separable RKHS of a bounded measurable kernel on Z. Assume that the estimation procedure D → G(·|·) is measurable.
Then the censored SVM learning method L c is measurable, and the map D → f c D,λn is measurable.
Proof. First, by Lemma 2.11 of SC08, for any f ∈ H, the map (z, is measurable. By Lemma 10, f c D,λn is the only element of H satisfying By Aumann's measurable selection principle (SC08, Lemma A.3.18), the map D → f c D,λn is measurable with respect to the minimal completion of the product σ-field on (Z × T × {0, 1}) n . Since 5. Theocratical Results. In the following, we discuss some theoretical results regarding the censored SVM learning method, proposed in Section 4. In Section 5.1 we discuss finite sample bounds. In Section 5.2 we discuss consistency. Learning rates are discussed in Section 5.3. Finally, censoring model misspecification is discussed in Section 5.4.
Define the censoring estimation error to be the difference between the estimated and true survival functions of the censoring variable.
Let H be an RKHS over the covariates space Z, where we assume throughout this section that Z is a compact subset of R d . Let B H ≡ {f ∈ H : f H ≤ 1} be the unit ball in the RKHS H.
Denote N (B H , · ∞ , ε) to be the ε-covering number of B H with respect to the norm · ∞ , i.e., the minimum number of sup-norm ε-balls that covers B H .
We are now ready to establish a finite-sample bound for the generalization of censored SVM learning methods: Theorem 13. Let L : Z × Y × R → [0, ∞) be a convex, locally Lipschitz continuous loss function satisfying L(z, y, 0) ≤ 1 for all (z, y) ∈ Z × Y. Let H be a separable RKHS over Z with continuous kernel k satisfying k ∞ ≤ 1. Let (Z, T, C) be distributed according to P . LetĜ n (t|Z) be an estimator of the survival function of the censoring variable and assume (A1)-(A2). Then, for any fixed regularization constant λ > 0, n ≥ 1, ε > 0, and η > 0, with probability not less than The proof appears in Appendix A.1. We remark that the bound above depends on the distribution P through the constant K and the error terms A 2 and Err n .
For the Kaplan-Meier estimator (see Example 1), under some conditions, bounds of the random error Err n ∞ were established (Bitouzé et al., 1999). In this case we can replace the the bound of Theorem 13 with a more explicit one.
More specifically, letĜ n be the Kaplan-Meier estimator. Then, for every n ≥ 1 and ε > 0 the following Dvoretzky-Kiefer-Wolfowitz-type inequality holds (Bitouzé et al., 1999, Theorem 2): where K 0 = P (T ≥ τ ) is a lower bound on the survival function at τ , and where C o is some universal constant (see Wellner, 2007, for a bound on C o ). Fix η > 0 and write Some algebraic manipulations then yield As a result, we obtain the following corollary: Corollary 14. Consider the setup of Theorem 13. Assume that the censoring variable C is independent of both T and Z. LetĜ n be the Kaplan-Meier estimator of G. Then for any fixed regularization constant λ, n ≥ 1, ε > 0, and η > 0, with probability not less than 1 − 7 2 e −η , 5.2. P-universal Consistency. In this section we discuss consistency of the censored SVM learning method L c proposed in Section 4. In general, P -consistency means that (3) holds for all ε > 0.
Universal consistency means that the learning method is P -consistent for every probability measure P on Z ×T ×{0, 1}. In the following we discuss a more restrictive notion than universal consistency, namely P-universal consistency. Here, P is the set of all probability distributions for which there is a constant K such that conditions (A1)-(A2) hold. We say that a censored learning method is P-universally consistent if (3) holds for all P ∈ P. We note that when the first assumption is violated for a set of covariates Z 0 with positive probability, there is no hope of learning the optimal function for all Z ∈ Z, unless some strong assumptions on the model are enforced. The second assumption is required for proving consistency of the learning method L c proposed in Section 4.
However, it is possible that other censored learning techniques will be able to achieve consistency for a larger set of probability measures.
In order to show P-universal consistency, we utilize the bound given in Theorem 13. This bound depends on four different terms: the approximation error A 2 , the entropy of the ball B H , the (locally) continuous Lipschitz constant c L , and the error in the estimation of the survival function G. We need the following assumptions: (B1) H is a separable RKHS over Z with universal kernel k satisfying k ∞ ≤ 1 (see Remark 15 below for definition of a universal kernel).
(B2) There are constants a > 1 and p > 0, such that for every ε > 0, the entropy of B H is bounded as follows: (B3) There is a constant q > 0, such that the (locally) Lipschitz constant is bounded by c L (λ) ≤ cλ q for all λ large enough.
(B4)Ĝ n is consistent for G and there is a finite constant s > 0 such that P ( Err n ∞ ≥ bn −1/s ) → 0 for any b > 0.
Before we state the main result regarding P-universal consistency, we present some examples for which the assumptions above hold: Remark 15. A continuous kernel k whose corresponding RKHS H is dense in the class of continuous functions over Z is called universal. Examples for universal kernels include the Gaussian kernels, and the Taylor kernels. For more details, the reader is refereed to SC08, Chapter 4.6. Recall that R * L,P = inf f R L,P (f ) where the infimum is taken over all measurable functions f . For universal kernels, we have inf f ∈H R L,P (f ) = R * L,P . (SC08, Corollary 5.29).
Remark 16. The entropy bound (10) of Assumption (B2) is satisfied for both Taylor and Gaussian kernels for all p > 0 (see SC08, Section 6.4).
By Assumption (B4) and the fact that Err n ∞ < 1, there exists a constant c 2 = c 2 (η) that depends only on η, such that for all n ≥ 1, This inequality, together with the bound on c λ given above yield that with probability of at least Assumption (B1), together with Lemma 5.15 of SC08 yield that A 2 (λ) → 0 as λ → 0. By assumption, λ (q+1)/2 n min{2(p+1),s} → ∞ ensures that the second expression in the RHS of (12) converges to zero, which implies (3). Since (3) holds for every P ∈ P, we obtain P-consistency.

Learning Rates.
In the previous section we discussed P-universal consistency which ensures that for every probability P ∈ P, the learning method L c asymptotically learns the optimal function. In this section we would like to study learning rates.
We define learning rates for censored learning methods similarly to the definition for regular learning methods (see SC08, Definition 6.5): Definition 20. Let L : Z × Y × R → [0, ∞) be a loss function. Let P ∈ P be a distribution. We say that a censored learning method L c learns with a rate {ε n } n , where {ε n } ⊂ (0, 1] is a sequence decreasing to 0, if for some constant c P > 0, all n ≥ 1, and all η ∈ [0, ∞), there exists a constant c η ∈ [1, ∞) that depends on η and {ε n } but not on P , such that In order to study the learning rates, we need an additional assumption: (B5) There exist constants c 3 and β ∈ (0, 1] such that where A 2 is the approximation error function defined in (7). Proof. By Assumption (B5) and Eq. (11), where a is defined in Assumption (B2), and c 2 = c 2 (η) depends on η but not on n (see proof of Theorem 19). Choose λ n to be a sequence that behaves like n − 1 (2β+q+1) min{p+1,s/2} . Then, it follows from (12) that where c P is independent of η.

Misspecified Censoring Model.
In the previous subsection we showed that under conditions (B1)-(B4), the censored SVM learning method L c is P-universally consistent. While one can choose the Hilbert space H and the loss function L in advance such that conditions (B1)-(B3) hold, condition (B4) need not hold when the censoring mechanism is misspecified. In the following, we consider this case.
LetĜ n (t|z) be the estimator of the survival function for the censoring variable. The deviation of G n (t|z) from the true survival function G(t|z) can be divided into two terms. The first term is the deviation of the estimatorĜ n (t|z) from its limit, while the second term is the difference between the estimator limit and the true survival function. When the model is correctly specified, and the estimator is consistent, the second term vanishes. More formally, let G P (t|z) be the limit of the estimator under the probability measure P , and assume it exists. Define the errors Note that Err n1 is a random function that depends on the data, the estimation procedure, and the probability measure P , while Err 2 is a fixed function that depends only on the estimation procedure and the probability measure P .
In the following, we would like to find finite sample bounds when the censoring model is misspecified. This will be done by refining the proof of Theorem 13. First, we need to introduce the concept of clipping. We say that a loss function L can be clipped at M > 0, if, for all (z, y, s) ∈ Z × Y × R, L(z, y, s) ≤ L(z, y, s) where s denotes the clipped value of s at ±M , that is, L(z, y, s) .
For a function f , we define f to be the clipped version of f , i.e., max{−M, min{M, f }}.
Theorem 23. Let L : Z ×Y ×R → [0, ∞) be a convex, locally Lipschitz continuous loss function satisfying L(z, y, 0) ≤ 1 for all (z, y) ∈ Z ×Y and that can be clipped at M > 0, and let the constant W be defined by (13). Let G P = limĜ n and assume that condition (A1) holds for both G and G P .
As a consequence of Theorem 23, we can prove that even under misspecification of the censored data model, the censored learning method L c achieves the optimal risk, up to a constant that depends on E P (G P − G), which is the expected distance of the limit of the estimator and the true distribution. If the estimator estimates reasonably well, one can hope that this term is small, even under misspecification.
We now show that the additional condition of Corollary 24 holds for both the Kaplan-Meier estimator and the Cox model estimator.
Example 25 (Kaplan-Meier estimator). LetĜ n be the Kaplan-Meier estimator of G. Let G P be the limit ofĜ n . Note that G P is the marginal distribution of the censoring variable. It follows from (9) that condition (14) holds for all s > 2.
Example 26 (Cox model estimator). LetĜ n be the estimator of G when the Cox model is assumed (see Example 2). Let G P be the limit ofĜ n . It was shown that the limit G P exists, regardless of correctness of the proportional hazard model (Goldberg and Kosorok, 2011). Moreover, for all ε > 0, and all n large enough, where W 1 , W 2 are universal constants that depend on the set Z, the variance of Z, the constants K and K 0 , but otherwise do not depend on the distribution P (see Goldberg and Kosorok, 2011, Theorem 3.2, and conditions therein). Fix η > 0 and write Some algebraic manipulations then yield Hence, condition (14) holds for all s > 2.
6. Simulation Study. In this section we illustrate the use of the censored SVM learning method proposed in Section 4 via a simulation study. We consider five different data generating mechanisms, including one-dimensional and multidimensional settings, and different types of censoring mechanisms. We compute the censored SVM decision function with respect to the absolute deviation loss function L AD . For this loss function, the Bayes risk is given by the conditional median (see Example 7). We choose to compute the conditional median and not the conditional mean, since censoring prevents reliable estimation of the unrestricted mean survival time when no further assumptions on the tail of the distribution are made (see discussion in Karrison, 1997;Zucker, 1998;Chen and Tsiatis, 2001). We compare the results of the SVM approach to the results obtained by the Cox model and to the Bayes risk. We test the effects of ignoring the censored observations.
Finally, for multidimensional examples, we also check the benefit of variable selection.
The algorithm presented in Section 4 was implemented in the Matlab environment. For the implementation we used the Spider library for Matlab 1 . The Matlab code for both the algorithm and the simulations can be found in Supplement A. The distribution of the censoring variable was estimated using the Kaplan-Meier estimator (see Example 1). We used the Gaussian RBF kernel k σ (x 1 , x 2 ) = exp(σ −2 x 1 − x 2 2 2 ), where the width of the kernel σ was chosen using crossvalidation. Instead of minimizing the regularized problem (6), we solve the equivalent problem (see SC08, Chapter 5): where H is the RKHS with respect to the kernel k σ , and λ is some constant chosen using crossvalidation. Note that there is no need to compute the norm of the function f in the RKHS space H explicitly. The norm can be obtained using the kernel matrix K with coefficients k ij = k(Z i , Z j ) (see SC08, Chapter 11). The risk of the estimated functions was computed numerically, using a randomly generated data set of size 10000.
In some simulations the failure time is distributed according to the Weibull distribution (Lawless, 2003). The density of the Weibull distribution is given by where κ > 0 is the shape parameter and ρ > 0 is the scale parameter. Assume that κ is fixed and that ρ = exp(β 0 + β Z), where β 0 is a constant β is some coefficient vector, and Z is the covariates vector. In this case, the failure time distribution follows the proportional hazard assumption, i.e., the hazard rate is given by h(t|Z) = exp(β 0 + β Z)dΛ(t), where Λ(t) = t κ . When the proportional hazard assumption holds, estimation based on Cox regression is consistent and efficient (see Example 2; note that the distribution discussed there is of censoring variable and not the failure time, nevertheless, the estimation is similar). Thus, when the failure-time distribution follows the proportional hazard assumption, we use the Cox regression as a benchmark.
In the first setting, the covariates Z are generated uniformly on the segment [−1, 1]. The failure time follows the Weibull distribution with shape parameter 2, and scale parameter −0.5Z. Note that the proportional hazard assumption holds. The censoring variable C is distributed uniformly on the segment [0, c 0 ] where the constant c 0 is chosen such that the mean censoring percentage is 30%. We used 5-fold-cross-validation to choose the kernel width and the regularization constant among the set of pairs (λ −1 , σ) = (0.1 · 10 i , 0.05 · 2 j ) , i, j ∈ {0, 1, 2, 3} .   We repeated on the simulation 100 times for each of the sample sizes 50, 100, 200, 400, and 800.
In Figure 1, the conditional median obtained by the censored SVM learning method and by Cox regression are plotted. The true median is plotted as a reference. In Figure 2 Note that the proportional hazard assumption holds for Z 2 , but not for the original covariate Z. In Figure 3, the true, the SVM and the Cox regression median are plotted. In Figure 4, we compare  The results for the third and the forth settings appears in Figure 5 and Figure 6, respectively.
We compare the risk of standard SVM that ignores censored observations, censored SVM, censored SVM with variable selection, and the Cox regression. We performed variable selection for censored SVM based on recursive feature elimination as in Guyon et al. (2002, Section 2.6). When the proportional hazard assumption holds (setting 3), SVM performs reasonably well, although the Cox model performs better as expected. When the proportional hazard assumption fails to hold (setting 4), SVM performs better and it seems that the risk of the Cox regression converges, but not to Bayes risk (see Example 26 for discussion). Both figures show that variable selection achieves a slightly smaller median risk with the price of higher variance and that ignoring the censored observations leads to higher risk.
In the fifth setting, we consider a non-smooth conditional median. We also investigate the in-fluence of using a misspecified model for the censoring mechanism. The covariates Z are generated uniformly on the segment [−1, 1]. The failure time is normally distributed with expectation 3 + 31{Z < 0} and variance 1. Note that the proportional hazard assumption does not hold for the failure time. The censoring variable C follows the Weibull distribution with shape parameter 2, and scale parameter −0.5Z + log(6) which results in mean censoring percentage of 40%. Note that for this model, the censoring is independent of the failure time only given the covariate Z (see Assumption (A2)). Estimation of the censoring distribution using the Kaplan-Meier corresponds to estimation under a misspecified model. Since the censoring follows the proportional hazard assumption, estimation using the Cox estimator corresponds to estimation under the true model. We use 5-fold-cross-validation to choose the regularization constant and the width of the kernel, as in setting 1.
In Figure 7, the conditional median obtained by censored SVM learning method using both the misspecified and true model for the censoring, and by Cox regression are plotted. The true median is plotted as a reference. In Figure 8, we compare the risk of the SVM method using both the misspecified and true model for the censoring. We also checked the effect of ignoring the censored observations. Both figures show that in general SVM does better than the Cox model, regardless the censoring estimation. The difference between the misspecified and true model for the censoring is small and the corresponding curves in Figure 7 almost coincide. Figure 8 shows again that there is a non-negligible price for ignoring the censored observations. 7. Concluding Remarks. We studied an SVM framework for right censored data. We proposed a general censored SVM learning method and showed that it is well defined and measurable.
We derived finite sample bounds on the deviation from the optimal risk. We proved risk consistency and computed learning rates. We discussed misspecification of the censoring model. Finally, we performed a simulation study to demonstrate the censored SVM method.
We believe that this work illustrates an important approach for applying support vector machines to right censored data, and to missing data in general. However, many open questions remain and many possible generalizations exist. First, we assumed that the censoring is independent of the failure time given the covariates, and the probability that no censoring occurs is positive given the covariates. It should be interesting to study the consequences of violation of one or both assumptions. Second, we have used the inverse-probability-of-censoring weighting to correct the bias induced by censoring. In general, this is not always the most efficient way of handling missing data (see, for example, van der Vaart, 2000, Chapter 25.5). Third, we discussed only right-censored data and not general missing mechanisms. We believe that further developing SVM techniques, that are able to better utilize the data and to perform under weaker assumptions in more general settings, is of great interest.

Proof.
Note that where i.e., R L G ,D is the empirical loss function with the true censoring distribution function. Using conditional expectation, we obtain that for every f ∈ H, We conclude that We would like to replace the bound above, with a bound that does not depend on the functions f c D,λ and f P,λ . To do that we first bound the norm of both f P,λ and f c D,λ . We start with f P,λ . Since L is a convex locally Lipschitz continuous loss function, we obtain that f P,λ exists (see Section 3). Since and L(z, y, 0) ≤ 1, we conclude that f P,λ 2 H ≤ λ −1/2 . Moreover, since k ≤ 1, it follows from Lemma 4.23 of SC08 that f P,λ ∞ ≤ λ −1/2 . In addition, for f ∈ λ −1/2 B H , where B H be the unit ball of H, we have |L(z, y, f (z))| ≤ |L(z, y, f (z)) − L(z, y, 0)| + L(z, y, 0) ≤ c L (λ −1/2 )λ −1/2 + 1 .
We are now ready to bound the second expression of (16).
where the last inequality follows from (18) and Assumption (A1). Consequently, we obtain |R L G ,D (f c D,λ ) − R L n D ,D (f c D,λ )| + |R L n D ,D (f P,λ ) − R L G ,D (f P,λ )| ≤ B Err n ∞ K .
Combining this bound with (19), and applying additional algebraic calculations, yields the assertion.
Proof. The proof is based on utilizing (13) to replace the bound for f c D,λ obtained in (21) of Theorem 13 with a bound on f c D,λ . Note that since L can be clipped at M , for every f ∈ H we have Hence, we can obtain a similar bound to (15) We start by bounding A n . Note that Starting with the first expression of A n , note that by (17) and the arguments that follow, f P,λ ∞ ≤ λ −1/2 ≤ (Kλ) −1/2 . Repeating the arguments that lead to (20), together with Lemma 27, we obtain that for any f ∈ (Kλ) −1/2 B H |R L G ,P ( f ) − R L,P ( f )| > B 2 2η + 2 log 2N (B H , · ∞ , (Kλ) 1/2 ε) n + 2εK −1 c L ((Kλ) −1/2 ) , with probability not more than e −η , where B = 1 K c L (Kλ) −1/2 (Kλ) −1/2 + 1 . As for the second expression in A n , note that for any clipped function f , we have |R L G ,D ( f ) − R L n D ,D ( f )| ≤ P n δL(Z, Y, f (Z)) G(T |Z) − P n δL(Z, Y, f (Z)) G P (T |Z) where the last inequality follows from condition (A1).
We now bound B n . Note that B n ≤ R L n D ,D (f P,λ )) − R L n D ,D ( f c P,λ ) The second and third components of B n can be bounded by (23) and (24), respectively. We now bound the first expression of B n . Define h L (f )(Z, T ) = L(Z, T, f (Z)) − L(Z, T, f * P (Z)) where f * P ∈ argmin f R L,P (f ), and where the minimum is taken over all measurable functions f : R L n D ,D (f P,λ )) − R L n D ,D ( f c P,λ ) =P n δL(Z, T, f P,λ (Z)) − δL(Z, T, f c P,λ (Z)) G n (Z|T ) ≤ 1 K P n L(Z, T, f P,λ (Z)) − L(Z, T, f c P,λ (Z)) = 1 K P n h L (f P,λ ) − h L ( f c P,λ ) .
It follows from Bernstein's inequality (see Eq. (7.41) of SC08, for details) that with probability not less than 1 − e −η ,
Proof. Assume that N (F, · , ε) = K < ∞, otherwise the assertion is trivial. Let f 1 , . . . , f K be such that F ∈ K k=1 B(f k , ε), where B(f, ε) is the ε-ball with respect to the norm · , centered at f . Note that since | s − t| ≤ |s − t| for all s, t ∈ R, we have f 1 − f 2 ≤ f 1 − f 2 and hence F ∈ K k=1 B( f k , ε). The result follows.

SUPPLEMENTARY MATERIAL
Supplement A: Matlab Code (). Please read the file README.pdf for details on the files in this folder.