Tree-based censored regression with applications in insurance

: We propose a regression tree procedure to estimate the con- ditional distribution of a variable which is not directly observed due to censoring. The model that we consider is motivated by applications in in- surance, including the analysis of guarantees that involve durations, and claim reserving. We derive consistency results for our procedure, and for the selection of an optimal subtree using a pruning strategy. These theoretical results are supported by a simulation study, and two applications involving insurance datasets. The ﬁrst concerns income protection insurance, while the second deals with reserving in third-party liability insurance.


Introduction
In numerous applications of survival analysis, analyzing the heterogeneity of a population is a key issue. For example, in insurance, many risk evaluations are linked with the analysis of duration variables, such as lifetime, time between two claims, time between the opening of a claim and its closure. A strategic question is then to determine clusters of individuals which represent different levels of risk. Once such groups have been identified, it becomes possible to improve pricing, reserving or marketing targeting. In this paper, we show how to adapt CART methodology (Classification And Regression Trees) to a survival analysis context, with such applications in mind. The presence of censoring is a specific feature of data involving duration variables. Here, these variables appear naturally in the applications we consider, either because we are focusing on lifetimes, or because we are interested in quantities that are observed only when some event has occurred (typically, the final settlement of a claim). The procedure we develop is shown to be consistent, while its practical behavior is investigated through a simulation study and two real dataset analyses.
The CART procedure (Breiman et al. (1984)) is a natural candidate for dealing with such problems, since it simultaneously provides a regression analy-sis (which allows us to consider nonlinearity in the way the response depends on covariates) and a clustering of the population under study. Moreover, its tree-based algorithmic simplicity makes it easy to implement. It consists of successively splitting the population into less heterogeneous groups. A model selection step then allows us to select from this recursive partition a final subdivision into groups of observations of reasonable size, with simple classification rules to affect an individual to one of these classes. Tree-based methods have met with many successes in medical applications, due to the need for clinical researchers to define interpretable classification rules for understanding the prognostic structure of data (see e.g., Fan, Nunn and Su (2009), Gao, Manatunga and Chen (2004), Ciampi, Negassa and Lou (1995), Bacchetti and Segal (1995)). In survival analysis, a recent review of these methods can be found in Bou-Hamad, Larocque and Ben-Ameur (2011). Let us also mention Wey, Wang and Rudser (2014), who recently considered tree-based estimation of a censored quantile regression model, which extends the methodology of Wang and Wang (2009). For insurance applications, Olbricht (2012) highlighted their usefulness to approximate mortality curves in a reinsurance portfolio and compare them to German life tables in a nonparametric way, but based on fully observed data, which is not the case in the present paper.
As already mentioned, one of the most delicate problems when dealing with survival analysis is the presence of censoring in the data, and the necessity to correct the bias it introduces when using statistical methods. Our approach is based on the IPCW strategy ("Inverse Probability of Censoring Weighting"), see van der Laan and Robins (2003), Chapter 3.3. It consists in determining a weighting scheme that compensates the lack of complete observations in the sample. Therefore, our procedure is connected with the technique presented in Molinaro, Dudoit and van der Laan (2004). The main differences in our approach involve the specificity of the weighting scheme we consider (based on the Kaplan-Meier estimator of the censoring distribution) and the fact that we do not only focus on a duration (subject to censoring); our interest lies in the conditional distribution of a related variable, which is observed only if the duration is. This particular framework is motivated by applications in insurance where the final claim amount to be paid is known only after the claim has been settled, which can take several years in some cases. Another difference with Molinaro, Dudoit and van der Laan (2004) is that their approach requires modeling the conditional distribution of the censoring. In our case, no such model is required since we use weights based on a Kaplan-Meier estimator (Kaplan and Meier (1958)), and our strategy relies on Kaplan-Meier integrals (see e.g., Stute (1999), Gannoun et al. (2005) and Lopez, Patilea and Van Keilegom (2013) for the application of similar strategies to censored regression).
The paper is organized as follows. In Section 1, we describe specific details of the censored observations we consider. Section 2 is devoted to the description of the regression tree procedure, and its adaptation to the presence of censoring. Its consistency is shown in Section 3. A simulation study and two real data examples from the insurance field are respectively presented in Sections 4 and 5.

Observations and general framework
This section aims to describe the type of observations we have at our disposal (Section 1.1) and define the regression function we wish to estimate (Section 1.2). Section 1.3 is devoted to the nonparametric estimation of the distribution function of the variables involved in our model.

Censored observations
In the following, we are interested in a random vector (M, T, X), where M ∈ R p , T ∈ R + is a duration variable, and X ∈ X ⊂ R d denotes a set of random covariates that may have an impact on T and/or M. The presence of censoring prevents the direct observation of (M, T ), while X is always observed. Next, let us introduce a censoring variable C ∈ R + . For the sake of simplicity, we assume that T and C are continuous random variables. We also assume, for convenience but without loss of generality, that the components of M are all strictly positive. The variables that are observed instead of (M, T ) are Compared to a classical censoring regression scheme, such as the one described for example in Stute (1993), the variables M i correspond to quantities that are observed only when the individual i is fully observed. An illustration of such a case is described in Section 5.2, where T represents the time before a claim is fully settled, and M the total corresponding amount (only known at the end of the claim settlement process). The censored regression framework of Stute (1993) can be seen as a special case, taking M = T.

Regression function
Our aim is to understand the impact of X, and possibly T, on M. More precisely, we wish to estimate a function where P is a subset of an appropriate functional space and φ a loss function.
In the following, we will restrict ourselves to real-valued functions π. Table 1 shows the different types of regression models corresponding to different possible choices of φ, and the corresponding set P. These examples cover mean regression and quantile regression. Table 1 Expressions for π 0 for some classical choices of φ and P. The notation L p (R d ) indicates a restriction to the set of functions π(x, t) which do not depend on t, and, for a random vector U , q τ,U (u) denotes the τ -th conditional quantile of M with respect to U, that is, the value of mu such that P(M ≤ mu|U = u) = τ.

Estimation of the distribution function of (M, T, X)
In this framework, the empirical distribution function of (M, T, X) cannot be computed, since M and T are not directly observed. Since most statistical methods rely on this nonparametric estimator, an effort should be made to finding an alternative estimator that takes censoring into account. Due to classical identifiability issues, an assumption on the way C depends on the variables (M, T, X) must be specified. In the following, we assume that Assumption 1 below holds.
Assumption 1. Assume that: Under Assumption 1, observe that, for all functions ψ ∈ L 1 , where G(t) = P(C ≤ t). The function G is usually unknown. However, Assumption 1 ensures that it can be estimated consistently by the Kaplan-Meier estimator (see Kaplan and Meier (1958)), i.e., since T and C are independent, and P(T = C) = 0 for continuous random variables (see Stute and Wang (1993) for consistency of Kaplan-Meier estimators). Therefore, a natural estimator of F (m, t,  Stute (1993), due to a connection betweenĜ and the jumps of the Kaplan-Meier estimator of the distribution of T (see Satten and Datta (2001)).
Remark 1.1. Assumption 1 is a natural extension of the identifiability condition considered by Stute (1993). Alternative assumptions have been proposed by several authors for censored regression. For example, Van Keilegom and Akritas (1999), Heuchenne and Van Keilegom (2010a), and Heuchenne and Van Keilegom (2010b) assume that T and C are independent conditionally on X (in the absence of an additional variable M ). A special case of Assumption 1 is the situation where (M, T, X) is independent of C. But, as shown in Stute (1993) (where Assumptions (i) and (ii) p. 91 are identical to ours in the case where T = M ), Assumption 1 is more general. However, it still introduces constraints on the way C is allowed to depend on the covariates. An alternative would be to assume that (M, T ) is independent of C conditionally on X. A way to adapt our approach to this framework would be to replace the Kaplan-Meier estimatorĜ by the conditional Kaplan-Meier estimator of Beran (1981) and Dabrowska (1989), as in Lopez (2011) (see also Lopez, Patilea and Van Keilegom (2013)). However, this complicates the procedure due to the introduction of kernel smoothing with respect to X, with potentially erratic behavior when the dimension of the covariates d is high. We therefore restrain ourselves to the condition in Assumption 1, which is well-adapted to the practical applications we have in mind (see Section 5).
Remark 1.2. In practice, we use a learning sample to build the regression tree, and a validation sample to select the best-adapted subtree (further details in Section 2.3). Suppose that the learning sample is of size n, while there are v observations in the test sample. In this situation, the estimatorĜ can be computed either from the learning sample (n observations) or from the whole sample (n + v observations), this latter option leading to a slight modification in the definition ofĜ. As we will explain in Section 2.3, we use this second strategy in practice, which has no significant consequence in the theory, provided that v is at most of the same order as n.

Adapting CART to survival data with Kaplan-Meier weights
This section is devoted to the description of our regression tree methodology, adapted to censoring. Section 2.1 explain the growing procedure, i.e., the successive partitions of the observations into elementary classes, while Section 2.2 shows the link between a subtree extracted from this procedure and an estimator of the regression function. Section 2.3 presents the pruning strategy for selecting our final estimator.

Growing the tree
The building procedure of a regression tree is based on the definition of a splitting criterion that furnishes partition rules at each step of the algorithm. More precisely, at each step s, a tree with L s leaves is constituted, each of these leaves representing disjoint subpopulations of the initial n observed individuals. In our case, the rules used to create these populations are based on the values of Y and X. More precisely, the leaves correspond to a partition of the space Ls . The individual i belongs to the subpopulation of the leaf l ifX i : At step s + 1, each leaf is likely to become a new node of the tree by making use of the splitting criterion. LetX (j) denote the j-th component ofX. In the absence of censoring, to partition the subpopulation of the l-th leaf into two subpopulations, one determines, for each componentX (j) , the threshold x where Γ ⊂ R,x = (t, x), andF n denotes the empirical distribution of (M, T, X). The first term of (2.1) can be seen as an estimator of Then, one determines j 0 = arg min j=1,..,d+1 L l (j, x (j) l ). Next, the partition of the population of the l-th leaf is performed by separating the individuals having X In our framework, the empirical distribution functionF n is unavailable. The idea is then to replaceF n in (2.1) byF defined in (1.3). In other words, in the previous regression tree procedure, the empirical means that we would use in the absence of censoring are replaced by weighted sums, with weight W i,n = δ i n −1 [1 −Ĝ(Y i −)] −1 being affected to the i-th observation in order to compensate the presence of censoring.
An important remark can be made in view of both the definition of the splitting criterion and the weights W i,n . The splitting criterion consists of a rule which is based on the values ofX, whose first component T is unobserved for the censored individuals. Hence, under random censoring, this procedure cannot be understood as a rule to perform classification of all the observations in the sample; only uncensored individuals are classified. Nevertheless, the fact that the censored ones are not assigned to any leaf of the tree does not constitute an obstacle in view of performing the growing procedure. Indeed, if the i-th individual is censored, W i,n = 0. Therefore, at each step, a censored observation could be assigned to any subpopulation without modifying the value of L l (j, x (j) l ). This does not mean that the information contained in censored observations is not used, since they play an important role in computingĜ, and thus W i,n .
To summarize, the aforementioned procedure thus produces clusters of individuals with rules to assign uncensored observations to one of them. The question about how to assign a censored observation should be considered separately; see the application in Section 5.2. The details of our modified CART algorithm (with censoring weights) are as follows: Step 0: Compute the estimatorĜ from the dataset with n individuals.
Step 1: initialization. Consider the tree with only one leaf (L 1 = 1), corresponding to the population composed of all n U uncensored observations Step s: splitting. Consider the tree obtained at step s−1, with L s−1 leaves. Each leaf l corresponds to a set T s.1 if e l = 1 or if all observations have the same values ofX, do not split; s.2 else, the leaf becomes a node in the next tree: determine the j 0 and x if the l-th leaf satisfied condition s.1). Set L s the new number of leaves. Go to step s + 1, unless L s = L s−1 . The procedure stops when all leaves are in step s.1. This produces the maximal tree from which our final estimator is extracted.

From the tree to the regression function
Recall that our aim is to estimate the function π 0 in (1.1). Consider a subtree S of the maximal tree built from the algorithm in Section 2.1. We now describe how this subtree can be associated with an estimator of π 0 . Let K(S) denote the total number of leaves of S. As previously explained, this subtree can be seen as a collection of rules (see Meinshausen (2009) for further formalization of this concept). By construction, a leaf l is associated with a set T l (recall that the sets T l are disjoint with their union equal to T ) and a rule R l (x) = 1x ∈T l that determines if an individual is affected or not to the corresponding cluster. This induces the following estimator of π 0 : The coefficientγ l can be seen as an estimator of Hence, defining can be seen as a piecewise-constant approximation of π 0 , which tends to be closer to π 0 when the partition of T is finely spaced. On the other hand, π S should be close to π S provided that the sets T l are not too small. In view of estimating π 0 , a crucial issue is thus to extract an appropriate subtree from the maximal tree, corresponding to a good compromise between a sharp partition of T and the necessity of having enough observations in each leaf to estimate well the coefficients γ l . Achieving this is the aim of the pruning strategy developed in the following section.

Remark 2.1. In the presence of right-censored observations, a classical difficulty is handling observations that are close to the right tail of the distribution. Indeed, little information is available on this part of the distribution due to the lack of large uncensored observations. Our procedure is impacted by this problem, which translates to a blowing up in the value of the weights when
For this reason, a careful look at the leaves containing large observations is required. In the following, our theoretical results do not cover the case where weights may blow up. In other words, we exclude too-large uncensored observations from the procedure in order to avoid the instability they cause. This is a classical issue in censored regression, where, instead of π 0 , one is often required to consider π 0 (τ ) = arg min π∈P E [φ(M, π(T, X))|T ≤ τ ] , where τ is strictly included in the support of Y, introducing a small bias.

Selection of a subtree: Pruning algorithm
Denote by K n ≤ n the number of leaves of the maximal tree. The pruning strategy consists of selecting from the data a subtreeŜ withK leaves. Let S denote the set of subtrees of the maximal tree. The pruning strategy consists of determiningŜ(α) such that and to useπŜ (α) as a final estimator of π 0 . We will denoteK α the number of leaves inŜ(α). A penalty term proportional to K(S)/n was first proposed by Breiman et al. (1984), see also Gey and Nedelec (2005). The procedure consists of starting with α = 0, then progressively increasing its value, in order to determine a sequence 0 < α 1 < ... < α Kn such thatK αj+1 =K αj . The existence of such a sequence has been proved by Breiman et al. (1984). Moreover, it follows from Breiman et al. (1984, p. 284-290) Then, the question is to select the right α j in this list. To this end, a test sample (see Remark 1.2) of size v is used. More precisely, let and select α j0 such that V(α j ) is minimal. This procedure differs from the classical one by the introduction of the weights involvingĜ. Section 3.3 shows that this strategy remains valid in the presence of censoring.
Observe that different strategies may be used for computing the estimator G involved in (2.3). We choose to compute it once for all, i.e., using the whole ..,n+v , and using this estimator both in the construction of the trees and in the validation step. Alternatively, one could use in the growing step an estimatorĜ computed from the learning sample, and, in the validation step, another one computed from the test sample. We argue that such a strategy is likely to increase the instability of the procedure since the estimator G computed from the information contained in the test sample would usually have poorer performance (usually v << n). Therefore, taking an estimatorĜ computed from the whole sample seems preferable, observing that correcting the presence of censoring and selecting the most appropriate tree are two separate problems.
Remark 2.2. This selection criterion, in its uncensored version, has been shown to be consistent for selecting the best subtree in many cases, see Breiman et al. (1984) and Gey and Nedelec (2005). See also Molinaro, Dudoit and van der Laan (2004) for similar strategies. Optimality properties and practical evidence for some of these techniques can be found in van Der Laan, Dudoit and van der Vaart (2006), van Der Laan and Dudoit (2003), and Dudoit et al. (2003).

Consistency of the CART weighted estimator
This section is devoted to the proof of the consistency of the tree procedure. The roadmap of the proof consists of the three following steps: 1. We consider the value of the criterion that we wish to optimize on each leaf of the tree. We provide a quasi-exponential bound for the deviations of the difference between this criterion and the limit that it is supposed to estimate. The result is presented in Theorem 1, Section 3.2. 2. Under some regularity assumptions on this criterion, the consistency of the parametersγ l is obtained for each leaf of the tree, see Proposition 1 (Section 3.2). Next, the consistency of the global regression estimatorπ S is deduced in Corollary 1. 3. We show in Proposition 2 of Section 3.3 that the pruning strategy is legitimate, in the sense that it leads to, from a collection of subtrees, an estimator which achieves the best convergence rate, up to some smaller remainder terms. This is a consequence of the two previous steps, where the non-asymptotic results that are provided permit to easily track the effect of the size of the tree on estimation quality.

A bound on the deviations of the criterion
We consider in this section a tree with leaves T l (l = 1, . . . , K), where T l is a random subdivision of T corresponding to the scheme defined in Section 2.1. Let and define the relative variation of M n,l − M l around γ l as The quantity Δ l (γ, γ l ) is a way of measuring, in leaf l, a normalized variation of the error made by replacing the criterion M l by its empirical counterpart. The cornerstone of our theoretical results is Theorem 1 below, which furnishes a bound for the deviations of Δ l . Before stating the result, some assumptions on the regularity of the loss function are required.

Assumption 2.
There exists a constant M < ∞ such that, for all m, Assumption 2 holds provided that φ is continuously differentiable with respect to π, with uniformly bounded derivative. The second assumption that we need on φ requires us to introduce notation concerning covering numbers. For a class of functions F and a probability measure Q, let N (ε, L 2 (Q), F) denote the minimum number of L 2 (Q)-balls of radius ε required to cover the set F. In the following, for a class of functions F with envelope function E (by envelope, we mean that all functions in F are uniformly bounded by E), we will use the following notation:

Assumption 3. Define the class of functions
Assume that, for some positive constants C 1 and w, where we recall that the functions in Φ are bounded by M from Assumption 2.
Assumption 3 holds provided that the function φ is regular enough. For example, assume that φ is twice continuously differentiable with respect to π. Also assume that its second order derivative with respect to π is, for a fixed m, Hölder with Hölderian constant H m . If this constant satisfies E[H m ] < ∞, it is easy to check that we are in the situation of Example 19.7 in van der Vaart (1998), and Assumption 3 holds.
We now state the main result of this section.
The introduction of τ is required due to the erratic behavior of the Kaplan-Meier estimator at the right-hand side of the distribution. We therefore need to remove the observations that are too large, which is the purpose of considering only leaves such that T l ⊂ T τ . This type of truncation is classical in censored regression, see e.g., Sánchez Sellero, González Manteiga and Van Keilegom (2005), Heuchenne and Van Keilegom (2010b) and Lopez, Patilea and Van Keilegom (2013).
Sketch of the proof of Theorem 1. The probability (3.1) can be decomposed into This means that Δ l,C corresponds to the replacement ofĜ by G in the definition of M n,l , while Δ * l corresponds to the deviation we would consider in a situation where the distribution of the censoring is known exactly.
The two probabilities in the decomposition (3.2) are studied separately in Lemmas 1 and 2 respectively. We proceed as follows: 1. Lemma 1 handles the replacement ofĜ by G in the criterion (corresponding to Δ l,C ) via the adaptation of the Dvóretsky-Kiefer-Wolfowitz inequality for the Kaplan-Meier estimator given by Bitouzé, Laurent and Massart (1999). 2. Lemma 2 is obtained through a concentration inequality due to Talagrand (Talagrand (1994)) to study the deviations of Δ * l , that is, of the criterion we would compute if we knew exactly the distribution of the censoring.
Remark 3.1. The sequence u n appears in Lemma 1 as Remark 3.2. If n + v observations are used to computeĜ, n simply becomes n + v in the third exponential term of (3.1), and u n is replaced by u n+v .

Consistency of the regression tree
Consider a leaf T l ⊂ T τ . Once again, restraining ourselves to T τ is required due to the poor performance of the Kaplan-Meier estimator near the tail of the distribution. Theorem 1 allows us to easily deduce consistency ofγ l , up to adding some regularity assumptions on the function φ, which we now present.
We also require some reasonable restrictions on the parameter space Γ.
Assumption 5. Γ is compact, convex with non-empty interior, and for all l = 1, . . . , K, γ l belongs to the interior of Γ.
By the definition ofγ l , we have M n,l (γ l ) − M n,l (γ l ) ≥ 0, while M l (γ l ) − M l (γ l ) ≤ 0 by the definition of γ l . Hence, Moreover, if follows from a second-order Taylor expansion and Assumptions 4 and 5 that from which one deduces (3.4) The following Proposition 1 then easily follows from (3.4) and Theorem 1.

we have used the notation in Theorem 1, and where μX is defined as in Assumption 4.
This Proposition means that in each leaf, the estimatorγ l is close to γ l with high probability. Nevertheless, the term μX(T l ) shows that estimation performance in the leaf deteriorates when the leaf is "too small" (that is, when the selection rules define a region of the space T which has a small measure with respect to the distribution ofX). This is a classical issue when proving consistency of regression trees, see e.g., Condition 1 in Chaudhuri (2000) and Condition 1 in Chaudhuri and Loh (2002). Condition (3.5) in Corollary 1 below is clearly linked to this issue since, in a random design, μX (T l ) represents in a sense the number of observations in T l .
and P (x) = P( π S − π S 2 2,τ > x). Then, for some positive constants C j , since the intersection of T l and T l is empty for l = l , and using (3.5). Equation (3.6) then follows from Proposition 1. To show (3.7), observe, following Remark 3.1, that . Then, since π S − π S 2 2,τ is bounded (say by a finite constant A), and the result follows since P(E n ) = 2.5u n .

Consistency of the pruning strategy
The next result shows that penalizing the subtree S by a factor αK(S)/n is a relevant strategy. This idea already seems reasonable in view of (3.7). Indeed, φ(m,π S )dF (m, t, x) is, due to the regularity assumptions on φ (Assumption 4), of the same order as π S − π S 2 2,τ , which is of order K(S)/n. Penalizing by αK(S)/n can then be interpreted as compensating the structural decrease in π S − π S 2,τ when K(S) increases. Proposition 2 below confirms this.
DefineπŜ (α) as the estimator selected using the pruning strategy with parameter α. Let where the O(n −1 )-term does not depend on K 0 .
The proof of Proposition 2 is postponed to the appendix. It introduces an optimal choice of complexity K 0 for the selected tree. It is optimal in the sense that K 0 minimizes φ(t, π K )dF (t, m, x) over K, that is, the unknowable criterion that would be optimized if we knew the distribution F. Proposition 2 means that the penalization strategy gives approximately the same performance as we would have if we knew the optimal complexity K 0 . Indeed, the L 2 -norm of the error is of order K 0 n −1 , plus some approximation term (the distance between π K0 and π 0 ).

Simulations
We investigate here the practical behaviour of tree-based estimators for censored data via simulations. For the sake of simplicity, we consider the case where we are interested in the distribution of the lifetime T , thus focusing on estimating π 0 (x) = E[T | X = x]. Consider the following simulation scheme (see the parameter values in Table 2): (Notice that there thus exist four subgroups in the whole population.) 3. draw n + v i.i.d. censoring times, Pareto-distributed: C i ∼ Pareto(λ, μ); 4. from the simulated lifetimes and censoring times, get for all i the actual observed lifetime Y i = inf(T i , C i ) and the indicator δ i = 1 Ti≤Ci ; 5. compute the estimatorĜ from the entire generated sample (Y i , δ i ) 1≤i≤n+v .
Descriptive statistics corresponding to various simulated datasets (of different sizes) are available in Table 3. To each simulated sample, we fit a regression tree with the algorithm in Section 2.1, and prune it using the strategy in Section 2.3.

Table 2
Parameters involved in the simulation scheme. Group-specific means Component probabilities Censorship rate c[ [ c, d[ [ d, e] 10% 30% 50% 0.08 0.05 0.16 0.5 [0, 0.3[ [0.3, 0.6[ [0.6, 0.8[ [0.8, 1] (λ, μ)   Then, we compute the weighted squared errors given by where the i th observation belongs to leaf l(i), where we know that π 0 (X i ) = 1/β. In order to gain some robustness in our results, we repeated 5000 times the simulation scheme above to compute empirical means of W SE i , leading to the MW SE. We also considered different values for (λ, μ) in the censoring process so as to measure the impact of censoring on the procedure's performance (see Table 2 for these values for the Pareto distribution). The performance of the procedure is shown in Figure 1 and Table 4. Clearly, the strength of the censoring has an impact on performance. One can also observe that the performance in the group with the highest mean (Group 2) is worse than in the others, which is linked with the fact that largest observations are more likely to be censored. However, the hierarchy of the groups in term of performance cannot be entirely summarized with respect to the typical size of the lifetimes (see Group 4 which has a lower mean, but performs worse than Group 1).

Applications to real-life insurance datasets
In this section, we consider two applications in insurance. The first, described in Section 5.1, focuses on the prediction of a duration variable only (duration in a state of disability). The second, in Section 5.2, is dedicated to claim reserving, and illustrates the need to introduce a supplementary variable M. In this situation, the key issue is to predict the claim amount, this being known only after some time T , subject to censoring.

Income protection insurance
The real-life database we consider reports the claims of income protection guarantees over six years. It consists of 83 547 claims, with the following information for each claim: a policyholder ID, cause (sickness or accident), gender (male or female), socio-professional category (SPC: manager, employee or miscellaneous), age at the claim date, duration in the disability state (perhaps right-censored), commercial network (3 kinds of brokers). All insurance contracts considered have a common deductible of 30 days.
Here, the censoring rate equals 7.2%, the mean observed duration in the disability state is about 100 days (beyond the deductible of 30 days), with a median of 42 days. There is strong dispersion among the observed durations, the standard deviation being 162 days. Our goal is to find a segmentation into several classes of homogeneous individuals, and to predict the duration in the disability state in each class.
To begin, we compute the Cox proportional-hazards model with the (discretized) age at the claim date as covariate, since the recovery rates used in the calculation of technical provisions for this kind of guarantee depends on the age range at the claim date. This adjustment leads us to consider the high predictive power of this variable. However, the proportional hazards assumption is thoroughly rejected by all classical statistical tests (likelihood ratio, Wald and log-rank tests). Nevertheless, the obtained results are retained, to enable a comparison with those from the tree approach. We thus try to explain the disability duration by sex, SPC, commercial network, age at the claim date (5 pre-determined classes, due to local prudential regulation) and cause of disability. The final tree (after pruning) is given in Figure 2. We see in Table 5 the significant differences between tree and Cox estimates. These differences can be  explained by two phenomena resulting from using the Cox proportional-hazards model: • our approach directly targets the duration expectation while Cox partiallikelihood is focused on estimating the hazard rate; • the estimation of the baseline hazard is very sensitive to the longer durations (mainly concentrated in class e), which affect the estimates of all other classes (whereas our estimation is expected to be less sensitive to this phenomenon for classes a to d ).
These differences reinforce the interest of such an approach to incorporate heterogeneity in the reserving process of an insurance portfolio. More generally, the predictive performance of duration models for censored survival endpoints can be assessed using various techniques including timedependent ROC curves using independent test data (hereafter denoted by ROC(t), see e.g., Heagerty, Lumley and Pepe (2000) and Heagerty and Zheng (2005)). Figure 3 illustrates such ROC curves (at t = 15, 100, 110, which correspond respectively to the first quartile, mean and third quartile of observed lifetimes), obtained from previously-built models. The only difference lies in age, which is considered here as a continuous covariate to benefit from one of the strengths of tree-based procedures, i.e., eliciting good cut-points for continuous  covariates and capturing potential nonlinearity. Table 6 gives the value of the AUC (Area Under Curve) at various time points, corresponding to previous durations to which were added the median of observed lifetimes. The tree approach seems significantly better than the Cox one at predicting lifetime, with an excellent mean AUC of 80%. Once again and in this more general framework, these results prove the interest of using trees as opposed to the Cox model for prediction, whatever the duration threshold under study.

Reserving in third-party liability insurance
This real-life database was extracted in the 2000s by an international insurance company, and reports about 650 claims related to medical malpractice insurance during seven successive years. The initial dataset contains information about various dates concerning the claims (date for reporting, opening or closing the case, etc.), contract features, and some data on associated payments. These payments encompass indemnity payments and ALAE (Allocated Loss Adjustment Expenses), where ALAE are assignable to specific claims and represent fees paid to outside attorneys used to defend the claims. After some pre-processing, one can compute useful quantities for our purposes, especially (potentially censored) development times and total payments. Here T i is the "lifetime" of a claim, that is, the time between its issue date and the claim settlement date. The consorship C i is the delay between the claim issue date and the extraction date of the database, and M i is the total amount of the i th claim. The latter is observed only if the claim has been fully settled (32% of the observations are censored). In this setting, it is reasonable to assume that C i does not depend on (M i , T i , X i ), but this would clearly be wrong in the case of covariates depending on the claim issue date. Table 7 summarizes some descriptive statistics about the covariates that are used when running the weighted CART algorithm to explain the response M i . As could be expected in this type of business, the data are highly skewed; for instance, many declared claims are assigned no payments because the company is still waiting for a court decision before paying. A parametric model would then be quite difficult to fit, which emphasizes the interest of using such techniques. As we have already mentioned, a key issue is to predict the future coming expenses related to claims that are still under payment. Typically, computing would give the best L 2 -approximation of the amount M i based on the information available on claim i. Our aim is then to produce an estimatorM of this ideal (but unattainable) predictor. Of course, M * is known if δ i = 1, that is, M * (m, y, 1, x) = m, but the key issue is to predict it for unsettled claims (δ i = 0). For such claims, rewrite and introduce Z 1 (m, y) = 1 1(M > m, T > y), and Z 2 (m, y) = M Z 1 . In view of (5.1), we have to estimate the quantities π m,y 0,1 (x) = E[Z 1 |X = x] and π m,y 0,2 (x) = E[Z 2 |X = x]. Each of these are estimated using the CART pro-cedure described in Section 2. Hence, for each censored claim, we use two regression trees to compute a predictionM i , obtained as the ratioM i =π Ni,Yi 0,2 (X i )/ π Ni,Yi 0,1 (X i ). Note that, for each censored claim, the trees we compute are different since the values of Y i and N i are. We now determine a reserve to be constituted by summing theM i . To check that the proposed amount is reasonable, we can compare the values ofM i with the prediction of experts that are present in the database. The aggregated results are presented in Tables 8 and 9.
The predictions are highly overdispersed for both "expert" and "tree" reserves (see Table 8) but, as mentioned earlier, this is not surprising from a business point of view. We observe that our regression tree approach produces reserve amounts which are significantly higher than the reserves made by the experts, except for lower amounts. We argue that this has to be linked with the fact that the expert reports are made close to the opening of the claim. In our approach, we use posterior information: if a claim is open for a long time, our procedure tends to predict an higher final value (claims with long duration before settlement are more likely to be associated with larger amounts). This difference justifies the use in practice of our technique as a second diagnosis, complementary to expert judgment. Finally, notice in Table 9 that the gap between the two reserves is not necessary increasing when increasing the level of information. For instance, the tree global reserve is 1.22 times bigger than the expert one when considering two thirds of the censored observations (from the  minimum to the 66-th percentile of the censored observations), whereas it is 1.31 times bigger with half.

Conclusion
In this paper, we defined a regression tree procedure well-adapted to the presence of incomplete observations due to censoring, and proved its consistency. The framework that we considered is motivated by the field of survival analysis, but also allows us to consider related applications, such as claim reserving in insurance. In such types of problem, a duration is present (and subject to censoring), but also an additional variable (the claim amount) that is observed only if the observation is uncensored. We presented two practical applications of this technique that demonstrate its feasibility and interest. The next step is to extend the procedure we have developed to random forests, in the same spirit as Hothorn et al. (2006). Indeed, regression tree procedures, although they produce an easily understandable model at the end, are known for their sensitivity to changes in the dataset. This investigation is left for future work, but we would like to emphasize the role of ensemble methods in improving the predictive abilities of the technique we described in the present paper.

Appendix A: Main lemmas
Lemmas 1 and 2 below are the key results required to prove Theorem 1. Proof. Since T l ∈ T τ , we have that 1x ∈T l = 0 if t > τ. Let c G = (1 − G(τ )) and c F = (1 − F (τ )). We have c F > 0 and c G > 0.Therefore, we have where we have used Assumption 2. Since (1 − G) is bounded away from zero, the empirical mean on the right-hand side is bounded by Mc −1 G . On the other hand, and the probability on the right-hand side can be bounded by 2.5 exp(−2nz 2 + Cn 1/2 z), for some absolute constant C > 0, using the Dvoretsy-Kiefer-Wolfowitz inequality for the Kaplan-Meier estimator proved in Bitouzé, Laurent and Massart (1999). Hence the result follows, with A = c 2 F c 4 G [2M ] −1 , B = Cc F c 2 G [2M ] −1 , and u n = exp(−n 1/2 c F c G [C + n 1/2 c F c G ]/2).

Lemma 2. Assume that X is a random vector with d continuous components
where c G = (1 − G(τ )) as in the proof of Lemma 1. As in Proposition C1 in Appendix C, introduce a sequence of i.i.d. Rademacher variables (ε i ) 1≤i≤n , independent from (N i , Y i , δ i , X i ) 1≤i≤n , and define − f (n, y, d,x)dP(n, y, d,x)} , we get, from Proposition C1, P n sup l:T l ⊂Tτ sup γ∈Γ |Δ * (γ, γ l )| > A 1 (Z + y) The result follows by applying this inequality to y = nx/(2A 1 ), with where A c denotes the complement of a set A, e j denotes the vector of R d+1 with all components equal to zero except the (j + 1)-th one, and < ·, · > the scalar product in R d+1 . It follows from Example 8.4 in van der Vaart and Wellner (1996), combined with points (i) and (ii) in Proposition 8.2 in the same (stability properties of VC-classes), that H d is a VC-class of sets (see van der Vaart and Wellner (1996) for definition), with VC-index 2(d + 1)(d + 2). As a consequence, , for some universal constant K (see Dudley (1999)), and the result follows from (B.1).

Lemma 5. Under the Assumptions of Proposition 2, we have
for some positive constant C 6 < ∞.
We have, due to the regularity of φ, We can distinguish two cases, depending whether K < K 0 or K > K 0 .
A bound for K < K 0 . In this case, Δ(K) > 0, and