Bootstrapping Aalen-Johansen processes for competing risks: Handicaps, solutions, and limitations

: Statistical inference in competing risks models is often based on the famous Aalen-Johansen estimator. Since the corresponding limit process lacks independent increments, it is typically applied together with Lin’s (1997) resampling technique involving standard normal multipliers. Recently, it has been seen that this approach can be interpreted as a wild bootstrap technique and that other multipliers, e.g. centered Poissons, may lead to better ﬁnite sample performances, see Beyersmann et al. (2013). Since the latter is closely related to Efron’s classical bootstrap, the question arises whether this or more general weighted bootstrap versions of Aalen- Johansen processes lead to valid results. Here we analyze their asymptotic behaviour and it turns out that such weighted bootstrap versions in general possess the wrong covariance structure in the limit. However, we demonstrate that the weighted bootstrap can nevertheless be applied for speciﬁc null hypotheses of interest and also discuss its limitations for statistical inference. To this end, we introduce diﬀerent consistent weighted bootstrap tests for the null hypothesis of stochastically ordered cumulative incidence functions and compare their ﬁnite sample performance in a simulation study.


Introduction
In the widely used competing risks set-up, survival data is modeled via a simple time continuous Markov chain, which may be described by an initial state (e.g. "non-failure") and a final state (e.g. "failure"). Here the latter is categorized into different absorbing states which are exclusive and may be interpreted as the "competing" failure causes. In this context the so called cumulative incidence function (CIF), also called sub-distribution function, is of particular interest. For each absorbing state, i.e. failure cause, it is separately defined as the probability of occurrence for this particular failure type until a given time. Time-simultaneous inference for the CIF is often based on its canonical Aalen-Johansen estimator, see Aalen and Johansen (1978) [1]. However, because of the complicated covariance structure of its standardized limit process, depending on the statistical question of interest, often other tools are needed to create valid statistical procedures. Here a worthwhile and very promising possibility to attack this problem is the use of adequate resampling procedures like Lin's multiplier technique, see Lin (1993Lin ( , 1997 [26,24] or Martinussen and Scheike (2006) [28] for special examples with medical background. Lin's (1997) [24] resampling idea is as follows: For fixed data, standard normal multipliers are introduced into a proper (resampling) statistic which theoretically possesses the same Gaussian limit distribution as the corresponding normalized Aalen-Johansen process of the CIF. Then the unknown distribution of the Aalen-Johansen process is approximated by repeatedly generating a large number of realizations of the resampling statistic. This approach leads to the construction of valid confidence bands, see Lin (1997) [24].
In the context of hypothesis testing, Klein (2007, 2008) [6,7] as well as Sankaran et al. (2010) [36] have also studied Lin's resampling scheme to test for equality of different CIFs in extensive simulation studies. Spitoni et al. (2012) [39] investigated Lin's resampling method for estimating transition probabilities in semi-Markovian models with applications to survival analysis.
As mentioned by Cai et al. (2010) [13], Lin's (1997) [24] multiplier method is a special version of the general wild bootstrap approach, originally introduced by Wu (1986) [41] for inference in regression models. Recently Beyersman et al. (2013) [10] have provided a rigorous study of the theoretical properties of the wild bootstrap for the Aalen-Johansen estimator in competing risks allowing for independent left-truncation and right-censoring. There it is discussed that other multipliers such as standardized Poisson variates may help to construct more accurate confidence bands for the CIF in the competing risk set-up. As explained in that paper the latter is quite close in spirit to Efron's (1979) [15] classical bootstrap, in which the resampling scheme is generated by drawing with replacement from the sample (or an adequately transformed sample). This motivates the question whether the classical bootstrap or other related resampling techniques may also be applied for statistical inference in one-and two-sample competing risks designs. In particular, the current paper studies (1) the theoretical properties of a general exchangeably weighted bootstrap version of the Aalen-Johansen estimator in this context, covering amongst others the above mentioned wild bootstrap as well as Efron's original bootstrap, and (2) statistical applications and limitations of this general resampling approach for testing different null hypotheses of interest for the CIF.
The weighted bootstrap approach was first introduced for i.i.d. samples by Mason and Newton (1992) [29], see also Praestgaard and Wellner (1993) [34], Putter and van Zwet (1996) [35] as well as van der Vaart and Wellner (1996) [40]. It has then been further developed and generalized to more general schemes, allowing for different dependency structures, by Janssen and Pauls (2003) [21], Janssen (2005) [20], del Barrio et al. (2009) [14], Pauly (2011) [31]. Here we focus on the technique derived in Janssen (2005) [20] and Pauly (2011) [31]. Inference procedures of interest in competing risk designs are given by one-, two-and k-sample tests for the null hypotheses of equality (which may correspond to the construction of time-simultaneous confidence bands) as well as of ordering of the CIF(s). Here we focus on two-sample problems. It will turn out that for the first problem (i.e. testing equality of CIFs of two independent groups) the wild bootstrap is exceptionally suited, whereas for the second problem general resampling versions of studentized Pepe (1991) [33] tests lead to consistent inference procedures. The theoretical results are motivated from competing risks designs with independent left-truncation and right-censoring but will also hold for more general counting processes satisfying the multiplicative intensity model, see the monograph of Andersen et al. (1993) [5] for more details.
The paper is organized as follows. In Section 2 we introduce the competing risks model, the CIF and its canonical Aalen-Johansen estimator. After recapitulating the wild bootstrap technique for these estimators, we introduce their general weighted bootstrap versions in Section 3 and analyze their weak convergence. Statistical applications for testing the null hypothesis of ordered CIFs in the two-sample case are given in Section 4 and their finite sample properties are investigated in simulations in Section 5. Finally our results are discussed in Section 6 and all proofs are given in the Appendix.

Notation, model and estimators
To be as general as possible in the competing risks set-up we consider a non-homogeneous Markov chain (X t ) t≥0 in continuous time with finite state space {0, 1, . . . , k}, k ∈ N. Here state 0 is initial with P (X 0 = 0) = 1, and all other states 1, . . . , k, representing the competing risks, are assumed to be absorbing. For convenience we restrict ourselves to the case k = 2 with two absorbing states. The corresponding transition intensities (or cause-specific hazard functions) of (X t ) t≥0 from state 0 into state j = 1, 2 will be denoted by α j and are assumed to exist. Moreover, the event time is given by T = inf{t > 0 | X t = 0} and allows for the following relation to the cause-specific hazards with meaningful practical interpretation. Below we are interested in the risk development of this Markov process in time on a given interval [0, t] with t < τ .
For n independent replicates of this Markov chain, corresponding to the observation in time from n individuals, we consider the associated bivariate counting (2.1) counts the number of observed transitions into state j, where 1(·) denotes the indicator function. It is worth noting, that, under the given assumptions, the processes N 1 and N 2 are càdlàg and do not jump simultaneously. Moreover, we assume that N fulfills the multiplicative intensity model given in Andersen et al. (1993) [5], i.e. its intensity process λ = (λ 1 , λ 2 ) is given by is the number of Markov chains without a jump shortly before time t, i.e. the number at risk at t−. It is shown in Andersen et al. (1993, Chapter III) [5] that the assumption (2.2) is satisfied in several important situations with incomplete observations. In particular, covering the practical important case in which data is subject to independent right-censoring and left-truncation (or even filtering). For example, left-truncation means that patient i is only "under study" if T i > L i , i.e. its event time T i is greater than its truncation time L i . We refer to Andersen et al. (1993) [5] for the explicit modelling of these and other incomplete observations data structures.
We are now interested in deriving statistical inference procedures for the cumulative incidence functions, or sub-distribution functions, The corresponding sub-survival function will be denoted by S j (t) = 1 − F j (t), j = 1, 2. Consistent estimators for the CIFs are given by the famous Aalen-Johansen estimators (AJEs) Here J(u) = 1(Y (u) > 0) and P (T > u) denotes the Kaplan-Meier estimator.
In addition, we denote the related estimator of the sub-survival function bŷ S j (t) = 1 −F j (t). Construction of simultaneous confidence bands for a CIF, say F 1 , is in general based on the corresponding process which, under certain regularity assumptions, converges to a zero-mean Gaussian process. For example, a sufficient condition, which we will assume throughout, is the following: For t < τ there exists a deterministic function y with inf u∈[0,t] y(u) > 0 such that Here, and throughout the paper, " p −→ " denotes convergence in probability, whereas " d −→ " stands for convergence in distribution as n → ∞. In particular, under Assumption (2.5), the process W n inherits the following representation in terms of different local martingales where for 1 ≤ i ≤ n, j = 1, 2, are local square integrable martingales. Note that we have suppressed the dependency on the sample size n as well as the appearance of the indicator J(u) in both integrals in (2.6) for better lucidity. As a consequence of (2.5) and (2.6) it follows from Theorems IV. where U is a zero-mean Gaussian process with covariance function for s 1 ≤ s 2 . Since the covariance function ζ is unknown and the process U lacks independent increments, resampling techniques are helpful tools for developing inference procedures. Here Lin's resampling technique, as well as the more general wild bootstrap approach (see Beyersmann et al., 2013 [10]), attack the problem by using an adequate resampling process that in some sense reflects the representation (2.6) and reproduces its distribution in the limit. This will be the starting point of the following section.

Weighted resampling of the Aalen-Johansen estimator
The above mentioned wild bootstrap resampling procedure approximates the limit distribution of W n by introducing i.i.d. zero-mean random variables G j;i , 1 ≤ i ≤ n, 1 ≤ j ≤ 2, with variance 1 and finite fourth moment into the representation (2.6). Replacing M j;i with G j;i N j;i and all unknown quantities with their estimators leads to the following general wild bootstrap version of W n as introduced in Lin (1997) [24], see also Beyersmann et al. (2013) [10], whereF j andŜ j , j = 1, 2, are the AJEs of F j and S j , respectively, see Equation (2.4). Note that we again have suppressed the appearance of the indicator J(u) in both integrals. In Beyersmann et al. (2013) [10] it was shown that the conditional distribution of W n asymptotically coincides with the distribution of W n . That is, given the data, we have convergence in distribution in probability, where U is as in (2.8). In practice, this result is implemented by simulating, for fixed data, a large number of independent copies of the multipliers G j;i , to approximate the conditional distribution of W n . Here, Lin's (1997) [24] resampling scheme is obtained for standard normal multipliers.
To obtain a better connection with Efron's classical bootstrap we rewrite (after multiplying with √ 2) the above wild bootstrap statistic √ 2 W n as where for 0 ≤ s ≤ t and i = 1, . . . , n . Now, for fixed s, the representation in (3.2) may be interpreted as a wild bootstrap version of the linear statistic √ 2n 2n i=1 Z 2n;i (s) in the array of real valued random variables Z 2n (s) = (Z 2n;i (s)) i≤2n . Now recall from Mammen (1992) [27] that for linear statistics in independent observations, the consistency of the wild bootstrap and Efron's bootstrap go hand in hand. Translating the above representation to the classical bootstrap, where given the observations a random sample Z * 2n;1 (s), . . . , Z * 2n;2n (s) is drawn with replacement from Z 2n (s), Here Z 2n denotes the mean of Z 2n . Following Mason and Newton (1992) [29] the statistic W E n can be rewritten in a distributionally equivalent way as This motivates us to study a general weighted bootstrap version of √ 2 W n , namely Here w 2n = (w 2n;1 , . . . , w 2n;2n ) is an exchangeable vector of random variables that is independent of Z 2n . For example, the choice of Efron's bootstrap weights w 2n;i = m 2n;i − 1 delivers W * n = W E n . Following Janssen (2005) [20] and Pauly (2011) [31] we impose the following regularity conditions on the weights to achieve convergence of all finite dimensional distributions of the process W * n (·) as n → ∞: where Z is a random variable with E(Z) = 0 and V ar(Z) = 1. Moreover, it turns out that sufficient conditions for the tightness of W * n (·) are given by Heuristically, the additional Assumptions (3.7)-(3.9) ensure that the correlation between multiple factors of centered weights decreases quickly enough for large n and a high number of different leading terms. Under these assumptions we can prove the following weak convergence result for the exchangeably weighted bootstrap version (3.3) of the AJE.
The above theorem shows that the weighted bootstrap with exchangeable weights leads to a bootstrap version of W n whose limit covariance function differs from the correct asymptotical covariance of the Aalen-Johansen process W n by the summand 1 2 ξ(r)ξ(s). In comparison, the wild bootstrap statistic W n from the beginning of Section 3 reproduces the correct limit process. The reason for this behaviour can easily be explained at the special case of the classical bootstrap version (and also holds for many other related resampling versions that fall into our approach). Efron's bootstrap version of a linear statistic namely needs to center each random variable Z 2n;i at the mean Z 2n . Without this term, the bootstrap statistic √ 2n 2n i=1 m 2n;i Z 2n;i (with conditional expectation (2n) 3/2 Z 2n ) would in general not follow a non-degenerated conditional limit theorem. However, this centering term affects the (conditional) covariance structure of the bootstrap process. In particular, it can be seen in the Appendix, that its asymptotic covariance function ζ * (r, s) is given by the limit (in probability) of 2n i=1 2n(Z 2n;i (r) −Z 2n (r))(Z 2n;i (s) −Z 2n (s)). In comparison the asymptotic covariance function of the wild bootstrap version √ 2 W n is given by the limit (in probability) of 2n i=1 2nZ 2n;i (r)Z 2n;i (s), see the proof of Theorem 2 in Beyersmann et al. (2013) [10]. The reason is that due to the i.i.d. structure of the zero-mean wild bootstrap weights we can directly work with Z 2n;i instead of (Z 2n;i − Z 2n ) to gain a conditional central limit theorem. Actually, Theorem 3.1 even shows that a modified wild bootstrap version of the AJE of the form (3.3) with i.i.d. weights (w 2n;i ) i and centered variables (Z 2n;i − Z 2n ) instead of Z 2n;i would not possess the correct limit structure.
This result now leads to the question whether Efron's bootstrap (or other included resampling techniques that fall into our approach) is not applicable for statistical inference about CIFs in competing risks studies. The answer is two-fold. Since W * n reproduces the wrong covariance of the AJE it is not applicable directly. This means that the asymptotic limit distribution of transformed versions (as sup-distances or integral statistics) of the AJE of a CIF that serve as test statistic for a particular problem (as testing equality or ordering of a CIF) can in general not be reproduced by its corresponding transformed exchangeably weighted bootstrap version (3.3). However, for some situations it may nevertheless be applicable by including adequate studentizations to the corresponding test statistic, see e.g. Janssen (1997Janssen ( , 2005 [18,20] or   [32] for similar examples in the context of testing. Roughly speaking such a multplicative studentization works if the statistic we are interested in becomes asymptotically pivotal after studentizing. To explain this statement we give negative and positive examples. First, let us consider Cramér-von Mises-type statistics for testing equality of a CIF. In this case the asymptotic limit is given by a squared L 2 -norm of a Gaussian process which admits a principal components decomposition and its covariance function is a series depending on all eigenfunctions and eigenvalues of a corresponding integral operator, see Adler (1990) [2] or Shorack and Wellner (2009) [38] for details. In this case it seems reasonable that one studentization alone cannot transform this random variable into another principal components decomposition with predefined eigenvalues and eigenfunctions. Hence the result from Theorem 3.1 is not applicable in this situation. However, if we consider, e.g., a one-or two-sample version of Pepe's test for the hypothesis of ordered CIFs, then it turns out that the resulting test statistic is asymptotically normal. In this situation a studentized version of the test statistic leads to an asymptotic standard normal distribution and its finite sample distribution may be approximated by a related studentized bootstrap version. This will be studied in more detail in the next section for the more interesting two-sample case.

Two-sample resampling tests for ordered CIFs
In order to demonstrate the applicability of the above theory we study a specific inference problem of interest. Suppose we are interested in the comparison of two CIFs on a subinterval [t 1 , t 2 ] of [0, τ ) with 0 ≤ t 1 < t 2 < τ . Here we like to test whether the CIFs from two independent groups with the same competing risk, say j = 1, possess a specific order. A practical interpretation may be given by two independent medical studies for the side effects of similar but different drugs. Another example is given in Bajorunaite and Klein (Example 5, 2007) [6] where bone marrow transplant studies are compared. Note that similar null hypotheses (mainly the null hypothesis of equality) have already been studied in the literature, see e.g. Gray (1988) [17], Aly et al. (1994) [4], Barmy et al. (2006) [16], Klein (2007, 2008) [6,7] or Sankaran et al. (2010) [36] and the references cited therein, where some of them also apply Lin's resampling technique.
In the sequel we extend the notation from Section 2 with a superscript (k) to denote the quantities of the kth group, k = 1, 2. This yields the CIFs F (k) 1 for the competing risk j = 1 as well as counting processes N (k) where n k is the sample size of group k = 1, 2. The hypotheses of interest may then be written as To this end, we suggest an integral-type test statistic, namely where n = n 1 + n 2 and ρ : [0, τ ] → (0, ∞) is a deterministic and integrable function that allows for different weighting of time intervals of interest, see e.g. Pepe (1991) [33] for a similar choice. Note that such statistics are motivated from related goodness of fit problems, see, e.g., Shorack and Wellner (2009) [38] or van der Vaart and Wellner (1996) [40]. Well known theorems from stochastic process theory then show that T n is asymptotically 1 } provided that n k /n → p k ∈ (0, 1) for k = 1, 2. Here, the limit variance is given by where ζ (k) denotes the asymptotic covariance function of the Aalen-Johansen process W (k) n k of group k = 1, 2, see Equation (2.9) above. Note that σ 2 ζ > 0 holds if we have α (k) 1 > 0 on a set with positive Lebesgue-λ λ |[t1,t2] measure for at least one choice of k = 1, 2, which we like to assume in the sequel. As already explained at the end of Section 3 we need an asymptotically pivotal test statistic for applying our weighted bootstrap result from Theorem 3.1. This will be done by studentizing T n and will correct for the wrong bootstrap limit covariance. To this end, we construct a consistent estimate V 2 n by replacing n2 . Therebyζ where u 1−α denotes the (1 − α)-quantile of the standard normal distribution and T n,stud = T n /V n 1(V n > 0). We will now construct a weighted resampling version of ϕ n . In view of Theorem 3.1 and the martingale representation (2.6) under {F 1 } a weighted resampling version of T n may be given by where (w (k) 2n;i ) i,k is an array of exchangeable weights fulfilling (3.4)-(3.9) and we set Z 2n = 1 2n 2 k=1 We like to note that the (−1) in this expression is due to the martingale representation of T n . As shown below, an application of Theorem 3.1 yields that the conditional distribution of T * n is asymptotically N (0, σ 2 ζ )-distributed in probability, where σ 2 ζ = σ 2 ζ due to the wrong limit covariance structure of the weighted bootstrap AJE.
As has already been seen in Janssen (2005) [20] as well as Konietschke and Pauly (2014) [23], different, say classes, of weights need different studentizations. For convenience, and to avoid distinguishing between too many cases, we therefore now focus only on two resampling procedures: Efron's bootstrap with weights w 2n;i = m 2n;i − 1 and the wild bootstrap with w 2n;i = G i . Here (m 2n;1 , . . . , m 2n;2n ) is a multinomially distributed random vector with sample size 2n = 2n i=1 m 2n;i and equal selection probability (2n) −1 and (G i ) i is a sequence of i.i.d. random variables with E(G 1 ) = 0, V ar(G 1 ) = 1 and E(G 4 1 ) < ∞. However, other resampling tests can be obtained similarly. Motivated from the weighted variance estimator given in Janssen (2005, Section 3) [20], a weighted resampling version of V 2 n , say V * 2 n , is then given by replacing We thereby choose v 2n;i = m 2n;i in case of Efron's and v 2n;i = G 2 i in case of the wild bootstrap. With this choice it is proved in the Appendix that, under H = : {F on [t 1 , t 2 ]} and the conditions given in Theorem 4.1 below, the conditional distribution of T * n,stud = T * n /V * n 1(V * n > 0) given the data is asymptotically N (0, 1)-distributed in probability. Moreover, the resulting weighted resampling tests (corresponding either to Efron's or wild bootstrap weights) ϕ * n = 1(T n,stud > c * n (α)), are consistent and even asymptotically exact, where c * n (α) is the (data-dependent) (1 − α)-quantile of the conditional distribution of T * n,stud given the data. Theorem 4.1. Suppose that (2.5) holds for both groups. Then ϕ n is a consistent and asymptotic level α test, i.e. E H ≤ (ϕ n ) → α1(F 2 ) and E K (ϕ n ) → 1. If, in addition, σ 2 ζ > 0 then ϕ * n is also consistent and of asymptotic level α. Moreover, ϕ n and ϕ * n are asymptotically equivalent, i.e. under H = it holds E H= (|ϕ n − ϕ * n |) → 0. Remark 4.1. (a) The asymptotic equivalence implies that both tests also possess the same power under contiguous alternatives. (b) In case of the wild bootstrap the results remain valid if we omit the centering term Z 2n in (4.3) as well as the covariance correction ξ * n (s, t). Below we will denote the resulting test as ϕ W n . (c) Note that the assumption of a deterministic weight function can be relaxed. In particular, it can be shown that the above theorem remains also valid for non-deterministic sequences of weights ρ n : [0, τ ] → (0, ∞) such that sup s |ρ n (s) − ρ(s)| P → 0 in probability for an integrable and deterministic function ρ : [0, τ ] → (0, ∞). This can be shown using straightforward stochastic process arguments similar to those applied in Brendel et al. (2014) [12]. (d) Utilizing the squared weights v 2n;i = G 2 i within the wild bootstrap variance estimator can be motivated from corresponding symmetry-type tests with weights G i = 1 2 (ε 1 +ε −1 ). Such tests are typically applied in the context of paired data, where the involved studentization of the test statistic is often invariant under reflections of the coordinates, see Janssen (1999) [19] or Konietschke and Pauly (2014) [23] for details and examples. In this case, the resampling version of the studentization remains unchanged since G 2 i = 1 holds for this choice of weights. Hence the choice with v 2n;i = G 2 i generalizes this to all covered wild bootstrap procedures.
In the next section the finite sample properties of the asymptotic test ϕ n , Efron's bootstrap test ϕ E n (= ϕ * n with weights w 2n;i = m 2n;i − 1) and the wild bootstrap test ϕ W n from Remark 4.1 with normal multipliers are investigated in a small Monte Carlo study.

Simulations
The testing procedures from the last section are all valid asymptotically, i.e. as n → ∞. In the next step their small sample properties are investigated in a small simulation study with regard to (i) keeping the preassigned error level under the null hypothesis and (ii) to their power behaviour under certain alternatives. All simulations were conducted with the help of the R-computing environment, version 2.15.0 (R Development Core Team, 2010), each with N sim = 1000 simulation runs. Moreover, for the resampling tests we have additionally run B = 999 bootstrap runs in each simulation step. Here we consider the following simulation set-up for the type-I-error: 1. For the event times we have modeled the cause specific intensities of the first group as α In case of censoring we have analyzed situations with equal censoring (λ (1) , λ (2) ) = (0.5, 0.5) (light censoring) and (λ (1) , λ (2) ) = (1, 1) (moderate censoring) as well as unequal censoring distributions with (λ (1) , λ (2) ) = (0.5, 1).
Note that our simulation designs are driven by the cause specific hazards as suggested in Beyersmann et al. (2009, Section 3.2) [9]. Here the second group corresponds to the typical proportional intensity model with constant cause specific, whereas individuals of the first group have decreasing and increasing cause specific hazard rates, respectively.
The results for the type I errors (for α = 0.05) of the three tests can be found in Table 1, where the case without censoring is denoted by (λ 1 , λ 2 ) = (0, 0). For easier reading the closest result to the prescribed 5% level is printed in bold type. Note that in this setting we have equality of the CIFs F (k) 1 (t) = 0.5(1 − exp(−2t)), k = 1, 2, of the first risk j = 1 but unequal CIFs of the second risk. It is seen that, for most of the scenarios, the bootstrap test ϕ E n based on Efron's multinomially distributed weights has a simulated type I error far above the 5% level (sizes in [ .049, .074]). Thus, ϕ E n tends to be quite liberal. On the contrary, the test ϕ n based on the 95%-quantile of the standard normal distribution, and the wild bootstrap test ϕ W n based on i.i.d. standard normally distributed weights keep the 5% level much better. In most cases, ϕ W n (sizes in [ .041, .062]) seems to be slightly more accurate than ϕ n (sizes in [ .041, .063]), especially in settings with unbalanced sample sizes (n 1 , n 2 ) = (50, 100).
The results for the power of all tests are presented in Table 2, where simulations have been performed for alternative hypotheses corresponding to c = 0.1, 0.2, . . . , 0.9. Here the choice c = 0.9 corresponds to a situation close to the null, whereas we move farther into the alternative with decreasing c. Apparently, ϕ E n has the greatest power in all scenarios due to its quite liberal behaviour. Therefore, we turn our attention to the differences in the results for ϕ n and ϕ W n . Apart from a few exceptions, ϕ W n has a marginal greater power than ϕ n . In particular, all of the differences in the simulated powers of the two tests amount values in the interval [−.006, .0.012].
Thus, having the simulated type I error rates in mind, there is a clear preference for ϕ W n . However, since the improvement compared to ϕ n is not very large, we plan to study the behaviour of the presented tests in a more applied paper in the future, where they will be additionally compared with other existing procedures. There, also other resampling versions that fall into our approach (such as the i.i.d. weighted bootstrap, Rubin's Bayesian bootstrap or simply other i.i.d. weigths with finite fourth moment, cf. Example A.1) shall be studied in extensive simulations for different settings. On the other hand, the simulation results for the present set-up strongly suggest not to use ϕ E n in this context.

Discussion and outlook
We have considered a weighted bootstrap approach for the AJE of a competing risk including amongst others Efron's classical, Rubin's Bayesian as well as the wild bootstrap. It turned out that the asymptotic covariance structure of the AJE is not reflected correctly by the weighted bootstrap. This handicap is due to the utilized resampling from centered data which is a necessity for most of the presented bootstrap procedures. One exception is the wild bootstrap of Lin (1997) [24] and Beyersmann et al. (2013) [10], where this centering is not needed due to the i.i.d. structure of the weights. Nevertheless, we have demonstrated that the covariance problem can be solved for specific inference problems. Roughly speaking, the general weighted bootstrap approach can be used for test statistics (here functionals of AJEs) which are asymptotically pivotal. This has been demonstrated for the unpaired two-sample testing problem of ordered CIFs. There an integral-type statistic is made asymptotically pivotal by an adequate studentization. If, however, the limit distribution of the test statistic is more complicated (e.g. if a variance stabilizing transformation or studentization cannot deduce pivotality), the general weighted bootstrap is not applicable. Hence when dealing with more complicated settings as, e.g., nonparametrically testing for equality of different CIFs, the (general) wild bootstrap from uncentered observations Z seems to be the only known and reasonable choice and offers at least some safeguards.
To this end, other possibilities for testing equality of different CIFs than the wild bootstrap will be studied by the authors in a forthcoming paper.
Finally, we like to note that in semiparametric models the above approach may be improved by modifying the presented resampling algorithms as in Lin et al. (2000) [25] or Scheike and Zhang (2003) [37], where the martingale increments dM 0j;i in the resampling step are replaced with estimated increments d M 0j;i rather than dN 0j;i .

Appendix: The proofs
Proof of Theorem 3.1. In order to prove the result we have to show (conditional) weak convergence of finite dimensional (fidi) distributions as well as tightness. For the first we apply Theorem 4.1 in Pauly (2011) [31] and for the latter we use a tightness criterion as in Billingsley (1999) [11]. To verify the fidi convergence of the process let t 1 , . . where · denotes the Euclidean distance. This implies condition (4.1) in Pauly (2011) [31]. Now the calculation of (4.2) in Pauly (2011) [31] finishes the proof of the fidi convergence: The matrix Similarly as in Beyersmann et al. (2013) [10] the first sum converges to 2ζ(t j , t ℓ ) in probability. Moreover, each factor of the second sum has the local martingale representation is the Doob-Meyer local martingale representation of the counting process N 1 + N 2 . Note that each of the three first integrals in (A.2) also is a local square integrable martingale by Theorem II.3.1 of Andersen et al. (1993) [5]. By Rebolledo's martingale limit theorem it is easy to see that each local martingale in (A.2) converges to zero in probability: Consider, for instance, the predictable variation process by Condition (2.5), where we have implicitly used the notation of Andersen et al. (1993) [5]. A similar result holds for the other local martingales. The remaining integrals, however, converge to in probability by the uniform consistency of the AJE and Condition (2.5). This shows (4.2) in Pauly (2011) [31] and thus the desired fidi convergence. It remains to prove conditional tightness of the process. To this end, we apply Theorem 13.5 in Billingsley (1999) [11] and rewrite Let 0 ≤ r ≤ s ≤ u ≤ t and β = 1. Then, by the measurability of Z 2n and their independence of w 2n , it follows that where C k , k = 1, . . . , 5, counts the number of possible index values each leading to the same expected value. For example, C 3 = 3 due to the index combinations The D k are defined as where the sum runs over all indices i 1 , i 2 , i 3 , i 4 that yield the expected value E k . Each case k = 1, . . . , 5 is treated separately: Recall that each Z 2n;i is represented by a one-jump process N 1;i or N 2;i so that which tends to zero in probability by Lemma A.1. Condition (3.7) yields the negligibility of C 1 D 1 E 1 . For treating k = 2 first note that, by the Cauchy-Schwarz inequality, where, by Assumption (2.5) and the involved (Y /n) −1 in the integrand, the asymptotic boundedness of max i n|Z 2n;i (y) − Z 2n;i (x)| in probability (here and below denoted by O P (1)) yields the last inequality. Applying the Hölder(p, q)inequality with p = 3/4, q = 1/4 to the expectation E 2 , we arrive at an upper bound for C 2 D 2 E 2 . Now Conditions (3.7)-(3.9) and straightforward applications of the Cauchy-Schwarz inequality as above imply where O P (1) can be chosen independently of r, s, u. Thus, we have found a common upper bound for C k D k E k , k = 1, . . . , 5, that equals O P (1) times

D. Dobler and M. Pauly
For example we have h n (r, s) = n n i=1 (X 2n;i (s)−X 2n;i (r)) 2 +n n i=1 (Y 2n;i (s)− Y 2n;i (r)) 2 in case of (x, y) = (r, s). Due to similarity, we only consider the first term. Since N 1;i , 1 ≤ i ≤ n, are all one-jump processes, this term is equal to where the left-continuity of all integrands should be kept in mind and where σ 2 j (s) = s 0 α j (v)/y(v)dv for j = 1, 2, see Equation (4.1.11) in Andersen et al. (1993) [5]. Similarly, the convergence of the second sum holds with σ 2 2 instead of σ 2 1 . We can now finish the proof similar to Beyersmann et al. (2013) [10] by the subsequence principle for convergence in probability: For each subsequence there exists a further subsequence such that for P a.e. ω ∈ Ω there exists a seqeunce of non-decreasing, continuous functions H n such that (A.3) is less than or equal to C(H n (u) − H n (r)) 3/2 for large n ≥ n 0 and a constant C > 0. Note that n 0 and C are independent of r, s, u ∈ [0, t]. Here H n converges uniformly to a non-decreasing, continuous function H given by . Hence the conditional tightness follows from a slight extension of Theorem 13.5 in Billingsley (1999) [11] pointwise along subsequences which in turn implies the assertion of this theorem.
Proof of Theorem 4.1. As already outlined above the convergences T n D → T ∼ N (0, σ 2 ζ ) and V 2 n P → σ 2 ζ (see Lemma A.1 below) hold under H = . Moreover, σ 2 ζ > 0 holds by assumption. Hence T n,stud is asymptotically standard normal by Slutzky's Lemma. In addition, since σ 2 ζ > 0 even holds for F 1 , we have that T n,stud tinuous counterpart in probability as n → ∞. This is due to the consistency of the AJE for CIFs as well as a similar argument as in Beyersmann et al. (2013) [10]. A simple Polya-type argument now shows that such monotonic process estimators even converge uniformly on [t 1 , t 2 ] 2 in probability which implies the convergence of the weighted integrals overζ n andξ n (r)ξ n (s) and thus consistency of V 2 n in probability. We now continue by showing consistency of V * 2 n and start by proving that is negligible. Recall, that Z 2n;i are defined as integrals with respect to counting processes. We now pool each quantity in a canonical way by merging the indices 2;i ) i,k and similarly for J and Y . Then, after changing the order of integration to drdsdN (k) j;i , we see that (A.4) is bounded from above by 1 (s)) 1 (s)) + 1(n < l k ≤ n + n 1 )(Ŝ   1 (s)) + 1(n + n 1 < l k )(F   to zero in probability given the data. In the same way it can be shown that the remaining integral with (ζ n − ζ * n )(r, s) replaced by ξ * n (r, s) −ξ n (r)ξ n (s) in (A.6) also converges to zero in probability given the data which completes the proof.
Finally, we consider the examples mentioned in Remark 3.1(b) and prove that they fulfill the assumptions of Theorem 3.1. The extensions to the two-sample case as mentioned in Section 4 are straightforward.
Proof of Example A.1. We first show that the weights given in (a)-(c) fulfill the Conditions (3.4)-(3.9). Since part (a) is the most difficult to prove, we only consider this part and leave the others as an exercise. Moreover, we only show that Condition (3.9) holds, since (3.7) and (3.8) can be shown similarly and the proof for (3.4)-(3.6) can be found in Janssen (2005) [20] and Pauly (2009) [30]. Let n ≥ 2, then we start with where each single expectation is further calculated with the help of the moment generating function of (m 2n;i ) i or by consulting the monograph of Johnson et al. (1997) [22]. Thus, we have and E[m 2n;1 m 2n;2 ] = cov(m 2n;1 , m 2n;2 ) + E[m 2n;1 ] 2 = −2n 1 4n 2 + 1 = 1 − 1 2n so that the initial expectation finally equals which is in O(n −2 ). Hence (a) follows. Part (b) can be shown in the same way and (c) is only a special example of (b). We will now prove (d) with the help of (b). To this end we rewrite W * n as where we have utilized in the first and last equality the identity i (Z 2n;i − Z 2n ) = 0. Here the first factor C −1 η σ η /η 2n on the right hand side converges to 1 almost surely by the SLLN and the second factor is the wild bootstrap version (3.3) of the AJE in the weights G i = (η i − µ η )/σ η . Hence the assertion is a consequence of Slutzky's Lemma and part (b). Part (e) is only a special example of (d).