Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process.

We consider the task of learning the structure of the graph underlying a mutually-exciting multivariate Hawkes process in the high-dimensional setting. We propose a simple and computationally inexpensive edge screening approach. Under a subset of the assumptions required for penalized estimation approaches to recover the graph, this edge screening approach has the sure screening property: with high probability, the screened edge set is a superset of the true edge set. Furthermore, the screened edge set is relatively small. We illustrate the performance of this new edge screening approach in simulation studies.

In this section, we provide a very brief review of the multivariate Hawkes process. A more comprehensive discussion can be found in Liniger (2009) and Zhu (2013).
Following Brémaud and Massoulié (1996), we define a simple point process N on R + as a family {N (A)} A∈B(R + ) taking integer values (including positive infinity), where B(R + ) denotes the Borel σ-algebra of the positive half of the real line. Further let t 1 , t 2 , . . . ∈ R + be the event times of N . In this notation, N (A) = i 1 [ti∈A] for A ∈ B(R + ). We write N [t, t + dt) as dN (t), where dt denotes an arbitrary small increment of t. Let H t be the history of N up to time t. Then the H t -predictable intensity process of N is defined as (1) Now suppose that N is a marked point process, in which each event time t i is associated with a mark m i ∈ {1, . . . , p} (see e.g., Definition 6.4.I. in Daley and Vere-Jones, 2003). We can then view N as a multivariate point process N j j=1,...,p , of which the jth component process is given by N j (A) = i 1 [ti∈A,mi=j] for A ∈ B(R + ). To simplify the notation, we let t j,1 , t j,2 , . . . ∈ R + denote the event times of N j .
The intensity of the jth component process is In the case of the linear Hawkes process, this function takes the form (Brémaud and Massoulié, 1996;Hansen, Reynaud-Bouret and Rivoirard, 2015) λ j (t) = μ j + p k=1 ⎛ ⎝ i:t k,i ≤t We refer to μ j ∈ R as the background intensity, and ω j,k (·) : R + → R as the transfer function.
For p fixed, Brémaud and Massoulié (1996) established that the linear Hawkes process with intensity function (2) is stationary given the following assumption.
We now define a directed graph with node set {1, . . . , p} and edge set E ≡ {(j, k) : ω j,k = 0, 1 ≤ j, k ≤ p} , for ω j,k given in (2). Let denote the maximum in-degree of the nodes in the graph. In this paper, we propose a simple screening procedure that can be used to obtain a small superset of the edge set E.

Estimation and theory for the Hawkes process
We first consider the low-dimensional setting, in which the dimension of the process, p, is fixed, and T , the time period during which the point process is observed, is allowed to grow. In this setting, asymptotic properties such as the central limit theorem have been established; for instance, see Bacry et al. (2013) and Zhu (2013). Consequently, estimating the edge set E is straightforward in low dimensions. In high dimensions, when p might be large, we can fit the Hawkes process model using a penalized estimator of the form minimize ω j,k ∈F ,1≤j,k≤p where L · ; {N j } p j=1 is a loss function, based on, e.g., the log-likelihood (Bacry, Gaïffas and Muzy, 2015) or least squares (Hansen, Reynaud-Bouret and Rivoirard, 2015); P · ; {N j } p j=1 is a penalty function, such as the lasso (Hansen, Reynaud-Bouret and Rivoirard, 2015); λ is a nonnegative tuning parameter; and F is a suitable function class. Then, a natural estimator for E is {(j, k) : Recently, Reynaud-Bouret and Schbath (2010), Bacry, Gaïffas and Muzy (2015), and Hansen, Reynaud-Bouret and Rivoirard (2015) have established that under certain assumptions, penalized estimation approaches of the form (5) are consistent in high dimensions, provided that the edge set E is sparse. For instance, Hansen, Reynaud-Bouret and Rivoirard (2015) establish the oracle inequality of the lasso estimator for the Hawkes process, given that certain conditions hold on the observed event times. However, to show that these conditions hold with high probability for arbitrary samples, these theoretical results require that the point process is mutually-exciting -that is, an event in one component process can increase, but cannot decrease, the probability of an event in another component process. This amounts to assuming that ω j,k (Δ) ≥ 0 for all Δ ≥ 0, for ω j,k defined in (1).
When the dimension p is large, penalized estimation procedures of the form (5) (Bacry, Gaïffas and Muzy, 2015;Hansen, Reynaud-Bouret and Rivoirard, 2015) become computationally expensive: they require O(T p 2 ) operations per iteration in an iterative algorithm. This is problematic in contemporary applications, in which p can be on the order of tens of thousands (Ahrens et al., 2013). These concerns motivate us to propose a simple and computationally-efficient edge screening procedure for estimating the true edge set E in high dimensions. Under very few assumptions, our proposed screening procedure is guaranteed to select a small superset of the true edge set E.

Organization of paper
The rest of this paper proceeds as follows. In Section 2, we introduce our screening procedure for estimating the edge set E, and establish its theoretical properties. We present simulation results in support of our proposed procedure in Section 3. Proofs of theoretical results are presented in Section 4, and the Discussion is in Section 5.
2. An edge screening procedure

Approach
For j = 1, . . . , p, let Λ j denote the mean intensity of the jth point process introduced in Section 1. That is, Following Equation 5 of Hawkes (1971), for any Δ ∈ R, the (infinitesimal) cross-covariance of the jth and kth processes is defined as where δ(·) is the Dirac delta function, which satisfies For a given value of Δ, we can estimate the cross-covariance function V j,k (Δ) using kernel smoothing: In this paper, we focus on kernel functions that are bounded by 1 and are defined on a bounded support, i.e., g., the Epanechnikov kernel). Let B denote a tuning parameter that defines the time range of interest for B]. For any ζ, we define the set of screened edges as where f 2, [l,u] ≡ u l f 2 (t)dt 1/2 is the 2 -norm of a function f on the interval [l, u].
Screening for the Hawkes process 1211 The screened edge set E(ζ) in (9) can be calculated quickly: V j,k 2, [−B,B] can be calculated in O(T ) computations, and so E(ζ) can be calculated in O(T p 2 ) computations. The procedure can be easily parallellized.
There are three tuning parameters in the procedure: the bandwidth h in (8), the range B in (9), and the screening threshold ζ in (9). The bandwidth h can be chosen by cross-validation. The range B can be selected based on the problem setting. For instance, when using the multivariate Hawkes process to model a spike train data set in neuroscience, we can set B to equal the maximum time gap between a spike and the spike it can possibly evoke. The choice of screening threshold ζ can be determined based on the sparsity level that we expect based on our prior knowledge. Alternatively, we may wish to use a small value of ζ in order to reduce the chance of false negative edges in E(ζ), or a larger value due to limited computational resources in our downstream analysis.

Theoretical results
We consider the asymptotics of triangular arrays (Greenshtein and Ritov, 2004), where the dimension p is allowed to grow with T . When unrestricted, it is possible to cook up extreme networks, where, for instance, the mean intensity Λ j in (6) diverges to infinity. To avoid such cases, we pose the following regularity assumption.
Assumption 2. There exist positive constants Λ min , Λ max , and V max such that 0 < Λ min ≤ Λ j ≤ Λ max and max Δ∈R |V j,k (Δ)| ≤ V max for all 1 ≤ j, k ≤ p, where Λ j and V j,k are defined in (6) and (7), respectively. Furthermore, Λ min , Λ max , and V max are generic constants that do not depend on p.
Next, we make some standard assumptions on the transfer functions ω j,k in (2).
(c) There exist positive constants b, θ 0 , and C such that, for all 1 ≤ j, k ≤ p, and for any Δ 1 , Assumption 3(a) guarantees that the multivariate Hawkes process is mutuallyexciting: that is, an event may trigger (but cannot inhibit) future events. This assumption is shared by the original proposal of Hawkes (1971). Furthermore, existing theory for penalized estimators for the Hawkes process requires this assumption (Bacry, Gaïffas and Muzy, 2015;Hansen, Reynaud-Bouret and Rivoirard, 2015).
Assumption 3(b) guarantees that the non-zero transfer functions are nonnegligible. Such an assumption is needed in order to establish variable selection consistency (Bühlmann and van de Geer, 2011;Wainwright, 2009) for the penalized estimator (5).
Assumption 3(c) guarantees that the transfer functions are sufficiently smooth; this guarantees that the cross-covariances are smooth (see Section A.2 in Appendix), and hence can be estimated using a kernel smoother (8). Instead of Assumption 3(c), we could assume that ω j,k is an exponential function (Bacry, Gaïffas and Muzy, 2015) or that it is well-approximated by a set of smooth basis functions (Hansen, Reynaud-Bouret and Rivoirard, 2015).
Recall that s was defined in (4). We now state our main result.
Theorem 1. Suppose that the Hawkes process (2) satisfies Assumptions 1-3. Let h = c 1 s −1/2 T −1/6 in (8) and ζ = 2c 2 s 1/2 T −1/6 in (9) for some constants c 1 and c 2 . Then, for some positive constants c 3 and c 4 , with probability at least 1 − c 3 T 7/6 s 1/2 p 2 exp(−c 4 T 1/6 ), Theorem 1(a) guarantees that, with high probability, the screened edge set E(ζ) contains the true edge set E. Therefore, screening does not result in false negatives. This is referred to as the sure screening property in the literature (Fan and Lv, 2008;Fan, Samworth and Wu, 2009;Fan and Song, 2010;Fan, Feng and Song, 2011;Fan, Ma and Dai, 2014;Liu, Li and Wu, 2014;Song et al., 2014;Luo, Song and Witten, 2014). Typically, establishing the sure screening property requires assuming that the marginal association between a pair of nodes in E is sufficiently large; see e.g. Condition 3 in Fan and Lv (2008) and Condition C in Fan, Feng and Song (2011). In contrast, Theorem 1(a) requires only that the conditional association between a pair of nodes in E is sufficiently large; see Assumption 3(b).
Theorem 1(b) guarantees that E(ζ) is a relatively small set, on the order of O(card(E)s −1 T 1/3 ). Suppose that p 2 ∝ s −1/2 exp(c 4 T 1/6− ) for some positive constant < 1/6; this is the high-dimensional regime, in which the probability statement in Theorem 1 converges to one. Then the size of E(ζ), O(card(E)s −1 T 1/3 ), can be much smaller than p 2 , the total number of node pairs. We note that the rate of T 1/3 is comparable to existing results for nonparametric screening in the literature (see e.g., Fan, Feng and Song 2011;Fan, Ma and Dai 2014).
To summarize, Theorem 1 guarantees that under a small subset of the assumptions required for penalized estimation methods to recover the edge set E, the screened edge set E(ζ) (9) is small and contains no false negatives. We note that this is not the case for other types of models. For instance, in the case of the Gaussian graphical model, Luo, Song and Witten (2014) considered estimating the conditional dependence graph by screening the marginal covariances. In order for this procedure to have the sure screening property, one must make an assumption on the minimum marginal covariance associated with an edge in the graph, which is not required for variable selection consistency of penalized estimators (Cai, Liu and Luo, 2011;Luo, Song and Witten, 2014;Ravikumar et al., 2011;Saegusa and Shojaie, 2016).
It is important to note that Theorem 1 considers an oracle procedure, where the tuning parameters depend on unknown parameters. The heuristic selection guidelines suggested at the end of Section 2.1 may not satisfy the requirements of Theorem 1. We leave the discussion of optimal tuning parameter selection criteria for future research. Also, note that the bandwidth h ∝ T −1/6 is wider than the typical bandwidth for kernel smoothing, which is T −1/3 (Tsybakov, 2009). This is because we aim to minimize a concentration bound on V j,k − V j,k (see the proof of Lemma 3 in the Appendix), rather than the usual mean integrated square error as in, e.g., Theorem 1.1 in Tsybakov (2009).

Remark 1. In light of Theorem 1, consider applying a constraint induced by
Theorem 1 can be combined with existing results on consistency of penalized estimators of the Hawkes process (Bacry, Gaïffas and Muzy, 2015;Hansen, Reynaud-Bouret and Rivoirard, 2015) in order to establish that (10) results in consistent estimation of the transfer functions ω j,k . As a concrete example, Hansen, Reynaud-Bouret and Rivoirard (2015) considered (10) with L ω j,k ; {N j } p j=1 taken to be the least-squares loss, and P ω j,k ; {N j } p j=1 a lasso-type penalty. Our simulation experiments in Section 3 indicate that in this setting, (10) can actually have better small-sample performance than (5) when p is very large. Furthermore, solving (10) can be much faster than solving (5): the former requires O(T 4/3 s −1 card(E)) computations per iteration, compared to O(T p 2 ) per iteration for the latter (using e.g. coordinate descent, Friedman, Hastie and Tibshirani, 2010). In the high-dimensional regime when p 2 ∝ s −1/2 exp(c 4 T 1/6− ) for some positive constant < 1/6, we have that T p 2 . We note that in order to solve (10), we must first compute E(ζ), which requires an additional one-time computational cost of O(T p 2 ).

Simulation set-up
In this section, we investigate the performance of our screening procedure in a simulation study with p = 100 point processes. Intensity functions are given by (2), with μ j = 0.75 for j = 1, . . . , p, and ω j,k (t) = 2t exp(1 − 5t) for (j, k) ∈ E. By definition, ω j,k = 0 for all (j, k) / ∈ E. We consider two settings for the edge set E, Setting A and Setting B. These settings are displayed in Figure 1. In what follows, it will be useful to think about the (undirected) node pairs as belonging to three types. (i) We let (ii) With a slight abuse of notation, we will useẼ c ∩ supp(V) to denote node pairs that are not inẼ with non-zero population cross-covariance, defined in (7). (iii) Continuing to slightly abuse notation, we will useẼ c \supp(V) to denote node pairs that are not inẼ and that have zero population cross-covariance. Throughout the simulation, we set the bandwidth h in (8) to equal T −1/6 , and the range of interest B in (9) to equal 5. Thus, h satisfies the requirements of Theorem 1, and [−B, B] covers the majority of the mass of the transfer function ω j,k . However, these simulation results are not sensitive to the particular choices of h or B.

Investigation of the estimated cross-covariances
In Setting A, within a single connected component, all of the node pairs that are not inẼ are inẼ c ∩ supp(V). However, for the most part, the population cross-covariances corresponding to node pairs inẼ c ∩ supp(V) are quite small, because they are induced by paths of length two and greater. This can be seen from the left-hand panel of Figure 2. Given the left-hand panel of Figure 2, we expect the proposed screening procedure to work very well in Setting A: for a sufficiently large value of the time period T , there exists a value of ζ such that, with high probability, E(ζ) =Ẽ.
In Setting B, six nodes receive directed edges from the same set of four nodes. Therefore, we expect the pairs among these six nodes to be in the setẼ c ∩ supp(V), and to have substantial population cross-covariances. This intuition is supported by the center panel of Figure 2, which indicates that the node pairs inẼ c ∩ supp(V) have relatively large estimated cross-covariances, on the same order as the node pairs inẼ. In light of Figure 2, we anticipate that for a sufficiently large value of the time period T , the screened edge set E(ζ) will contain the edges inẼ as well as many of the edges inẼ c ∩ supp(V).

Size of smallest screened edge set
We now define ζ * ≡ max ζ : E ⊆ E(ζ) , and calculate card E(ζ * ) . This represents the size of the smallest screened edge set that contains the true edge set.
Results, averaged over 200 simulated data sets, are shown in Figure 3. We see that in Setting A, for sufficiently large T , card E(ζ * ) = card(Ẽ), which implies that E(ζ * ) =Ẽ. In other words, in Setting A, the screening procedure yields perfect recovery of the setẼ (11). This is in line with our intuition based on the left-hand panel of Figure 2.
In contrast, in Setting B, even when T is very large, card( E(ζ * )) > card(Ẽ), which implies that E(ζ * ) ⊇Ẽ. This was expected based on the center panel of Figure 2.

Performance of constrained penalized estimation
We now consider the performance of the estimator (10), which we obtain by calculating the screened edge set E(ζ), and then performing a penalized regres- sion subject to the constraint that ω jk = 0 for (j, k) / ∈ E(ζ). Note that rather than assuming a specific functional form for ω j,k , Hansen, Reynaud-Bouret and Rivoirard (2015) use a basis expansion to estimate ω j,k . Following their lead, we use a basis of step functions, of the form 1 ((m−1)/2,m/2] (t) for m = 1, . . . , 6. Instead of applying a lasso penalty to the basis function coefficients (Hansen, Reynaud-Bouret and Rivoirard, 2015), we employ a group lasso penalty for every 1 ≤ j, k ≤ p (Yuan and Lin, 2006;Simon and Tibshirani, 2012). Thus, (10) consists of a squared error loss function and a group lasso penalty. We let whereω j,k solves (10). Results are shown in Figure 4. In Setting A, solving the constrained optimization problem (10) leads to substantially better performance than solving the unconstrained problem (5). The improvement is especially noticeable when T is small. In Setting B, solving the constrained optimization problem (10) leads to only a slight improvement in performance relative to solving the unconstrained problem (5), since, as we have learned from Figures 2 and 3, the screened set E(ζ) contains many edges inẼ c ∩ supp(V). In both settings, solving the constrained optimization problem leads to substantial computational improvements.

Proofs of theoretical results
In this section, we prove Theorem 1. In Section 4.1, we review an important property of the Hawkes process, the Wiener-Hopf integral equation. In Section 4.2, we list three technical lemmas used in the proof of Theorem 1. Theorem 1 is proved in Section 4.3. Proofs of the technical lemmas are provided in the Appendix.

The Wiener-Hopf integral equation
Recall that the transfer functions ω = {ω j,k } 1≤j,k≤p were defined in (2), the cross-covariances V = {V j,k } 1≤j,k≤p were defined in (7), and the mean intensities Λ = (Λ 1 , . . . , Λ p ) T were defined in (6). If the Hawkes process defined in (2) is stationary, then for any Δ ∈ R + , where Equation (13) belongs to a class of integral equations known as the Wiener-Hopf integral equations.

Technical lemmas
We state three lemmas used to prove Theorem 1, and provide their proofs in the Appendix. The following lemma is a direct consequence of (13) and our assumptions. Recall that [0, b] is a superset of supp(ω j,k ) introduced in Assumption 3.

Lemma 1. Under Assumptions 1-3, for sufficiently large
The next lemma shows that the cross-covariance is Lipschitz continuous given the smoothness assumption on ω j,k (Assumption 3(c)). We will use this lemma in the proof of Theorem 1, in order to bound the bias of the kernel smoothing estimator (8). Recall that s, the maximum node in-degree, was defined in (4).

Lemma 2. Under Assumptions 1-3, the cross-covariance function is Lipschitz
Recall that the bandwidth h was defined in (8). The following concentration inequality holds on the estimated cross-covariance.

Discussion
In this paper, we have proposed a very simple procedure for screening the edge set in a multivariate Hawkes process. Provided that the process is mutuallyexciting, we establish that this screening procedure can lead to a very small screened edge set, without incurring any false negatives. In fact, this result holds under a subset of the conditions required to establish model selection consistency of penalized regression estimators for the Hawkes process (Wainwright, 2009;Hansen, Reynaud-Bouret and Rivoirard, 2015). Therefore, this screening should always be performed whenever estimating the graph for a mutually-exciting Hawkes process.
The proposed screening procedure boils down to just screening pairs of nodes by thresholding an estimate of their cross-covariance. In fact, this approach is commonly taken within the neuroscience literature, with a goal of estimating the functional connectivity among a set of p neuronal spike trains (Okatan, Wilson and Brown, 2005;Pillow et al., 2008;Mishchencko, Vogelstein and Paninski, 2011;Berry et al., 2012). Therefore, this paper sheds light on the theoretical foundations for an approach that is often used in practice.

A.1. Proof of Lemma 1
Proof. First, we observe that, if V j,k is non-negative for all j and k, then ω j,l * V l,k is non-negative for any j, l, k. Under Assumption 1, we know that (13) holds. We can see from (13) that where the inequality follows from Assumption 2 and the equality holds since We now show that the elements of V are non-negative, i.e., V l,k (Δ) ≥ 0 for 1 ≤ l, k ≤ p, and Δ ∈ R. Recall from the definition (7) in the main paper that where the second equality follows from (23) In this proof, we use the Stieltjes integral to rewrite λ l (t) in (2) as Plugging in λ l (t) from (24) into (22) gives Using the fact that (see e.g., Hawkes and Oakes (1974)) Rearranging the terms gives Next, we will rewrite (25) by taking the conditional expectation of dN k or dN m as in (23). Note here that, when Δ < Δ, we condition dN m on the history When Δ > Δ, we condition dN k on the history up to t − Δ. These cases are discussed separately in the following.
When Δ < Δ, for each integral of the summation, it holds that From the definition of λ m (t) in (2) Expanding λ k and Λ k yields by the nature of the mutually-exciting process. Thus, for Δ ≥ Δ, (27) Applying both (26) and (27) to (25) shows that V l,k (Δ) ≥ 0.

A.2. Proof of Lemma 2
Proof. For any Δ ≥ 0, the integral equation (13) gives For any x, y ≥ 0, we can write where the last inequality holds since ω j,l ≡ 0 for l / ∈ E j . We then have For I, we know from Assumptions 2 and 3(c) that For II l , we can expand the convolution Without loss of generality, we consider only the case that x ≥ y. We can decompose the integrals into parts on the intervals S. Chen et al.
where we use Assumption 3(c) in the second inequality, Assumptions 2 in the third inequality, and the boundedness of ω j,l from Assumption 3(c) in the last inequality.
Recalling that x ≥ y, we have Finally, plugging (31) and (32) into (30) gives where we set θ 1 ≡ θ 0 Λ max + bθ 0 V max + 2CV max . Note that the last inequality holds as long as s ≥ 1. (The result also holds if s = 0: in this case, the second term in (30) is zero for every j and the bound (31) suffices.)

A.3. Proof of Lemma 3
Recall that the estimator of the cross-covariance (8) takes the form The proof of Lemma 3 uses the following result. Lemma 4 is based on Proposition 3 of Hansen, Reynaud-Bouret and Rivoirard (2015); for completeness, we provide its proof in Section A.4.

Lemma 4. Suppose that Assumption 1 holds. We have
where c 4 , c 5 , and c 6 are constants.
We are now ready to prove Lemma 3.
Proof. First, note that where we use the definition of V in the third equality. Using the fact that the kernel K(x/h) is defined on [−h, h], we can write where the first inequality follows from Lemma 2.

A.4. Proof of Lemma 4
Lemma 4 follows directly from the proof of Proposition 3 in Hansen, Reynaud-Bouret and Rivoirard (2015). The only difference is that we want a polynomial bound on the deviation, while Hansen, Reynaud-Bouret and Rivoirard (2015) consider a logarithmic bound. For completeness, we state the proof of Lemma 4 below, but note that the proof is almost identical to the proof of Proposition 3 in Hansen, Reynaud-Bouret and Rivoirard (2015). We refer the interested readers to the original proof in Section 7.4.3 of Hansen, Reynaud-Bouret and Rivoirard (2015) for more details. Throughout this section, we assume that N ≡ (N 1 , . . . , N p ) T is defined on the full real line. We first state some notation that is only used in this section.
1. Following Hansen, Reynaud-Bouret and Rivoirard (2015), we use C (i) a1,a2,... to denote a constant that depends only on a 1 , a 2 , . . .; and we use the superscript i to indicate that this is the ith constant appearing in the proof. 2. Without loss of generality, we assume that supp(ω j,k ) ⊂ (0, 1], as in Hansen, Reynaud-Bouret and Rivoirard (2015). 3. As in Hansen, Reynaud-Bouret and Rivoirard (2015), we introduce a function Z(N ) such that Z(N ) depends only on {dN (t ), t ∈ [−A, 0)}, and there exist two non-negative constants η and d such that 4. We also introduce the (time) shift operator S t so that Z • S t (N ) depends only on {dN (t ), t ∈ [−A + t, t)}, in the same way as Z(N ) depends on the points of N in [−A, 0).
We are now ready to prove the lemma. When proving the bound (34), we only discuss the case when j = k. The proof for the case when j = k follows from the same argument and is thus omitted.
Proof. In this proof, we will consider a probability bound for Z • S t (N ) − E(Z) dt ≥ u, where, for some κ ∈ (0, 1) to be specified later, Note that, by applying the bound to −Z(·), we can obtain a bound for Z • S t (N ) − E(Z) . To complete the proof, we will verify the statements (34) and (35) by considering some specific choices of Z(·). For any positive integer k such that x ≡ T/(2k) > A, we have where the inequality follows from the stationarity of N . As in Reynaud-Bouret and Roy (2006), let {M x q } ∞ q=1 be a sequence of independent Hawkes processes, each of which is stationary with intensities λ(t) ≡ (λ 1 (t), . . . , λ p (t)) T . See Section 3 of Reynaud-Bouret and Roy (2006) then where T e,q is the time to extinction of the process M x q . The extinction time T e,q is introduced in Sections 2.2 and 3 in Reynaud-Bouret and Roy (2006). Roughly speaking, it is the last time when there is an event for the Hawkes process with intensity λ(t) of the form (2), with background intensity μ ≡ (μ 1 , . . . , μ p ) T set to 0 for t ≥ 0. Since T e,q is identically distributed for all q, we can focus on one T e,q . Denoting by a l the ancestral points with marks l and by H l a l the length of the corresponding cluster whose origin is a l , we have: Then by the exact argument on page 48 of Hansen, Reynaud-Bouret and Rivoirard (2015), we have Thus, there exists a constant C (1) A depending on A such that if we take k = C (1) A T κ , for some κ ∈ (0, 1) to be specified later, then where c 4 is a constant. Note that x = T/2k ≈ T 1−κ is larger than A for T large enough (depending on A). Now, note that the event T ≡ {T e,q ≤ T/2k − A, for all q = 0, . . . , k} only depends on the process N . We will first find a probability bound for the first term in (45). In other words, we will show that, given the event T , Let Consider the measurable events whereÑ is a constant that will be defined later and M x q | [t−A,t) represents the number of points of M x q lying in [t − A, t). Let Ω = 0≤q≤k−1 Ω q . Then We have P(Ω c ) ≤ q P(Ω c q ), where each P(Ω c q ) can be easily controlled. Indeed, it is sufficient to split [2qx − A, 2qx + x] into intervals of size A (there are about C (2) A T 1−κ of these) and require the number of points in each sub-interval to be smaller thanÑ /2. By stationarity, we then obtain Using Proposition 2 in Hansen, Reynaud-Bouret and Rivoirard (2015) with u = Ñ /2 + 1/2, we obtain: and, thus, AÑ ). Note that this control holds for any positive choice ofÑ . Thus, for anyÑ > 0, Hence by takingÑ = C A T 1−κ , for C A large enough, the right-hand side of (50) is smaller than C (2) A T 1−κ exp(−c 4 T 1−κ ). It remains to obtain the rate of D ≡ P( q F q ≥ u/2 and Ω). For any positive constant that will be chosen later, we have: . Next note that if for any integer l, then |F q | ≤ xd (l + 1) ηÑ η + 1 + xE(Z).
Hence, cutting Ω c q into slices of the type {lÑ < sup t M x q | [t−A,t) ≤ (l + 1)Ñ } and using (50) withÑ = C In the same way, following Hansen, Reynaud-Bouret and Rivoirard (2015), we can write where z b ≡ xd[Ñ η + 1] + xE(Z) = C η,A dT (1−κ)(1+η) . Then, by stationarity, where σ 2 ≡ E Z(N ) − E(Z) . Going back to (51), by (52), we have using the fact that log(1 + u) ≤ u. Since one can choose c 6 in the definition (43) of u (not depending on d) such that u/2 − kz 1 ≥ √ 2kz v z + 1 3 z b z for some z = c 4 T κ−2η(1−κ) . Hence, One can choose (as in the proof of the Bernstein inequality in Massart (2007), page 25) to obtain a bound on the right-hand side in the form of e −z . We can then choose c 4 large enough, and only depending on η and A, to guarantee that D ≤ e −z ≤ c 5 exp(−c 4 T 1−κ ). In summary, we have shown that, given the event T , With a slight abuse of notation, letting c 5 = max(c 5 , C A ) gives (49). To complete the proof, we apply the concentration inequality (49) with some specific choices of Z(·).