Spectral clustering in the dynamic stochastic block model

In the present paper, we studied a Dynamic Stochastic Block Model (DSBM) under the assumptions that the connection probabilities, as functions of time, are smooth and that at most $s$ nodes can switch their class memberships between two consecutive time points. We estimate the edge probability tensor by a kernel-type procedure and extract the group memberships of the nodes by spectral clustering. The procedure is computationally viable, adaptive to the unknown smoothness of the functional connection probabilities, to the rate $s$ of membership switching and to the unknown number of clusters. In addition, it is accompanied by non-asymptotic guarantees for the precision of estimation and clustering.


Introduction
Networks arise in many areas of research such as sociology, biology, genetics, ecology, information technology to list a few. An overview of statistical modeling of random graphs can be found in, e.g., [20] and [13]. Analysis of stochastic networks is extremely important and is used in a variety of applications such as sociology, biology, genetics, ecology, information technology and national security. Stochastic networks are used, for example, to model brain connectivity, gene regulatory networks, protein signaling networks, to monitor cyber and homeland security and to evaluate and predict social relationships within groups or between groups such as countries.
In this paper, we consider a dynamic network defined as an undirected graph with n nodes with connection probabilities changing in time. Specifically, we observe adjacency matrices A t of the graph at time instances τ t where 0 < τ 1 < · · · < τ T = b. Here, A t (i, j) are Bernoulli random variables with P t (i, j) = Pr(A t (i, j) = 1) that are independent for any 1 ≤ i < j ≤ n and A t (i, j) = A t (j, i) = 1 if a connection between nodes i and j is observed at time τ t and A t (i, j) = A t (j, i) = 0 otherwise. We set A t (i, i) = 10 (or any other large enough constant) and assume, for simplicity that time instances are equispaced and the time interval is scaled to one, i.e. b = 1 and τ t = t/T . Furthermore, we assume that the network can be described by a Dynamic Stochastic Block Model (DSBM): at each time instant τ t the nodes are grouped into K classes G t,1 , · · · , G t,K , and the probability of a connection P t (i, j) is entirely determined by the groups to which the nodes i and j belong at the moment τ t . In particular, if i ∈ G t,k and j ∈ G t,k ′ , then P t (i, j) = B t (k, k ′ ) where B t is the connectivity matrix at time τ t with B t (k, k ′ ) = B t (k ′ , k). In this case, for any t = 1, . . . , T , one has where Θ t is a clustering matrix such that Θ t has exactly one 1 per row and Θ t (i, k) = 1 if and only if node i belongs to the class G t,k and is zero otherwise. One of the main problems in this setting is to cluster the nodes and identify the groups that have common probabilities of connections. If one had an oracle that would give the membership assignments (matrices Θ t ), then one could obtain accurate estimators of matrices B t and P t by averaging elements of the adjacency matrices. The objective of the present paper is to suggest a modification of a popular spectral clustering procedure to the case of the DSBM and study its precision at a time instant τ t in a non-asymptotic setting.
In comparison, there are many fewer results concerning the DSBM model. Although approaches developed for time-independent networks can be applied to a temporal network frame-by-frame, they totally ignore temporal continuity of the network structures and parameters. Nonetheless, by taking advantage of continuity and observations at multiple time instances one can gain a better insight into a variety of issues and improve precision of the inference (see [15] and [29]).
A survey of the papers published before 2010 can be found in Goldenberg et al. [13]. After Olhede and Wolfe [28] established universality of the SBM, several authors investigated the SBM in the dynamic setting. Majority of them described changes in the memberships via Markov-type structures that allow to model smooth evolution of groups across times. For example, Yang et al. [38] assumed that, for each node, its membership forms a Markov chain independent of other nodes, however, connection probabilities do not change in time. Xu and Hero III [36] allowed both the connection probabilities and the group memberships to change with time via a latent state-space model. Later, Xu [35] and Xu et al. [37] further refined the model by introducing a Markov structure on the memberships. In both papers, the logits of connection probabilities are modeled via a dynamical system model. Some authors [16], [38] presented Bayesian variants of similar ideas. We should also mention Fu et al. [10] and Xing et al. [34] that extended the DSBM to the case of the mixed memberships under the assumption that data follows the multivariate logistic-normal distribution. For example, Xing et al. [34] assumed that the data followed dynamic logistic-normal mixed membership blockmodel and inferred parameter values by using Laplace variational approximation scheme which generalizes the variational approximation developed in Airoldi et al. [1]. None of the papers cited above inferred the number of classes. This shortcoming was corrected by Matias and Miele [27] who propose a Markov chain model for the membership transitions and infer the unknown parameters including the unknown number of classes via variational approximations of the EM algorithm. The approach of [27] was further extended by a very recent paper of Zhang et al. [39] who assumed the Poisson model on the number of connections with the time-independent probabilities that edges appear or disappear at any time instant. We should also cite an early work of Chi et al. [5] that made no assumptions on the mechanism that governs changes in the cluster memberships and deals with a problem by introducing two cost functions, the snapshot cost associated with the error of current clustering and the temporal cost that measures how the clustering preserves continuity in terms of cluster memberships where both cost functions are based on the results of the k-mean clustering algorithm.
While some of the procedures described in those papers show good computational properties, they come without any guarantees for the accuracy of estimation and clustering. To the best of our knowledge, the only paper that investigates precision of temporal clustering is [15], where the authors apply the spectral clustering to the matrix of averages under the assumption that the sequences B t (k, k ′ ), t = 1, . . . , T form stationary ergodic processes for each k and k ′ and prove consistency of the procedure as T and n tend to infinity.
In this paper, we likewise consider a dynamic network that possesses some kind of continuity in a sense that neither connection probabilities B t in (1.1) nor class memberships change drastically from one time instant to another. In particular, we assume that, for any pair k and k ′ of classes, the connection probabilities B t (k, k ′ ) represent values of some smooth function at time τ t and, therefore, can be treated as functional data. In addition, we suppose that at most s nodes can switch their class memberships between two consecutive time points. Both assumptions guarantee some degree of consistency of the network in time. Under those assumptions, we extract group memberships of the nodes at every time point by using a spectral clustering procedure and evaluate the error of this procedure. The clustering technique is applied to kernel-type estimators of the edge probability matrices P t that we construct in the paper. By using Lepskii method, we achieve adaptivity of the suggested procedure to the unknown temporal smoothness of the functional connection probabilities B t (k, k ′ ) and to the rate s of membership switching. Finally, by setting a threshold on the ratio of the eigenvalues of the estimated probability matrix, we findK, the estimated number of clusters, that coincides with the true number of clusters K with high probability.
Our paper makes several key contributions. We present a computationally viable methodology for estimating an edge probability matrix and clustering of a time-dependent network that follows the DSBM. The procedure is adaptive to the set of unknown parameters and is accompanied by non-asymptotic guarantees for the precision of estimation and clustering. In order to obtain those results, we develop a variety of new mathematical techniques. In particular, we develop a discrete kernel estimator for an unknown matrix and obtain its adaptive version using Lepskii method. To the best of our knowledge, neither of those methods have been used in this setting so far. In addition, our analysis in Lemma 2 generalizes the methodologies developed by Friedman et al. [9], Feige and Ofek [8], and Lei and Rinaldo [23], from estimating the spectral norm of a random matrix to the spectral norm of the sum of independent random matrices. Our approach gives a sharper bound than the conventional matrix concentration inequalities such as the matrix Bernstein inequality [32], by a logarithmic factor. Finally, we estimate the number of clusters and provide guarantees of the accuracy of this estimator.
The rest of the paper is organized as follows. Section 2 introduces notations and main assumptions of the paper. Section 3 presents the spectral clustering algorithm and evaluates its error at time t in terms of the estimation error of the matrix of the connection probabilities P t . For this reason, Section 4 describes construction of a kernel-type estimator of the probability matrix at each time point t and evaluates its error. While Sections 4.1 and 4.2 assume that the degree of smoothness β of the connection probabilities and the rate s of switching of nodes' memberships are known, Section 4.3 utilizes the Lepskii method for construction of adaptive estimators of the connection probability matrices P t . Finally, Section 5 presents upper bounds for the clustering errors. Section 6 offers an estimator for the number of clusters and provide precision guarantees for the clustering procedure with the estimated number of clusters. Section 7 concludes the paper with a discussion. Section 8, Appendix, describes construction of a discrete kernel and also contains proofs of all statements in the paper.

Notations and assumptions
For any a, b ∈ R, denote a ∨ b = max(a, b), a ∧ b = min(a, b). For any two positive sequences {a n } and {b n }, a n ≍ b n means that there exists a constant C > 0 independent of n such that C −1 a n ≤ b n ≤ Ca n for any n. For any set Ω, denote cardinality of Ω by |Ω|. For any x, [x] is the largest integer no larger than x.
For any vector t ∈ R p , denote its ℓ 2 , ℓ 1 , ℓ 0 and ℓ ∞ norms by, respectively, t , t 1 , t 0 and t ∞ . Denote by 1 and 0 vectors that have, respectively, only unit or zero elements. Denote by e j the vector with 1 in the j-th position and all other elements equal to zero.
For a matrix Q, its i-th row and j-th columns are denoted, respectively, by Q i, * and Q * ,j . Similarly, reductions of Q to a set of rows or columns in a set G are denoted, respectively, by Q G, * and Q * ,G . For any matrix Q, denote its spectral and Frobenius norms by, respectively, Q and Q F , Denote the largest in absolute value element of Q by Q ∞ and the number of nonzero elements of Q by Q 0 .
Denote by M n,K the collection of clustering matrices Θ ∈ {0, 1} n×K . Denote by n t (k) = |G t,k | the number of elements in class G t,k and let n t,max = max k n t (k) and n t,min = min k n t (k), k = 1, . . . , K. We assume that there exists α n independent of T and an absolute constant C α independent of n and T such that If the network is sparse, then α n is small for large n and P t ∞ ≤ C α α n , otherwise, one can just set α n = 1. Denote We shall carry out time-dependent clustering of the nodes in the situation where neither the connection probabilities nor the cluster memberships change drastically from one time point to another. In addition, to make successful clustering possible, the values of probabilities of connection should be sufficiently different from each other, which is guaranteed by the smallest eigenvalues of matrices H t being separated from zero.
In order to quantify those notions, we consider a Hölder class Σ(β, L) of functions f on [0, 1] such that f are l times differentiable and where l is the largest integer strictly smaller than β. We suppose that the following assumptions hold.

(A1).
For (A2). At most s nodes can change their memberships between any consecutive time instances.
(A3). There exists an absolute constant C λ , 1 ≤ C λ < ∞, independent of n and T such that Clustering of the nodes can be recovered only up to column permutations. However, in order condition A1 can hold, we shall assume that the node's labels are fixed and do not depend on t. We denote the set of K × K permutation matrices by E K and, following [23], consider two measures of clustering precision at time τ t . The first is the overall relative clustering error at time τ t that measures the overall proportion of mis-clustered nodes. The second measure is the highest relative clustering error over the communities at time τ t In addition, we study two global measures of clustering accuracy such as the overall highest relative error over the communities and the overall highest relative error

Spectral clustering and its error
Spectral clustering is a common method for community recoveries (see, e.g., [17], [18], [23], [26], [30] and [31] among others). The accuracy of spectral clustering depends on how well one can relate the eigenvectors of P t = Θ t B t Θ T t to the eigenvectors of its estimator P t . For this reason, our first goal will be to obtain an estimator P t of P t . Subsequently we shall apply the spectral clustering based on the approximate k-means algorithm suggested by Lei and Rinaldo [23]. Although one can read a description of the algorithm in their paper, for completeness we review it here.
Given a matrix P ∈ R n×n , let U ∈ R n×K be the matrix that consists of the first K eigenvectors of P. Then [23] suggested to investigate the (1 + ǫ)−approximate solution to the k-means problem applied to the n row vectors of U, specifically, findingΘ ∈ M n,K andX ∈ R K×K that satisfy Then the cluster assignments are given by the estimatedΘ t . There exist efficient algorithms for solving (3.1), see, e.g., [21]. The procedure is summarized as Algorithm 1.
Algorithm 1 Spectral clustering in the dynamic stochastic block model Input: Adjacency matrices A t for t = 1, . . . , T ; number of communities K; approximation parameter ǫ. Output: Membership matrices Θ t for any t = 1, . . . , T .
Steps: 1: Estimate P t by P t,r defined in (4.2). 2: Let U t ∈ R n×K be a matrix representing the first K eigenvectors of P t,r . 3: Apply the (1 + ǫ)-approximate k-means algorithm to the row vectors of U t 4: Obtain the solution Θ t . (2.4) are determined by the precision of estimation of P t by P t . In particular, the following statement holds.

Lemma 1. Let clustering be carried out according to the Algorithm 1 on the basis of an estimator
and, if the right-hand side of (3.2) is bounded by one, then Here, λ min (P t ) is the smallest nonzero eigenvalue of P t .
4 Estimation of the edge probability matrices 4

.1 Construction of the estimator
In order to estimate P t , we choose an integer r ≥ 0, the width of the window, and consider three pairs of sets of integers If t ∈ D r,j , we construct an estimator of P t on the basis of A t+i where i ∈ F r,j , j = 1, 2, 3. For this purpose, we introduce discrete kernel functions W (j) r,l (i) of an integer argument i that satisfy the following assumption where W max is independent of r, j and i, and for j = 1, 2, 3, one has Here |F r,j | is the cardinality of the set F r,j .
One can easily see that function W r,l , j=2,3, mimic the boundary kernels (the left boundary kernel for j = 2 and the right boundary kernel for j = 3). Section 8.1 provides an algorithm for the explicit construction of W (j) r,l for any values of r, l and j. We ought to point out that the dependence of W (j) r,l on r is a weak one, especially as r grows. We estimate the edge probability matrix P t by Note that since the sets D r,j are disjoint for different values of j, the estimator of P t always involve just one expression in figure brackets in formula (4.2).

Estimation error
In order to figure out how to choose the value of r, we evaluate the error P t,r − P t . Denote where ∆ 1,t (r) = P t,r − P t,r and ∆ 2,t (r) = P t,r − P t represent, respectively, the variance and the bias portions of the error. The following statements provide upper bounds for those quantities.
The exact expression for C 0,τ is given by formula (8.14) in the Appendix.
Lemmas 2 and 3 together with inequality (4.3) provide an upper bound for ∆ t (r). Since ∆ 1,t (r) is decreasing and ∆ 2,t (r) is increasing in r, there exist a value r * that ensures the best bias-variance balance. Denote r * = argmin Then, the following lemma provides an upper bound for ∆ t (r).

Lemma 4. Let (2.1) and
Assumptions A1-A4 hold with α n ≥ C −1 α c 0 log n/n. Then, the optimal value of r is where [x] is the largest integer no greater than x and C T and C s are positive constants independent of n, T , n max , s and α n . Also, with probability at least 1 − 4 n −τ one has where constant C ∆ depends on τ, c 0 , W max , β, L, C α and C λ .
Note that δ 1 < δ 2 corresponds to the case where r * = 0 and this situation occurs only if T is rather small or s is large. In particular, δ 2 ≤ δ 1 if and s ≤C s (n max α n ) −1 n, (4.10) whereC T andC s are positive constants independent of n, T , n max , s and α n . In this case, r * ≥ 1 and one can take an advantage of the smoothness of the connection probabilities and the relative stability of group memberships.

Adaptive estimation
Observe that the value of r * depends on the values of s, n max , α n and β that are unknown, therefore, in practice, the value r * in (4.8) is unavailable. In order to construct an adaptive estimator we use the Lepskii method [24], [25]. For any t, set r ≡ r t = max 0 ≤ r ≤ T /2 : P t,r − P t,ρ ≤ 4 C 0,τ n α n /(ρ ∨ 1) for any ρ < r (4.11) Observe that evaluation of r does not require the knowledge of s, n max or β. If the network is not sparse, one can set α n = 1. Otherwise, one needs to know α n for choosing an optimal value of r. If α n is known, the following lemma ensures that the replacement of r * by r changes the upper bound by a constant factor only.
Theorem 1. Let clustering be carried out according to the Algorithm 1. Let P t = Θ t B t Θ T t where B t = α n H t and Θ t ∈ M n,K . If (2.1) and Assumptions A1-A4 hold with α n ≥ C −1 α c 0 log n/n, then for any τ > 0, with probability at least 1 − 4 n −τ , one has where (5.2) holds provided the right-hand side of (5.1) is bounded by one, δ 1 and δ 2 are defined in (4.7) and In addition, if T grows at most polynomialy with n, so that T ≤ n τ 1 for some τ 1 < ∞, then for any τ > 0, with probability at least 1 − 4 n −(τ −τ 1 ) , one has Theorem 1 provides upper bounds for the local clustering errors at time τ t as well as for the maximum clustering errors on the whole time interval. It confirms that as long as (4.10) holds, at every time instance τ t , the clustering errors obtained by our algorithm will be smaller than those obtained by separately clustering snapshots of the network at each individual time point. Consequently, the errors will be smaller than those reported in [23]. In particular, by direct calculations, one derives K min(δ 2 1 , δ 2 2 ) α 2 n n 2 min ≤ Kn α n n 2 min min 1; n α n T 2β 1 2β+1 + n max α n s n (5.4) For example, if the community sizes are balanced, i.e. there exist positive constants C 1 and C 2 such that we immediately obtain the following corollary.
Remark 1. Dense network. Inequalities (5.1), (5.2), (5.6) and (5.3) imply that precision of clustering is better when α n is larger. Indeed, if the network is dense, then α n = 1, the estimator P t, r is fully adaptive and with probability at least 1 − 4 n −(τ −τ 1 ) , Remark 2. Constant memberships. If group memberships of the nodes remain unchanged over time, then s = 0 and one can cluster the average P of edge probability matrices on the basis of its observed counterpart P where rather than the individual matrices P t . In this case, 2r + 1 = T , W max = 1 and the bias portion of the error disappears, hence, P − P ≤ C 0,τ nα n /T . Observe that λ min ( P) ≥ C −1 λ α n n min . Therefore, for any τ > 0, with probability at least 1 − 4 n −τ , one hasR

Remark 3. Constant matrix of connection probabilities.
Consider the situation where nodes of the network can switch memberships in time (s > 0) but the matrix of the connection probabilities is constant: B t ≡ B. In this case, Assumption A1 is valid with β = ∞. Then, for r < T one has P t,r − P t ≤ 2 √ 2 W max C λ α n √ n max rs. Hence, δ 2 = (α 3 n n n max s) 1 4 and for any τ > 0, with probability at least 1 − 4 n −τ , the clustering error at time τ t appears as Kn α n n 2 min min 1; n max α n s n 6 Estimating the number of clusters Lemma 5 implies that the eigenvalues of P t, r can be used to estimate the true number of clusters K. Indeed, denote the sorted eigenvalues of any symmetric matrix X ∈ R n×n by λ 1 (X) ≥ λ 2 (X) ≥ . . . ≥ λ n (X). Then, due to the matrix perturbation result |λ i (X) − λ i (Y)| < X − Y (see, e.g., Corollary III.2.2 of [3]), obtain λ K+1 ( P t, r ) ≤ P t, r − P t , λ j ( P t, r ) ≥ λ j (P t ) − P t, r − P t , j = 1, · · · , K.
Denote λ j,t = λ j (P t ), λ j,t = λ j ( P t, r ) and Then, one has Hence, if ǫ t is small enough, then there exists a threshold ̟ such that while λ j+1,t / λ j,t can exhibit chaotic behavior for j ≥ K + 1. Therefore, one can estimate K by where ̟ is a tuning parameter. Note that the expression for K t is somewhat similar to the one suggested by Le and Levina [22] with the difference that we use eigenvalues of the adjacency matrix in the situation of a time-dependent network.
The following statement shows that if eigenvalues of P t grow at most exponentially and λ K,t = λ min (P t ) is large enough, thenK t is an accurate estimator of K with high probability. Proposition 1. Let Assumptions A1-A3 hold and α n ≥ C −1 α c 0 log n/n. Let for some w > 0 where K is the true number of clusters. If where ∆ t (r * ) is defined in (4.9), then for any τ > 0, with probability at least 1 − 4 n −τ , inequalities (6.2) hold with ̟ = (3 + w) −1 and K t = K.
Observe that condition (6.5) on the lowest nonzero eigenvalue of P t is essentially a necessary condition required for accurate clustering. Indeed, ∆ t (r * ) ≤ C ∆ min(δ 1 , δ 2 ) by (4.9) and, by Assumption A3, λ min (P t ) ≥ C −1 λ α n n min , so that (6.5) is guaranteed by On the other hand, the clustering error in (5.1) is bounded above byR t ( Θ t , Θ t ) ≤ C R (2 + ǫ) K ℵ where ℵ is defined in (6.6). Therefore, a small valueR t ( Θ t , Θ t ) ≤ δ of the clustering error implies that ℵ ≤ (C R (2 + ǫ)) −1 δ/K which ensures (6.6) provided that K is large enough. Note also that assumption (6.4) is not restrictive. Indeed, since λ K (P t ) = λ min (P t ) ≥ C −1 λ α n n min and λ 1 (P t ) = λ max (P t ) ≤ C λ α n n max , obtain that so condition (6.4) always holds, for example, in the case of a balanced model satisfying (5.5). Combination of Theorem 1 and Proposition 1 immediately yield the following corollary.

Discussion
In the present paper, we study the DSBM under the assumptions that the connection probabilities, as functions of time, are smooth and that at most s nodes can switch their class memberships between two consecutive time points. We estimate the edge probability tensor by a kernel-type procedure and extract group memberships of the nodes by spectral clustering. The methodology is computationally viable, adaptive to the unknown smoothness of the functional connection probabilities, to the rate s of membership switching and to the unknown number of classes. In addition, it is accompanied by non-asymptotic guarantees for the precision of both estimation and clustering. Since we do not make an assumption that the network is generated by a time-varying graphon, we cannot take advantage of the techniques described in [7] or [40]. Nevertheless, under an appropriate set of conditions, one can possibly improve the precision of clustering using approaches developed in those papers.

Acknowledgments
Marianna Pensky was partially supported by National Science Foundation (NSF) grant DMS-1407475.
It is easy to check that the determinant of the system of equations (8.2) is nonzero and, for every r ≥ 1, the system (8.3) allows to recover a kernel W (1) r,l . For example, for m = 1 we derive Construction of the boundary kernels W b j r −j i j and choose the coefficients, so that the kernel satisfies condition (4.1). The latter leads to the system of (l + 1) linear equations of the form

Proofs of the statements in the paper
Proof of Lemma 1. If P t = UDU T and P t = U D U T , then, by Lemma 5.1 of [23], obtain that there exists and orthogonal matrix O such that Let S t,k is a subset of nodes in class G t,k that are misclassified. Then, Lemma 5.3 and Theorem 3.1 of Lei and Rinaldo (2015) imply that provided the right-hand side of (8.4) is bounded by one. In order to derive (3.3), observe that Proof of Lemma 2. Since the case r = 0 follows directly from Theorem 1.1 in the Supplementary material of [23], we can assume that r ≥ 1. Also, in order to simplify the proof, we do not consider kernels W (j) r,l for each j = 1, 2, 3, separately, but instead remove the index j since the proofs are practically identical for all three values of j.
Note that here the definition is different from the proof in [23] by a factor of r, since we consider weighted sum of random matrices (instead of a single random matrix). Partitioning |x T ( P t,r − P t,r )y| into the portions containing the "light pairs" and the "heavy pairs", obtain In order to obtain upper bounds for the right-hand side of (8.7), we need three supplementary statements, Lemmas 6, 7 and 8, that generalize, respectively, Lemmas 3.1, 4.1 and 4.2 of Lei and Rinaldo [23] to our setting. The proofs of Lemmas 6, 7 and 8 are deferred till Section 8.3.

Lemma 6. Under assumptions of Lemma 2, one has
Pr    sup x,y∈T (i,j)∈L(x,y)
Proof of Lemma 3. First, let us prove that under Assumption A2, one has for any k such that Let, without loss of generality, k > 0. Note that matrix Θ t − Θ t+k at most ks nonzero rows in which one entry is 1 and another is -1. If we permute the rows of matrix Θ t − Θ t+k so that those nonzero rows are the first ones, we obtain that (Θ t − Θ t+k )(Θ t − Θ t+k ) T is the block-diagonal matrix with the only nonzero block matrix Λ ∈ R ks×ks in the top left corner that has elements with absolute values equal to 0,1 or 2. Then which implies (8.15). Note that the upper bound is tight (to see this, consider the case where s elements move from class i to class j at each of k time points. Next, note that ∆ 2,t in (4.3) can be decomposed as P t,r − P t ≤ P t,r −P t,r + P t,r − P t ≡ ∆ 2,1,t + ∆ 2,2,t , (8.16) where For this purpose observe that Θ t ≤ √ n max and that The last two inequalities together with (8.15) imply (8.17). Now, let us prove that ∆ 2,2,t = P t,r − P t ≤ L l! W max α n n r T β .

(8.18)
Let j = 1, 2 or 3 be determined by the value of t. Denote Observe that by Assumptions A1 and A4, for any k, k ′ = 1, . . . , K, using Taylor's expansion at i = 0, one derives where |ξ| ≤ r/T . Due to Assumption A1, the first sum is equal to zero and Recall that B t+i − B t = α n (H t+i − H t ) and that the spectral norm of a matrix is dominated by the l 1 norm. Therefore, Proof of Lemma 4. If r * = 0, then results of the Lemma follow directly from [23]. Consider r ≥ 1. Then, where C depends on τ, c 0 , W max , l, L and λ max . Denote It is easy to see that F 1 (r) and F 2 (r) are growing in r while F 3 (r) is declining. Therefore, the minimum is reached at the point r where F 1 (r) + F 2 (r) ≍ max(F 1 (r), F 2 (r)) = F 3 (r). Observe that F 1 (r) = F 3 (r) if r = r 1 = n −1 α n T 2β 1/(2β+1) and F 2 (r) = F 3 (r) if r = r 2 = (α n n max s) −1 n. Moreover, max(F 1 (r), F 2 (r)) occurs at r * = min(r 1 , r 2 ) and we need r * to be an integer. Then, min r (F 1 (r) + F 2 (r) + F 3 (r)) ≍ F 3 (r * ) and plugging r * into F 3 (r), we obtain (4.7).

Proofs of supplementary lemmas
Proof of Lemma 6. First, let us prove that for any C > 0, one has Pr sup x,y∈T (i,j)∈L(x,y) (8.24) Denote u ij = x i y j I(|x i y j | ≤ C α α n r/n) + x j y i I(|x j y i | ≤ C α α n r/n). Consider S = (i,j)∈L(x,y) x i y j ( P t,r (i, j) − P t,r (i, j)) = 1 |F r | 1≤i<j≤n k∈Fr u ij W r,l (k)(A t+k (i, j) − P t+k (i, j)).
Note that the right-hand side is the sum of n(n − 1)|F r | independent variables ξ i,j,k = u ij W r,l (k)(A t+k (i, j) − P t+k (i, j))/|F r | with zero means and absolute values bounded by |ξ i,j,k | ≤ 2W max C α α n /n |F r |, due to |u ij | ≤ 2 C α α n r/n and |A t+k (i, j) − P t+k (i, j)| ≤ 1. Applying Bernstein's inequality and using , the fact that i<j u 2 ij ≤ 2 (as proved in the end of Section 3 in the Supplementary material of [23]), obtain Pr    sup x,y∈T 1≤i<j≤n k∈Fr since |F r | ≥ 1 + r. Using the fact that cardinality |T | ≤ exp(n log 14) (see Section 3 in the Supplementary material of [23] with δ = 1/2), obtain (8.24). In order to complete the proof, observe that inequality (8.8) guarantees that the right-hand side in (8.24) is bounded by 2n −τ .
In order to complete the proof, note that inequality (8.10) guarantees that the right hand side of (8.25) is bounded below by 1 − n −τ due to max(r, 2) ≤ |F r | ≤ 3r for r ≥ 1.
be random variables such that Pr(X i = 1 − p i ) = p i , Pr(X i = −p i ) = 1 − p i for some p i > 0. Let X = n i=1 w i X i , p = 1 n n i=1 p i , p max = max 1≤i≤n p i , w max = max 1≤i≤n w i , then for k > max(e 3wmax , 2), Pr(X ≥ kp max n) < e −(k+1)pmaxn ln(k+1)/2wmax .
Proof. E(e λX i ) = p i e w i (1−p i )λ + (1 − p i )e −w i p i λ . Following the same proof as Lemma 2.1.8 in (Alon and Spencer, 2004), we have Let λ = ln[1 + a/pn]/w max , then using (1 + a/n) n ≤ e a , its right hand side is bounded by e a− n i=1 w i p i ln[1+a/pn]/wmax .