Consistency of the maximum likelihood and variational estimators in a dynamic stochastic block model

We consider a dynamic version of the stochastic block model, in which the nodes are partitioned into latent classes and the connection between two nodes is drawn from a Bernoulli distribution depending on the classes of these two nodes. The temporal evolution is modeled through a hidden Markov chain on the nodes memberships. We prove the consistency (as the number of nodes and time steps increase) of the maximum likelihood and variational estimators of the model parameters, and obtain upper bounds on the rates of convergence of these estimators. We also explore the particular case where the number of time steps is fixed and connectivity parameters are allowed to vary.


Introduction
Random graphs are a suitable tool to model and describe interactions in many kinds of datasets such as biological, ecological, social or transport networks. Here we are interested in time-evolving networks, which is a powerful tool for modeling real-world phenomena, where the role or behaviour of the nodes in the network and the relationships between them are allowed to change over time. Indeed, it is important to take into account the evolutionary behaviour of the graphs, instead of just studying separate snapshots as static graphs. We focus on graphs evolving in discrete time and refer to Holme [2015] for an introduction to dynamic networks.
A myriad of dynamic graph models have been introduced in the past few years, see for instance . We focus here on those which are based on the (static) stochastic block model [SBM, Holland et al., 1983] in which the nodes are partitioned into classes. In SBM, class memberships of the nodes are represented by latent variables and the connection between two nodes is drawn from a distribution depending on the classes of these two nodes (a Bernoulli distribution in the case of binary graphs). A first dynamic version of the SBM with discrete time is proposed in Yang et al. [2011]. There, the nodes are partitioned into Q classes and the graphs are binary or weighted. The nodes are allowed to change membership over time, and these changes are governed by independent Markov chains with values in the Q classes, while the connection probabilities are constant over time. Xu and Hero [2014] introduce a state-space model on the logit of the connection probabilities for dynamic (binary) networks with connection probabilities and group memberships varying over time. Unfortunately, their model presents parameter identifiability issues [Matias and Miele, 2017]. Xu [2015] proposes a stochastic block transition model in which the presence or absence of an edge between two nodes at a particular time affects the presence or absence of such an edge at a future time. There, the nodes can change classes over time, new nodes can enter the network, and the connection probabilities are allowed to vary over time. The model in Matias and Miele [2017] and in Becker and Holzmann [2018] is quite similar to that of Yang et al. [2011] except that it allows the connection probabilities to vary and the latter is moreover nonparametric. Bartolucci et al. [2018] extend the model of Yang et al. [2011] to deal with different forms of reciprocity in directed graphs, by directly modeling dyadic relations and with the assumption that the dyads are conditionally independent given the latent variables. Paul and Chen [2016] and Han et al. [2015] study multi-graph SBM, arising in settings including dynamic networks and multi-layer networks where each layer corresponds to a type of edge. In these two models, the nodes memberships stay constant over the layers. Pensky [To appear], Pensky and Zhang [2017] study a dynamic SBM for undirected and binary edges where both connection probabilities and group memberships vary over time, assuming that the connection probabilities between groups are a smooth function of time. Xing et al. [2010] and Ho et al. [2011] introduce dynamic versions of the mixed-membership stochastic block model, allowing each actor to carry out different roles when interacting with different peers. Zreik et al. [2016] introduce the dynamic random subgraph model, given a known decomposition of the graph into subgraphs, in which the latent class membership depends on the subgraph membership and the edges are categorical variables, their types being sampled from a distribution depending on the latent classes of the two nodes. There, a state-space model is used to characterize the temporal evolution of the latent classes proportions.
As far as estimation is concerned, different methods of inference are proposed to estimate groups and model parameters. The maximum likelihood estimator (MLE) is not tractable in the SBM, thus neither in its dynamic versions. Variational methods are rather popular to approximate that MLE [Xing et al., 2010, Ho et al., 2011, Han et al., 2015, Paul and Chen, 2016, Zreik et al., 2016, Matias and Miele, 2017, Bartolucci et al., 2018. Yang et al. [2011] rely on Gibbs sampling and simulated annealing. Pensky and Zhang [2017] propose an estimator of the connection probabilities matrix at each time step by a discrete kernel-type method and obtain a clustering of the nodes thanks to spectral clustering on this estimated matrix. They also give an estimator for the number of clusters. Spectral clustering algorithms are also used by Han et al. [2015] on the mean graph over time and by Liu et al. [2018] who use eigenvector smoothing to get some similarity across time periods (and allow the number of classes to be unknown and possibly varying over time).
Some theoretical results on the convergence of the procedures have been proven, mainly for static graphs. In the static SBM, Celisse et al. [2012] prove the consistency of the MLE and variational estimates as the number of nodes increases, and Bickel et al. [2013] establish their asymptotic normality. Mariadassou and Matias [2015] have a different approach and give sufficient conditions for the groups posterior distribution to converge to a Dirac mass located at the actual groups configuration, for every parameter in a neighborhood of the true one. Rohe et al. [2011] give asymptotic results on the normalized graph Laplacian and its eigenvectors for the spectral clustering algorithm, allowing the number of clusters to grow with the number of nodes. They also provide bounds on the number of misclustered nodes, requiring an assumption on the degree distribution. Lei and Rinaldo [2015] prove consistency for the recovery of communities in the spectral clustering on the adjacency matrix, with milder conditions on the degrees, and also extend this result to degree corrected stochastic block models. Klopp et al. [2017] derive oracle inequalities for the connection probabilities estimator and obtain minimax estimation rates, including the sparse case where the density of edges converges to zero as the number of nodes increase thus extending previous results of Gao et al. [2015]. Gaucher and Klopp [2019] propose a bound on the risk of the maximum likelihood estimator of network connection probabilities, and show that it is minimax optimal in the sparse graphon model.
In the dynamic setting, fewer theoretical results have been established. Pensky [To appear] derives a penalized least squares estimator of the connection probabilities adaptive to the number of blocks and which does not require knowledge of the number of classes Q. She shows that it satisfies an oracle inequality. Under the additional assumption that at most n 0 nodes change groups between two time steps, this estimator attains minimax lower bounds for the risk. She also introduces a dynamic graphon model and shows that the estimators (that do not require knowledge of a degree of smoothness of the graphon function) are minimax optimal within a logarithmic factor of the number of time steps. Based on the same dynamic SBM with at most n 0 nodes changing groups between two time steps, Pensky and Zhang [2017] give an upper bound for the (non asymptotic) error of their estimators of the connection probabilities matrix and group memberships (and also an estimator for the number of clusters). Han et al. [2015] show consistency (as the number of time steps increases but the number of nodes is fixed) of two estimators of the class memberships for dynamic SBM (and more generally multi-graph SBM) in which the nodes memberships are constant over time but the connection probabilities are allowed to vary and the considered graphs are binary and symmetric. They show that the spectral clustering (on the mean graph over time) estimator of the class memberships is consistent under some stationarity and ergodicity conditions on the connection probabilities. They also prove that the MLE of the class memberships is consistent (i.e. that the fraction of misclustered nodes converges to 0) in the general case (without any structure on the connection probabilities), provided certain sufficient conditions are satisfied. In their multi-layer model, Paul and Chen [2016] give minimax rates of misclassification under certain conditions on the growth of the types of relations, number of nodes and number of classes, extending the result of Han et al. [2015].
Here, we consider a dynamic version of the binary SBM as in Yang et al. [2011], where each node is allowed to change group membership at each time step according to a Markov chain, independently of other nodes. We prove the consistency of the connectivity parameter MLE and, under some additional conditions, of the transition matrix MLE, when the number of nodes and of time steps are increasing. We also give upper bounds on the rates of convergence of these estimators. While these upper bounds are known to be non optimal in the static case where asymptotic normality is obtained with classical rates of convergence [Bickel et al., 2013], these are the first to be established in a dynamic setting for the MLE. As already mentioned, the log-likelihood is intractable (except for very small values of the number of nodes n and the number of time steps T ), as it requires to sum over Q nT terms. Thus, while its consistency remains an important result, the estimator cannot be computed. A possible alternative is to rely on a variational estimator to approximate the MLE [see for instance Matias and Miele, 2017]. We also establish the consistency of the variational estimator of the connectivity parameter and under some additional assumptions, that of the variational estimator of the transition matrix and obtain the same upper bounds on the rates of convergence as for the MLE. In the particular case where the number of time steps T is fixed, we also consider the model of Matias and Miele [2017], in which the connection probabilities are allowed to vary over time and generalise these results with only the number of nodes increasing. When T = 1, we not only recover the results of Celisse et al. [2012] but extend these by giving rates of convergence. Unlike the model studied in Han et al. [2015] and Paul and Chen [2016], the node memberships in our model evolve over time. Our context is different from Pensky [To appear] that focuses on least squares estimate.
This article is organized as follows. Section 2 introduces our model and notation. More precisely, Section 2.1 describes the dynamic stochastic block model as introduced in Yang et al. [2011], Section 2.2 gives the assumptions we make on the model parameters, Section 2.3 describes the dynamic stochastic block model as in Matias and Miele [2017] for the finite time case and Section 2.4 states the expression of the likelihood of this model to define the MLE. Section 3 establishes the consistency and upper bounds of the rates of convergence for the MLE of the connection probabilities in Section 3.1 and of the transition matrix in Section 3.2. Section 4 is dedicated to variational estimators: Section 4.1 and 4.2 establish the consistency of the variational estimators of the connection probabilities and transition matrix, respectively, along with upper bounds of the associated rates of convergence. All the proofs of the main results are postponed to Section 5, except those for the fixed T case that are in Appendix A, while the more technical proofs are deferred to Appendix B.

Dynamic stochastic block model
We consider a set of n vertices, forming a sequence of binary undirected graphs with no self-loops at each time t = 1, . . . , T . The case of a set of directed graphs, with or without self-loops, may be handled similarly. These vertices are assumed to be split into Q latent classes, and we denote by Z t i the label of the i-th vertex at time t.
, we assume that the {Z i } 1≤i≤n are independent and identically distributed (iid) and each Z i is a homogeneous and stationary Markov chain with transition probabilities where Γ = (γ ql ) 1≤q,l≤Q is a stochastic matrix. We let α = (α 1 , . . . , α Q ) the stationary distribution of the Markov chain. We will also denote Z t = (Z t 1 , . . . , Z t n ) and Z 1: , j≤n the symmetric binary adjacency matrix of the graph at time t such that for every nodes 1 ≤ i, j ≤ n, we have X t ii = 0 and X t i j = X t ji . Each X t follows a stochastic block model so that, conditional on the latent groups {Z t i } 1≤i≤n , the {X t i j } 1≤i, j≤n are independent Bernoulli random variables where π ql ∈ [0, 1] are the connectivity parameters. More precisely, conditional on the whole sequence of latent groups {Z t i } 1≤t≤T,1≤i≤n , the graphs X 1:T = X 1 , . . . , X T are assumed to be independent, each X t having a distribution depending only on {Z t i } 1≤i≤n . The model is thus parameterized by θ = (Γ, π), with Γ = (γ ql ) 1≤q,l≤Q and π = (π ql ) 1≤q,l≤Q . Note that π is a symmetric matrix in the undirected setup. We denote by P θ (resp. E θ ) the probability distribution (resp. expectation) of all the random variables {Z t i , X t i j } t≥1;i, j≥1 , under the parameter value θ. In the following, we assume that we observe {X t i j } 1≤i, j,≤n, 1≤t≤T and we denote by θ * = (Γ * , π * ) = ((γ * ql ) 1≤q,l≤Q , (π * ql ) 1≤q,l≤Q ) the true parameter value, with corresponding probability distribution P θ * and expectation E θ * . We also let 1 A denote the indicator function of the set A and A c the complementary set of A in the ambient set. For any integer M ≥ 1, the set 1, M is the set of integers between 1 and M. For any finite set A, let |A| denote its cardinality. For any configuration z 1:T , we denote N q (z t ) (resp. N ql (z 1:T )) the number of nodes assigned to class q by the configuration z t (resp. the number of transitions from class q to class l in configuration z 1:T ), that is N q (z t ) = |{i ∈ 1, n ; z t i = q}| and N ql (z 1:

Assumptions
The assumptions we make on the model parameters are the following.
Assumption 1 is necessary for identifiability of the model. Indeed, if it does not hold, we cannot distinguish between classes q and q ′ . Assumption 2 ensures that each Markov chain Z i is irreducible, aperiodic and recurrent. This assumption could be weakened at the cost of technicalities. In particular, it implies that the stationary distribution α exists. Moreover, Assumption 2 also implies that for any q ∈ 1, Q , we have α q ∈ [δ, 1 − δ]. Note that this can be seen as an equivalent of Assumption 2 in Celisse et al. [2012] (on the probability distribution of the class memberships) in the dynamic case. Celisse et al. [2012] however also have an additional assumption that is an empirical version of this assumption (which states that the observed class proportions are bounded away from 0) that is true with high probability. We do not make such an assumption and use the fact that the probability of this event converges to 1. Assumption 3 is technical and could also be weakened with additional technicalities. For example, Celisse et al. [2012] also consider the case π ql ∈ {0, 1} (i.e. π ql ∈ {0, 1} ∪ [ζ, 1 − ζ]) whereas we do not. The whole parameter set defined by these constraints is denoted by Θ. In the following, we assume that θ * ∈ Θ.
In what follows, we work up to label permutation on the groups. Indeed, as in any latent group model, the parameters can only be recovered up to label switching on the latent groups. We then define the following notation for any permutation σ ∈ S Q with S Q the set of permutations on 1, Q θ σ = (Γ σ , π σ ) = (γ σ(q)σ(l) ) 1≤q,l≤Q , (π σ(q)σ(l) ) 1≤q,l≤Q .

Finite time case
If the number of time steps T is fixed, it is possible to let the connection probabilities vary over time. We then consider this case, the connection parameter now being π 1:T = (π 1 , . . . , π T ) with π t = (π t ql ) 1≤q,l≤Q for every t ∈ 1, T and π t ql = P θ (X t i j = 1 | Z t i = q, Z t j = l) for any (t, q, l) ∈ 1, T × 1, Q 2 . Note that this is the more general model of Matias and Miele [2017], in which the model parameter is θ = (Γ, π 1:T ). Moreover, we introduce the following Assumptions 1' and 3' that are alternate versions of Assumptions 1 and 3 respectively for the finite time case.
3'. There exists some ζ > 0 such that for every t ∈ 1, T , for any (q, l) ∈ 1, Q 2 , we have π t ql ∈ [ζ, 1 − ζ]. Assumption 1' (resp. Assumption 3') expresses that for every t ∈ 1, T , π t satisfies Assumption 1 (resp. Assumption 3). We also introduce the following additional assumption, which ensures (together with Assumption 1') that the model is identifiable (up to a label permutation). See Matias and Miele [2017]. 4. For every q ∈ 1, Q , for every t 1 , t 2 ∈ 1, T , π t 1 qq = π t 2 qq ≔ π qq and {π qq ; q ∈ 1, Q } are Q distinct values. Assumption 4 states that the diagonal of π does not change over time, and that its values are distinct. We denote by Θ T the set of parameters satisfying Assumptions 1', 2, 3' and 4. As before, we assume in the following that θ * ∈ Θ T in the fixed T case.
In the next section, we study separately the consistency of the connectivity parameter estimatorπ and that of the transition matrix estimatorΓ.
3 Consistency of the maximum likelihood estimate

Connectivity parameter
We first prove the consistency of the maximum likelihood estimator of the connectivity parameter π = (π ql ) 1≤q,l≤Q when the number of nodes and time steps increase. We denote the normalized log-likelihood by M n,T (Γ, π) = 2 n(n − 1)T ℓ(θ) = 2 n(n − 1)T log P θ (X 1:T ) and introduce the quantities, for any A = (a ql ) 1≤q,l≤Q ∈ A the set of Q × Q stochastic matrices, whereĀ π = argmax A∈A M(π, A). It is worth noticing that M(π), which will be the limiting value for M n,T (Γ, π) when n and T increase (see below), does not depend on Γ.
Proposition 1. For any sequence {r n,T } n,T ≥1 increasing to infinity, if log(T ) = o(n), we have for all ǫ > 0 We then conclude on the consistency of the maximum likelihood estimator of the connection probabilities with the following corollary. Note that we also obtain an upper bound of the rate of convergence of this estimator.
Corollary 1. For any sequence {r n,T } n,T ≥1 increasing to infinity such that r n,T = o(n 1/4 ) and if log(T ) = o(n), we have for every ǫ > 0 We want to get equivalent consistency results if the number of time steps T is fixed and only the number of nodes n increases. In that case, denoting byθ = (Γ,π 1:T ) the MLE of θ, we have the following Corollary that is the equivalent of Corollary 1.
Corollary 2. If the number of time steps T is fixed, we have for every ǫ > 0 and for any sequence {r n } n≥1 increasing to infinity such that r n = o(n 1/4 ) This result states that min σ∈S Q π * 1:T −π 1:T σ ∞ converges to 0 in P θ * -probability as n increases, i.e. the MLE of the connection probabilities is consistent up to label switching, and gives an upper bound of the rate of convergence of the MLE of the connection probabilities. The particular case when T = 1 is then a stronger result than that of Celisse et al. [2012] where no rate of convergence is given.

Latent transition matrix
We now prove that the MLE for the transition matrix Γ is consistent when the number of nodes and time steps increase.
Lemma 1. Any critical pointθ = (Γ,π) of the likelihood function ℓ(·) is such thatΓ satisfies the fixed point equation There are two different possible cases for the MLEθ • Eitherθ is a critical point of the likelihood function. ThenΓ satisfies equation (4).
• Orθ is not a critical point (this can happen if it belongs to the boundary of Θ) and we assume that there exists Γ such that (Γ,π) ∈ Θ and (Γ,π) satisfies equation (4) (at least for n and T large enough). We then choose as our estimator (Γ,π). By an abuse of notation, we will denote this estimatorθ = (Γ,π) and call it MLE in the following.
The following result establishes that asymptotically, any estimator that correctly estimates the transition probability matrix π also recovers the group memberships. This result is similar to Theorem 1 in Mariadassou and Matias [2015].
Proposition 2. For any estimatorθ ∈ Θ (at least for n and T large enough), if log(T ) = o(n), there exist some positive constants C, C 1 , C 2 , C 3 , C 4 such that for any ǫ > 0, for any positive sequence {y n,T } n,T ≥1 such that log(1/y n,T ) = o(n), any η ∈ (0, δ) and for n and T large enough, we have Proposition 3. If log(T ) = o(n), for any ǫ > 0 and {r n,T } n,T ≥1 any sequence increasing to infinity such that r n,T = o nT/ log n , we have for any σ ∈ S Q with {v n,T } n,T ≥1 a sequence decreasing to 0 such that v n,T = o( log(nT )/n).
Corollary 3. Assume that log(T ) = o(n) and min σ∈S Q π σ −π * ∞ = o P θ * (v n,T ) with {v n,T } n,T ≥1 a sequence decreasing to 0 such that v n,T = o( log(nT )/n). Then for any ǫ > 0 and {r n,T } n,T ≥1 any sequence increasing to infinity such that r n,T = o nT/ log n , we have the convergence Remark 1. Note that the upper bound obtained in Corollary 1 on the rate of convergence in probability ofπ does not ensure that min σ∈S Q π σ − π * ∞ = o P θ * (v n,T ) holds. While the latter has never been established (to our knowledge), it is a reasonable assumption.
We want an equivalent result than that of Corollary 3 when the number of time steps T is fixed, and the connection probabilities are varying over time (the connection parameter being π = π 1:T = (π 1 , . . . , π T ) with π t = (π t ql ) q,l ). For that, we are going to need an equivalent of Proposition 2 in that case.
The following corollary gives the expected result.
Corollary 4. Let the number of time steps T ≥ 2 be fixed. Assume that min σ∈S Q π 1:T σ − π * 1:T ∞ = o P θ * (v n ) with {v n } n≥1 a sequence decreasing to 0 such that v n = o( log(n)/n). Then for any ǫ > 0 and {r n } n≥1 any sequence increasing to infinity such that r n = o n/ log n , we have the convergence The proof of Corollary 4 is the same as that of Corollary 3, but relying on Proposition 4 instead of Proposition 2 and is therefore omitted.

Variational estimators
In practice, we cannot compute the MLE except for very small values of n and T , because it involves a summation over all the Q nT possible latent configurations. We cannot either use the Expectation-Maximization (EM) algorithm to approximate it because it involves the computation of the conditional distribution of the latent variables given the observations which is not tractable. A common solution is to use the Variational Expectation-Maximization (VEM) algorithm that optimizes a lower bound of the log-likelihood (see for example Daudin et al. [2008]). Let us denote Z t iq = 1 Z t i =q for every t, i and q. Using the same approach as in Matias and Miele [2017] for the VEM algorithm in the dynamic SBM, we consider a variational approximation of the conditional distribution of the latent variable Z 1:T given the observed variable X 1:T in the class of probability distributions parameterized by The quantity to optimize in the VEM algorithm is then with KL(·, ·) denoting the Kullback-Leibler divergence and H(·) denoting the entropy. Definê and the variational estimator of θθ = (Γ,π) = argmax θ∈Θ J(χ(θ), θ).

Connectivity parameter
Proposition 5. For any sequence {r n,T } n,T ≥1 increasing to infinity, if log(T ) = o(n), we have for all ǫ > 0 We conclude on the consistency of the connection probabilities variational estimators as n and T increase thanks to the following corollary.
Corollary 5. For any sequence {r n,T } n,T ≥1 increasing to infinity such that r n,T = o(n 1/4 ), we have for any ǫ > 0 We have the equivalent following corollary for a fixed number of time steps.
Corollary 6. If the number of time steps T is fixed, we have for every ǫ > 0 and for any sequence {r n } n≥1 increasing to infinity such that r n = o(n 1/4 )

Latent transition matrix
We now prove thatΓ is consistent when the number of nodes and time steps increase.
Lemma 2. Any critical point (χ,θ) of the function J(·, ·) is such thatΓ satisfies the fixed-point equation We assume that (χ,θ) is a critical point of J(·, ·). Then we have the fixed-point equation The following proposition gives the consistency and a rate of convergence of this estimator, under an assumption on the rate of convergence ofπ.
Proposition 6. If log(T ) = o(n), for any ǫ > 0 and {r n,T } n,T ≥1 any sequence increasing to infinity such that r n,T = o nT/ log n and for any σ ∈ S Q Then for any ǫ > 0 and {r n,T } n,T ≥1 any sequence increasing to infinity such that r n,T = o nT/ log n , we have the convergence The proof of Corollary 7 is the same as that of Corollary 3, using Proposition 6 instead of Proposition 3 and is therefore omitted.
When the number of time steps T is fixed and the connection probabilities can vary over time, we have the following Corollary that is the equivalent of Corollary 7.
Corollary 8. Let the number of time steps T ≥ 2 be fixed. Assume that min σ∈S Q π 1:T σ − π * 1:T ∞ = o P θ * (v n ) with {v n } n≥1 a sequence decreasing to 0 such that v n = o( log(n)/n). Then for any ǫ > 0 and {r n } n≥1 any sequence increasing to infinity such that r n = o n/ log n , we have the convergence The proof of Corollary 8 is the same as that of Corollary 7, but relying on Proposition 4 instead of Proposition 2 and is therefore omitted.

Proof of Proposition 1
The proof follows the lines of the proof of Theorem 3.6 in Celisse et al. [2012]. Nonetheless, our result is sharper as we establish an upper bound of the rate of convergence (in probability) of the normalised likelihood. We fix some θ ∈ Θ and introduce the quantitieŝ Z 1:T = argmax Note thatZ 1:T is a random variable that depends on Z 1:T and that Similarly, for any t ∈ 1, T , we haveZ t = argmax z∈ 1, We bound the difference between M n,T (Γ, π) and M(π) by introducing three intermediate terms so that we can write, for any sequence {r n,T } n,T ≥1 and any ǫ > 0 In the following, we prove separately the convergence (in P θ * -probability) to zero of the three terms of this sum (while controlling for the rate of these convergences). Before starting, let us remark that we have and In particular, for every t ∈ 1, T , we havê First term of the right-hand side of (10). We let Lemma 3. For every t ∈ 1, T , we have Going back to (13) and applying Lemma 3, we get Now, using classical dependency rules in directed acyclic graphs [see for e.g. Lauritzen, 1996] combined with Assumption 2, we get This implies that P θ * (sup θ∈Θ T 1 > ǫr n,T /(3 √ n)) = 0 as soon as ǫr n,T / √ n ≥ 6 log(1/δ)/(n − 1). Then for any sequence {r n,T } n,T ≥1 increasing to infinity, for any ǫ > 0, we have that P θ * (sup θ∈Θ T 1 > ǫr n,T /(3 √ n)) → 0 as n and T increase.
Second term of the right-hand side of (10). Let us denote For the sake of clarity, we study this term on the event {Z 1:T = z * 1:T } where z * 1:T ∈ 1, Q nT is a fixed configuration. This event induces the definition ofZ 1:T following Equation (8) as or equivalently for every t ∈ 1, T , By definition ofẑ 1:T andZ 1:T respectively, we have the two inequalities implying the lower and upper bounds Taking the absolute value gives us an upper bound for T 2 (z * 1:T ) Using Equations (11) and (12), we then obtain the following upper bound for T 2 (z * 1:T ) We use the following concentration result to conclude.
Lemma 4. Let ǫ, β > 0 and {x n,T } n,T ≥1 a sequence of positive real numbers. We let P * θ * (·) denote the probability conditional on {Z 1:T = z * 1:T } under parameter θ * , i.e. P * θ * (·) = P θ * (·|Z 1: with Let us choose x n,T = log(n) in the above lemma. For any ǫ > 0, for any sequence {r n,T } n,T ≥1 increasing to infinity, we have for n and T large enough Then for n and T large enough, the first term in the right-hand side of inequality (14) is equal to 0 and we have Third term of the right-hand side of (10). Let us denote For any fixed configuration z t ∈ 1, Q n , analogous to Equation (12), we write is the (random variable) number of nodes classified in group q in the current (random) configuration Z t , while they belong to group q ′ in (deterministic) configuration z t . Recall that N q (z t ) is the number of nodes assigned to class q by the configuration z t and let us denote . We remark that the definition ofZ t implies thatÃ t = argmax A t ∈A t (Z 1:T ) Φ t (A t , π) with A t (Z 1:T ) the (random) subset of stochastic matrices defined for every t ∈ 1, T by Let us also denoteĀ t π = argmax A∈A t (Z 1:T ) M(π, A). Then We start by stating a concentration lemma on the random variable N q (Z t ) for any q ∈ 1, Q and any t ∈ 1, T .
Lemma 5. For any θ ∈ Θ and any η ∈ (0, δ), let Building on the previous concentration lemma, the following one gives the convergence in P θ * -probability of the second term in the right-hand side of (15).
Lemma 6. For any ǫ > 0, any η ∈ (0, δ) and {r n,T } n,T ≥1 any positive sequence, Then taking any η ∈ (0, δ), for any ǫ > 0, for any sequence {r n,T } n,T ≥1 increasing to infinity, we have the following inequality for n and T large enough implying that the probability in Lemma 6 converges to 0 as n and T increase for any ǫ > 0, as long as log T = o(n). Now, for the first term in the right-hand side of (15), note that we have for every π and every t In both cases, we get that for every t and π, thus obtaining the upper bound Letting Finally, we bound the first term of the right-hand-side of (15) as follows Applying Markov's Inequality, we obtain The following lemma gives an upper bound of the expectation appearing in the previous inequality, for any q, l ∈ 1, Q .
Lemma 7. For any q, l ∈ 1, Q and any t ∈ 1, T , we have the following inequality This leads to Then for any ǫ > 0, for any sequence {r n,T } n,T ≥1 increasing to infinity, we have the convergence We proved the convergence to 0 of the three terms in the right-hand side of (10) for any sequence {r n,T } n,T ≥1 increasing to infinity and as long as log T = o(n). This gives the expected result and concludes the proof.

Proof of Corollary 1
To prove this corollary, we establish the following lemma that allows us to obtain a rate of convergence ofπ to π * from a rate of convergence of M n,T to M. Note that this lemma is a bit more general than what we need and gives an equivalent result when the number of time steps T is fixed, which will be useful for Corollary 2.
Lemma 8. Let {F n,T } n,T ≥1 be any random functions on the set Θ (resp. Θ T ) and M (resp. M T ) defined as before.
Assume that there exists a sequence {v n,T } n,T ≥1 (resp. {v n } n≥1 ) a sequence decreasing to 0 such that for every ǫ > 0, we have the following convergence as n, T → ∞ (resp. n → ∞) If for any n and T ,θ = (Γ,π) (resp.θ = (Γ,π 1:T )) is defined as the maximizer of F n,T on the set Θ, (resp. Θ T ) we have the following convergence The result of Corollary 1 is then a direct consequence of Proposition 1 (choosing the sequence {r 2 n,T } n,t≥1 ) and Lemma 8 applied with F n,T = M n,T .

Proof of Proposition 2
The proof follows the lines of the proof of Theorem 3.8 in Celisse et al. [2012]. Nonetheless, our result is sharper as we will establish an upper bound of the rate of convergence (in probability) of the quantity at stake. For any ǫ > 0, any sequence {y n,T } n,T ≥1 and η ∈ (0, δ), we write with Ω η (θ * ) as defined in Lemma 5. We will establish that there exist some positive constants C, C 1 , C 2 , C 3 , C 4 such that for any fixed configuration z * 1:T ∈ Ω η (θ * ), any ǫ > 0, any positive sequence {y n,T } n,T ≥1 such that log(1/y n,T ) = o(n) and n and T large enough, we have P θ * P˘θ(Z 1:T z * 1:T | X 1:T ) P˘θ(Z 1:T = z * 1:T | X 1:T ) > ǫy n,T Z 1:T = z * 1:T ≤ P θ * π − π * ∞ > v n,T | Z 1:T = z * 1:T Combined with (19) and applying Lemma 5, this gives the desired result. So now we focus on establishing (20).
In what follows, we consider a fixed configuration z * 1:T ∈ Ω η (θ * ) and introduce the Hamming distance between z * 1:T and any other configuration z 1:T defined as We let P * θ * (·) denote the probability conditional on {Z 1:T = z * 1:T } under parameter θ = θ * , i.e. P * θ * (·) = P θ * (· | Z 1:T = z * 1:T ). In the following, we will often use the fact that the variables {X t i j } are independent under P * θ * (with mean value π * z * t i z * t j ) so that we can rely on Hoeffding's Inequality. We introduce a sequence {v n,T } n,T ≥1 decreasing to 0 and Ω n,T the event defined as We bound the probability of interest in (20) Thus, the proof of (20) boils down to establishing the desired upper bound on the second term appearing in the right-hand side of (21). We have as long as nT ≥ Q. For any configuration z 1:T such that z 1:T − z * 1:T 0 = r, we denote by r(1), . . . , r(T ) the number of differences between the two configurations at each time step t ∈ 1, T , i.e. r(t) = z t −z * t 0 such that r = t r(t). Moreover, for any parameter π, we define D n,T (z 1:T , π) the subset of indexes (i, j, t) ∈ 1, n 2 × 1, T such that i < j for which the parameter π differs between the configuration z * 1:T and z 1:T , namely D n,T (z 1:T , π) ≔ (i, j, t) ∈ I n,T ; π z t i z t j π z * t i z * t j , with I n,T = {(i, j, t) ∈ 1, n 2 × 1, T ; i < j} the set of indexes over which we sum to compute the conditional log-likelihood. In what follows, we abbreviate to D * (resp.D), the set D n,T (z 1:T , π * ) (resp. D n,T (z 1:T ,π)). Next lemma gives a decomposition of the main term at stake in (22).
Lemma 9. We have the decomposition log P˘θ(Z 1:T = z 1:T | X 1:T ) P˘θ(Z 1:T = z * 1:T | X 1: Combining (22) and Lemma 9, we obtain We then decompose We handle these three terms separately in the following. From now on, we consider a configuration z 1:T such that z 1:T − z * 1:T 0 = r = t r(t).
First term in the right-hand side of (27). Recall that U 1 is given by (23). We can further decompose this term For n and T large enough such thatΓ ∈ [δ, 1 − δ] Q 2 (implying for the corresponding stationary distributionα To handle the term U 1 , we need to lower bound the cardinality of the set D * . This is the purpose of Lemma 10 which is a generalization of Proposition B.4 in Celisse et al. [2012]. This can be done for all the configurations z 1:T and all the configurations z * 1:T that belong to some Ω η (θ).
Combining Lemma 10 with the previous bound, we get that We also have for (x, y) ∈ (0, 1) 2 . The function k is positive for every (x, y) such that x y, hence, introducing the notation K * = min q,l,q ′ ,l ′ ;π * So, by (28), we have for n large enough This leads to for any u > 0 and large enough n. Moreover, thanks to Hoeffding's Inequality and Assumption 3, where C ζ is a constant depending on ζ. Finally using Lemma 10, we have P * θ * U 1 > − log(1/(ǫy n,T )) − 5r log(nT ) ≤ exp log(1/(ǫy n,T )) + 5r log(nT ) Second term in the right-hand side of (27). We have .
For any q, l, q ′ , l ′ ∈ 1, Q , we introduce the sets j )} G ql = G ql (z 1:T , z * 1:T , π * ,π) ≔ (D * ∪D) ∩ F ql = {(i, j, t) ∈ I n,T ; z t i = q, z t j = l and (π * Then we bound For every u > 0, we thus have We start by dealing with the first term of the right-hand side of (30). Notice that on the event Ω n,T , we have (π ql − π * ql )/(π * ql (1 − π * ql )) ≤ v n,T /ζ 2 for every q, l ∈ 1, Q . The next lemma establishes that any set D n,T (z 1:T , π) is included in a larger set, whose cardinality is bounded. In particular, the random setD is included in a larger deterministic subset.
As the set G ql is random (becauseD is random), we write where now D is a deterministic set. By a union bound and Hoeffding's inequality, we have for any D ⊂ D n,T (z 1:T ) This leads to For the second term of (30), we get from a union bound and from Lemma 11 (that gives an upper bound for |D * ∪D|) that Finally, we have the following upper bound for the second term of (27) Third term in the right-hand side of (27). We want to bound (in probability) the last term U 3 . Distinguishing between the cases where X t i j = 0 and X t i j = 1, we have For any (q, l) ∈ 1, Q 2 , we further introduce the sets Centering the X t i j (under the distribution P * θ * ), we get Then, on the event Ω n,T and for n and T large enough such that |(π ql −π * ql )/(1 −π * ql )| ≤ 1/2 and |(π ql −π * ql )/π * ql | ≤ 1/2 for every q and l, using the fact that | log(1 + x)| ≤ 2|x| for x ∈ [−1/2, 1/2], we have Then, for every u > 0, For the first term of (31), using Hoeffding's inequality as before, For the second term of (31), we use Finally, we have the following upper bound for the third term of (27) Combining the 3 bounds on the right-hand-side of (27).
This leads directly to inequality (20).

Proof of Proposition 3
We fix some σ ∈ S Q and study the convergence in P θ * −probability ofγ σ(q)σ(l) to γ * ql withΓ as defined by the fixed point equation (4) First, let us denote Then we can write the quantity at stake aŝ to obtain the following upper bound on the probability of interest First term of the right-hand side of (33). For the first term in (33), for any 0 < λ < δ (implying λ < α * q for any q ∈ 1, Q ), First, we upper bound the probability P θ * A q,l − α * q γ * ql > ǫr n,T √ log n √ nT for any ǫ > 0, using the following lemma.
Lemma 12. If log(T ) = o(n), for any ǫ > 0, for any sequence {r n,T } n,T ≥1 increasing to infinity such that r n,T = o nT/ log n and any η ∈ (0, δ), we have for any σ ∈ S Q (1) with v n,T a sequence decreasing to 0 such that v n,T = o log(nT )/n .
Then, for the second term of (34), notice that B q = Q l=1 A q,l and Q l=1 γ * ql = 1. We then have, if log(T ) = o(n) and v n,T = o log(nT )/n , using Lemma 12 again, Finally, for the first term of (33), if y n,T is such that 1/y n,T = o nT/ log(n) , if v n,T = o log(nT )/n and as long as log(T ) = o(n), we obtain Second term of the right-hand side of (33). For the second term of (33), we split it on two complementary events as before. For any 0 < λ < δ, we have We already gave an upper bound on the second term in the right-hand side of (36). Let us give one for the first term. Notice that as α * q ≥ δ and if B q ≥ α * q − λ ≥ δ − λ > 0, we have by the mean value theorem We can then write for the first term in the right-hand side of (36), as long as log(T ) = o(n), for {y n,T } n,T ≥1 such that 1/y n,T = o nT/ log n and with v n,T such that v n,T = o log(nT )/n , still using Lemma 12 We finally obtain for the second term of the right-hand side of (33) We conclude the proof by summing the upper bounds obtained in (35) and (37) and by noticing that P θ * ( Γ σ − Γ * ∞ > ǫr n,T log n/ √ nT ) ≤ 1≤q,l≤Q P θ * (|γ σ(q)σ(l) − γ * ql | > ǫr n,T log n/ √ nT ).

Proof of Proposition 5
We use the following lemma, that states that the quantity we optimize in the VEM algorithm and the log-likelihood are asymptotically equivalent.
We have that for any ǫ > 0, for n and T large enough, We then conclude by combining this result with Proposition 1.

Proof of Corollary 5
This is a direct consequence of Proposition 5 and Lemma 8 applied with the functions F n,T = 2 n(n−1)T J(χ(·), ·).

Proof of Proposition 6
This proof is quite similar to that of Proposition 3. We fix some σ ∈ S Q and study the convergence in P θ * −probability ofγ σ(q)σ(l) to γ * ql withΓ as defined by the fixed point equation (5), i.e. .

First, let us denote
Then we can write the quantity at stake as We follow the line of the proof of Proposition 3, using Lemma 14 below instead of Lemma 12 in order to obtain the result.
Lemma 14. For any ǫ > 0, for any sequence {r n,T } n,T ≥1 increasing to infinity such that r n,T = o nT/ log n and any η ∈ (0, δ), we have for any σ ∈ S Q with v n,T a sequence decreasing to 0 such that v n,T = o( log(nT )/n).

A Proofs of main results for the finite time case A.1 Proof of Corollary 2
When the number of time steps is fixed and the connection probabilities vary over time, the conditional loglikelihood is and the likelihood ℓ T (θ) is defined as in (2) with ℓ T c (·) instead of ℓ c (·). The maximum likelihood estimator is then As before, we denote the normalized log-likelihood M n,T (Γ, π 1:T ) = 2/(n(n−1)T )ℓ T (θ). We introduce the following limiting quantity We follow the lines of the proof of Proposition 1 in order to prove that we have for any sequence y n → +∞, for all ǫ > 0 Choosing y n = r 2 n , we then use Lemma 8 to conclude that, as r 2 n / √ n = o(1) by assumption, for any ǫ > 0, In particular, for every t ∈ 1, T ,π t converges in P θ * -probability to π * t up to label switching. Then, let us prove that on the event {min σ 1 ,...,σ T ∈S Q π 1:T − π * 1:T σ 1:T ∞ ≤ ǫr n n −1/4 } (whose probability converges to 1), for n large enough, the permutation σ t minimizing the distance between π * t andπ t σ t is the same for every t ∈ 1, T . We consider n large enough such that ǫr n n −1/4 < min 1≤q l≤Q |π * qq − π * ll |/4. Denoting by σ 1 m , . . . , σ T m the permutations (depending on n) minimizing π 1:T − π * 1:T σ 1:T ∞ , we have that, for any 1 ≤ t t ′ ≤ T , if some q, l ∈ 1, Q are such that and on the event we consider implying that q = l. This means that on this event, the permutation σ t m minimizing the distance between π * t and π t σ t is the same for every t ∈ 1, T . We can conclude that P θ * min σ∈S Q π 1:T σ − π * 1:T ∞ > ǫr n /n 1/4 = 1 − P θ * min σ∈S Q π 1:T σ − π * 1:T ∞ ≤ ǫr n /n 1/4 − −−− → n→∞ 0.

A.2 Proof of Proposition 4
First, let us introduce some notations, as in the proof of Proposition 2. For any fixed configuration z * 1:T ∈ Ω η , we define for any configuration z 1:T and any parameter θ D n,T (z 1:T , π 1:T ) ≔ (i, j, t) ∈ I n,T ; π t z t i z t j π t z * t i z * t j and for any 1 ≤ t ≤ T D t n,T (z t , π t ) ≔ (i, j) ∈ 1, n 2 ; i < j and π t and as before, we abbreviate to D * (resp.D), the set D n,T (z 1:T , π * 1:T ) (resp. D n,T (z 1:T ,π 1:T )). We also introduce for any q, l, q ′ , l ′ ∈ 1, Q the quantities F qlq ′ l ′ , F ql , G qlq ′ l ′ and G ql as before, accordingly to this definition of D n,T (z 1:T , π 1:T ). Finally, we introduce for any t ∈ 1, T and q, l, q ′ , l ′ ∈ 1, Q the quantities Note that we can get an equivalent of Lemma 10 with a similar proof that gives that for any configuration z * 1:T in Ω η , for any configuration z 1:T and any θ ∈ Θ T , D n,T (z 1:T , π 1:T ) ≥ γ 2 4 nr.
In the same way, we have an equivalent of Lemma 11 (with a similar proof) that gives that for any z t and z * t two configurations at time t such that z t − z * t 0 = r(t) and any parameter π t = (π t ql ) 1≤q,l≤Q , we have Going back to the proof of Proposition 4, we follow the line of that of Proposition 2, with a few changes. We get the same decomposition as in equation (26), replacing π by π 1 , . . . , π T in the definitions of U 1 , U 2 and U 3 , and replacing the event Ω n,T by Ω n = { π 1:T − π * 1:T ∞ ≤ v n }. For U 1 , the proof does not change. For U 2 , we write (instead of (29)) For every u > 0, we thus have We start by dealing with the first term of (40). Notice that on the event Ω n , we have π t ql − π * t ql /(π * t ql (1−π * t ql )) ≤ v n /ζ 2 for every q, l ∈ 1, Q . As the set G t ql is random (becauseD t is random), we write for every t ∈ 1, T , using (39), where now D is a deterministic set. By a union bound and Hoeffding's inequality, we have for any D ⊂ D t n,T (z t ) This leads to, for the first term of (40), For the second term of (40), we get from a union bound and from (39) Finally, we have the following upper bound for U 2 P * θ * Ω n ∩ |U 2 | > r log(nT ) ≤2Q 2 T exp − rζ 4 (log(nT )) 2 4Q 4 T 2 v 2 n n (2nr) 2nr+1 + Q 2 T P * θ * v n > ζ 2 log(nT ) 4Q 2 T n .

A.3 Proof of Corollary 6
As in the proof of Proposition 5, using the convergence in Equation (38) and Lemma 13, we obtain for any ǫ > 0 We then conclude by using Lemma 8 applied with F n,T = 2 n(n−1)T J(χ(·), ·). We compute the derivative of the Lagrangian with respect to each parameter γ ql .
At the critical pointθ = (γ,π), we obtain that for each (q, l) ∈ 1, Q 2 we havȇ where ∝ means 'proportional to'. The constraint l γ ql = 1 gives the normalizing term and we obtain .

B.2 Proof of Lemma 2
We can write the quantity to optimize J(χ, θ) =E Q χ log P θ (X 1:T , Z 1: Using this expression, we can obtain directly the expected fixed-point equation for the variational estimator of the transition probability from q to l.

B.3 Proof of Lemma 3
We rely on the notation introduced in the proof of Proposition 1. For any t ∈ 1, T , using classical dependency rules in directed acyclic graphs and the expression (9) ofẑ t , we write log P θ (X t | X 1:t−1 ) = log and thus Using Bayes' rule, we have log P θ (X t | X 1:t−1 ) = log P θ (X t , Z t | X 1:t−1 ) − log P θ (Z t | X 1:t ).
Taking the expectation of this quantity with respect to any distribution Q on Z t , we obtain Taking now Q as the Dirac distribution located onẑ t , we have H(Q) = 0 and Now, combining Inequalities (43) and (44), we obtain giving the expected result.

B.4 Proof of Lemma 4
To prove this lemma, we first establish a control of the expectation of the random variable appearing in the statement.
Lemma 15. We have the following inequality for z * 1:T and z 1:T any configurations and any θ ∈ Θ We now turn to the proof of Lemma 4. Let us first recall Talagrand's inequality [see for e.g. Massart, 2007, page 170, Equation (5.50)].
Lemma 17. There exist c 1 , c 2 > 0 such that for any ǫ > 0, for any sequence {r n,T } n,T ≥1 , we have, as long as ǫr n,T log n/(2α * q γ * ql √ nT ) < 1, We then combine the two upper bounds obtained in (49) and (50) in order to conclude, the assumption ǫr n,T log n/(2α * q γ * ql √ nT ) < 1 being satisfied for n and T large enough because r n,T = o( nT/ log n). We obtain the expected result, using the fact that log(T ) = o(n), that r n,T increases to infinity and that v n,T = o log(nT )/n , Pˆθ σ Z t i = q, Z t+1 i = l | X 1:T − α * q γ * ql > ǫy n,T         ≤ P θ * π σ − π * ∞ > v n,T + o(1).

B.14 Proof of Lemma 14
This proof is quite similar to that of Lemma 12. For any ǫ > 0, let us write and upper bound the two probabilities in the right-hand side of this inequality. We already proved in Lemma 12 that the second term converges to 0 thanks to the assumptions on the sequence {r n,T } n,T ≥1 . For the first term, let z 1:T denote a fixed configuration. Let us work on the set {Z 1:T = z 1:T } and use the same method as in the proof of Lemma 12, ≤2Qχ (θ σ ) (Z 1:T z 1:T ).
Then we obtain For each z 1:T , we use the following lemma.