Estimating Networks With Jumps

We study the problem of estimating a temporally varying coefficient and varying structure (VCVS) graphical model underlying nonstationary time series data, such as social states of interacting individuals or microarray expression profiles of gene networks, as opposed to i.i.d. data from an invariant model widely considered in current literature of structural estimation. In particular, we consider the scenario in which the model evolves in a piece-wise constant fashion. We propose a procedure that minimizes the so-called TESLA loss (i.e., temporally smoothed L1 regularized regression), which allows jointly estimating the partition boundaries of the VCVS model and the coefficient of the sparse precision matrix on each block of the partition. A highly scalable proximal gradient method is proposed to solve the resultant convex optimization problem; and the conditions for sparsistent estimation and the convergence rate of both the partition boundaries and the network structure are established for the first time for such estimators.


Introduction
Networks are a fundamental form of representation of relational information underlying large, noisy data from various domains. For example, in a biological study, nodes of a network can represent genes in one organism and edges can represent associations or regulatory dependencies among genes. In a social analysis, nodes of a network can represent actors and edges can represent interactions or friendships between actors. Exploring the statistical properties and hidden characteristics of network entities, and the stochastic processes behind temporal evolution of network topologies is essential for computational knowledge discovery and prediction based on network data.
In many dynamical environments, such as a developing biological system, it is often technically impossible to experimentally determine the network topologies specific to every time point in a discrete time series. Resorting to computational inference methods, such as extant structural learning algorithms, is also difficult because for every model unique to a single time point, there exist as few as only a single snapshot of the nodal states distributed accordingly to the model in question. In this paper, we consider an estimation problem under a particular dynamic context, where the model evolves piecewise constantly, i.e., staying structurally invariant during unknown segments of time, and then jump to a different structure.
A popular technique for deriving the network structure from iid sample is to estimate a sparse precision matrix. The importance of estimating precision matrices with zeros was recognized by (Dempster, 1972) who coined the term covariance selection. The elements of the precision matrix represent the associations or conditional covariances between corresponding variables. Once a sparse precision matrix is estimated, a network can be drawn by connecting variables whose corresponding elements of the precision matrix are non-zero. Recent studies have shown that covariance selection methods based on the penalized likelihood maximization can lead to a consistent estimate of the network structure underlying a Gaussian Markov Random Fields (Fan et al., 2009;Ravikumar et al., 2008). Moreover, a particular procedure for covariance selection known as neighborhood selection, which is built on ℓ 1 norm regularized regression, can produce a consistent estimate of the network structure when the sample is assumed to follow a general Markov Random Field distribution whose structure corresponds to the network in question (Ravikumar et al., 2009;Meinshausen and Bühlmann, 2006;Peng et al., 2009). Specifically, a Markov Random Field (MRF) is a probabilistic graphical model defined on a graph G = (V, E), where V = {1, . . . , p} is a vertex set corresponding to the set of random variables to be modeled (in this paper we call them nodes and variables interchangeably), and E ⊆ V × V is the edge set capturing conditional indecencies among these nodes. Let X = (X 1 , . . . , X p ) ′ denote a p-dimensional random vector, whose elements are indexed by the nodes of the graph G. Under the MRF, a pair (a, b) is not an element of the edge set E if and only if the variable X a is conditionally independent of X b given all the rest of variables X V \{a,b} , X a ⊥ X b |X V \{a,b} . A distribution over X can be defined by taking the following log linear form that makes explicit use of the (presence and absence of edges in the) edge set: p(X) ∝ exp{ (a,b)∈V θ ab X a X b }. When the elements of the random vector X are discrete, e.g., X ∈ {0, 1} p , the model is referred to as a discrete MRF, sometimes known as an Ising model in statistics physics community; whereas when X is a continuous vector, the model is referred to as a Gaussian graphical model (GGM) because one can easily show that the p(X) above is actually a multivariate Gaussian. The MRF have been widely used for modeling data with graphical relational structures over a fixed set of entities (Wainwright and Jordan, 2008;Getoor and Taskar, 2007). The vertices can describe entities such as genes in a biological regulatory network, stocks in the market, or people in society; while the edges can describe relationships between vertices, for example, interaction, correlation or influence.
The statistical problem we concern in this paper is to estimate the structure of the Gaussian graphical model from observed samples of nodal states in a dynamic world. Traditional methods handle this problem with the assumption that the samples are iid. Let D = {x 1 , . . . , x n } be an independent and identically distributed sample according to a p-dimensional multivariate normal distribution N p (0, Σ), where Σ is the covariance matrix. Let Ω := Σ −1 denote the precision matrix, with elements (ω ab ), 1 ≤ a, b ≤ p. Then one can obtain an estimator of the Ω from D via optimizing a proper statistical loss function, such as likelihood or penalized likelihood. As mentioned earlier, the precision matrix Ω encodes the conditional independence structure of the distribution and the pattern of the zero elements in the precision matrix define the structure of the associated graph G. There has been a dramatic growth of interest in recent literature in the problem of covariance selection, which deals with the graph estimation problem above. Existing works range from algorithmic development focusing on efficient estimation procedures, to theoretical analysis focusing on statistical guarantees of different estimators. We do not intend to give an extensive overview of the literature here, but interested readers can follow the pointers bellow. In the classical literature (e.g. Lauritzen, 1996), procedures are developed for small dimensional graphs and commonly involve hypothesis testing with greedy selection of edges. More recent literature estimates the sparse precision matrix by optimizing penalized likelihood (Yuan and Lin, 2007;Fan et al., 2009;Banerjee et al., 2008;Rothman et al., 2008;Friedman et al., 2008;Ravikumar et al., 2008;Guo et al., 2010b;Zhou et al., 2008) or through neighborhood selection (Meinshausen and Bühlmann, 2006;Peng et al., 2009;Guo et al., 2010a;, where the structure of the graph is estimated by estimating the neighborhood of each node. Both of these approaches are suitable for high-dimensional problems, even when p ≫ n, and can be efficiently implemented using scalable convex program solvers. Most of the above mentioned work assumes that a single invariant network model is sufficient to describe the dependencies in the observed data. However, when the observed data are not iid, such an assumption is not justifiable. For example, when data consist of microarray measurements of the gene expression levels collected throughout the cell cycle or development of an organism, different genes can be active during different stages. This suggests that different distributions and hence different networks should be used to describe the dependencies between measured variables at different time intervals. In this paper, we are going to tackle the problem of estimating the structure of the GGM when the structure is allowed to change over time. By assuming that the parameters of the precision matrix change with time, we obtain extra flexibility to model a larger class of distributions while still retaining the interpretability of the static GGM. In particular, as the coefficients of the precision matrix change over time, we also allow the structure of the underlying graph to change as well. This semi-parametric generalization of the parametric model is referred to as a varying coefficient varying structure (VCVS) model. Now, let {x i } i∈[n] ∈ R p be a sequence of n independent observations (we use [n] to denote the set {1, . . . , n}) from some p-dimensional multivariate normal distributions, not necessarily the same for every observation. Let {B j } j∈[B] be a disjoint partitioning of the set [n] where each block of the partition consists of consecutive elements, that is, B j ∩ B j ′ = ∅ for j = j ′ and j B j = [n] and B j = [T j−1 : T j ] := {T j−1 , T j−1 + 1, . . . , T j − 1}. Let T := {T 0 = 1 < T 1 < . . . < T B = n + 1} denote the set of partition boundaries. We consider the following model so that observations indexed by elements in B j are p-dimensional realizations of a multivariate normal distribution with zero mean and the covariance matrix Σ j = (σ j ab ) a,b∈ [p] . Let Ω j := (Σ j ) −1 denote the precision matrix with elements (ω j ab ) a,b∈ [p] . With the number of partitions, B, and the boundaries of partitions, T , unknown, we study the problem of estimating both the partition set {B j } and the non-zero elements of the precision matrices {Ω j } j∈ [B] from the sample {x i } i∈ [n] . Note that in this work we study a particular case of the VCVS model, where the coefficients are piece-wise constant functions of time. A scenario where the coefficients are smoothly varying functions of time has been considered in Zhou et al. (2008) for the GGM and in Kolar et al. (2010b) and Kolar and Xing (2009) for an Ising model.
If the partitions {B j } j were known, the problem would be trivially reduced to the setting analyzed in the previous work. Dealing with the unknown partitions, together with the structure estimation of the model, calls for new methods. We propose and analyze a method based on time-coupled neighborhood selection, where the model estimates are forced to stay similar across time using a fusiontype total variation penalty and the sparsity of each neighborhood is obtained through the ℓ 1 penalty. Details of the approach are given in §2.
The model in Eq. (1) is related to the varying-coefficient models (e.g. Hastie and Tibshirani, 1993) with the coefficients being piece-wise constant functions. Varying coefficient regression models with piece-wise constant coefficients are also known as segmented multivariate regression models (Liu et al., 1997) or linear models with structural changes (Bai and Perron, 1998). The structural changes are commonly determined through hypothesis testing and a separate linear model is fit to each of the estimated segments. In our work, we use the penalized model selection approach to jointly estimate the partition boundaries and the model parameters.
Little work has been done so far towards modeling dynamic networks and estimating changing precision matrices. Zhou et al. (2008) develops a nonparametric method for estimation of time-varying GGM, where x t ∼ N p (0, Σ(t)) and Σ(t) is smoothly changing over time. The procedure is based on the penalized likelihood approach of Yuan and Lin (2007) with the empirical covariance matrix obtained using a kernel smoother. Our work is very different from the one of Zhou et al. (2008), since under our assumptions the network changes abruptly rather than smoothly. Furthermore, as we outline in §2, our estimation procedure is not based on the penalized likelihood approach. Estimation of time-varying Ising models has been discussed in Ahmed and Xing (2009) and Kolar et al. (2010b). Yin et al. (2008) and Kolar et al. (2010a) studied nonparametric ways to estimate the conditional covariance matrix. The work of Ahmed and Xing (2009) is most similar to our setting, where they also use a fused-type penalty combined with an ℓ 1 penalty to estimate the structure of the verying Ising model. Here, in addition to focusing on GGMs, there is an additional subtle, but important, difference to Ahmed and Xing (2009). In this work, we use a modification of the fusion penalty (formally described in §2) which allows us to characterize the model selection consistency of our estimates and the convergence properties of the estimated partition boundaries, which is not available in the earlier work.
The remaining of the paper is organized as follows. In §2, we describe our estimation procedure and provide an efficient first-order optimization procedure capable of estimating large graphs. The optimization procedure is based on the smoothing procedure of Nesterov (2005) and converges in O(1/ǫ) iterations, where ǫ is the desired accuracy. Our main theoretical results are presented in §3.
In particular, we show that the partition boundaries are estimated consistently. Furthermore, the graph structure is consistently estimated on every block of the partition that contains enough samples. Numerical studies showing the finite sample performance of our procedure are given in §4. The proofs of the main results are relegated to §6, with some technical details presented in Appendix.

Notation Schemes:
For clarity, we end the introduction with a summary of the notations used in the paper. We use [n] to denote the set {1, . . . , n} and [l : r] to denote the set {l, l + 1, . . . , r − 1}. We use B j to denote j-th block of the partition T . With some abuse of notation, we also use B j to denote the set [T j−1 : T j ]. The number of samples in the block B j is denoted as |B j |. For a set S ⊂ V , we use the notation X S to denote the set {X a : a ∈ S} of random variables. We use X to denote the n × p matrix whose rows consist of observations. The vector X a = (x 1,a , . . . , x n,a ) ′ denotes a column of matrix X and, similarly, X S = (X b : b ∈ S) denotes the n × |S| sub-matrix of X whose columns are indexed by the set S and X B j denotes the sub-matrix |B j | × p whose rows are indexed by the set B j . For simplicity of notation, we will use \a to denote the index set [p]\{a}, X \a = (X b : b ∈ [p]\{a}). For a vector a ∈ R p , we let S(a) denote the set of non-zero components of a. Throughout the paper, we use c 1 , c 2 , . . . to denote positive constants whose value may change from line to line. For a vector a ∈ R n , define ||a|| 1 = i∈[n] |a i |, ||a|| 2 = i∈[n] a 2 i and ||a|| ∞ = max i |a i |. For a symmetric matrix A, Λ min (A) denotes the smallest and Λ max (A) the largest eigenvalue. For a matrix A (not necessarily symmetric), we use |||A||| ∞ = max i j |A ij |. For two vectors a, b ∈ R n , the dot product is denoted a, b = i∈[n] a i b i . For two matrices A, B ∈ R n×m , the dot product is denoted as A, B = tr(A ′ B). Given two sequences {a n } and {b n }, the notation a n = O(b n ) means that there exists a constant c 1 such that a n ≤ c 1 b n ; the notation a n = Ω(b n ) means that there exists a constant c 2 such that a n ≥ c 2 b n and the notation a n ≍ b n means that a n = O(b n ) and b n = O(a n ). Similarly, we will use the notation a n = o p (b n ) to denote that b −1 n a n converges to 0 in probability. an element of the covariance matrix Ω the precision matrix ω ab an element of the precision matrix ·, · the dot product a, b = a ′ b ·, · the dot product between matrices A, B = tr(A ′ B) ξ min the minimum change between regression coefficient ||θ a,j − θ a,j−1 || 2 ≥ ξ min θ min the minimum size of a coefficient |θ a,j b | ≥ θ min ∆ min the minimum size of a block |B j | ≥ ∆ min 2 Graph estimation via Temporal-Difference Lasso In this section, we introduce our time-varying covariance selection procedure, which is based on the time-coupled neighborhood selection using the fused-type penalty. We call the proposed procedure Temporal-Difference Lasso (TD-Lasso).
We start by reviewing the basic neighborhood selection procedure, which has previously been used to estimate graphs in, for example, Peng et al. (2009), Meinshausen and Bühlmann (2006), Ravikumar et al. (2009) and Guo et al. (2010a). We start by relating the elements of the precision matrix Ω to a regression problem. Let the set S a to denote the neighborhood of the node a. DenoteS a the closure of S a ,S a := S a ∪{a}, and N a the set of nodes not in the neighborhood of the node a, N a = [p]\S a . It holds that X a ⊥ X Na |X Sa . The neighborhood of the node a can be easily seen from the non-zero pattern of the elements in the precision matrix Ω, S a = {b ∈ [p]\{a} : ω ab = 0}. See Lauritzen (1996) for more details. It is a well known result for Gaussian graphical models that the elements of θ a = argmin are given by θ a b = −ω ab /ω aa . Therefore, the neighborhood of a node a, S a , is equal to the set of non-zero coefficients of θ a . Using the expression for θ a , we can write X a = b∈Sa X b θ a b + ǫ, where ǫ is independent of X \a . The neighborhood selection procedure was motivated by the above relationship between the regression coefficients and the elements of the precision matrix. Meinshausen and Bühlmann (2006) proposed to solve the following optimization procedureθ a = argmin and proved that for iid sample the non-zero coefficients ofθ a consistently estimate the neighborhood of the node a, under a suitably chosen penalty parameter λ.
In this paper, we build on the neighbourhood selection procedure to estimate the changing graph structure in model (1). We use S j a to denote the neighborhood of the node a on the block B j and N j a to denote nodes not in the neighborhood of the node a on the j-th block, N j a = V \S j a . Consider the following estimation procedurê where the loss is defined for and the penalty is defined as The penalty term is constructed from two terms. The first term ensures that the solution is going to be piecewise constant for some partition of [n] (possibly a trivial one). The first term can be seen as a sparsity inducing term in the temporal domain, since it penalizes the difference between the coefficients β ·,i and β ·,i+1 at successive time-points. The second term results in estimates that have many zero coefficients within each block of the partition. The estimated set of partition boundarieŝ contains indices of points at which a change is estimated, withB being an estimate of the number of blocks B. The estimated number of the blockB is controlled through the user defined penalty parameter λ 1 , while the sparsity of the neighborhood is controlled through the penalty parameter λ 2 . Based on the estimated set of partition boundariesT , we can define the neighborhood estimate of the node a for each estimated block. Letθ a,j =β a Using the estimated vectorθ a,j , we define the neighborhood estimate of the node a for the blockB j asŜ j a := S(θ a,j ) := {b ∈ \a :θ a,j b = 0}. Solving (3) for each node a ∈ V gives us a neighborhood estimate for each node. Combining the neighborhood estimates we can obtain an estimate of the graph structure for each point i ∈ [n].
The choice of the penalty term is motivated by the work on penalization using total variation (Rinaldo, 2009;Mammen and van de Geer, 1997), which results in a piece-wise constant approximation of an unknown regression function. The fusion-penalty has also been applied in the context of multivariate linear regression Tibshirani et al. (2005), where the coefficients that are spatially close, are also biased to have similar values. As a result, nearby coefficients are fused to the same estimated value. Instead of penalizing the ℓ 1 norm on the difference between coefficients, we use the ℓ 2 norm in order to enforce that all the changes occur at the same point.
The objective (3) estimates the neighborhood of one node in a graph for all time-points. After solving the objective (3) for all nodes a ∈ V , we need to combine them to obtain the graph structure. We will use the following procedure to combine {β a } a∈V , That is, an edge between nodes a and b is included in the graph if at least one of the nodes a or b is included in the neighborhood of the other node. We use the max operator to combine different neighborhoods as we believe that for the purpose of network exploration it is more important to occasionally include spurious edges than to omit relevant ones. For further discussion on the differences between the min and the max combination, we refer an interested reader to Banerjee et al. (2008).

Numerical procedure
Finding a minimizerβ a of (3) can be a computationally challenging task for an off-the-shelf convex optimization procedure. We propose too use an accelerated gradient method with a smoothing technique (Nesterov, 2005), which converges in O(1/ǫ) iterations where ǫ is the desired accuracy. We start by defining a smooth approximation of the fused penalty term. Let H ∈ R n×n−1 be a matrix with elements With the matrix H we can rewrite the fused penalty term as 2λ 1 n−1 i=1 ||(βH) ·,i || 2 and using the fact that the ℓ 2 norm is self dual (for example, see Boyd and Vandenberghe, 2004) we have the following representation The following function is defined as a smooth approximation to the fused penalty, where µ > 0 is the smoothness parameter. It is easy to see that Setting the smoothness parameter to µ = ǫ 2(n−1) , the correct rate of convergence is ensured. Let U µ (β) be the optimal solution of the maximization problem in (7), which can be obtained analytically as where Π Q (·) is the projection operator onto the set Q. From Theorem 1 in Nesterov (2005), we have that Ψ µ (β) is continuously differentiable and convex, with the gradient that is Lipschitz continuous.
With the above defined smooth approximation, we focus on minimizing the following objective (2009) (see also Nesterov (2007)), we define the following quadratic approximation of F (β) at a point β 0

Following Beck and Teboulle
where L > 0 is the parameter chosen as an upper bounds for the Lipschitz constant of ∇L + ∇Ψ. Let p L (β 0 ) be a minimizer of Q L (β, β 0 ). Ignoring constant terms, p L (β 0 ) can be obtained as It is clear that p L (β 0 ) is the unique minimizer, which can be obtained in a closed form, as a result of the soft-thresholding, is the soft-thresholding operator that is applied element-wise. In practice, an upper bound on the Lipschitz constant of ∇L + ∇Ψ can be expensive to compute, so the parameter L is going to be determined iteratively. Combining all of the above, we have the following algorithm.
In the algorithm, γ is a constant used to increase the estimate of the Lipschitz constant L. Compared to the gradient descent method (which can be obtain by iterating β k+1 = p L (β k )), the accelerated gradient method updates two sequences {β k } and {z k } recursively. Instead of performing the gradient step from the latest approximate solution β k , the gradient step is performed from the search point z k that is obtained as a linear combination of the last to approximate solutions β k−1 and β k . Since the condition F (p L (z k )) ≤ Q L (p L (z k ), z k ) is satisfied in every iteration, we have the algorithm converges in O(1/ǫ) iterations following Beck and Teboulle (2009). As the convergence criterion, we stop iterating once the relative change in the objective value is below some threshold value.

Tuning parameter selection
The penalty parameters λ 1 and λ 2 control the complexity of the estimated model. In this work, we propose to use the BIC score to select the tuning parameters. Define the BIC score for each node a ∈ V as Algorithm: Accelerated Gradient Method for Equation (3) Input: (3). The penalty parameters can now be chosen as We will use the above formula to select the tuning parameters in our simulations, where we are going to search for the best choice of parameters over a grid.

Theoretical results
This section is going to address the statistical properties of the estimation procedure presented in Section 2. The properties are addressed in an asymptotic framework by letting the sample size n grow, while keeping the other parameters fixed. For the asymptotic framework to make sense, we assume that there exists a fixed unknown sequence of numbers {τ j } that defines the partition boundaries as T j = ⌊nτ j ⌋, where ⌊a⌋ denotes the largest integer smaller that a. This assures that as the number of samples grow, the same fraction of samples falls into every partition. We call {τ j } the boundary fractions. We give sufficient conditions under which the sequence {τ j } is consistently estimated. In particular, if the number of partition blocks is estimated correctly, then we show that max j∈ [B] |T j −T j | ≤ nδ n with probability tending to 1, where {δ n } n is a non-increasing sequence of positive numbers that tends to zero. If the number of partition segments is over estimated, then we show that for a distance defined for two sets A and B as we have h(T , T ) ≤ nδ n with probability tending to 1. With the boundary segments consistently estimated, we further show that under suitable conditions for each node a ∈ V the correct neighborhood is selected on all estimated block partitions that are sufficiently large. The proof technique employed in this section is quite involved, so we briefly describe the steps used. Our analysis is based on careful inspection of the optimality conditions that a solutionβ a of the optimization problem (3) need to satisfy. The optimality conditions forβ a to be a solution of (3) are given in §3.2. Using the optimality conditions, we establish the rate of convergence for the partition boundaries. This is done by proof by contradiction. Suppose that there is a solution with the partition boundaryT that satisfies h(T , T ) ≥ nδ n . Then we show that, with high-probability, all such solutions will not satisfy the KKT conditions and therefore cannot be optimal. This shows that all the solutions to the optimization problem (3) result in partition boundaries that are "close" to the true partition boundaries, with high-probability. Once it is established that T and T satisfy h(T , T ) ≤ nδ n , we can further show that the neighborhood estimates are consistently estimated, under the assumption that the estimated blocks of the partition have enough samples. This part of the analysis follows the commonly used strategy to prove that the Lasso is sparsistent (see for example Bunea, 2008;Wainwright, 2009;Meinshausen and Bühlmann, 2006), however important modifications are required due to the fact that position of the partition boundaries are being estimated.
Our analysis is going to focus on one node a ∈ V and its neighborhood. However, using the union bound over all nodes in V , we will be able to carry over conclusions to the whole graph. To simplify our notation, when it is clear from the context, we will omit the superscript a and writeβ,θ and S, etc., to denoteβ a ,θ a and S a , etc.

Assumptions
Before presenting our theoretical results, we give some definitions and assumptions that are going to be used in this section. Let ∆ min := min j∈ [B] |T j − T j−1 | denote the minimum length between change points, ξ min := min a∈V min j∈[B−1] ||θ a,j+1 − θ a,j || 2 denote the minimum jump size and θ min = min a∈V min j∈[B] min b∈S j |θ a,j b | the minimum coefficient size. Throughout the section, we assume that the following holds.
A1 There exist two constants φ min > 0 and φ max < ∞ such that A2 Variables are scaled so that σ j aa = 1 for all j ∈ [B] and all a ∈ V .
The assumption A1 is commonly used to ensure that the model is identifiable. If the population covariance matrix is ill-conditioned, the question of the correct model identification if not well defined, as a neighborhood of a node may not be uniquely defined. The assumption A2 is assumed for the simplicity of the presentation. The common variance can be obtained through scaling.
A3 There exists a constant M > 0 such that The assumption A3 states that the difference between coefficients on two different blocks, ||θ a,k − θ a,j || 2 , is bounded for all j, k ∈ [B]. This assumption is simply satisfied if the coefficients θ a were bounded in the ℓ 2 norm.
A4 There exist a constant α ∈ (0, 1], such that the following holds The assumption A4 states that the variables in the neighborhood of the node a, S j a , are not too correlated with the variables in the set N j a . This assumption is necessary and sufficient for correct identification of the relevant variables in the Lasso regression problems (see for example Zhao and Yu, 2006;van de Geer and Bühlmann, 2009). Note that this condition is sufficient also in our case when the correct partition boundaries are not known.
The lower bound on the minimum coefficient size θ min is necessary, since if a partial correlation coefficient is too close to zero the edge in the graph would not be detectable.

A6
The sequence of partition boundaries {T j } satisfy T j = ⌊nτ j ⌋, where {τ j } is a fixed, unknown sequence of the boundary fractions belonging to [0, 1].
The assumption is needed for the asymptotic setting. As n → ∞, there will be enough sample points in each of the blocks to estimate the neighborhood of nodes correctly.

Convergence of the partition boundaries
In this subsection we establish the rate of convergence of the boundary partitions for the estimator (3). We start by giving a lemma that characterizes solutions of the optimization problem given in (3). Note that the optimization problem in (3) is convex, however, there may be multiple solutions to it, since it is not strictly convex.

Lemma 1. A matrixβ is optimal for the optimization problem (3) if and only if there exist a collection of subgradient vectors {ẑ
for all k ∈ [n] andẑ 1 =ẑ n+1 = 0.
The following theorem provides the convergence rate of the estimated boundaries ofT , under the assumption that the correct number of blocks is known.
be a sequence of observation according to the model in (1). Assume that A1-A3 and A5-A6 hold. Suppose that the penalty parameters λ 1 and λ 2 satisfy Let {β ·,i } i∈[n] be any solution of (3) and letT be the associated estimate of the block partition. Let {δ n } n≥1 be a non-increasing positive sequence that converges to zero as n → ∞ and satisfies ∆ min ≥ nδ n for all n ≥ 1. Furthermore, suppose that (nδ n ξ min Suppose that δ n = (log n) γ /n for some γ > 1 and ξ min = Ω( log n/(log n) γ ), the conditions of theorem 5 are satisfied, and we have that the sequence of boundary fractions {τ j } is consistently estimated. Since the boundary fractions are consistently estimated, we will see below that the estimated neighborhood S(θ j ) on the blockB j consistently recovers the true neighborhood S j .
Unfortunately, the correct bound on the number of block B may not be known. However, a conservative upper bound B max on the number of blocks B may be known. Suppose that the sequence of observation is over segmented, with the number of estimated blocks bounded by B max . Then the following proposition gives an upper bound on h(T , T ) where h(·, ·) is defined in (14).

Proposition 3. Let {x i } i∈[n] be a sequence of observation according to the model in (1). Assume that the conditions of theorem 2 are satisfied. Letβ be a solution of (3) andT the corresponding set of partition boundaries, withB blocks. If the number of blocks satisfy B ≤B ≤ B max , then
The proof of the proposition follows the same ideas of theorem 2 and its sketch is given in the appendix.
The above proposition assures us that even if the number of blocks is overestimated, there will be a partition boundary close to every true unknown partition boundary. Figure 1: The figure illustrates where we expect to estimate a neighborhood of a node consistently. The blue region corresponds to the overlap between the true block (bounded by gray lines) and the estimated block (bounded by black lines). If the blue region is much larger than the orange regions, the additional bias introduced from the samples from the orange region will not considerably affect the estimation of the neighborhood of a node on the blue region. However, we cannot hope to consistently estimate the neighborhood of a node on the orange region.

Correct neighborhood selection
In this section, we give a result on the consistency of the neighborhood estimation. We will show that whenever the estimated blockB j is large enough, say |B j | ≥ r n where {r n } n≥1 is an increasing sequence of numbers that satisfy (r n λ 2 ) −1 λ 1 → 0 and r n λ 2 2 → ∞ as n → ∞, we have that S(θ j ) = S(β k ), where β k is the true parameter on the true block B k that overlapsB j the most. Figure 1 illustrates this idea. The blue region in the figure denotes the overlap between the true block and the estimated block of the partition. The orange region corresponds to the overlap of the estimated block with a different true block. If the blue region is considerably larger than the orange region, the bias coming from the sample from the orange region will not be strong enough to disable us from selecting the correct neighborhood. On the other hand, since the orange region is small, as seen from Theorem 2, there is little hope of estimating the neighborhood correctly on that portion of the sample.
Suppose that we know that there is a solution to the optimization problem (3) with the partition boundaryT . Then that solution is also a minimizer of the following objective min θ 1 ,...,θB j∈B Note that the problem (17) does not give a practical way of solving (3), but will help us to reason about the solutions of (3). In particular, while there may be multiple solutions to the problem (3), under some conditions, we can characterize the sparsity pattern of any solution that has specified partition boundariesT .
Lemma 4. Letβ be a solution to (3), withT being an associated estimate of the partition boundaries. Suppose that the subgradient vectors satisfy |ŷ i,b | < 1 for all b ∈ S(β ·,i ), then any other solutionβ with the partition boundariesT satisfyβ b,i = 0 for all b ∈ S(β ·,i ).
The above Lemma states sufficient conditions under which the sparsity patter of a solution with the partition boundaryT is unique. Note, however, that there may other solutions to (3) that have different partition boundaries. Now, we are ready to state the following theorem, which establishes that the correct neighborhood is selected on every sufficiently large estimated block of the partition.
Theorem 5. Let {x i } i∈[n] be a sequence of observation according to the model in (1). Assume that the conditions of theorem 2 are satisfied. In addition, suppose that A4 also holds. Then, if |T | = B + 1, it holds that Under the assumptions of theorem 2 each estimated block is of size O(n). As a result, there are enough samples in each block to consistently estimate the underlying neighborhood structure. Observe that the neighborhood is consistently estimated at each i ∈B j ∩ B j for all j ∈ [B] and the error is made only on the small fraction of samples, when i ∈B j ∩ B j , which is of order O(nδ n ).
Using proposition 3 in place of theorem 2, it can be similarly shown that, for a large fraction of samples, the neighborhood is consistently estimated even in the case of over-segmentation. In particular, whenever there is a sufficiently large estimated block, with |B k ∩ B j | = O(r n ), it holds that S(B k ) = S j with probability tending to one.

Numerical studies
In this section, we present a small numerical study on the proposed algorithm on simulated networks. A full performance test and application on real world data is beyond the scope of this paper which mainly focuses on the theory of time-varying model estimation. In all of our simulations studies we set p = 30 and B = 3 with |B 1 | = 80, |B 2 | = 130 and |B 3 | = 90, so that in total we have n = 300 samples. We consider two types of random networks: a chain and a nearest neighbor network. We measure the performance of the estimation procedure outlined in §2 on the following metrics: average precision of estimated edges, average recall of estimated edges and average F 1 score which combines the precision and recall score. The precision, recall and F 1 score are respectively Our results are averaged over 50 simulation runs. We compare our algorithm against an oracle algorithm which exactly knows the true partition boundaries. In this case, it is only needed to run the algorithm of Meinshausen and Bühlmann (2006) on each block of the partition independently. We use a BIC criterion to select the tuning parameter for this oracle procedure as described in Peng et al. (2009). Chain networks. We follow the simulation in Fan et al. (2009) to generate a chain network (see Figure 2). This network corresponds to a tridiagonal precision matrix (after an appropriate permutation of nodes). The network is generated as follows. First, we choose generate a random permutation π of [n]. Next, the covariance matrix is generated as follows: the element at position (a, b) is chosen as σ ab = exp(−|t π(a) − t π(b) |/2) where t 1 < t 2 < . . . < t p and t i − t i−1 ∼ Unif(0.5, 1) for i = 2, . . . , p. This processes is repeated three times to obtain three different covariance matrices, from which we sample 80, 130 and 90 samples respectively.
For illustrative purposes, Figure 3 plots the precision, recall and F 1 score computed for different values of the penalty parameters λ 1 and λ 2 . Table 2 shows the precision, recall and F 1 score for the parameters chosen using the BIC score described in 2.2. The numbers in parentheses correspond to standard deviation. Due to the fact that there is some error in estimating the partition boundaries, we observe a decrease in performance compared to the oracle procedure that knows the correct position of the partition boundaries.
Nearest neighbors networks. We generate nearest neighbor networks following the procedure outlined in Li and Gui (2006). For each node, we draw  Figure 3: Plots of the precision, recall and F 1 scores as functions of the penalty parameters λ 1 and λ 2 for chain networks. The parameter λ 1 is obtained as 100 * 0.98 50+i , where i indexes y-axis. The parameter λ 2 is computed as 285 * 0.98 230+j , where j indexes x-axis. The white region of each plot corresponds to a region of the parameter space that we did not explore.
a point uniformly at random on a unit square and compute the pairwise distances between nodes. Each node is then connected to 4 closest neighbors (see Figure 4). Since some of nodes will have more than 4 adjacent edges, we remove randomly edges from nodes that have degree larger than 4 until the maximum degree of a node in a network is 4. Each edge (a, b) in this network corresponds to a non-zero element in the precision matrix Ω, whose value is generated uniformly on [−1, −0.5] ∪ [0.5, 1]. The diagonal elements of the precision matrix are set to a smallest positive number that makes the matrix positive definite. Next, we scale the corresponding covariance matrix Σ = Ω −1 to have diagonal elements equal to 1. This processes is repeated three times to obtain three different covariance matrices, from which we sample 80, 130 and 90 samples respectively.
For illustrative purposes, Figure 5 plots the precision, recall and F 1 score computed for different values of the penalty parameters λ 1 and λ 2 . Table 3 shows the precision, recall and F 1 score for the parameters chosen using the BIC score, together with their standard deviations. In the same table, we give the results of the oracle procedure.

Conclusion
We have addressed the problem of time-varying covariance selection when the underlying probability distribution changes abruptly at some unknown points in time. Using a penalized neighborhood selection approach with the fused-type penalty, we are able to consistently estimate times when the distribution changes and the network structure underlying the sample. The proof technique used to establish the convergence of the boundary fractions using the fused-type penalty is novel and constitutes an important contribution of the paper. Furthermore, our procedure estimates the network structure consistently whenever there is a large overlap between the estimated blocks and the unknown true blocks of samples coming from the same distribution. The proof technique used to establish the consistency of the network structure builds on the proof for consistency of the neighborhood selection procedure, however, important modifications are necessary since the times of distribution changes are not known in advance. Applications of the proposed approach range from cognitive neuroscience, where the problem is to identify changing associations between different parts of a brain when presented with different stimuli, to system biology studies, where the task is to identify changing patterns of interactions between genes involved in different cellular processes. We conjecture that our estimation procedure is  Figure 5: Plots of the precision, recall and F 1 scores as functions of the penalty parameters λ 1 and λ 2 for nearest neighbor networks. The parameter λ 1 is obtained as 100 * 0.98 50+i , where i indexes y-axis. The parameter λ 2 is computed as 285 * 0.98 230+j , where j indexes x-axis. The white region of each plot corresponds to a region of the parameter space that we did not explore. also valid in the high-dimensional setting when the number of variables p is much larger than the sample size n. We leave the investigations of the rate of convergence in the high-dimensional setting for a future work.

Proof of Theorem 2
We build on the ideas presented in the proof of Proposition 5 in Harchaoui and Lévy-Leduc (2010). Using the union bound, and it is enough to show that P[|T j −T j | > nδ n ] → 0 for all j ∈ [B]. Define the event A n,j as A n,j := |T j −T j | > nδ n and the event C n as We show that P[A n,j ] → 0 by showing that both P[A n,j ∩ C n ] → 0 and P[A n,j ∩ C c n ] → 0 as n → ∞. The idea here is that, in some sense, the event C n is a good event on which the estimated boundary partitions and the true boundary partitions are not too far from each other. Considering the two cases will make the analysis simpler.
First, we show that P[A n,j ∩ C n ] → 0. Without loss of generality, we assume thatT j < T j , since the other case follows using the same reasoning. Using (15) twice with k =T j and with k = T j and then applying the triangle inequality we have Some algebra on the above display gives The above display occurs with probability one, so that the event also occurs with probability one, which gives us the following bound =: P[A n,j,1 ] + P[A n,j,2 ] + P[A n,j,3 ].
Finally, we upper bound the probability of the event A n,j,3 . As before, φ min (T j −T j )ξ min /9 ≤ ||R 1 || 2 with probability at least 1 − 2 exp(−nδ n /2 + 2 log n). This gives us an upper bound on P[A n,j,3 ] as which, using lemma 7, converges to zero as under the conditions of the theorem (ξ min √ nδ n ) −1 √ p log n → 0. Thus we have shown that P[A n,j,3 ] → 0. Since the case whenT j > T j is shown similarly, we have proved that P[A n,j ∩ C n ] → 0 as n → ∞.
We proceed to show that P[A n,j ∩ C c n ] → 0 as n → ∞. Recall that C c n = {max j∈[B] |T j − T j | ≥ ∆ min /2}. Define the following events n ]. First, consider the event A n,j ∩ D (m) n under the assumption thatT j ≤ T j . Due to symmetry, the other case will follow in a similar way. Observe that We bound the first term in (22) and note that the other terms can be bounded in the same way. The following analysis is performed on the event n . Using (15) with k =T j and k = T j , after some algebra (similar to the derivation of (20)) the following holds with probability at least 1−2 exp(−nδ n /2+2 log n), where we have used lemma 8. LetT j = ⌊2 −1 (T j + T j+1 )⌋. Using (15) with k =T j and k = T j after some algebra (similar to the derivation of (21)) we obtain the following bound which holds with probability at least 1−c 1 exp(−nδ n /2+2 log n), where we have used lemma 8 twice. Combining the last two displays, we can upper bound the first term in (22) with where we have used lemma 7 to obtain the third term. Under the conditions of the theorem, all terms converge to zero. Reasoning similar about the other terms in (22), we can conclude that P[A n,j ∩ D n ] → 0 as n → ∞. Next, we bound the probability of the event A n,j ∩ D (l) n , which is upper bounded by Observe that Using the same arguments as those used to bound terms in (22), we have that P[D (l) n ] → 0 as n → ∞ under the conditions of the theorem. Similarly, we can show that the term P[D (r) n ] → 0 as n → ∞. Thus, we have shown that P[A n,j ∩ C c n ] → 0, which concludes the proof.

Proof of Lemma 4
ConsiderT fixed. The lemma is a simple consequence of the duality theory, which states that given the subdifferentialŷ i (which is constant for all i ∈B j , B j being an estimated block of the partitionT ), all solutions {β ·,i } i∈[n] of (3) need to satisfy the complementary slackness condition b∈\aŷ i,bβb,i = ||β ·,i || 1 , which holds only ifβ b,i = 0 for all b ∈ \a for which |ŷ i,b | < 1.

Proof of Theorem 5
Since the assumptions of theorem 2 are satisfied, we are going to work on the event |T j − T j | ≤ nδ n }.
In this case, |B k | = O(n). For i ∈B k , we write where is the bias. Observe that ∀i ∈B k ∩ B k , the bias e i = 0, while for i ∈B k ∩ B k , the bias e i is normally distributed with variance bounded by M 2 φ max under the assumption A1 and A3.
We proceed to show that S(θ k ) ⊂ S k . Sinceθ k is an optimal solution of (3), it needs to satisfy Now, we will construct the vectorsθ k ,žT k−1 ,žT k andyT k−1 that satisfy (24) and verify that the subdifferential vectors are dual feasible. Consider the following restricted optimization problem where the vector θ k N k is constrained to be 0. Let {θ j } j∈ [B] be a solution to the restricted optimization problem (25). Set the subgradient vectors asžT (24) foř yT k−1 ,N k . By construction, the vectorsθ k ,žT k−1 ,žT k andyT k−1 satisfy (24). Furthermore, the vectorsžT k−1 andžT k are elements of the subdifferential, and hence dual feasible. To show thatθ k is also a solution to (17), we need to show that ||yT k−1 ,N k || ∞ ≤ 1, that is, thatyT k−1 is also dual feasible variable. Using lemma 4, if we show thatyT k−1 ,N k is strict dual feasible, ||yT k−1 ,N k || ∞ < 1, then any other solutionθ k to (17) will satisfyθ k N = 0. From (24) we can obtain an explicit formula forθ S ǩ Recall that for large enough n we have that |B| > p, so that the matrix (XB k S k ) ′ XB k S k is invertible with probability one. Plugging (26) into (24), we have that ||yT LetΣ k andΣ k be defined as For i ∈ [n], we let B(i) index the block to which the sample i belongs to. Now, for any b ∈ N k , we can write i ∈B k , and W b ∈ R |B k | be the vector with components equal to w i b . Using this notation, and We analyze each of the terms separately. Starting with the term T 1 b , after some algebra, we obtain that (32) Recall that we are working on the event E, so that |||Σ k (37) we bound the first two terms in the equation above. We bound the first term by observing that for any j and any b ∈ N k and n sufficiently large with probability 1 − c 1 exp(−c 2 log n). Next, for any b ∈ N k we bound the second term as with probability 1−c 1 exp(−c 2 log n). Choosing ǫ 1 , ǫ 2 sufficiently small and for n large enough, we have that max b |T 1 b | ≤ 1 − α + o p (1) under the assumption A4. We proceed with the term T 2 b , which can be written as Since we are working on the event E the second term in the above equation is dominated by the first term. Next, using (32) together with (37), we have that Combining with Lemma 7, we have that under the assumptions of the theorem We deal with the term T 3 b by conditioning on XB k S k and ǫB k , we have that W b is independent of the terms in the squared bracket in T 3 b , since allžT k−1 ,S ,žT k ,S andŷT k−1 ,S are determined from the solution to the restricted optimization problem. To bound the second term, we observe that conditional on XB k S k and ǫB k , the variance of T 3 b can be bounded as whereη Using lemma 8 and Young's inequality, the first term in (33)  with probability at least 1 − 2 exp(−|B k |/2 + 2 log n). Using lemma 6 we have that the second term is upper bounded by with probability at least 1 − exp(−c 1 |B k |δ ′2 + 2 log n). Combining the two bounds, we have that Var(T 3 b ) ≤ c 1 s(|B k |) −1 with high probability, using the fact that (|B k |λ 2 ) −1 λ 1 → 0 and |B k |λ 2 → ∞ as n → ∞. Using the bound on the variance of the term T 3 b and the Gaussian tail bound, we have that Combining the results, we have that max b∈N k |Y b | ≤ 1 − α + o p (1). For a sufficiently large n, under the conditions of the theorem, we have shown that Sinceẽ i = 0 only on i ∈B k \B k and nδ n /|B k | → 0, the term involvingẽB k is stochastically dominated by the term involving ǫB k and can be ignored. Define the following terms Conditioning on XB k S k , the term T 1 is a |S k | dimensional Gaussian with variance bounded by c 1 /n with probability at least 1 − c 1 exp(−c 2 log n) using lemma 8. Combining with the Gaussian tail bound, the term ||T 1 || ∞ can be upper bounded as P ||T 1 || ∞ ≥ c 1 log s n ≤ c 2 exp(−c 3 log n).
Proof. For any 1 ≤ l < r ≤ n, with r − l > v n we have Lemma 7. Let {x i } i∈[n] be independent observations from (1) and let {ǫ i } i∈ [n] be independent N (0, 1). Assume that A1 holds. If v n ≥ C log n for some constant C > 16, then for some constants c 1 , c 2 > 0.
Proof. Let Σ 1/2 denote the symmetric square root of the covariance matrix Σ SS and let B(i) denote the block B j of the true partition such that i ∈ B j . With this notation, we can write x i = Σ B(i) 1/2 u i where u i ∼ N (0, I). For any l ≤ r ∈ B j we have || r i=l x i ǫ i || 2 = || r i=l Σ j 1/2 u i ǫ i || 2 ≤ φ 1/2 max || r i=l u i ǫ i || 2 .
Conditioning on {ǫ i } i , for each b ∈ [p], r i=l u i,b ǫ i is a normal random variable with variance r i=l (ǫ i ) 2 . Hence, || r i=l u i ǫ i || 2 2 /( r i=l (ǫ i ) 2 ) conditioned on {ǫ i } i is distributed according to χ 2 p and where the last inequality follows from (38). Using lemma 6, for all l, r ∈ B j with r − l > v n the quantity r i=l (ǫ i ) 2 is bounded by (1 + C)(r − l + 1) with probability at least 1 − exp(−c 1 log n), which gives us the following bound Proof. For any 1 ≤ l < r ≤ n, with r − l ≥ v n we have ≤ 2 exp(−v n /2) using (35), convexity of Λ max (·) and A1. The lemma follows from an application of the union bound. The other inequality follows using a similar argument.

Proof of Proposition 3
The following proof follows main ideas already given in theorem 2. We provide only a sketch.
Given an upper bound on the number of partitions B max , we are going to perform the analysis on the event {B ≤ B max }. Since we are going to focus on P[h(T , T ) ≥ nδ n {|T | = B ′ + 1}] for B ′ > B (for B ′ = B it follows from theorem 2 that h(T , T ) < nδ n with high probability). Let us define the following events E j,1 = {∃l ∈ [B ′ ] : |T l − T j | ≥ nδ n , |T l+1 − T j | ≥ nδ n andT l < T j <T l+1 } E j,2 = {∀l ∈ [B ′ ] : |T l − T j | ≥ nδ n andT l < T j } E j,3 = {∀l ∈ [B ′ ] : |T l − T j | ≥ nδ n andT l > T j }.
Using the above events, we have the following bound P[E j,1 ] + P[E j,2 ] + P[E j,3 ].
The probabilities of the above events can be bounded using the same reasoning as in the proof of theorem 2, by repeatedly using the KKT conditions given in (15). In particular, we can use the strategy used to bound the event A n,j,2 . Since the proof is technical and does not reveal any new insight, we omit the details.

A collection of known results
This section collects some known results that we have used in the paper. We start by collecting some results on the eigenvalues of random matrices. Let Using standard results on concentration of spectral norms and eigenvalues (Davidson and Szarek, 2001), Wainwright (2009) derives the following two crude bounds that can be very useful. Under the assumption that p < n, P[Λ min (Σ) ≤ φ min /9] ≤ 2 exp(−n/2).