Detecting Structural Changes in Longitudinal Network Data

Dynamic modeling of longitudinal networks has been an increasingly important topic in applied research. While longitudinal network data commonly exhibit dramatic changes in its structures, existing methods have largely focused on modeling smooth topological changes over time. In this paper, we develop a hidden Markov multilinear tensor model (HMTM) that combines the multilinear tensor regression model (Hoff 2011) with a hidden Markov model using Bayesian inference. We model changes in network structure as shifts in discrete states yielding particular sets of network generating parameters. Our simulation results demonstrate that the proposed method correctly detects the number, locations, and types of changes in latent node characteristics. We apply the proposed method to international military alliance networks to find structural changes in the coalition structure among nations.

Here y ijt informs the dyadic relationship between actors i and j at time t. While dynamic modeling of tensor-valued longitudinal networks, mainly in a form of reduced-rank decomposition, has been an increasingly important topic in social, biological, and other fields of science, a fully probabilistic treatment of dynamic network process has been a challenging problem due to simultaneous dependence between dyadic and temporal observations that are often associated with fundamental shifts in data generating processes. 1 By employing a Bayesian framework of tensor decomposition by Hoff (2011) and Hoff (2015), we present a Bayesian method to model changepoint process in longitudinal networks. The motivation of our method is firmly based on a common observation in network analysis that longitudinal network datasets frequently exhibit irregular dynamics, implying multiple changes in their data generating processes (e.g. Guo et al., 2007;Heard et al., 2010;Wang et al., 2014;Cribben and Yu, 2016;Barnett and Onnela, 2016;Ridder et al., 2016). Figure 1 shows an example of longitudinal network data with 3 layers of time series and 90 nodes. The data generating parameters of nodes in each network are depicted by 2 dimensional latent traits at each time point. The distance between each pair of nodes represents their probability of connection, so that proximal nodes are more likely to have connections. As clearly shown by the blocks of matrices at the bottom panel, the clustered patterns of connections are well depicted by the node positions, that can be recovered using our method, on the top panel. Colors in the backgrounds of the layers represent latent regimes inferred using our method. In addition to the latent traits that are specific to each network, one can easily notice that the two cluster networks at t = 1 turned into a three cluster network at t = 2. Contrary to the dramatic change of the overall network structure at t = 2, the network at t = 3 exhibits identical node positions to the t = 2 network. The same color indicates layers sharing latent node positions. The goal of our method is to uncover 1) latent traits representing data generating processes at each regime sharing those traits (colored node positions/groupings in the top panel) and 2) the timing of unspecified number of changes (t = {1} for regime 1 and t = {2, 3} for regime 2). t=1 t=2 t=3 Figure 1: A dynamic network with 3 temporal snapshots embedded in their latent (i.e. unobserved) node positions and latent (colored) regimes of layers that can be recovered using our method (top panel). The same network is represented in a tensor format (bottom panel). Gray lines between layers indicate node identity. Nodes in the same color exhibit dense connections whereas nodes in different colors exhibit sparse connections. Same cluster connections are represented by their node group color and inter-cluster connections are colored in white. The patterns of connections of the three snapshots can be represented as a tensor as shown in the bottom panel.
Each network snapshot is shown as a matrix with black dots representing the presence of connections between corresponding node pairs. Olive and yellow colors shown in the backgrounds of top and bottom panels indicate the hidden regimes of latent coordinates shared among layers. t = 1 network has distinct characteristics, consisting of two clusters, whereas t = 2 and t = 3 networks share latent coordinates of three clusters.
Conventional approaches to dynamic network modeling typically extend a static network analysis framework by assuming smooth topological changes over time or applying completely separate models for each time period (Robins and Pattison, 2001;Hanneke et al., 2010;Desmarais and Cranmer, 2012;Snijders et al., 2006Snijders et al., , 2010Westveld and Hoff, 2011;Ward et al., 2013). These methods rely largely on heuristic approaches to detect structural changes in data generating parameters. Recently, several methods for the "network change-point detection" problem have been proposed, noting the importance of irregular changes in network structures. For example, Cribben and Yu (2016) introduce a two-step approach to network change-point detection in which the cosine distances for the principal eigenvectors of time-specific graph Laplacian matrix are used to find change-points given pre-specified significance thresholds. 2 Another group of studies (Guo et al., 2007;Wang et al., 2014) allow parameter values of exponential random graph models (ERGMs) to change over time. However, both models exhibit computational inefficiency. For instance, the maximum size of network analyzed was 11 nodes in Guo et al. (2007) and 6 nodes in Wang et al. (2014). 3 By incorporating the stochastic blockmodel (SBM) framework, which presumes the existence of discrete node groups, Ridder et al. (2016) propose a method to identify a single parametric break. Ridder et al. (2016)'s method compares the bootstrapped distribution of the log-likelihood ratio between a null model and an alternative. However, the asymptotic distribution of a SBM with a break approaches to a mixture of χ 2 -distributions. Hence it does not meet the regularity condition of the log-likelihood ratio test statistic (Drton, 2009). A recent approach by Bartolucci et al. (2018) is also restricted to model changes in group membership in the SBM setting when the number of group is fixed. Likewise, existing methods for the "network change-point detection" problem lack the capacity of a fully probabilistic modeling and fail to incorporate uncertainty in the model structure and parameter estimation.
Our approach diverges from previous methods in two significant ways. First, we build a dynamic model using Hoff (2011Hoff ( , 2015's multilinear tensor regression model (MTRM), which is a multilayer (i.e. tensor) extension of the latent space approach to network data. MTRM allows us to decompose longitudinal network data into node-specific (or row and column) random effects and time-specific (or layer-specific) random effects. 4 For example, let {z i,j,t } be latent propensities to form a link between i and j observed at time t and x i,j,t be a vector of known covariates affecting {z i,j,t }. Then, based on the notion of multilayer exchangeability (Hoff, 2009b), MTRM models the latent propensity of link formation ({z i,j,t }) as a function of covariates, node-specific random effects ({u 1 , . . . , u N }) and time-specific random effects (V t ): where {u 1 , . . . , u N } represent (time-constant) R-dimensional latent node positions and V t is a diagonal matrix of (time-varying) node-connection rules. As we will explain in details, this multiplicative decomposition is highly useful for the joint estimation of time-varying network generation rules in conjunction with latent node positions that are constant for the duration of a hidden regime. Different from SBM formulation, the continuous multidimensional node position formulation let us to model any underlying latent structure including both group-structured networks, treated by SBM, and networks without group substructures that are unable to be modeled by SBM. The second departure of our approach from existing methods is the use of hidden Markov model (HMM) to characterize the change-point process. As shown in other applications (Baum et al., 1970;Chib, 1998;Robert et al., 2000;Cappe et al., 2005;Scott et al., 2005;Frühwirth-Schnatter, 2006;Teh et al., 2006), the conditional independence assumption in HMM turns out to be highly useful to model unknown changes in the latent network traits. More specifically, latent node positions ({u 1 , . . . , u N }), which are constrained to be constant over time in Hoff (2011), are allowed to change over time depending on the transition of hidden states.
The resulting model is a hidden Markov multilinear tensor model (HMTM) as it combines Hoff (2011)'s MTRM with HMM. HMTM assumes that a dynamic network process can be modeled as discrete changes in the latent space representation of network layers at each time point. These changes reflect fundamental shifts in structural properties of network under consideration. For example, structural changes in military alliance networks reflect the transformation of the international system such as the balance of power system during the Concert of Europe, the bifurcated system in the run-up to the World War I, and the bipolar structure during the Cold War, as we will see shortly.
The proposed method has several notable contributions to longitudinal network analysis. First, we show that degree heterogeneity hinders the recovery of meaningful traits in the latent space approach and demonstrate that degree correction formulations (Karrer and Newman, 2011;Chaudhuri et al., 2012) make a crucial difference in the recovery of ground-truth group structures underlying our example data generation. Second, our method provides an important tool to understand dynamic network processes by allowing researchers to model fundamental changes in factors underlying the evolution of longitudinal networks. Changes in longitudinal network data can take a variety of forms and our method does not restrict the types of network generating models. Finally, we provide an open-source R package, NetworkChange, that implements all the methods introduced in the paper including Bayesian model diagnostic tools: the approximate log marginal likelihood (Chib, 1995), the Watanabe-Akaike Information Criterion (WAIC) (Watanabe, 2010). We report the performance test results of these diagnostics.

Latent Space Model for Tensor
Let U = (u 1 , . . . , u N ) ∈ R N ×R be the R-dimensional latent node positions of N nodes and v t = (v 1t , . . . , v Rt ) ∈ R R be a vector exhibiting dimensionspecific node connection rules at time t. In this formulation, network effects are modeled by the product of latent node traits (u i for node i and u j for node j) and layer-specific node-connection rules (v t at time t or tth layer) as follows: where u i , v t , u j = R r=1 u i,r v r,t u j,r and matrix normal(M, U, V) is a N × R matrix-variate normal distribution with mean matrix M, row variance U, and column variance V.
The resulting estimates of node-specific latent variables recover a specific type of similarity between nodes that is easily interpretable (Hoff, 2008). If u i and u j exhibit similar values, they will have similar inner product outcomes with node k's latent position vector u k . This means that the probability of connection with k is analogous for i and j. This corresponds to the notion in network theory that nodes i and j are structurally equivalent (Wasserman and Faust, 1994). In addition, the generation rule parameter v t contains the information on what the distance relationships on each dimension of the U space reveal about their connection probability. For example, v rt > 0 corresponds to the case when a network generation rule for the rth dimension at time t is homophilous (assortative). In words, v rt > 0 indicates that two nodes on rth dimension at time t are more likely to be connected if they are located in the same side of the axis and the magnitude of their product is high. Similarly, v rt < 0 corresponds to the case when a network generation rule for rth dimension at time t is heterophilous (dissortative), so that nodes located on the opposite sides are more likely to be connected than the ones with the same sign.

Degree Correction
One of most important features of the proposed method is detecting changes in meso-scopic network properties, grouping of nodes, such as homophilous or heterophilous groups and core-periphery substructures in a network. The emergence (and changes) of group structures is commonly observed in realworld network data (Borgatti and Everett, 1999;Nowicki and Snijders, 2001;Newman, 2006;Fortunato, 2010;Xu, 2015;Ridder et al., 2016). The formulation of MTRM in Equation (4), however, is designed to recover consistent regression parameters (β) considering network effects as a nuisance parameter. Hence it entails a critical weakness in uncovering latent meso-scopic network features.
Except for exogenous covariates (x i,j,t ), Equation (4) has no treatment to account for degree heterogeneity that has been known to confound the group structure recovery (e.g. Newman, 2006Newman, , 2010Karrer and Newman, 2011). The intuition is that the distribution of degrees in empirical networks is highly heterogeneous and skewed following power law or exponential distributions (Clauset et al., 2009) while the implicit assumption in the group structure recovery is that the expected degree of nodes having a similar role (i.e. proximal in the latent space or belonging to the same group) is simi-lar. This problem is well known in the network science literature and various degree-correction methods have been proposed (Newman, 2006(Newman, , 2010Karrer and Newman, 2011;Chaudhuri et al., 2012;Zhao et al., 2012). In Layer 1, the withingroup link probability of 0.5 and the between-group link probability of 0.2 (homophily) and in Layer 2 the within-group link probability of 0.2 and the between-group link probability of 0.5 (heterophily). Figure 2 illustrates the problem in a simple setting using a 2 layer network where we assume no change in node positions. We generate two undirected networks consisting 3 groups, each group composed of 10 nodes, in Figure  2(A) and Figure 2(B). The number of groups and node membership are identical but the connection rules are opposite. In Figure 2(A), the within group link probability (0.5) is much larger than the between group link probability (0.2) and the probabilities are flipped in Figure 2(B). The node colors indicate group memberships and the lines indicate links. Without assuming a change between 2 layers, our task is to identify 3 hidden groups from the data using their recovered node positions.
We first fit an probit MTRM shown in Equation (4) with an unconditional mean parameter without external covariates. Then, we applied the k-means clustering algorithm to the estimated latent node positions to identify group membership. The results are reported in Figure 2(C). The MTRM fails to distinguish the green group from the blue group and researchers will conclude erroneously that the data are generated by two groups.
A simple fix to this problem is to use a null model to control for the expected level of associations among pairs of nodes. One example is an additive null model (ω ijt ), consisting of the principal eigenvalue (λ princ t = max(|λ(Y t )|)) and its associated eigenvector (Peixoto, 2013): whereũ it is the ith row of the associated eigenvector. 5 In matrix form, we denote the principal eigenmatrix as Ω t . Figure 2(D) shows the results from the linear MTRM on the transformed data (B t = Y t −Ω t ). Colors are allocated by the k-means clustering analysis. The use of a null model allows us to recover three distinct blocs in the data. Of course, one can think of an alternative way of controlling for the expected level of associations by including a list of external covariates. However, when the goal is to identify hidden groups, not coefficients of covariates, using a null model is more intuitive and computationally less expensive than including a list of covariates.

The Proposed Method
In order to develop a dynamic network model for structural changes, we must start from the question, "What constitutes structural changes in networks?" On the one hand, one can think of a change in summary statistics of macro-scopic network properties, such as average shortest path length or network density as a structural change (Cranmer et al., 2014). On the other hand, a change in the population statistics of micro-scopic network properties, such as transitivity or node degree, can be considered as a structural change (Heard et al., 2010;Lung-Yut-Fong et al., 2012;Kolar and Xing, 2012). But global network statistics and local indices cannot fully represent generative processes of dynamic networks as the granularity of the information entailed in such measures is too limited. Instead, studies in network science have paid an increasing amount of attention to meso-scopic features of networks (e.g. community structures, stochastic blocks, core-periphery structures) (Nowicki and Snijders, 2001;Newman, 2006;Fortunato, 2010; 5 Alternatively, one can use a modularity matrix (M ijt ): and k i is the sum of weights for i (Newman and Girvan, 2004). Both methods are available in NetworkChange. Tibély et al., 2011;Sporns, 2014). Technically, various approaches for mesoscopic trait discovery locates nodes in a discrete or continuous latent space on the basis of their similarity. In this paper, a structural change in networks is defined as a change in meso-scopic features of networks. We support this claim by using synthetic examples and show that this perspective is effective enough to recover fundamental aspects of changes in network generation.

Hidden Markov Multilinear Tensor Model
As shown by Chib (1998) and Park (2012), multiple change-point problems are equivalent to the estimation of a nonergodic (or forward-moving) HMM, which has advantages in latent state identification and parameter estimation, thanks to the order constraint in latent states. Let us denote S as a vector of hidden state variables where S t is an integer-valued hidden state variable at t and P as a M × M transition matrix where p i,i is the ith diagonal element of P and M is the number of hidden states. Then, the probability distribution of a degree-corrected longitudinal network data with M − 1 breaks can be modeled as a Markov mixture of Mcomponent MTRMs using the conditional independence assumption of the HMM. Suppose Θ be a collection of parameters that represent a network generating process of a longitudinal network. Then, where p(B t |Θ m ) a generative network model at regime m. Here, the duration of hidden state m follows a geometric distribution of 1 − p mm where p mm is the mth diagonal element of an M ×M transition matrix. The regime change probability can be easily computed using the posterior draws of hidden states (e.g. 1 t−1 )). Following the MTRM, the HMTM decomposes the degree corrected network data at t as a bilinear product of latent node positions and dimension weights subject to hidden state changes: One may concern that the normal distribution of E t does not fit the data very well. In that case, the above model can be modified to include a Student-t distributed error (Carlin and Polson, 1991) where the prior distribution of γ t follows a gamma distribution (G(ν 0 /2, ν 1 /2)).
For prior distributions of U and V, we follow Hoff (2011)'s hierarchical scheme with two major modifications. First, we orthogonalize each column of U St using the Gram-Schmidt process (Björck, 1996;Guhaniyogi and Dunson, 2015) in each simulation step. Hoff (2011)'s hierarchical scheme centers rows of U St around its global mean (µ u,St ) using a multivariate normal distribution. This does not guarantee the orthogonality of each latent factor in U St . The lack of orthogonality makes the model unidentified, causing numerical instability in parameter estimation and model diagnostics (Murphy, 2012;Guhaniyogi and Dunson, 2015).
Second, we use independent inverse-gamma distributions instead of inverse-Wishart distribution for the prior distribution of a variance parameter (Ψ u,St , Ψ v ). The use of inverse-Wishart distribution for the prior distribution of a variance parameter (Ψ u,St , Ψ v ) comes at a great cost because choosing informative inverse-Wishart prior distributions for Ψ u,m and Ψ v is not easy (Chung et al., 2015) and a poorly specified inverse-Wishart prior distribution has serious impacts on the marginal likelihood estimation. In our trials, the log posterior inverse-Wishart density of Ψ u,St and Ψ v often goes to a negative infinity, failing to impose proper penalties. In HMTM, the off-diagonal covariance of U m is constrained to be 0, thanks to the Gram-Schmidt process, and the off-diagonal covariance of V is close to 0 as v t measures time-varying weights of independent U m . Thus, inverse-gamma distributions resolve a computational issue without a loss of information.
The resulting prior distributions of U and V are matrix-variate normal distributions in which each column vector (u i,St and v t ) follows a multivariate normal distribution. We first discuss the prior distribution of U: The prior distributions of V are similar to U but one difference is that only diagonal elements of V t are modeled as a multivariate normal distribution: Then, we complete the model building by introducing HMM-related prior specifications following Chib (1998): where π 0 is the initial probability of a non-ergodic Markov chain (π 0 = (1, 0, . . . , 0)).

Sampling Algorithm
Let Θ indicate a parameter vector beside hidden states (S) and a transition matrix (P): Then, the joint posterior density p(Θ, P, S|B) is where B t−1 = (B 1 , . . . , B t−1 ). Using the conditional independence we decompose the joint posterior distribution into three blocks and marginalize conditional distributions (Liu et al., 1994;van Dyk and Park, 2008): The sampling algorithm of the HMTM can be summarized as follows: Step 1 The sampling of regime specific U, µ, Ψ u consists of the following three steps for each regime m.
4. Orthogonalize U m using the Gram-Schmidt algorithm.
Step 2 The sampling of V, Step 3 The sampling of σ 2 m from IG c 0 +Nm·Nm·Tm Step 4 Sample S recursively using Chib (1998)'s algorithm. The joint conditional distribution of the latent states p(S 0 , . . . , S T |Θ, B, P) can be written as the product of T numbers of independent conditional distributions: Using Bayes' Theorem, Chib (1998) shows that State probabilities given all data The second part on the right hand side is a one-step ahead transition probability at t, which can be obtained from a sampled transition matrix (P). The first part on the right hand side is state probabilities given all data, which can be simulated via a forward-filtering-backwardsampling algorithm as shown in Chib (1998).
During the burn-in iterations, if sampled S has a state with single observation, randomly sample S with replacement using a pre-chosen perturbation weight (w perturb = (w 1 , . . . , w M )).
Step 5 Sample each row of P from the following Beta distribution: where p kk is the probability of staying when the state is k, and j k,k is the number of jumps from state k to k, and j k,k+1 is the number of jumps from state k to k + 1.
We provide the sampling details of the HMTM with a Student-t distributed error the supplementary material.

Assessing Model Uncertainty
We provide three metrics for model diagnostics and break number detection: the approximate log marginal likelihood method, WAIC, and the average loss of break points. The first measure is the approximate log marginal likelihood method using the candidate's estimator (Chib, 1995). Main advantages of the approximate log marginal likelihood are its direct connection with Bayes' theorem and its consistency when models are well identified and MCMC chains converge to the target distribution. A disadvantage of the approximate log marginal likelihood is its computational cost arising from additional MCMC runs at each Gibbs sampling block. Using the Rao-Blackwell approximation, the approximate log marginal likelihood of HMTM with M numbers of latent states (M M ) can be computed as follows: the log prior density of posterior means the log posterior density of posterior means ,v , σ 2 * , P * } are posterior means of MCMC outputs. The log likelihood is computed by summing log predictive density values evaluated at posterior means across all states and over all upper triangular array elements as follows: The computation of the log posterior density of posterior means requires a careful blocking in a highly parameterized model as discussed in Chib (1995). In our HMTM, the log posterior density evaluated at posterior means is decomposed into seven blocs: The second measure of model diagnostics is WAIC (Watanabe, 2010). WAIC approximates the expected log pointwise predictive density by subtracting a bias for the effective number of parameters from the sum of log pointwise predictive density. WAIC approximates leave-one-out cross validation (LOO-CV) in singular models and hence can serve as a metric for out-of-sample predictive accuracy of HMTM (Gelman et al., 2014). Predictive accuracy is a good standard for detecting the number of breaks because overfitting is a major concern in analysis using mixture models and HMMs. Also, the cost of computation is very low as WAIC is computed from MCMC outputs. Note that WAIC of HMTM partitions the data into T pieces of conditional density, and the estimated WAIC scores are dependent upon latent state estimates. The dependence on estimated latent states indicate that our measure of WAIC cannot be used to predict future networks given the sequence of network observation. Instead, we aim to use WAIC to compare predictive accuracies of HMTMs given a varying number of breaks.
Using the formula suggested by Gelman et al. (2014), WAIC for HMTM with M number of latent states (M M ) is indicates a variance, and Θ (g) , P (g) are the gth simulated outputs. Throughout the paper, we report the approximate log marginal likelihood in the deviance scale by multiplying -2 to logp(B upper |M M ) for easy comparison with WAIC following the advice of Gelman et al. (2014): The smaller the deviance, the better the accuracy.
The last measure of model diagnostics is the average loss of break points. The inclusion of redundant break points (e.g. imposing two breaks on a single break process) produces an instability in draws of hidden state variables. An easy way to check the existence of redundant breaks is to estimate average variances of simulated break points. This measure is equivalent to the average loss of break points assuming the simulation mean of break points (τ m ) as true break points: where G is the MCMC simulation size and M is the total number of breaks. The average loss is close to 0 if simulated break points are highly stable. Average Loss becomes larger if at least one of break points swings widely in each simulation.

Simulation Studies
In this section, we check the performance of the proposed method using simulated group-structured network data. We consider five different cases of group-structured network changes: 2. group-structured networks with a group-splitting break (Table 2) 3. group-structured networks with a group-merging break ( Table 3 in Supplementary Material) 6 4. group-structured networks with a group-merging break followed by a group-splitting break (Table 3) 5. group-structured networks with a group-splitting break followed by a group-merging break ( Table 4 in Supplementary Material)

Simulation Setup
Blocks in simulated data were generated by an assortative rule in which nodes belonging to the same group had a higher connection probability (p in = 0.5) than nodes belonging to different groups (p out = 0.05). 7 In the group merging examples, two groups were merged so that the tie formation probability between the members of the two groups changed from p out to p in . In the group splitting examples, an existing group split into two equal size groups so that the connection probability between the members of the two different groups became p out from p in . The length of time layers was 40. The planted break occurred at t = 20 in the case of the single break examples and t = 10 and t = 30 in the case of two breaks. We fit four different HMTMs from no break (M 0 ) to three breaks (M 3 ) and compare their model diagnostics, recovered latent spaces, and time-varying network generation rules.

Block-structured Networks with No Break
We first check how sensitive our proposed method is to the false positive bias (i.e. detecting breaks when there is no break). We generate block-structured networks with no break. 8 Table 1 summarizes the results from the no break    The ground truth is one break and the underlying group structure changes from a two group structure to a three group structure in the middle. case.
The reading of the results starts from model diagnostics in the first column. While the approximate log marginal likelihood incorrectly favors the one break model (M 1 ), WAIC correctly shows that the no break model (M 0 ) fits the data best. Also, the average loss of break points shows somewhat unstable movements. On average, simulated break points of the one break model swing ±0.57 around the estimated break point and simulated break points of the two break model swing ±1.48 around the estimated break points.
If we look at the estimated latent space of the no break model, it correctly recovers the latent two-group structure while the one break model identifies another latent state almost identical to the first one. The approximate log marginal likelihood fails to penalize the recovery of the redundant latent state. However, as we add more breaks, the approximate log marginal likelihood successfully penalizes the model for the existence of redundant states in the two break model and in the three break model.

Block-structured Networks with a Block-splitting Break
Now, we move the case of dynamic network data with a single group-splitting break. That is, the ground truth is that the number of latent groups changes from 2 to 3 in the middle. Table 2 shows the results of the simulation. Again, we start to read from the model diagnostic results in the first column. WAIC correctly identifies the single break model as the best-fitting model while the approximate log marginal likelihood favors the three break HMTM. As we have seen in the previous example, the approximate log marginal likelihood fails to penalize the model with redundant breaks. The source of the problem is the singleton state (a latent state consisting of a single observation). A similar problem has been noticed in finite mixture models with singular components (Hartigan, 1985;Bishop, 2006). In the three break model, the second state has only one observation, which increases the log likelihood dramatically. Since a singleton state is highly unlikely in reality, researchers can ignore false diagnostic results simply by checking the existence of singleton states. 9 Interestingly, the network generation rule in the last column shows almost identical patterns, regardless of the number of imposed breaks. Note that the second dimensional network generation rule (v 2 ) jumps to a large positive number in the middle as the number of groups increases from 2 to 3.
The average loss of break points clearly favors the one break model. Adding redundant breaks increases the average loss of break points significantly. For example, simulated break points of the three break model swing ±1.4 around the estimated break points on average while simulated break points of the one break model stay constant.

Block-structured Networks with Two Breaks
Last, we check whether our proposed method correctly recovers more complicated network changes. The first break is planted at t = 10 and it corresponds to a group-merging change. Another break is planted at t = 30, which corresponds to a group-splitting change. Thus, the number of latent groups changes 3, 2, and 3. Table 3 reports the results of the two break test.
WAIC correctly detects M 2 as the best-fitting model while the approximate log likelihood favors M 3 , the pattern of which is constant in our simulation. Again, the presence of a singleton state in the three break model is the source of the problem for the approximate log likelihood. The average loss of break points correctly favors the two break model. Fitting the one break model increases the average loss of break points because the latent state sampler falls into either of the breaks. In contrast, adding more than two breaks increases the average loss of break points significantly because the existence of a redundant break makes it difficult to pin down break points in simulations.
If we look the recovered latent states, the two break HMTM correctly recovers the two underlying changes between t = 10 and t = 11 (group merging) and between t = 30 and t = 31 (group splitting). Changes in the relative size of network generation rules (the last column) inform us the types of changes underlying network structures go through. For example, series data. In contrast, WAIC relies on the log pointwise predictive density as a measure of the goodness of the fit and its variance as a penalty. Since the log pointwise predictive density is averaged over the entire MCMC scan ( 1 G G g=1 p(B upper t |Θ (g) , P (g) , M M )), it is less sensitive to singular components in high dimensional mixture models like HMTM. This is why WAIC outperforms in the break number detection in the context of HMTM.  Table 3: Simulation Results of Block-structured Networks with Two Breaks. The ground truth is two breaks and the underlying group structure changes from a two group structure to a three group structure right after t = 10 and from a three group structure to a two group structure right after t = 30. The ground truth is two breaks. when the number of groups changes from 2 to 3 in the transition to Regime 3, v 2 returns to its previous level at Regime 1.
Overall, our simulation results clearly show that our proposed model and multiple metrics for model diagnostics work very well in (1) correctly identifying the number of breaks, (2) recovering latent group structure changes, and (3) identifying state-specific latent node positions. The approximate log marginal likelihood performs well where there is no singleton state while WAIC performs steadily regardless of the existence of singleton states. The average loss also shows steady performance, signaling the existence of redundant states in a tested model.

Applications
In this section, we apply our proposed method to the analysis of structural changes in military alliance networks. The structure of military alliance networks reflects the distribution of military power among coalitions of states, which changes over time in response to exogenous shocks to the international system or endogenous network dynamics. However, there has been no study that investigates changes in coalition structures of military alliance networks over time. A main reason is the lack of statistical methods that model unknown structural changes in longitudinal network data.
To illustrate our method, we start from a simple example using a small data set, consisting of seven "major powers" (Austria-Hungary, France, Germany, Italy, Japan, Russia, and the United Kingdom) from 1816 to 1938 (Gibler, 2008). We aggregated every 2 year network to increase the density of each layer. Changing the granularity of aggregation does not change the substantive findings. These seven major powers are main players of the balance of power system in Europe during the 19th century and two world wars in the 20th century. Also, the period from 1816 to 1938 corresponds to the era of shifting alliances among major powers. Thus, the structure of alliance among major powers will clearly display how the distribution of military power in the international system changes over time. Then, we apply our method to a larger data set of 104 postwar states after removing isolated nodes.

Major Power Alliance Network Changes, 1816 -1938
Figure 3 shows the model diagnostic results for the major power alliance data set. We dropped the results for the models with more than 3 breaks as they show strong signs of non-convergence, which indicate the existence of redundant states. All metrics of model diagnostics point to the two break model as the best-fitting model. In particular, the average loss of break points significantly drops in the two break model. Figure 4 visualizes changes in latent node positions of major powers (top) and changing patterns of the major-power network topology (bottom) from the two break model. 10 Node colors (online) indicate clusters of each node using the k means clustering method. Regime-specific network generation rule parameters v rt are reported in axis labels. Several substantive findings are noteworthy.
The first notable finding is the centrality of Austria-Hungary, connecting groups of major powers, between 1816 and 1890. This period includes what historians call "the age of Metternich" (1815-1848) (Rothenberg, 1968). After the end of Napoleonic Wars, Chancellor of Austria-Hungary played an important role in maintaining the European balance of power system. The first dimension of Regime 1 clearly distinguishes Austria-Hungary from the other major powers. In Regime 2 (1854-1890), Germany challenged the position of Austria-Hungary. However, throughout Regime 1 and Regime 2, the network position of Austria-Hungary remained highly critical in the sense that the removal of Austria-Hungary would have made the major power alliance network completely disconnected. In the language of social network analysis, Austria-Hungary filled a "structural hole" in the major power alliance network at the time, playing the role of broker (Burt, 2005(Burt, , 2009Stovel and Shaw, 2012). The second notable finding is the timing of the first break between 1852 and 1854. This break point coincides with the outbreak of the Crimean War. In this war, Russia was defeated by the united powers of Britain, France, Austria-Hungary, and Prussia (Germany). The rise of Germany led by Otto von Bismarck and the defeat of Russia marked the first break in the balance of power system.
The third notable finding is the timing of the second break between 1890 and 1892. Scholars of international relations and historians consider the formation of the Dual Alliance between Germany and Austria-Hungary in 1879 and a sequence of alliances that followed it as a structural change in the balance of power system (Snyder, 1997;Vermeiren, 2016). 11 These series of events transformed a web of shifting alliances into a clearly diverged group structure, consisting of two clusters: Austria-Hungary, Germany, and Italy on the one hand and France, Russia, and the United Kingdom on the other. The network diagram of the third regime (bottom-right) shows members of the two clusters, which formed each side of belligerents in World War I.

Postwar Alliance Network Changes, 1946 -2012
Now, we focus on the postwar period in which the number of states and alliance links among them exploded. After removing isolated nodes, our sample contained 104 states. Figure 5 shows the results of the model diagnostics.
Although the two break model is preferred by the approximate log marginal likelihood, the average loss of break points and WAIC favor the one break model. The estimated break point is between 1978 and 1980. Figure 6 shows changes in latent node positions (top) and changing patterns of the network topology (bottom). What we can see is that the number of latent groups does not change during the postwar period. What changed 11 First, Russia formed alliances with Germany and Austria-Hungary (Three Emperors' Alliance) in 1881. Then, Italy joined Germany and Austria-Hungary (Triple Alliance) in 1882. France, a long-time rival of Germany, formed an alliance with Russia in 1894 to check Germany and Austria-Hungary. In this process, an important cleavage in the alliance networks emerged.

Concluding Remarks
In this article, we presented HMTM as a statistical method to detect and analyze changes in structural properties of longitudinal network data. The proposed method has several advantages over existing dynamic network models. First, the proposed method combines a highly flexible generative model of multilayer networks (MTRM) with a HMM, which has proved to be an effective tool to model irregular dynamics in temporal data. This formulation is flexible enough to accommodate a variety of network representations such as graph Laplacian (Rohe et al., 2011) and motif Laplacian (Benson et al., 2016) as an input data format. Our simulation studies showed that our generative approach is a powerful tool to detect and analyze diverse types of network changes. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  (1946−1978) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Second, the Bayesian inference of HMTM enables researchers to identify the number of network changes in a principled way. Our simulation studies show that WAIC correctly identifies the number of breaks and the type of network changes in all tests while the approximate log marginal likelihood consistently favor overfitted models.
Finally, HMTM provides an important tool to investigate changes in meso-scale structures of longitudinal network data. Meso-scale structural changes are important quantities that reflect fundamental changes in the network generating process, that are unable to be captured by local network properties or global summary indices.
While we only consider undirected networks, our model can be extended to analyze other types of longitudinal network data consisting of directed networks or bipartite networks using a singular value decomposition-based framework (De Lathauwer et al., 2000;Hoff, 2007) and the hierarchical multilinear framework Hoff (2011) in general. Also, a hierarchical Dirichlet process prior can be used to endogenously detect the number of breaks (Beal et al., 2002;Ko et al., 2015;Teh et al., 2006;Fox et al., 2011). Another interesting extension of HMTM is the inclusion of nodal covariates (Volfovsky and Hoff, 2015) or covariates for network effects, where the latent space formulation may serve as an instrument to control for unobserved heterogeneous effects on tie formation. One difficulty in adding covariates in hidden Markov models is to avoid the endogeneity between state transition and effects of covariates (Kim et al., 2008).

MCMC Algorithm for Hidden Markov Tensor Model
For each t layer, generate B t = Y t − Ω t by choosing a null model (Ω t ).

Part 1
Step 1 The sampling of regime specific U, µ, Ψ u consists of the following three steps for 4. Orthogonalize U m using the Gram-Schmidt algorithm.
Step 2 The sampling of V, Step 3 The sampling of β from N (b 1 , B 1 ) where 1(S = m) is the number of time units allocated to state m and µ i,j,t is an element Step 4 The sampling of σ 2 m from IG c0+Nm·Nm·Tm

Part 2
Step 5 Sample S recursively using Chib (1998) Using Bayes' Theorem, Chib (1998) shows that p(S t |S t+1 , Θ, B, P) ∝ p(S t |Θ, B 1:t , P) State probabilities given all data p(S t+1 |S t , P) Transition probability at t . The second part on the right hand side is a one-step ahead transition probability at t, which can be obtained from a sampled transition matrix (P). The first part on the right hand side is state probabilities given all data, which can be simulated via a forward-filtering-backward-sampling algorithm as shown in Chib (1998).
Step 5-1 During the burn-in iterations, if sampled S has a state with single observation, randomly sample S with replacement using a pre-chosen perturbation weight (w perturb = (w 1 , . . . , w M )).

Part 3: p(P|B, S, Θ)
Step 6 Sample each row of P from the following Beta distribution: where p kk is the probability of staying when the state is k, and j k,k is the number of jumps from state k to k, and j k,k+1 is the number of jumps from state k to k + 1.

Sampling of HMTM with a Student-t Distributed Error
We modify the sampling of U based on a Student-t distributed error as follows: where Γ m is a T m by T m diagonal matrix with γ t corresponding to regime m.
Next, the sampling of σ 2 m can be done using IG c 0 +Nm·Nm·Tm Then, the draws of γ t are obtained from G(ν 2 /2, ν 3,t /2) where  Figure 7 demonstrates the performance of discrete block label recovery using a block-merging-splitting example with T = 40. Panel (a) shows the planted block structures at the beginning of each regime and panel (b) shows recovered latent node positions and their block memberships that are identified by Hartigan and Wong (1979)'s k-means clustering algorithm. The two break HMTM correctly recovers the two planted changes at T = 11 (block merging) and T = 31 (block splitting). When block a and block b are merged into one block in Regime 2, the second dimension becomes redundant and the second element of the regime-averaged network generation rule (v 2 ) becomes small (0.98). In Regime 3, the number of blocks increases back to 3 and v 2 returns to its previous level at Regime 1.

Further Simulation Results
In this section, we report additional simulation results that are not reported in the manuscript due to space limitation. First, we report simulation results from a block-merging network change. Then, we report simulation results from block-splitting-merging network changes. The structure of the reported tables is identical the ones discussed in the manuscript.  Table 4 correctly recovers the block-merging latent change and the second dimensional network generation rule (v 2 ) drops from a positive number to 0 in the middle as the number of blocks decreases from 3 to 2.  Table 4: Simulation Analysis of Block Merging MTRM. The ground truth is one break. All setup is analogous to the simulation examples in the main text. Two 10 node blocks were merged into a single block at t = 21. Table 5 summarizes the simulation results from the block-splitting-merging change example. The ground truth is the two break HMTM (M 2 ). WAIC correctly identifies M 2 as the best-fitting model while the approximate log marginal likelihood favors M 3 . Note that Regime 3 of M 3 has only two observations, which inflates the approximate log marginal likelihood. The results of M 2 correctly recovers the block-splitting-merging change in the latent node positions and network generation rules.  Table 5: Simulation Analysis of Block Split-Merging MTRM. The ground truth is two breaks. All parameters are analogous to the example depicted in Table 3 of the main text except the order of the changes.