Model selection for network data based on spectral information

We introduce a new methodology for model selection in the context of modeling network data. The statistical network analysis literature has developed many different classes of network data models, with notable model classes including stochastic block models, latent position models, and exponential families of random graph models. A persistent question in the statistical network analysis literature lies in understanding how to compare different models for the purpose of model selection and evaluating goodness-of-fit, especially when models have different mathematical foundations. In this work, we develop a novel non-parametric method for model selection in network data settings which exploits the information contained in the spectrum of the graph Laplacian in order to obtain a measure of goodness-of-fit for a defined set of network data models. We explore the performance of our proposed methodology to popular classes of network data models through numerous simulation studies, demonstrating the practical utility of our method through two applications.


Introduction
Network data have witnessed a surge of interest across a variety of fields and disciplines in recent decades, including the study of social networks (Lusher et al., 2013), network epidemiology (involving the spread of disease through networks of contacts) (Morris, 2004), covert networks of criminal activity and terrorism (Coutinho et al., 2020), brain networks (Obando and de Vico Fallani, 2017), financial markets (Finger and Lux, 2017), and more.Network data, as a data structure, are typically represented as a graph (Kolaczyk, 2009), consisting of a set of nodes representing the elements of a population of interest (e.g., researchers in a collaboration network) and a set of pairwise observations or measurements between nodes represented as edges between nodes (e.g., co-authorship on a paper).
Many classes of models have been proposed and developed to study and model network data.
A non-exhaustive review of some of the more prominent examples include exponential families of 1 arXiv:submit/4681729 [stat.ME] 7 Jan 2023 random graph models (ERGMs) (e.g., Lusher et al., 2013;Schweinberger et al., 2020), stochastic block models (SBMs) (e.g., Holland et al., 1983;Anderson et al., 1992;Wang and Bickel, 2017), latent position models (LPMs) (e.g., Hoff et al., 2002;Sewell and Chen, 2015;Tang et al., 2013;Athreya et al., 2017), and more.Each class offers a unique mathematical platform for constructing models of networks from observed network data, with respective strengths and weaknesses.The exponential family class provides a flexible parametric platform for building models of networks with dependent edges.In contrast, stochastic block models can capture network structure and clustering of nodes through a discrete latent space, whereas latent position models capture network structure and edge dependence through latent node positions in (e.g.) a latent Euclidean space.
A persistent challenge in statistical network analysis applications is how to compare different models and select models for specific network data sets.At present, the literature has primarily focused on model selection problems within each class of models, tailoring methods to specific classes of models (SBMs: Wang and Bickel (2017); Latouche et al. (2014); ERGMs: Hunter et al. (2008); LSMs: Ryan et al. (2017)).As a result, there is a gap in the literature which explores methods for comparing model fit or performing model selection across models from different mathematical platforms, e.g., comparing an ERGM to an SBM to an LPM.In this work, we introduce a novel non-parametric methodology for model selection in network data settings that can be applied to a broad class of models under weak assumptions, capable of facilitating comparison of models with different mathematical foundations.Our method utilizes information in the spectrum of the graph Laplacian in order to select a best fitting model for an observed network, and essentially only requires the ability to simulate adjacency matrices from candidate models and compute eigenvalues of the graph Laplacian derived from the adjacency matrices.
The rest of the paper is organized as follows.Section 2 reviews spectral properties of the graph Laplacian for networks and motivates the use of spectral information in the model selection problem for network data.Our proposed methodology is introduced in Section 3. We present experimental studies and simulations in Section 4, and two applications of our methodology in Section 5.

Spectral properties of the graph Laplacian
We consider simple undirected networks defined on a set of N nodes with corresponding adjacency matrix X ∈ {0, 1} N ×N , where Xi,j = 1 corresponds to the event that there is an edge between nodes i and j and Xi,j = 0 otherwise.We adopt the standard conventions that Xi,j = Xj,i an Xi,i = 0. Extensions of our methodology to directed networks is discussed in Section 4. Extensions to networks with valued edges is possible, but beyond the scope of this work.Let d = deg(X) = ( N j=1 Xi,j : i = 1, . . ., N ) ∈ R N be the vector of node degrees of the network.The Laplacian matrix, also called the graph Laplacian, is defined as L(X) := diag(d)−X, where diag(d) is the N × N diagonal matrix with diagonal d.Since L(X) is symmetric and positive semi-definite (Brouwer and Haemers, 2012), the eigenvalues of L(X) will all be real and non-negative.Throughout, we will let λ ∈ R N denote the vector of ordered eigenvalues (from smallest to largest) of the Laplacian matrix L(X).The vector λ will depend on the adjacency matrix X through L(X), however, for ease of presentation, we do not make this dependence explicit notationally, as it will be clear contextually.
Eigenvalues of Laplacian matrices encode many well known properties of a network.For example, the multiplicity of the eigenvalue 0 corresponds to the number of connected components in the network (Brouwer and Haemers, 2012).The second smallest eigenvalue (possibly 0) is known as the algebraic connectivity (Fiedler, 1973), and measures the overall connectivity of a graph (de Abreu, 2007).It is also used in stablishing Cheeger inequalities (Donetti et al., 2006), which have applications in image segmentation (Shi and Malik, 2000), graph clustering (Kwok et al., 2013) and expander graphs (Hoory et al., 2006).The subsequent eigenvalues of the Laplacian matrix are related to the minimal cuts (weighted edge deleting) required to partition a network (Bollobás and Nikiforov, 2004).
Two undirected graphs with adjacency matrices A and B are isomorphic if there exists a permutation matrix P such that A = P BP t , which requires that the adjacency matrices be similar A = P BP −1 , noting that a permutation matrix P satisfies P t = P −1 .In such cases, the corresponding graph Laplacian matrices will be similar as well: L(A) = deg(P BP t ) − P B P t = P deg(B)P t − P B P t = P L(B) P t .
Consequently, since L(B) is Hermitian, there exists an eigen decomposition L(B) = U D U t .Hence, L(A) = P (U D U t ) P t = (P U ) D (P U ) t .As a result, if λ is a vector of eigenvalues of L(B), it is also a vector of eigenvalues of L(A).In our context, this implies one can always differentiate two non-isomorphic networks if their eigenvalues are different.The reverse result is not true in general.There are graphs possessing the same eigenvalue decomposition (referred to as cospectral or isospectral) which are not isomorphic (Cvetković et al., 1980).However, numerical evidence suggests that the fraction of (non-isomorphic) cospectral graphs tends to zero as the number of nodes in a graph grows (Brouwer and Haemers, 2012).
Several applications of spectral decomposition of the Laplacian matrix have been proposed in the network analysis literature.For example, spectral clustering (Von Luxburg, 2007) is a well known clustering algorithm based on the leading eigenvectors of the Laplacian of a similarity matrix.Lei and Rinaldo (2015) established, under mild conditions, the consistency of the spectral clustering method for stochastic block models.Another example is in Newman (2006), where a family of community detection algorithms were proposed for networks based on the spectral decomposition of the graph Laplacian.Lastly, Shore and Lubin (2015) proposed a statistic for evaluating goodness-of-fit for network models reminiscent of the R 2 statistic in regression settings, which compares eigenvalues of the graph Laplacian generated from a model fit to the eigenvalues of the graph Laplacian from a pre-specified null model (typically taken to be a Bernoulli random graph model, referred to as a density only model).
In light of these results, it is natural to regard the vector of eigenvalues λ as a signature of a network, containing important topographical and structural information which can be exploited for the purposes of model selection.Our proposed methodology compares the empirical distribution of the spectrum of the graph Laplacian of candidate models to that of an observed network.
Our methodology is motivated by the following considerations regarding properties of the graph Laplacian.
First, if the true data generating process is in the list of candidate models, the observed eigenvalues derived from an observed network are expected to fall within the spectral distribution of the data generating process.If, in practice, none of the proposed models are the true generating process, candidate models can still be assessed by their ability to capture the spectrum of the observed graph Laplacian, providing a means for developing a methods for model selection.Second, we can obtain a relative measure of fit among competing models depending on how well the spectrum of the observed graph Laplacian is captured by candidate models, providing a means to not only select a best fitting model, but also to compare the fit of the best fitting model to unselected alternatives.Third, our methodology requires no parametric assumptions on the data generating process and is able to compare models across different mathematical platforms, including models which do not have a well-defined likelihood function or which are constructed through a stochastic process, an example of which are agent-based models (e.g., Snijders et al., 2010;Jackson and Watts, 2002) or generative algorithms based on preferential attachment models (e.g., Barabasi and Albert, 1999;Zeng et al., 2013).

Methodology
We outline a methodology for model selection in network data settings which exploits the spectral properties of the graph Laplacian, motivated by considerations in the previous section.We assume throughout that the network is completely observed, denoted by X obs .The corresponding observed vector of eigenvalues of the observed graph Laplacian L(X obs ) is denoted by λ obs .Our fundamental inferential goal is to select a best fitting model for the observed network X obs from a set of candidate models {M1, . . ., MM } (M ≥ 2).We frame the problem as a classification problem and aim to construct a classifier P : R N → {1, . . ., M } trained on the spectrum of the graph Laplacian for each of the candidate models in order to predict a class m ∈ {1, . . ., M } for a given vector of eigenvalues, namely λ obs .We present our model selection method algorithm in Table 1.

Model selection procedure:
1. Simulate K networks X (m,1) , . . ., X (m,K) from each of the candidate models model M m ∈ {M 1 , . . .M M }. 2. For each X (m,k) , compute the Laplacian matrix L(X (m,k) ) and the corresponding vector of eigenvalues λ (m,k) ∈ R N .3. Construct a design matrix D ∈ R (KM )×N by stacking the KM vectors of eigenvalues λ (m,k) to form the rows of D. 4. Train a classifier P : R N → {1, . . ., M } to predict a model m ∈ {1, . . ., M } using the K simulated vectors of eigenvectors λ (m,k)  for each class m ∈ {1, . . ., M } contained in the design matrix D.
Feature engineering is advised at this stage.5. Compute the Laplacian matrix L(X obs ) for the observed network X obs and the corresponding vector of eigenvalues λ obs .6. Predict a class m = P(λ obs ) for the observed network using the trained classifier from Step 4 and set M = M m .
Table 1: Description of the model selection algorithm.

Selection of classifier
Real life networks can possess hundreds, thousands or even millions of nodes.As the dimension of the vector of eigenvalues of the graph Laplacian matrices is equal to the number of nodes in the network, classification methods based on eigenvalues of the Laplacian matrix will be prone to the usual pitfalls of high dimensional classification problems.The literature for classification methods is quite extensive, which makes the choice of classifier a critical step in our methodology, although we show in Section 4 that the effect of the choice of classifier may not have a significant effect on the results of our methodology under certain circumstances.In light of these results, we consider practical concerns of the implementation of the choice of classifier.
Linear discriminant analysis, which requires the computation of the inverse of a covariance matrix, has been shown in practice to suffer a decay in performance as the number of variables increases and the sample size is fixed (Bickel and Levina, 2004).Alternative methods include support vector machines, neural networks, random forests, and boosting algorithms, which generally perform well in high-dimensional settings (Hastie et al., 2011).Within this class is the eXtreme Gradient Boosting (XGBoost) method, which offers both scalability and state-of-the-art performance (Chen and Guestrin, 2016).In the rest of this paper we use exclusively XGBoost, with the notable exception being Simulation study 5 in Section 4, in which we compare the performance of different classifiers to establish the claim made earlier in this section.

Relative measure of goodness-of-fit
Many classification algorithms return more than just a predicted class, often returning a vector of propensity scores s = (s1, . . ., sM ) with the property that s 1 = 1.If several models were considered, the propensity scores for many of the models can shrink simply because of the larger number of classes being considered, meaning that the interpretation of propensity scores s1, . . ., sM can depend on M .To overcome this issue and facilitate the comparison of fit between models, we propose to normalize the propensity scores to obtain a measure of goodness-of-fit which is independent of the number of candidate models M .To this end, we define to be the normalized score, which is equal to 1 for the highest scoring model.For all remaining models, the normalized score is a measure of the fit of the model relative to the highest scoring model.By rescaling all propensity scores in this manner, the number of models M which is considered in the candidate set of models has no effect on the interpretation of the (relative) propensity scores.

Simulation studies
We conduct a number of simulation studies to demonstrate the potential of our proposed methodology.Specifically, we aim to examine the extent to which the signature of a network is contained within the spectrum of the graph Laplacian.Simulation studies permit knowledge of the true data-generating model, which facilitates empirical studies which aim to clarify the conditions under which our proposed methodology is able to successfully differentiate different network models and structural properties of networks.

Simulation study 1: curved exponential families
We study the performance of our methodology on curved exponential families, which have gained popularity in the social network analysis community (e.g., Snijders et al., 2006;Hunter and Handcock, 2006), as well as other applications (e.g., Obando and de Vico Fallani, 2017;Schweinberger et al., 2020;Stivala and Lomi, 2021).The prominence of curved exponential family parameterizations for random graph models emerged out of a desire to solve challenges related to degeneracy and fitting of early and ill-posed model specifications (Snijders et al., 2006).Additionally, curved exponential family parameterizations are able to parsimoniously model complex sequences of graph statistics, such as degree sequences and shared partner sequences, without sacrificing interpretabil-  1) with data-generating parameter vector (θ 1 , θ 2 , θ 3 ) = (−2.5,0, 1) (Bernoulli) and (θ 1 , θ 2 , θ 3 ) = (−2.5,.3, 1) (GWESP).Each column corresponds to each model and we evidence the rightward shift in the degree and ESP distribution of the GWESP model, relative to the Bernoulli model.
ity (Hunter, 2007;Stewart et al., 2019).A prototypical example used in the social network analysis literature is the geometrically-weighted edgewise shared partner model, which models transitivity through the shared partner sequence (Snijders et al., 2006;Hunter, 2007;Stewart et al., 2019).
We simulate networks according to the following model: where θ1 ∈ R controls the baseline propensity for edge formation, and parameterizes the sequence of shared partner statistics In words, SPt(x) counts the number of edges in the network between nodes which have exactly t mutual connections, commonly called shared partners in the social network analysis literature.
While θ2 ∈ R, in typical applications θ2 ≥ 0 and θ3 ∈ (0, ∞), as values of θ3 < − log 2 correspond to models which are unstable in the sense of Schweinberger (2011), and empirical evidence suggests that θ3 ∈ (0, ∞) in many applications (Schweinberger, 2011;Stewart et al., 2019).The effect that the GWESP model specified by (1) has on the degree and shared partner distributions of networks is visualized in Figure 1, where positive values of θ2 stochastically encourage network formations with more transitive edges, i.e., edges between nodes with at least one shared partner, relative to the Bernoulli random graph model with θ2 = 0.This is evidenced by the rightward shift in the ESP distribution of the GWESP model, relative to the Bernoulli model.
We take the true data-generating model M to be the curved exponential family specified by (1) with parameter vector θ = (−2.5,θ2, 1), with θ2 on a grid covering the interval [0, 0

Simulation study 2: reciprocity in directed networks
When the adjacency matrix X is undirected, the corresponding Laplacian matrix L(X) will be positive semidefinite (Brouwer and Haemers, 2012), resulting in a real-valued vector of eigenvalues λ ∈ R N .However, when X is the adjacency matrix of a directed network, the graph Laplacian, as defined for undirected networks, may not be positive semidefinite, and may involve complex valued eigenvalues.A common adaptation for directed networks in the literature is to consider the incidence matrix B ∈ {0, 1, −1} N ×|E| , where |E| is the total number of edges in the network.On each column of the incidence matrix exactly one element will be −1, indicating the node where an edge begins, and exactly one element will be 1, indicating the node where said edge ends.Every other entry is zero.In this manner, a directed network is completely specified by listing all existing edges as columns that indicate which nodes are connected and an orientation between them.We can adapt our proposed methodology to directed networks by considering the symmetric graph Laplacian defined by L := B t B (Brouwer and Haemers, 2012).
We simulate directed networks from the probability mass function We apply our methodology taking M1 to be the density only model with fixed θ2 = 0 in (2).We take M = M2 to be the general model specified via (2) with unrestricted parameters.We conduct a simulation study by taking θ1 = −2.5 in both M1 and M2, taking θ2 = 0 in M1, and varying θ2 on a uniform grid of 100 values in [0, 1] for M2.The simulation results in Figure 3 are based on 1000 replications in each case, reconfirming findings in the previous simulation study which suggested that the ability of our methodology to detect the true data-generating model depends on how far θ2 is from 0, the point at which M1 = M2, and the size of the network.Moreover, this study uniquely demonstrates that our methodology can be applied successfully to directed networks.

Simulation study 3: latent position models
Latent variable models for networks, especially latent position models, have witnessed increased popularity and attention since the seminal work of Hoff et al. (2002).In this class of models, nodes are given a latent position zi ∈ Z (i = 1, . . ., N ) in a latent space, typically taken to be the Euclidean space (i.e., Z = R k ), although alternative spaces and geometries have been proposed as well, as is the case of ultrametric spaces (Schweinberger and Snijders, 2003), dot product similarity resulting in bilinear forms (Hoff et al., 2002;Athreya et al., 2017), as well as hyperbolic (Krioukov et al., 2010) and elliptic geometries (Smith et al., 2019).Edges in the network are assumed to be conditionally independent given the latent positions of nodes.Following Hoff et al. (2002), we simulate networks in this study accordingly: where θ ∈ R and zi, zj ∈ R k .Under this specification, the odds of two nodes forming an edge decreases in the Euclidean distance zi − zj 2 between the positions of the two nodes in the latent metric space.
We explore the ability of our methodology to detect the true dimension of a latent space by generating networks from the latent Euclidean model described above, varying the dimension of the latent metric space k ∈ {1, 2, 3, 4, 5}.Latent positions of nodes are randomly generated from a multivariate normal distribution in dimension k ∈ {1, 2, 3, 4, 5} with zero mean vector and identity covariance matrix.The candidate competings models are generated in the same fashion across dimensions 1, . . ., 5. We set θ = −2.5 to ensure a low baseline probability of edge formation, reflecting the sparsity of many real-world networks, and vary the network size N ∈ {50, 100, 150, 200, 250}.We apply our model selection methodology in each case and compute the percentage of times our methodology selects each of the candidate latent space models.
We summarize the results of the simulation study in Figure 4, which demonstrates that our methodology is able to correctly identify the true dimension of the data-generating latent space Estimates of the correct classification rate with 95% confidence intervals for selected network sizes and across various latent space dimensions.The diagonal panels correspond to correct classification where the selection rate is desired to be highest.
model provided the network size is sufficiently large.The diagonal panels in Figure 4 correspond to correct selection of the dimension of the latent space.Of particular note, the problem becomes more challenging as the dimension of the latent space grows, but this effect is mitigated as the network size increases, with most correct selection rates in this study close to 1 for networks of size N = 250.

Simulation study 4: comparing different latent mechanisms
We next study whether our proposed methodology is capable of distinguishing different latent mechanisms for edge formation in a latent position model.The first one is the same latent space model specified in (3), while the second one replaces the Euclidean distance term − zi − zj 2 with the dot product z t i z k , commonly referred to as a bilinear form.A related class of latent position models which utilize bilinear forms of latent node positions are random dot product graphs (Athreya et al., 2017).As in the previous simulation study, latent positions of nodes are randomly generated from a multivariate normal distribution with zero mean vector but this time with covariance matrix σ 2 I, with I being the identity matrix (of appropriate dimension) and σ 2 ∈ {0.1, 0.2, . . ., 1.0} a scale factor.As the scale factor tends to zero, both models converge to a density only model so detecting the true generating process becomes more difficult.We summarize the results of the simulation study in Figure 5, which demonstrates that our methodology is able to correctly identify the true model (distance based) when compared to a bilinear (similarity based) model.Of particular note, performance improves as the dimension of the latent space increases and as the network size increases, as in the previous studies conducted.

Simulation study 5: effect of the choice of classifier
In this study, we repeat Simulation study 1 using three different classifiers, XGBoost (Chen and Guestrin, 2016), Random Forest (Ho, 1995;Liaw and Wiener, 2002) and Naive Bayes (Hand and Yu, 2001;Majka, 2019).Doing so allows us to examine the effect that the choice of classifier has on the results of this simulation study, as well as to explore the relative effectiveness of each classifier in this simulation study.Figure 6 shows a similar performance for all classifiers in this simulation study, with the notable exception being the naive Bayes classifier when networks are size 25, suggesting that the choice of classifier has a weak effect on the performance of our proposed methodology, provided the network is sufficiently large.In line with conclusions in the previous simulation studies, larger network sizes result in more pronounced model signatures.In light of these results, the effect of the choice of classifier appears to diminish if the model signal is sufficiently strong.

Applications
In order to study the performance of our proposed model selection methodology in applications to real-world network data, we study two network data sets which have previously been studied in the literature, in order to have a baseline for evaluating whether our methodology confirms existing results and knowledge about these networks.

Application 1: Sampson's monastery network
We apply our model selection methodology to the Sampson's monastery network data on social relationships (likeness) among 18 monk novices in a New England monastery in 1968 (Sampson, 1968).Based on the existing literature studying this network, we propose different model structures for this network which are well-designed to capture the community structure known to be a critical component of the network.In order to model this structure, stochastic block models have been applied to the network (Airoldi et al., 2008), as well as latent position models with a hierarchical group-based prior distribution structure on the latent positions (Handcock et al., 2007).We consider the following models: • SBM: M1-M4 correspond to stochastic block models with K = 1, 2, 3, 4 blocks (M1 being equivalent to a density only model).
• LPM: M5-M8 correspond to latent position models with model terms for density and reciprocity and latent space dimensions K = 1, 2, 3, 4. Model 0.060 0.147 Table 2: Propensity scores s i and normalized propensity scores si for models M 1 -M 20 for the Sampson's monastery network.
Each model was fit and our model selection methodology was applied to choose a best fitting model.The latent space models were fit with Krivitsky and Handcock (2014) and the stochastic block models were fit with Leger (2016).Table 2 presents the results.The model with the highest propensity score is M4, the stochastic block model with K = 4 blocks.
It has been well-established in the literature that the Sampson's monastery network features strong community structure (Handcock et al., 2007;Airoldi et al., 2008), featuring three labeled groups.However, statistical analyses have revealed the presence of a potential fourth group, evidenced in analysis which employ mixed membership stochastic block models (Airoldi et al., 2008), as well as evidence in studies which employ latent position models which suggests certain nodes may have strong connections to two or more labeled groups (Handcock et al., 2007).Within the context of the models we considered here, the choice of a stochastic block model with K = 4 blocks appears to be sufficient to capture the mixing patterns of the communities as well as the reciprocity from the inclusion of a reciprocity term.We hold the opinion that the expression of transitivity is not sufficiently strong in this network, otherwise the latent position model with K = 4 groups would potentially serve as a better model, as latent position models are able to capture network transitivity through the latent metric space.Figure 7 supports this claim by simulating networks from M4 and comparing the empirical triangle count distribution of these simulated networks to the observed number of triangles in the network, demonstrating good model fit in this regard.

Application 2: multilevel school network
We end the section with an application to a multilevel network consisting of 6,607 third grade students over 306 classes across 176 primary schools in Poland in the 2010/2011 academic year (Maluchnik and Modzelewski, 2014).Our interest in this data set lies in the fact that it has already been extensively studied in Stewart et al. (2019), which provides the closest we can get

Figure 2 :
Figure 2: Results of Simulation study 1. (left) Estimate of the correct classification rate with 95% confidence band for networks of sizes N = 25, 50, 75.(right) Estimate of the correct classification rate with 95% confidence band for networks of sizes N = 100, 200, 300.

Figure 3 :
Figure 3: Results of Simulation study 2. Estimates of the correct classification rate with 95% confidence band for various network sizes N .

Figure 4 :
Figure4: Results of Simulation study 3. Estimates of the correct classification rate with 95% confidence intervals for selected network sizes and across various latent space dimensions.The diagonal panels correspond to correct classification where the selection rate is desired to be highest.

Figure 5 :
Figure 5: Results of Simulation study 4 comparing a distance based model (true model) to a similarity based model.Estimates of the correct classification rate with 95% confidence band for different networks at different sizes and across different dimensions of latent spaces.

Figure 7 :
Figure 7: Fit of the observed number of triangles in the Sampson network relative to the distribution of triangles from simulated networks from M 4 .The observed number of triangles is indicated in red.

Figure 8 :
Figure 8: Difference in selected statistics between fitted models and Polish school network.