The likelihood-ratio test for multi-edge network models

The complexity underlying real-world systems implies that standard statistical hypothesis testing methods may not be adequate for these peculiar applications. Specifically, we show that the likelihood-ratio test's null-distribution needs to be modified to accommodate the complexity found in multi-edge network data. When working with independent observations, the p-values of likelihood-ratio tests are approximated using a $\chi^2$ distribution. However, such an approximation should not be used when dealing with multi-edge network data. This type of data is characterized by multiple correlations and competitions that make the standard approximation unsuitable. We provide a solution to the problem by providing a better approximation of the likelihood-ratio test null-distribution through a Beta distribution. Finally, we empirically show that even for a small multi-edge network, the standard $\chi^2$ approximation provides erroneous results, while the proposed Beta approximation yields the correct p-value estimation.

Model selection is addressed by operationalizing the principle of parsimony, one of the fundamental concepts in statistical modeling. Statisticians usually view the principle of parsimony as a bias versus variance tradeo : bias decreases, and variance increases as the complexity of a model increases. The t of any model can be improved by increasing the number of parameters. However, a tradeo with the increasing variance must be considered when selecting a model to validate a statistical hypothesis. Parsimonious models should achieve the proper tradeo between bias and variance. Box et al. [3] suggested that the principle of parsimony should lead to a model with "the smallest possible number of parameters for adequate representation of the data". Data-driven selection of a parsimonious model is thus at the core of scienti c research.
We can roughly summarise model selection methods into two groups that address the principle of parsimony di erently. Model selection based on statistical tests and model selection based on informationtheoretic methods [6]. Prominent examples of information-theoretic methods are the AIC [1,2], the BIC [27], or description length minimisation [23,26]. Studying how such methods fare when faced with network data complexity is beyond this article's scope.
The LR test is instead one of the most common examples of statistical tests used for hypothesis testing. Statistical tests allow performing hypothesis testing in the following way. They evaluate how far a test statistic falls from an appropriately constructed null-model. In the LR test, the test statistic is the ratio between the likelihood 0 of the model 0 representing the null-hypothesis and the likelihood of a more complex model representing the alternative hypothesis to be tested. The better the alternative hypothesis is compared to the null, the smaller the test statistic's absolute value .
How small need be to reject the null hypothesis in favor of the alternative? This depends on the null-distribution of . In other words, it depends on the null-hypothesis and its corresponding nullmodel 0 . Assuming that the null-hypothesis is correct, we could generate realizations of model 0 that represent all the possible forms the null-hypothesis could have taken in the data. This provides a null-distribution for the test statistic, i.e., a baseline distribution of the test statistic assuming that the null hypothesis was true. If the alternative hypothesis does not t the observed data well, we can expect the probability of observing from the null-distribution to be relatively large. The reason for this is that does not t the data better than 0 . In other words, there is no (statistical) evidence that we need the more complex model to explain the data, and the null-model 0 is su cient. If the alternative hypothesis is considerably better than the null, the probability of observing under the null-hypothesis will be small. This would give statistical evidence to reject the null hypothesis in favor of the alternative. The p-value of the LR test is precisely the probability of observing a value from the null distribution as small or smaller than .
Standard implementations of the LR test have been developed assuming that the data consist of many independent observations of the same process (i.i.d observations) [21]. Under these circumstances,

3/15 (Submitted for publication)
Wilk's theorem provides a widely used approximation of the null-distribution of to a 2 distribution [24]. Analyzing complex systems, we are often faced with multi-edge network data. These data consist of repeated -and possibly time-stamped -edges ( , ) representing interactions between di erent agents , , the vertices of the network. Examples of such datasets arise in multiple elds, e.g., in the form of human or animal interactions. Usually, events in networks are explicitly dependent on each other. Thus, the crucial assumption of i.i.d observations required by Wilk's theorem is violated. The underlying assumption in network science is that the opposite of i.i.d. is true, i.e., that the presence or absence of edges between some vertex pairs a ects edges between other vertex pairs. This is critical in the presence of phenomena typically found in complex social systems, such as triadic closure [25], structural balance [17], degree-degree correlations [22], and other network e ects.
So what is the implication of such interdependencies? The dependence between di erent observations of a complex system means that some of the statistical tests' properties will not hold when analyzing network data. In particular, we show that the null-distribution of the test statistic of the LR test needs to be modi ed to accommodate such dependence. When this is not done, the results obtained applying LR tests for hypothesis testing cannot be relied upon.
To illustrate this, we employ the gHypEG (generalized hypergeometric ensemble of random graphs) [9,12] to model multi-edge network data. The gHypEG allows the encoding of di erent types of hypotheses in a model, from simple ones like block structures [8] to more complex ones, akin to statistical regression models [5,7]. Such models can then be used to evaluate di erent hypotheses about the data [4].

Statistical Hypotheses and gHypEG 2.1 Multi-Edge Network Data and Hypothesis Formulation
In this article, we deal with multi-edge network data. Each observation → consists of an interaction from a system's agent to another agent . All observations = { → } can be collected as directed edges in a multi-edge network ( , ), where is the set of all interacting agents, the vertices of the network. The matrix denotes the adjacency matrix of the network . Each of its entries reports the number of edges from vertex to vertex , or in other words, the number of observed interactions between agent and agent .
Within this framework, statistical hypotheses are formulated in terms of random graph models. Simple examples of such hypotheses are: (a) Each agent has the same potential to interact as any other.
(b) Di erent agents have di erent potentials to interact. (c) Agents are separated into 2 distinct groups, and agents of one group are more likely to interact with each other than with agents of the other group.
(d) Agents are separated into distinct groups, and agents of one group are more likely to interact with each other than with agents of the other groups.
More complex social and dynamical hypotheses can obviously be formulated depending on the system studied. Testing these hypotheses requires encoding them in statistical models and then comparing the t of these models against each other or against some null-models [21]. We employ the generalized hypergeometric ensemble of random graph (gHypEG) as the statistical model encoding such hypotheses to deal with multi-edge network data. While this is not the only option, we choose the gHypEG because of its versatility and suitability to model multi-edges. In the next sections, we show i) how to formulate a likelihood-ratio test between two of such hypotheses and ii) how the null-distribution of the LR test needs to be modi ed to t complex network data.

The Generalised Hypergeometric Ensembles of Random Graphs (gHypEG)
Before discussing the LR test details, we brie y introduce the generalized hypergeometric ensemble of random graphs. A more formal presentation is provided in [9]. The general idea underlying the gHypEG is to sample edges at random from a prede ned set of possible edges. Hypotheses about the system from which the data is observed can be encoded by either a) changing the number of possible edges in such a set and b) changing the odds of sampling an edge between two vertices instead of others, i.e., by specifying di erent edge sampling weights, or biases. Figure 1 provides a graphical illustration of such a process from the perspective of an agent A. Figure 1: Probabilities of connecting di erent agents according to gHypEG, from the perspective of an agent A. Graphical illustration of (left) uniform edge probabilities, (center) the probability of connecting two vertices is a function of degrees, and (right) the probability of connecting two vertices is function of degrees and propensities Ω (represented as the edge-weight).
On the left-hand side of Fig. 1, the number of possible ways A can interact with the other agents is the same: there are two edge-stubs for each vertex. Moreover, the odds of sampling one edge-stub instead 5/15 (Submitted for publication) of another is 1. I.e., each edge has the same sampling weight, which is denoted as Ω in the following, where , are vertices in . According to the model just described, the probability of observing a multiedge network  with edges depends only on the number Ξ of possible edges between each pair of vertices. This scenario gives rise to a uniform random graph model similar in spirit to the ( , ) model of Erdös and Rènyi [15]. The process of sampling edges from a collection of 2 Ξ possible edges, i.e, Ξ possible edges for each pair of vertices in a directed network with sel oops, is described by the hypergeometric distribution [9]: By setting Ξ = ( / ) 2 , we ensure that the average degree of the observed network  is preserved by the model. This rst scenario corresponds to the hypothesis (a) listed above: each agent has the same potential to interact. Furthermore, in a directed network with self-loops, Ξ = 2 / 2 corresponds to the maximum likelihood estimation of the only model parameter Ξ. We refer to the resulting hypergeometric network model as regular model. The central illustration of Fig. 1 highlights a di erent case. The odds between the di erent interactions are still identical. I.e., there is no preference for A to interact with any of the other agents. However, the actual possibilities of interactions vary between the di erent agents: each agent has a di erent number of edge-stubs for A to connect to. This scenario encodes a di erent potential of interaction for the di erent agents, usually re ected in a heterogeneous degree distribution found in the network . This model encodes hypothesis (b) above. In practice, this hypothesis requires setting different values Ξ for the number of possible edges between each pair vertices , . The probability of observing a network  according to this model changes as follows: where the matrix Ξ Ξ Ξ contains all di erent entries Ξ . The value of Ξ can be freely chosen to encode di erent properties of the system studied (see, e.g., [4]). For example, if we were studying a citation network consisting of citations between scientists, we could set Ξ to ⋅ , where is the number of articles published by scientist . ⋅ would then encode all the possible ways scientist could have cited scientist , through all their respective publications. In most cases, though, Ξ is taken to be out ⋅ in , where in is the observed in-degree of agent , and out its observed out degree. This hypergeometric network model corresponds to a soft con guration model [16], and de nes a network model that preserves the observed degree sequences in expectation [9]. The two models described so far are both characterized by the absence of sampling biases, i.e., interaction preferences between speci c vertex pairs that go beyond what is prescribed by the number of edge-stubs and degrees. GHypEG further expands this formulation modifying the hypergeometric con guration model with additional information available about the system. Speci cally, the probability of connecting two vertices depends not only on the observed degrees (i.e., number of stubs) but also on an independent propensity of two vertices to be connected. Such propensities introduce non-degree related e ects into the model. This result is achieved by changing the odds of connecting a pair of vertices instead of another. The right side of Fig. 1 illustrates this case, where is most likely to connect with vertex , even though has only one available stub.
We collect these edge propensities in a matrix . The ratio between any two elements Ω and Ω of the propensity matrix gives the odds-ratio of observing an edge between vertices and instead of and , independently of the degrees of the vertices. The probability of a graph  depends on the stubs' con guration speci ed by , and on the odds de ned by . Such a probability distribution is described by the multivariate Wallenius' non-central hypergeometric distribution [14,30]: with Ω = ∑ , Ω (Ξ − ).

Encoding Hypotheses
By constraining the number of free parameters in Ω Ω Ω, we can specify hypotheses about the data changing the sampling odds for di erent vertex pairs. For example, we can cluster vertices into multiple groups and verify whether the odds of observing interactions within a group and between a group are di erent [8]. The resulting model is similar to a degree-corrected stochastic block model [18]. Alternatively, we can specify Ω Ω Ω to encode endogenous network properties. E.g., Ω Ω Ω can be utilized to encode triadic closure [5], to verify whether pairs whose interactions will close triads in the network are more likely than others. Finally, di erent e ects contributing to the odds of observing some interactions instead of others can be composed together to formulate more complex hypotheses [7]. The advantage of the approach just described is the ability to encode a wide range of statistical hypotheses within the same modelling framework. This has the practical bene t of allowing the comparison of very di erent models, as they can all be formulated by means of the same probability distribution of Eq. (3). Di erent hypotheses are thus encoded by appropriately choosing the free parameters in Ξ Ξ Ξ and Ω Ω Ω.

7/15 (Submitted for publication)
For clarity, we will focus on simple hypotheses such as those described in the previous section. However, the results shown do hold for any combinations of Ξ Ξ Ξ and Ω Ω Ω. In the particular case of encoding group structures, Ω Ω Ω takes the following form: where is the group of agent , is the group of agent , and , is the propensity of sampling an edge between group and group . In the presence of 2 di erent groups of vertices ( , ), there are 3 possible values that Ω can take: , , (assuming that = ). The ratio / gives the odds between sampling an edge within group A and an edge between group A and B given a value of Ξ.

The Likelihood-ratio (LR) test
We now illustrate how the LR test is used to test a null-hypothesis against an alternative hypothesis about the observed system. The data are used to de ne a graph  with adjacency matrix . Let be some statistical hypothesis. Here, we always assume that each hypothesis is de ned by a gHypEG model that can be encoded by a propensity matrix and a combinatorial matrix . Each model is characterized by several free parameters that we want to t to the data , such that the probability of observing  is maximized. This requirement corresponds to performing a maximum likelihood estimation (MLE) of the free parameters.
Likelihood-ratio statistic. Assume now we have two hypotheses we want to test against each other. Let 0 denote the null-hypothesis and let denote the alternative. The corresponding models are de ned in terms of 0 , 0 and , . To test the alternative hypothesis against the null, we use the likelihood-ratio statistic (0, ), de ned as follows: Definition 1 (Likelihood-ratio statistic). Let  be a graph, 0 be the model corresponding to the nullhypothesis, and the model corresponding to the alternative hypothesis. The likelihood ratio statistic (0, ) is given by where ( , | ) = Pr(| , ) denotes the likelihood of model given the network .
Through the likelihood-ratio statistic, we can perform two types of tests. First, we can perform a standard model selection test to compare a simpler model against a more complex model. This test corresponds to verify whether there is enough evidence in the data that justi es the more complex model or whether the simpler model ts the data well enough. In this scenario, the simple model corresponds to the null-hypothesis, while the more complex model to the alternative.
Second, the likelihood-ratio test can be used to perform a goodness-of-t test. This test allows verifying the quality of the t of a model . By de ning the alternative hypothesis with a model full that perfectly reproduces the observed data (in expectation), we can test whether the t of the model is as good as such an over tting model [21]. In the framework of gHypEGs, the alternative hypothesis is obtained by specifying the parameter matrix Ω Ω Ω full such that the expectation of full corresponds to the observation . This model is the maximally complex model that can be speci ed with a gHypEG and has as many free parameters as entries in the adjacency matrix [9].
Let's now assume that the two models corresponding to the alternative and null hypotheses are nested. This means that 0 can be written as a special case of , and 0 as a special case of . Thus, the null-model (with fewer parameters) can be formulated by constraining some of the alternative model parameters. Thanks to Wilks' theorem [24], if the two models are nested, the number of observations is large, and the observations are independent, the distribution of under the null-hypothesis can be written in terms of (0, ) ∶= −2 log( (0, )), and can be approximated by the 2 distribution with as many degrees of freedom as the di erence of degrees of freedoms between the two models. Letting be the di erence of degrees of freedom between the null and the alternative modes, the -value of the likelihood-ratio test between the two hypotheses is computed as follows: p-value ∶= Pr( 2 ( ) ≥ (0, )).
We reject the null hypothesis in favour of the alternative if the -value is smaller than some threshold .
Distribution of under the null-hypothesis. The question that remains to be answered is whether the conditions provided by multi-edge network data allow Wilks' theorem's application. Unfortunately, in most real-world scenarios, the answer to this question is negative. This is a known issue in statistics, where it arises in the context of multinomial goodness-of-t tests [13,19,20,29]. Because Wallenius' multivariate non-central hypergeometric distribution converges to the multinomial distribution (a formal proof can be found in [32]), we use the results obtained for multinomial tests to nd a better approximation for the null distribution of (0, ) than the 2 approximation of Wilks' theorem. Specically, following the work of Smith et al. [29], we propose approximating the distribution of (0, ) with a Beta distribution. In some special cases there exist analytical solutions for [ (0, )] and 2 [ (0, )] [29]. However, in most situations, we resort to a numerical estimation of them. While a general solution would be optimal, thanks to the ability to generate samples provided by gHypEG models, the parameters' numerical estimation can be nevertheless performed with ease.
The result provided by Theorem 1 greatly helps when performing likelihood-ratio tests involving multi-edge network data. In fact, to estimate the full null distribution of the likelihood-ratio statistic numerically, we would need a considerable number of realisations. In the case of large networks, this is infeasible. Exploiting Theorem 1 instead, we only need to estimate the rst two moments of the distribution under the null hypothesis, which can be done reliably with a smaller number of realisation [28]. For example, in Fig. 2, we show the results of applying Kolmogorov-Smirnov's test to compare the LR's distribution statistics against the Beta distribution tted to increasing sample sizes. The example is constructed from a 40 vertices random graph with 500 undirected edges. The edges are generated according to the hypergeometric con guration model, and the likelihood-ratio test is performed comparing a regular model against the generating con guration model. To build the empirical distribution of the likelihood-ratio statistic, we take = 500 000 samples under the null hypothesis, and we compute the parameters of the asymptotic Beta distribution from an increasing number of independent samples. The results show that with a limited sample size ∼ 1000, most of the observations give a p-value for the Kolmogorov-Smirnov test larger than 0.05. I.e., with a limited sample size , the empirical nulldistribution obtained from is not signi cantly di erent from the Beta distribution whose parameters have been estimated from the realizations. That points to the fact that the asymptotic results of Theorem 1 are acceptable even for a nite number of observations and small sample sizes. The R package ghypernet 1 [10] provides an implementation of the likelihood-ratio test for gHypEG models. The package is Open Source and can be obtained from the CRAN R package repository.

Case studies
We provide two short case studies about the application of the likelihood-ratio test. First, we generate a random undirected graph with = 100 vertices and = 400 directed edges uniformly distributed between each vertex pairs. Utilizing the likelihood-ratio test, we can test the null-hypothesis (a) that each vertex has the same potential of interactions against the alternative hypothesis (b) that di erent agents have di erent interaction potentials. As explained in the previous sections, (a) is encoded by a regular model with one parameter, and (b) by a con guration model. This test corresponds to testing that the degree distribution deviates from that of the regular model. We expect that the test returns high p-values because we choose the null-hypothesis to match the random graph's generating process. The results, obtained from 1000 repetitions of the experiment, con rm this hypothesis with a p-value of the median of 0.44. Similarly, we perform the same experiment generating a random undirected graph from the standard con guration model with a heterogeneous degree distribution. To ease the comparison with the example above, we de ne it by a degree sequence sampled from a geometric distribution with mean chosen such that the expected number of edges in the graph is = 400. This e ectively corresponds to generating data according to the hypothesis (b). In this case, we expect small p-values from the same test done before because the generating model of the data corresponds to the alternative hypothesis. Repeating the experiment 1000 times, for the largest recorded we obtain a p-value < 1 − 20.
The second experiment we perform is with an empirical graph. We use Zachary's Karate Club (ZKC) [31] as a test case. ZKC consists of 34 vertices and 231 undirected multi-edges. As with most empirical graphs, its degree sequence is skewed (empirical skewness is 1.456). Hence, we expect that the test we performed before -comparing the null-hypothesis of a regular model against the hypergeometric con guration model -should reject the null-hypothesis. Performing such a likelihood-ratio test gives a p-value < 1 − 20, which con rms our expectations.
We can further exploit this example to compare the empirical distribution of (0, ) under the null hypothesis with the 2 distribution and the Beta distribution. The result is shown in Fig. 3a. Note that the value of (0, ) for ZKC is 300.338, which is out of scale on the right side of the x-axis of Fig. 3a. In this case, it appears that the Beta distribution and the 2 distribution provide similar ts. However, when we perform a two-sided Kolmogorov-Smirnov test, we get a p-value of 1.45 − 05 for the 2 distribution, which means that we can reject the null hypothesis that the empirical distribution follows a 2 . Performing the same test for the Beta distribution, we get a p-value of 0.4211. That means that we cannot reject the null-hypothesis that the empirical distribution follows a Beta. While this is not enough to claim that the distribution of (0, ) is a Beta, it gives con dence on using the asymptotic results of Theorem 1.
We perform a second experiment to highlight how much, in extreme cases, the 2 distribution can deviate from the empirical distribution of (0, ) under the null-hypothesis. For the ZKC, we now perform a goodness-of-t test of the hypergeometric con guration model. The alternative hypothesis is thus encoded by the maximally complex model tted by gHypEGs and results in a model that xes the expected graph as the observed one, as explained above. The t of its parameters is performed according to what described in [9]. The test can be interpreted as how well the null-model ts the data, which is entirely encoded in the full model. This test results in a p-value of 1.69 − 30. That means that the con guration model is not a good model for the ZKC. This result is hardly surprising, given the well-known community structure present in the empirical graph. In Fig. 3b, we show the empirical distribution of (0, ). There, we notice that while the Beta distribution provides a visually good t, the 2 distribution is heavily shifted to the right. The two-sample Kolmogorov-Smirnov test con rms this result, providing a p-value of 0.169 for the Beta distribution, and < 2.2 − 16 for the 2 distribution. In this last example, it is essential to note that using the 2 distribution would provide a misleading result. Comparing the value of (0, ) for the empirical graph, we see that it is on the right tail of the 2 distribution. Computing a p-value from this distribution would result in a p-value of ≈ 0.005. That means that in this case, we would only weakly reject the null-hypothesis, giving the wrong impression that the ZKC could come from an extreme realization of a simple con guration model. However, this is ruled out by looking at the likelihood-ratio statistics' empirical distribution or simply comparing an empirical graph with a realization from the hypergeometric con guration model [11].  Figure 3: Empirical distribution of (0, 2) for two likelihood-ratio tests perform on ZKC. In the top gure, we perform a likelihood-ratio test for the null-hypothesis that ZKC comes from a regular model against the alternative hypothesis that ZKC comes from a con guration model. The results show that there is strong evidence to reject the null hypothesis. In the bottom gure, we perform a goodness-of-t test for the hypergeometric con guration model on ZKC. Also, in this case, the results show a bad t of the null-model. While in the top-gure, the 2 and the Beta approximations of the likelihood-ratio statistic's empirical distribution give a relatively good t, in the bottom case, it is clear that the 2 does not approximate well the empirical distribution. The shaded area denotes the empirical distribution of (0, 2) under the null-hypothesis, the orange line its 2 approximation, and the green line its Beta approximation. The vertical line denotes the value of the likelihood-ratio statistic for ZKC. In the top gure, such a line is out of the boundaries on the right side and hence not plotted.

Discussion
The study of complex systems is intertwined with network science and advanced multivariate statistics. Hypothesis testing and model selection methods, in particular, need to account for the complexity underlying observations from such systems. Because interactions between system agents tend not to be independent, many standard statistical methods should be employed with care when dealing with network data.
This article has investigated how the likelihood-ratio test needs to be adapted to deal with network models. Likelihood-ratio tests provide a practical methodology for selecting di erent network models and testing statistical hypotheses. However, the characteristics of multi-edge networks require us to adapt the test null-distribution to account for the underlying complexity of network data. When this is not done, we incur the risk of over-(or under-) estimating the p-values of the statistical test, generating contradictory results, as shown in the case study above. With Theorem 1, we provide the means to correctly estimate the p-values for likelihood-ratio tests by means of a Beta distribution. Finally, we provide an implementation of the methods described through the Open Source R package ghypernet. Even though our analysis is focused on the likelihood-ratio test, similar issues may arise with other statistical tools applied to complex networks.
The main limitation of the results presented in this article is the need to numerically estimate the rst two empirical moments of the statistic's null distribution. Although this can be performed easily using our implementation, we will investigate analytical asymptotic estimates for the parameters needed in future research.