Bayesian Testing of Scientific Expectations Under Exponential Random Graph Models

The exponential random graph (ERGM) model is a commonly used statistical framework for studying the determinants of tie formations from social network data. To test scientific theories under the ERGM framework, statistical inferential techniques are generally used based on traditional significance testing using p-values. This methodology has certain limitations, however, such as its inconsistent behavior when the null hypothesis is true, its inability to quantify evidence in favor of a null hypothesis, and its inability to test multiple hypotheses with competing equality and/or order constraints on the parameters of interest in a direct manner. To tackle these shortcomings, this paper presents Bayes factors and posterior probabilities for testing scientific expectations under a Bayesian framework. The methodology is implemented in the R package 'BFpack'. The applicability of the methodology is illustrated using empirical collaboration networks and policy networks.


Introduction
The exponential random graph model (ERGM) is one of the most widely used statistical frameworks for explaining and predicting the formation of ties between actors in a network based on the characteristics of the actors and the endogenous characteristics of the network.The framework has resulted in many new insights in various scientific fields, including brain networks in neuroscience (Simpson et al., 2011), board interlocks between firms in management research (Kim et al., 2016), genetic and metabolic networks in biology (Saul & Filkov, 2007), and adolescent friendship networks in sociology (Goodreau et al., 2009).
A statistical analysis under an ERGM is typically executed by first obtaining the maximum likelihood estimates of the network parameters, which quantify the relative importance of the predictor variables, and the corresponding standard errors, which quantify the statistical uncertainty of the estimates given the information in the sample.Subsequently, to test which predictor variables have a statistical effect on the formation of ties in the network, classical p values are often used.When testing a null hypothesis of whether an ERGM parameter equals zero or not, the p value is defined as the probability of observing effects that are at least as extreme as the estimated effect from the observed network under the assumption that the true effect equals zero.There is an increasing debate, however, about the use of p values for statistical hypothesis testing (Benjamin et al., 2018;Mulder & Wagenmakers, 2016;Rouder et al., 2009).For applied researchers, a potential issue may be its difficult interpretation as p values are often misinterpreted in statistical practice (Greenland et al., 2016).Besides this practical issue, p values also have more fundamental limitations, which restrict their usefulness for applied social network research.
A first limitation is that p values cannot be be used to quantify the evidence in the data in favor of a null hypothesis, but it can only to falsify a null hypothesis.This is a limitation when the null hypothesis reflects an important scientific expectation or theory.Leifeld & Schneider (2012), for instance, investigated whether political actors with similar policy preferences form information exchange ties in policy networks.They hypothesized that preference similarity had no additional effect on the formation of ties when covariates based on institutional, relational, and social opportunity structures were also present in the model.When they tested this under the ERGM framework, a p value larger than the significance level was obtained, resulting in a state of disbelief because there was not enough evidence in the data to reject the null, but it was also not possible to state that there was evidence in favor of the null.The cause of this problem is that the p value cannot distinguish between the absence of evidence in the data and evidence of absence of an effect (Dienes, 2014).Thus, if a null hypothesis is of interest that is substantively meaningful and plausible, classical p values may not be preferred.
A second limitation is that the p value is statistically inconsistent when the null is true.This is a direct consequence of the central property that the p value is uniformly distributed under the null.This property is crucial in order to control the type I error probability of incorrectly rejecting the null using a prespecified significance level.A consequence is, however, that there is always a fixed probability (equal to the significance level, typically .05) of incorrectly rejecting a true null hypothesis, even when the size of the sampled network goes to infinity.Practically, this implies that when sampling data from a huge network (which may be very costly and time consuming) and the effect of a key predictor variable on the formation of ties in the network equals zero, there is still a strictly positive chance to incorrectly reject the null hypothesis which is equal to the prespecified significance level, typically .05.
A final important limitation is that p values are not designed for testing multiple hypotheses.Thus, when multiple hypotheses are formulated based on competing scientific expectations, it is not straightforward to test these hypotheses in a direct manner against each other using p values.Moreover, when a hypothesis also contains order constraints on the parameters of interest, which is often the case in applied research as expectations are often formulated using 'larger than' or 'smaller than' statements (Hoijtink, 2011), p values can only be used for a limited set of hypothesis tests (Silvapulle & Sen, 2004).For example, p values are not available for directly testing nonnested order hypotheses, e. g., H 1 : Testing order hypotheses using posthoc tests is also problematic for another reason: We can obtain conflicting conclusions (e. g., 'β 1 = β 2 ' and 'β 1 = β 3 ' cannot be rejected but 'β 2 = β 3 ' can be rejected), which is problematic for applied statistical practice.Nevertheless, testing order hypotheses is very relevant for social network research.For example, Leifeld (2018) investigated the effect of collegial relationships on the tendency to co-author papers in scientific collaboration networks.Supervisor-supervisee pairs were expected to show a stronger tendency to collaborate on publications than colleagues only working in the same team; same-team colleagues were in turn expected to exhibit a stronger tendency to collaborate than colleagues merely working in the same institution; and people in the same institution were expected to co-author at higher rates than people affiliated with different institutions.In such cases, inferential techniques for order hypotheses are desirable.
To resolve these issues and limitations of p values, this paper proposes a flexible Bayesian test under the ERGM framework.The methodology uses Bayes factors and posterior probabilities, which (i) can be used to quantify the relative evidence in the data in favor of a null hypothesis, (ii) are consistent under general conditions, and (iii) are broadly applicable for testing multiple hypotheses simultaneously and for testing hypotheses with order constraints on the parameters of interest.The computation is relatively fast due to its reliance on Gaussian approximations of the posterior (following large sample theory).To facilitate its usability, the methodology is implemented in the R package BFpack (Mulder, Gu, et al., 2021).The main function BF only requires a fitted (Bayesian) ERGM either using the ergm package (Hunter et al., 2008) or the Bergm package (Caimo & Friel, 2014).
The paper is organized as follows.Section 2 provides some background information about ERGMs and about the formulation of hypotheses which may involve order constraints on the parameters of interest.Two motivating empirical applications are then discussed in the context of scientific collaboration networks and information exchange in policy networks.Section 3 describes the methodology for Bayesian hypothesis testing under ERGMs and the implementation in the R package BFpack (Mulder, Gu, et al., 2021).In Section 4, the numerical behavior of the test is illustrated when testing a null hypothesis and when testing order hypotheses.In Section 5 we apply the methodology in two empirical applications.Section 6 ends the paper with concluding remarks and a discussion.

Statistical hypotheses under the ERGM framework
Under an ERGM, the probability of an adjacency matrix Y of a network of N actors, where Y ij = 1 denotes a tie (or edge) between actors i and j, and Y ij = 0 denotes the absence of a tie between i and j, is given by where Y denotes the set of all possible adjacency matrices that can be observed, and s(Y, X) denotes a vector of K sufficient statistics which are assumed to explain the connectivity in the network given the endogenous characteristics of the network Y, which could be the tendency to form ties, the tendency to form triangles, or the in-degree or out-degree of actors, and the characteristics of the actors which are summarized in X.The coefficients in the vector β of length K quantify the relative importance of the K sufficient statistics in the formation of ties in the network.Maximum likelihood estimation (Snijders, 2002;Hunter & Handcock, 2006) or Bayesian estimation (Caimo & Friel, 2013) can be used to estimate the coefficients given the observed network.

Application A: Information exchange in policy networks
The first motivating application is the question whether political actors who share similar policy beliefs and political preferences also exchange information in policy networks.Leifeld & Schneider (2012) employed an ERGM to examine the institutional, relational, and social opportunity structures guiding the exchange of information among political actors in a national-level policy network on a contested issue.A policy network is comprised of the interest groups, government agencies, scientific bodies, and other organizations exerting influence over the policy formulation and decision-making process.Actors can exchange technical and scientific or political and strategic information.Explaining under what conditions actors collaborate is instrumental for understanding why certain bills are adopted and why they contain specific elements, for example leaning towards interest group positions.Leifeld & Schneider (2012) found that any two actors' similarity in policy preferences explained political/strategic information exchange between them.But this effect vanished when opportunity structures like shared committee memberships or shared partners were included in the model.This was seemingly at odds with the prior literature, which had posited that collaboration takes place at higher rates within ideologically homogenous regions of the network, not across political divides (e. g., Henry et al., 2011).Would the lack of statistical significance once control variables were included have been evidence against an association between preference similarity and tie formation?Leifeld & Schneider (2012) concluded that the effect of preference similarity was "absorbed" by opportunity structures, meaning that opportunity structures were mainly conducive to tie formation between actors with similar preferences, hence preferences per se were no longer a significant predictor themselves.The original study did not benefit from a hypothesis testing framework that would have permitted testing the hypothesis that no association existed, controlling for other effects.Below in Section 5, we review this question using the original specification and test the hypothesis that there was no effect of preference similarity on information exchange.
H 1 : No effect of preference similarity.The first hypothesis assumes that preference similarity has no effect on political/strategic information exchange in a policy network while controlling for the number of edges, governmental target, scientific source, number of common committees, scientific communication, interest group homophily, influence attribution, geometrically weighted edge-wise shared partners (GWESP), geometrically weighted dyad-wise shared partners (GWDSP), and reciprocity.Mathematically, the hypothesis can be written as Given the notation of a constrained hypothesis in (2), the vector is given by β = (β edges , β pref sim , β gov target , β sci source , β committee , β sci comm , β int group homophily , β infl attr , β GWESP , β GWDSP , β reciprocity ) , and the equality matrix under H 1 is given by R E 1 = (0, 1, 0, . . ., 0) .
H 2 : Complement.The second hypothesis assumes that preference similarity has a non-zero effect on information exchange while controlling for the same covariates as under H 1 .Mathematically, the hypothesis can be written as Note that even though hypotheses H 1 and H 2 are often denoted by the null hypothesis H 0 and the alternative hypothesis H 1 , respectively, we use the notation in (2) because the methodology is not limited to testing the traditional null and alternative hypothesis.
Based on the analysis of Leifeld & Schneider (2012), the hypothesis test resulted in a p value of 0.798305, which would result in a nonsignificant effect using a significance level of .05.Because p values can only be used to falsify the null hypothesis, we are in a state of disbelief as there is not enough evidence in the data to reject the null but we also cannot claim there is evidence in favor of the null that an effect is absent.The underlying cause is that the classical test cannot distinguish between absence of evidence and evidence of absence of an effect (Dienes, 2014).
A secondary question of interest was the ordering of effect sizes for different theoretical elements.Leifeld & Schneider (2012) were primarily interested in opportunity structures, such as shared committees, followed by influence attribution, followed by preference similarity.We will explore in Section 4 whether this theoretical importance ranking of effects translates into an order of effect sizes using an order hypothesis test according to

Application B: Institutional and functional overlap in co-authorship networks
The second motivating application is the nesting of co-authorship ties in institutional settings.Leifeld (2018) used an ERGM to model the co-authorship networks among all political scientists in Germany and Switzerland during a five-year period.Of particular interest in the analysis was the question whether nomological and idiographic researchers, with their different publication norms, would find themselves separated into distinct communities in the co-authorship network.To answer this question, the study also took into account institutional incentives to collaborate: The more functional overlap existed between any two researchers, the greater the expected odds of collaboration between them, ranging from being in the same discipline and country, working at the same institution, being on the same team, all the way down to direct supervision relationships.But due to the lack of a nested hypothesis testing framework, the original study formulated the effect of increasing levels of institutional and functional overlap as three separate hypotheses, one for each level of functional overlap, without regard for the nested nature of the hypotheses.It concluded from increasingly positive coefficients that more functional overlap must lead to greater odds of collaboration.Here, we evaluate the nested nature of expected collaboration vis-a-vis alternative hypotheses stating that all levels of functional overlap have identical positive effects on co-authorship tie formation and that none of the different functional overlaps exerts any effect on tie formation.
H 1 : Increasingly positive effect of functional overlap.The first hypothesis posits that researchers with increasing functional overlap collaborate at higher rates: Researchers with affiliations at the same institution (e. g., university or research in-stitute) are hypothesized to have greater odds of forming collaboration ties than researchers without a shared affiliation.Researchers were expected to form ties at even higher rates within teams, such as chair groups or junior research groups, than within institutions, compared to pairs of researchers without a shared affiliation.Within teams, researcher pairs where one of the researchers was a professor (usually assistant, associate, or full professor) and the other one was not a professor (usually postdoctoral researcher or PhD student) were expected to have greater odds of co-authorship than the wider team they were embedded in, relative to the reference group of not sharing an affiliation.Mathematically, the hypothesis can be formulated as Like in the original study we control for the number of edges in the network, having shared partners (GWESP), degree distribution (GWDegree), seniority, gender, gender homophily, geographic distance, topic similarity, the share of English articles among an author's publications, the number of publications, and the similarity between two researchers in their share of English article.Given the notation of a constrained hypothesis in (2), the vector of coefficients is given by β = (β edges , β supervision , β same team , β same affiliation , β GWESP , β GWDegree , β seniority , β gender , β gender homophily , β geographic distance , β topic similarity , β share Eng.art., β number publications , β Eng.art.similarity ) .
The matrix with the coefficients for the order hypothesis is then given by Positive and constant effect of functional overlap.The second hypothesis posits that researchers who form a supervisor-supervisee pair, researchers on the same team (and who are not in a supervisor-supervisee pair), and researchers having the same affiliation (and who are not members on the same team) are all equally likely to publish together, and these pairs are expected to be more likely to publish together than people who do not have the same affiliation.Mathematically, the hypothesis can be formulated as We control for the same covariates as under H 1 .The matrix of equality constraints under this hypothesis is then given by and the matrix for the one-sided constraint under this hypothesis is equal to No Functional overlap effect.The third hypothesis assumes there is no effect of any level of functional overlap, which implies that researchers who form a supervisorsupervisee pair, researchers on the same team (and who do not form a supervisorsupervisee pair), researchers having the same affiliation (and who are not members of the same team), and researchers who do not have the same affiliation all have the same average tendency to publish together.Mathematically, the hypothesis can be formulated as We control for the same covariates as under H 1 .The matrix of equality constraints under this hypothesis is then given by H 4 : Complement.The fourth, complement hypothesis that none of the assumptions under H 1 , H 2 , or H 3 is true.We write this hypothesis as We control for the same covariates as under H 1 .Note that hypothesis H 4 covers the parameter space of the coefficients β which does not satisfy any of the constraints under H 1 , H 2 , and H 3 .
3 Bayesian hypothesis testing

Background information
A Bayesian hypothesis test is proposed that is based on the marginal (or integrated) likelihood of the observed network under hypothesis H t , which is defined by where β t denotes the vector of the free parameters under H t , and p t (β t ) denotes the prior probability distribution of the free parameters under H t , which reflects the prior expectation and uncertainty about the free parameters before observing the data.Under an ERGM, the marginal likelihood can be computed using the algorithm of Bouranis et al. (2018).If we consider two competing hypotheses, the ratio of the respective marginal likelihoods is also known as the Bayes factor.For example, the Bayes factor between hypothesis H 1 and H 2 is defined by  1 provides some general guidelines for interpretation which may be useful for researchers who are unfamiliar with Bayes factors.Note that these guidelines should not be applied in a strict sense as Bayes factors quantify the relative evidence between hypotheses on a continuous scale.Finally we note here that Bayes factors are consistent under mild conditions (e. g.O'Hagan, 1995) which implies that the evidence for the true hypothesis goes to infinity as the sample size grows.Thus, Furthermore, in a Bayesian framework one can specify prior probabilities for the hypotheses, denoted by P (H t ) for t = 1, . . ., T , which reflect the plausibility of the hypotheses (such as scientific expectations or substantive theories) before observing the data.A common default choice is the use of equal prior probabilities, i. e., P (H t ) = T −1 for t = 1, . . ., T .Subsequently, when data are observed, the evidence between the hypotheses in the data is quantified in the Bayes factor, which can be used to update the prior odds of the hypotheses to obtain the posterior odds via The posterior probabilities of the hypotheses can then be computed via

Prior specification and Bayes factor computation
In order to compute the marginal likelihoods of the hypotheses and, subsequently, the Bayes factors between the hypotheses, one only needs to formulate the prior distributions for the free parameters under the hypotheses.To simplify this endeavor, we use an encompassing prior approach (Klugkist & Hoijtink, 2007) where an unconstrained (or encompassing) prior, denoted by p u (β), is formulated under the full unconstrained ERGM, denoted by H u , and truncations are used under the constrained hypotheses according to , where I(•) denotes the indicator function.Thus, instead of needing to formulate prior distributions under all T hypotheses, we only need to formulate an unconstrained prior.Another advantage of this choice of the prior is that it substantially simplifies the computation of the Bayes factors as the marginal likelihoods do not have to be computed but instead only the prior and posterior distribution under the unconstrained model are needed.In fact, the Bayes factor between a constrained hypothesis H t and the unconstrained model can be written as an extended Savage-Dickey density ratio (e. g.Dickey, 1971;Mulder & Gelissen, 2021): where Thus, we only need to compute the prior and posterior density of the transformed parameter β E t at the equality constrained null value, and the prior and posterior probability that the order constraints hold conditional under the equality constraints under the unconstrained model.
An unconstrained multivariate normal prior distribution is specified for the coefficients which is based on the g prior (Zellner, 1986;Liang et al., 2008;Hanson et al., 2014).The prior mean is set to zero.This has two important motivations.First, a prior mean of 0 implies that, a priori, values close to zero are more likely than values far away from zero.This corresponds with applied social research where small effects are generally more likely than large effects.Second, a prior mean of 0 implies that a priori negative values are equally likely as positive values, which is an objective choice for one-sided testing (Mulder et al., 2010).Furthermore, following g prior methodology, the prior covariance matrix is based on the implied covariance structure of the predictor variables that is rescaled to one dyadic observation, i. e., D(X X) −1 , where D is the total number of possible dyads in the network and X is the design matrix under the unconstrained model.This results in weakly informative prior that is dominated by the information in the data.It also ensures that the prior is not "arbitrarily vague", which should be avoided when testing hypotheses using Bayes factors (Liang et al., 2008;Lindley, 1957;Bartlett, 1957).Hence, the unconstrained prior can be formulated as p u (β|X) = N (0, D(X X) −1 ).
Given the multivariate normal prior for the ERGM parameters, a sample from the unconstrained posterior can be obtained using the Bergm package Caimo & Friel (2014).Following large sample theory, a posterior can often be well approximated using a Gaussian approximation (Gelman et al., 2013, Ch. 4).Relatedly, note that the p values under ERGMs are also typically computed using a Wald type test For this reason, we approximate the unconstrained posterior of the ERGM parameters with a multivariate Gaussian distribution, i.e., where β and Σ denote the approximated posterior mean and the approximated posterior variance, respectively.Using this approximation, the posterior density and posterior probability in ( 8) is straightforward and Bayes factors between the hypotheses of interest can easily be obtained.Finally note that it is recommended that the input variables are appropriately scaled, allowing the respective coefficients to be tested against each other.This is in line with the general practice of hypothesis testing.Given the relation of the ERGM with logistic regression, we recommend the scaling that was suggested by Gelman et al. (2008) where binary (dummy) variables are shifted to have a mean of 0 and differ by 1, and other variables have a mean of 0 and are scaled to have standard deviation of 0.5 (so that continuous variables have the same scale as symmetric binary variables with equally sized categories).

Software implementation in BFpack
The proposed methodology for testing hypotheses using the Bayes factor is implemented in the R package BFpack (Mulder, Gu, et al., 2021).This package contains a variety of Bayes factor testing procedures for equality/order hypotheses for different types of models, including (but not limited to) (multivariate) multiple regression, (multivariate) analysis of (co)variance, generalized linear models, correlational analysis.The main function is called BF which only requires a fitted model in the form of a R object.Hypotheses can be formulated under this model using character strings using the hypothesis argument.
To compute the Bayes factors and posterior probabilities among a set of competing hypotheses under an ERGM, a fitted ERGM object of class ergm (from the ergm pack-age; Hunter et al., 2008) or a fitted Bayesian ERGM of class bergm (from the Bergm package; Caimo & Friel, 2014) needs to be plugged in the BF function1 .The function then extracts the necessary ingredients to compute the Bayes factors and posterior probabilities.Note that when a bergm object is plugged in the function (regardless of the prior that was used to obtain this object), a new Bayesian ERGM is fitted using the proposed prior from which the Bayes factors and posterior probabilities are computed.To get accurate estimates of the Bayes factors and posterior probabilities, it is recommend to collect a sufficient number of posterior draws using the bergm function.Throughout this paper we set main.iter = 100000 (instead of the default setting of 1000 draws), which can also be included as an argument in the BF function which is then transferred as argument for the bergm function.
In addition to the confirmatory hypothesis test of multiple constrained hypotheses of the form (2), BFpack also computes the posterior probabilities of a zero effect, a negative effect, or a positive effect by default, i. e., for all coefficients β 1 , . . ., β K using equal prior probabilities.The output of these tests can be used for testing the direction of the individual effects on the formation of ties in the network.

Numerical performance
Simulation studies were conducted to study the behavior of the proposed Bayesian test.As a base model, an ERGM was considered that is based on a policy network from Leifeld & Schneider (2012) with a simplified model specification using predictor variables: edges, preference similarity, influence attribution, common communities, reciprocity, and geometrically weighted edgewise shared partner with coefficients set to β edges = −3.44,β pref sim = −0.14, β infl attr = 0.39, β committee = 0.26, β reciprocity = 1.24, and β GWESP = 0.69, respectively.The goals of the simulations were (i) to explore the behavior of the test as a function on different true population values and (ii) to illustrate the consistent behavior of the test where the posterior probability of the true hypothesis gradually goes to 1 as the network size grows.
Two hypothesis tests were considered (for a theoretical justification, see Section 2).First, a traditional null hypothesis test was considered of whether the coefficient of preference similarity equals zero or not, i. e., Second, a multiple order hypothesis test was considered of a specific ordering of three coefficients versus an equality-constrained alternative versus the complement, i. e., The third complement hypothesis covers the parameter space that is not covered by H 1 or H 2 .
In the simulation for the first test, network data were generated using the above values for the coefficients except that the value for β pref.sim was varied on a grid from −0.8 to 0.8.For the second test, network data were generated using the above values except that the values for (β pref sim , β infl attr , β committee ) was varied according to (β, 2β, 3β), where β was varied on a grid of −0.4 to 0.4.Thus, when β is positive, the order hypothesis H 1 is true, when β is zero the equality hypothesis H 2 is true, and when β is negative the complement hypothesis H 3 is true.For each grid value, 100 networks were generated, and Bayes factors were computed between the hypotheses which are transformed to posterior probabilities.This was done for networks of 10 actors, 30 actors, and 90 actors where the structure of the edge covariates of preference similarity, influence attribution, and common communities were based on the observed covariates from Leifeld & Schneider (2012).The edge covariates were standardized so that the coefficients lie on a normalized scale.
Figure 1 shows the median posterior probability of the first test (left panels) and the multiple order test (right panels) for networks of 10 actors (lower panels), 30 actors (middle panels), and 90 actors (lower panels).Before discussing the results, we note that for the larger networks with 90 actors, the ERGM could often not be fit using the 'ergm' package for large negative effects in the first test resulting in the omission of the line for values smaller than −0.2 (left lower panel).For the second test (right lower panel), the model could not be fit for large negative values for many of the generated networks.The lines for small β values are therefore only based on the generated networks that could be fit using the 'ergm' package.
For both tests, the figure shows the anticipated behavior where on average the true hypothesis receives the largest posterior probability and this increases for larger networks.Furthermore, we see that the posterior probability for the equality hypothesis (H 1 in the first test and H 2 in the second test) goes to 1 with a slower rate when it is true than the rate when the alternative hypotheses are true.This implies that more data are generally needed to get a large posterior probability of a true equality hypothesis than for unconstrained or order constrained hypotheses.Intuitively, this seems reasonable because it is easier to find evidence that a parameter is not exactly 0 (or that several parameters are not exactly equal) than to find evidence that a parameter is exactly 0 (or that several parameters are exactly equal).This behavior is also often observed for Bayes factors for other testing problems.
For the order hypothesis test, is interesting to see that smaller effects are needed in the direction of the order hypothesis to acquire the same posterior probabilities in Figure 1: Median posterior probability of the first precise test (left panels) and the multiple order test (right panels) for networks of 10 actors (lower panels), 30 actors (middle panels), and 90 actors (lower panels) based on 100 randomly generated networks for every grid value.For the generated networks with 90 actors, the ERGM could often not be fitted for negative grid values for both tests.For the first test the lines are therefore omitted.
comparison to the effect in the opposite direction for the complement hypothesis.For example, in the case of a network with N = 30 actors (right middle panel of Figure 1), the posterior probability for the order hypothesis H 1 is largest when the true value is about 0.16, while the posterior probability for the complement hypothesis H 3 is largest when the effect β is lower than −0.24.This can be explained by the fact that the order hypothesis covers a smaller parameter space than the complement hypothesis, and therefore the complement receives a larger penalty for model complexity.In fact, the complexity of order hypotheses is implemented via the prior probability that the constraints hold (e. g., see Mulder et al., 2010).This illustrates that the criterion also behaves as an Occam's razor when assessing order hypotheses.
To see this more clearly, Figure 2 displays the Bayes factors of order hypothesis H 1 versus the unconstrained model, which is computed as the posterior probability The median Bayes factor of the order hypothesis H 1 : β pref.sim< β inf.att < β comm.commagainst the unconstrained model, which is computed as the ratio of posterior and prior probabilities that the constraints hold, while varying the ordered effect size on the x axis, for different network sizes of N = 10, 30, and 90 actors.
that the order constraints hold divided by the prior probability that the order constraints hold under the unconstrained model; see equation ( 8).Thus, in the case of overwhelming evidence for the order constraints, the posterior probability that the order constraints hold is approximately 1, and the Bayes factor of the order hypothesis against the unconstrained model is approximately equal to the reciprocal of the prior probability that the constraints hold.Note that the prior probability depends on the prior covariance structure, which depends on the design matrix via (9).For example, if the covariates are independent, the prior probability that the order constraints β pref sim < β infl attr < β committee hold under the unconstrained model is equal to 1/6, which can be explained by the fact that there are 6 possible orderings of 3 parameters.In this case, the Bayes factor of the order hypothesis against the unconstrained model grows to 6 if the ordered effect increases.This is also what can be observed in Figure 2 for a slightly different prior covariance matrix (based on the data from Leifeld & Schneider, 2012).

Empirical applications (revisited)
Application A: Information exchange in policy networks The main interest was whether preference similarity plays a role in information sharing in policy networks or whether the effect is absorbed by institutional, relational, and This is tested under Model 2 of Leifeld & Schneider (2012).When testing these hypotheses, the Bayes factor between H 1 and H 2 was equal to B 12 = 5.20, which implies that the network was 5.20 times more likely to be generated under a model that assumes that preference similarity equals zero than by a model that assumes that preference similarly can attain any real value.This suggests positive evidence in favor of no effect of preference similarity on tie formation.When assuming equal prior probabilities for the two hypotheses, i.e., P (H 1 ) = P (H 2 ) = .5, the posterior probabilities then yield P (H 1 |Y) = .839and P (H 2 |Y) = .161.Thus after observing the data, there is a posterior probability of approximately 84% for the model that assumes that preference similarity plays no role in tie formation in this policy network.Note that no formal decision needs to be drawn based on these posterior probabilities.But if one would conclude that H 1 is the true hypothesis, there would be a conditional error probability of about 16% of drawing the wrong conclusion given the observed network.The secondary question was the order hypothesis of shared committees followed by influence attribution, followed by preference similarity versus a (null) hypothesis of equal effects versus the complement that neither of these two hypotheses hold; see equation (3).The Bayes factors between the hypotheses and the posterior probabilities (when assuming equal prior probabilities of 1 3 for each hypothesis) are given in Table 2).The order hypothesis H 1 clearly receives most evidence, and has a posterior probability of approximately 98% given the observed netwerk.
We end this application by showing the accuracy of the Gaussian approximation of the posterior of preference similarity, common committees, and influence attribution based on the posterior draws from the bergm function.Figure 3 plots the estimated posterior (dashed black line) and the Gaussian approximation (red solid line).The figure shows that the posterior can be well approximated using a Gaussian distribution.Furthermore, we provide the additional (standard) output provided by BFpack of whether each coefficients equals zero, is negative, or is positive when assuming equal prior probabilities of 1 3 for each of these three hypothesis per coefficient in Table 3. Overall, we see that the conclusions based on the classical p values and estimates are somewhat comparable to the Bayesian posterior probabilities.The most important difference is that the posterior probabilities are more conservative regarding the evidence against a null hypothesis.

Application B: Institutional and functional overlap in co-authorship networks
The interest is in testing competing hypotheses about an increasing positive effect of functional overlap of the form  which are tested under the Swiss collaboration network data from Leifeld (2018).The Bayes factors between the hypotheses can be found in the evidence matrix in Table 4.We see that hypothesis H 1 , which assumes an increasing positive collegial effect, receives decisive evidence against H 2 (constant positive collegial effect) and H 1 (no collegial effect), and strong evidence against the complement hypothesis.The resulting posterior probabilities are printed in the last column of Table 4 when using equal prior probabilities.We see that the that posterior probabilities for H 1 , H 2 , H 3 , and H 4 are approximately 92%, 2%, 0%, and 6%.Thus, hypothesis H 1 , which assumes is an increasingly positive effect of functional overlap, is clearly most likely after observing the data.Furthermore, we can safely rule out hypothesis H 3 , which assumed no functional overlap effect, given the zero posterior probability.Again we note that no decision needs to be made which hypothesis is true because these outcomes can be interpreted on a continuous scale.But if it would be concluded that hypothesis H 1 is true, it is important to note that there would still be a conditional probability of drawing the wrong conclusion of about 8% given the observed network.Again we end the application by checking the accuracy of the Gaussian approximation of the posterior based on the posterior draws using the bergm function; see Figure 4 where the black dashed lines are the estimates posteriors and the solid red lines are the Gaussian approximations.Overall the Gaussian approximation seem accurate with a slight deviation for the posterior of the same affiliation effect due to its skewness.We provide the additional (standard) output provided by BFpack of whether each coefficients equals zero, is negative, or is positive when assuming equal prior probabilities of 1 3 for each of these three hypothesis in Table 3. Overall, we see that the conclusions based on the classical p values and estimates are comparable to the Bayesian posterior probabilities where the evidence against the null again seems larger for the significance test.This result holds in general (e.g., Berger & Delampady, 1987;Sellke et al., 2001;Wagenmakers, 2007), which is an important motivation for the recent suggestion to change the default significance level in statistical practice from .05 to .005.

Discussion
A Bayes factor test was proposed for testing a broad class of statistical hypotheses with competing equality and/or order constraints on the coefficients under ERGMs.The outcome of the Bayes factor quantifies the relative evidence in the data in favor of a hypothesis with a set of constraints on the ERGM coefficients relative to a hypothesis with alternative constraints on the coefficients.When prior probabilities for the hypotheses are formulated, the Bayes factors can be translated to posterior probabilities of the hypotheses, which quantify the plausibility of the hypothesis after observing the data.Several attractive properties of the methodology were illustrated, such as its consistent behavior, its ability to test multiple hypotheses simultaneously, its ability to quantify the evidence in the data in favor of the null, and its ability to balance between fit and complexity as an Occam's razor, also for hypotheses with order constraints.These properties are not shared by classical p values, which are currently dominant for hypothesis testing in applied social network research using ERGMs.The proposed Bayes factors were calculated using unit information type g priors (Zellner, 1986;Liang et al., 2008), which were based on the design matrix of the predictor variables under the largest unconstrained (or 'encompassing') model in combination with a zero prior mean.Consequently, equal prior probability mass is placed for negative and positive effects, which make the prior suitable for testing one-sided and order constraints because (see also Mulder & Raftery, 2022, for a discussion of prior specification for order hypotheses when using the BIC).The unit information prior covariance matrix ensures that the prior is relatively vague and completely dominated by the information in the data while at the same time not arbitrarily vague to avoid Bartlett's phenomenon (Bartlett, 1957;Liang et al., 2008).Truncated versions of this unconstrained prior were used under the constrained hypotheses of interest such that the priors under all hypotheses of interest are linked to the same unconstrained prior (see Consonni & Veronese, 2008, for a related discussion).This choice greatly simplified the computation of Bayes factors via an extended Savage-Dickey density ratio.It is important to note that various objective Bayes factors that have been proposed in the literature can also be written as Savage-Dickey density ratios, such as partial Bayes factors and fractional Bayes factors (Berger & Pericchi, 1996;O'Hagan, 1995;Mulder, 2014;Mulder & Gu, 2022).Hence, Bayes factors based on Savage-Dickey density ratios have shown to result in consistent selection behavior in various testing problems (despite certain phylosophical aspects of the methodology; see Verdinelli & Wasserman, 1995;Marin & Robert, 2010;Heck, 2019;Mulder, Berger, et al., 2021;Mulder et al., 2022).A thorough investigation about the intricate properties regarding prior specification under Bayesian ERGMs falls outside the scope of the current paper and would be interesting to explore in future research.
Finally, the implementation of the new methodology in the R package BFpack (Mulder, Gu, et al., 2021) allows researchers to apply the methodology in a straightforward manner.Thereby, the paper contributes to the increasing availability of Bayes factors for various statistical models, such as network autocorrelation models (Dittrich et al., 2019(Dittrich et al., , 2020)), structural equation models (Gu et al., 2019;Van Lissa et al., 2021), or analysis of variance (Rouder et al., 2012;Mulder & Gu, 2022), to name a few.For a recent overview of applications of Bayes factor testing in the social and behavioral sciences, we refer interested readers to Heck et al. (2022).Furthermore, tutorials on the use of Bayes factors are available by Hoijtink et al. (2019), Masson (2011), or Wagenmakers et al. (2010).

Figure 3 :
Figure3: Posterior density estimate based on 1,000,000 posterior draws (dashed black line) and the Gaussian approximation of the posterior (solid red line) for the policy network.

Figure 4 :
Figure4: Posterior density estimate based on 1,000,000 posterior draws (dashed black line) and the Gaussian approximation of the posterior (solid red line) for the coauthorship network.

Table 1 :
(Raftery, 1995)interpreting the Bayes factor B 12 of hypothesis H 1 against H 2(Raftery, 1995).B 21 = 1/B 12 , implying that if there is 10 times more evidence for H 1 relative to H 2 , then there is 10 times less evidence in the data for H 2 relative to H 2 .Thus, none of the hypotheses has a special role (unlike the null hypothesis in significance testing).Moreover, the Bayes factor follows a transitivity property, e. g., B 31 = B 32 × B 21 , which implies that if H 3 receives 5 times more evidence than H 2 , and H 2 receives 10 times more evidence than H 1 , then H 3 receives 50 times more evidence than H 1 .To facilitate the interpretation of Bayes factors, Table which quantifies how much more plausible the observed network is under H 1 relative to H 2 .From this definition, it automatically follows that the data were more likely to be observed under H 1 than H 2 if B 12 > 1, and vice versa.Because of this interpretation, the Bayes factor can be used as a relative measure of evidence in the data in favor of one hypothesis against another hypothesis.For example, if B 12 = 10 for the hypotheses in Application A, this implies that there is 10 times more evidence in the data for hypothesis H 1 than for the alternative H 2 .Furthermore, it can be seen that Bayes factors satisfy a symmetry property, i. e.,

Table 2 :
Bayes factors between the hypotheses in the second test in Application A, and posterior probabilities based on equal prior probabilities.

Table 3 :
Classical results (MLE, standard error, and p value) and posterior probabilities of a null, a negative, and a positive effect (assuming equal prior probabilities) for all ERGM coefficients of the policy network (Application A).

Table 4 :
Bayes factors between the hypotheses in Application B, and posterior probabilities based on equal prior probabilities.

Table 5 :
Classical results (MLE, standard error, and p value) and posterior probabilities of a null, a negative, and a positive effect (assuming equal prior probabilities) for all ERGM coefficients of the co-authorship network (Application B).