Testing for Equivalence of Network Distribution Using Subgraph Counts

Abstract We consider that a network is an observation, and a collection of observed networks forms a sample. In this setting, we provide methods to test whether all observations in a network sample are drawn from a specified model. We achieve this by deriving the joint asymptotic properties of average subgraph counts as the number of observed networks increases but the number of nodes in each network remains finite. In doing so, we do not require that each observed network contains the same number of nodes, or is drawn from the same distribution. Our results yield joint confidence regions for subgraph counts, and therefore methods for testing whether the observations in a network sample are drawn from: a specified distribution, a specified model, or from the same model as another network sample. We present simulation experiments and an illustrative example on a sample of brain networks where we find that highly creative individuals’ brains present significantly more short cycles than found in less creative people. Supplementary materials for this article are available online.


Introduction
We show that subgraph counts are flexible and powerful statistics for testing distributional properties of networks, when more than one network is observed. Specifically, we use subgraph counts to test the hypotheses that all networks in a sample are generated either from a given distribution, from distributions specified by a given model class, or from the same model as that of another sample.
Our results address the fundamental inference problem raised by the following experiment (Gray et al., 2012): The networks connecting brain regions of individuals of varied levels of creativity are observed. However, while these observations can be assumed to be independent, due to the variability of the brain structure and the instability of the observation technique, they cannot be assumed to be identically distributed; for instance, they need not contain as many nodes and edges. Unfortunately, this implies that if we identify each network realization with its adjacency matrix, the obtained matrices will be of different sizes. This prevents, for instance, estimating the distribution the adjacency matrices are drawn from as one would from a sample of independent and identically distributed realizations. How, while allowing for such variations, can we test for significant differences between individuals with different levels of creativity?
Formally, we consider that each network is an observation-say G i for each subject-and a collection of observed networks form a sample-say G = (G 1 , . . . , G N ). Then, our goal is to compare distributional properties of the G i -s as N grows. This parallels more classical statistical settings, where an observation is a vector-such as X i ∈ R k -and a sample is a matrix: X = (X 1 , . . . , X N ) ∈ R k×N . However, this parallel with the classical setting stops there. Indeed, the G i need not be of the same size, or have nodes that are matched; i.e., the X i -s would not have the same dimension, and the entries could be shuffled.
This setting strongly differs from the two settings that have already seen extensive research. First, the setting where only one or two very large network are observed and for which many statistical tests already exist (see Hoff et al. (2012); Ho et al. (2012) ;Bickel et al. (2012); Sussman et al. (2012); Bhattacharyya and Bickel (2015); Tang et al. (2013); Olhede and Wolfe (2014); Fosdick and Hoff (2015); Klopp et al. (2016); Coulson et al. (2016), to cite but a few). Available statistical tests for network comparison in that setting focus on the case where one or two large networks are observed (Asta and Shalizi, 2014;Gao and Lafferty, 2017;Banerjee and Ma, 2017;Ghoshdastidar et al., 2017), or when one finite network is compared to a fixed model alone (Birmele, 2012;Ali et al., 2014Ali et al., , 2016. The second case addresses graph samples, as we do here, but under very restrictive assumptions: samples that are independent and identically distributed, where all graphs have the same size, and where nodes across networks can be matched one-to-one. Then, under these assumptions, it is possible to compare network summaries using classical statistical methods (Simpson et al., 2012;Stoffers et al., 2013;Daianu et al., 2013;Ginestet et al., 2017), or by fitting a parametric or semi-parametric model (Simpson et al., 2012;Wang et al., 2019;Durante and Dunson, 2018).
We provide, in the graph sample setting described above, an analog of the multivariate t-test for network samples: methods to test whether a given network sample G presents averages consistent with either a specific model, or with that of another sample. The averages we use are subgraph counts; e.g., the number of or in the sample. The choice of subgraph counts as statistics is motivated by their success in comparing large networks (Milo et al., 2002;Ali et al., 2014), but also by results in random graph theory and the study of large graphs. In both fields, subgraph counts have proved to be the most powerful tool available to compare networks (Ali et al., 2014(Ali et al., , 2016 and are known to have properties similar to moments of random variables when studying large networks (Diaconis and Janson, 2008;Bickel et al., 2012;Lovász, 2012;Chatterjee and Diaconis, 2013). Finally, on account of these results, many powerful algorithm exists to count subgraphs efficiently; e.g., Talukder and Zaki (2016); Ortmann and Brandes (2016).
Formally, we are representing network samples in a space defined by subgraph counts, and performing comparisons in that space. While other network comparison techniques also use embeddings (Asta and Shalizi, 2014;Gao and Lafferty, 2017;Wang et al., 2019), using subgraph counts presents three key advantages: first, if the G i -s are generated by a blockmodel (Hoff et al., 2012)-one of the most popular random network models to date-and for some families of subgraphs, the embedding is one-to-one. This result is known as the finite forciblity of a family of graphs (Lovász, 2012, Ch. 16). Second, very few assumptions on each G i need to be made as N grows to obtain consistency and asymptotic normality of the image of G in the embedding space. This enables us to work under a very flexible null model. Finally, because it relies on direct summaries of the G i -sthe number of , , and so on-this embedding remains interpretable, part of the appeal to use subgraph based inference.
The main risk in subgraphs for testing purposes is that we cannot be assured that the subgraphs considered are sufficient to distinguish the null and the data generative mechanism.
We provide several experiment allowing us to claim that while possible, this does not appear to be common. Furthermore, this risk is balanced by the interpretability of subgraph counts: if the null and the generative mechanism have the same number of subgraph, they are maybe equivalent for the purpose of the study at hand. Another shortcoming is that subgraph counts can be highly correlated, especially in denser networks, making the estimation of the inverse of their covariance matrix unstable. One of our contribution are closed form formulae for these covariance matrices under our null, which reduces but does not fully resolves this issue.
Finally, because subgraph counting library are not standard, implementing the proposed methods is not as easy as for other methods, which tend to rely on more common, linear algebra related, data transformation pipelines. To this end we made all the code to perform our analysis available 1 .
In the remainder of this article, we first introduce subgraph counts and the kernel based random graph model. We then present, successively, the case where all the networks in the sample come from the same kernel model, and the case where each observed network may come from a different kernels. In both cases, we prove asymptotic normality of our estimator, present a plug-in estimator of its variance, describe the limit of the estimator under the alternative hypothesis, and provide representative examples showing the practical utility of the result. Methods to produce the figures, as well as supporting simulations, can be found in the Supplementary Material. We conclude with an analysis of connectome data, and with a discussion.

Subgraph Counts in Kernel Based Random Graphs
We now define our statistics (subgraph counts) and our null model (the kernel based random graph model). Subgraph counts are natural statistics to compare networks for two reasons.
First, subgraph counts intuitively summarize a network through its fundamental building blocks. This has historically given them purchase to address hard fundamental and empirical problems (Rucinski, 1988;Milo et al., 2002;Ali et al., 2014). Second, subgraph counts present tractable analytical properties. We will describe and leverage these properties below, in a manner paralleling what is done in related literature (Bickel et al., 2012;Bhattacharyya and Bickel, 2015;Rucinski, 1988). graphs. There are 2 copies of the triangle ( ) in a) and b), but 0 in c). There are 1, 0 and 3 copies of the square ( ) in a), b) and c) respectively.
A subgraph count is the number of copies of a given graph in another graph (see Figure 1).
Throughout, we call the subgraph-and denote F -the graph which is counted and G the larger graph in which the counting takes place. All graphs will be simple (unweighted, no self loops or multiple edges). Subgraphs are also termed motifs, pattern graphs or shapes depending on the field (Milo et al., 2002;Alon et al., 1997;Hočevar and Demšar, 2014;Benson et al., 2016).
For clarity, we define subgraph counts formally as follows: Definition 1 (Graph equivalence '≡'). Fix two graphs F and F . We say that F is equivalent to-or is a copy of-F , and write F ≡ F , if there exists a bijective map φ from the vertex set of F to the vertex set of F such that ij is an edge in F iff φ(i)φ(j) is an edge in F .
Definition 2 (Subgraph count X F (G)). Fix two graphs F and G. We denote X F (G) the number of non-necessarily induced subgraphs of G equivalent to F ; i.e, where F ⊂ G if the vertex end edge sets of F are subsets of those of G.
With this notation, calling G a , G b and G c the graphs in Figure 1, we have X (G a ) = 1, The power of subgraph counts to study networks stems from their inherent linearity.
Indeed, products of subgraph counts are but linear combinations of other subgraph counts.
Intuitively, first observe that a product of two subgraph counts will involve counting pairs of copies. Then, a product of subgraph counts can be recovered by counting the number of copies of all subgraphs that can be induced by a pair of copies. More precisely, in the Appendix we show the following, which is implicitly used in Rucinski (1988): Lemma 1 (Linearity of subgraph counts (Rucinski, 1988)). For any two graphs F and F , there are factors c H and a set H F F of subgraphs-the set of subgraphs that can be obtained using one copy of each F and F as building blocks-such that for any graph G For instance, as these will be used later on, we have that in any graph G, This algebraic property of subgraph counts allows us to understand the proofs of Rucinski (1988); Bhattacharyya and Bickel (2015). The property is also crucial to the subgraph counting algorithms of Hočevar and Demšar (2014), and can be found in many other examples. Crucially, as opposed to cases where the model enforces linearity-such as with assumptions of Normality-it is the nature of the statistics (subgraph counts) and the system (graphs) that makes the problem linear.
The linearity of subgraph counts allows us to use as null the very flexible kernel based model (Lovász, 2012). This framework subsumes most models used in the statistical literature on networks; e.g., blockmodel (Hoff et al., 2012) and dot-product models (Sussman et al., 2012). It has the intuitive structure of affixing to each node i a latent feature (here x i ) and of connecting nodes i and j (conditionally independently) with a probability determined by the node features (here f (x i , x j )).
Definition 3 (Kernel f and random graph G n (f )). Fix a symmetric measurable map f : [0, 1] 2 → [0, 1], and call it a kernel. We call G n (f ) the random graph distribution over graphs with n nodes such that: to each node is randomly and independently assigned a feature x i ∈ [0, 1], with x i ∼ Unif([0, 1]); and where edges form independently conditionally To recover a blockmodel with K blocks, it suffices to consider a partition of [0, 1] in K sets (i.e., (P 1 , . . . , P K ) ∈ P K ([0, 1])) and set f as constant over each P u ×P v . The dot-product model is recovered with a kernel f of finite rank; i.e., f (x, y) = u≤K λ u f u (x)f u (y).
The model assumes that vertices' location in the latent space are independent and identically distributed, so that the model does not assume any structure or symmetry among the nodes in the observed networks. However, it can accommodate any such structure, which would be recovered through estimation. Nonetheless, in some cases it may seem relevant to enforce such structure; e.g., in our brain example, assume that nodes in different hemispheres are unlikely to connect. Doing so risks misspecifying the model to gain faster convergence, which might be necessary in some cases. We did not find it necessary in our application.
In the kernel framework, subgraph counts have direct interpretation as moments of f (Lovász, 2012) (see (1)). Specifically, if G ∼ G n (f ), then the moments of X F (G) are moments of f . An infinite number of subgraph counts are sufficient statistics to distinguish between any two kernel (Lovász, 2012;Bollobás and Riordan, 2009), as the subgraph counts can be used to define the subgraph metric. However, there are no guarantees on which or how many subgraph counts are needed to distinguish between two kernels. For blockmodels and finite rank models, we know only that a finite number is sufficient (a concept know as finite forcibility, see Lovász (2012, Chapter 16.7 & Appendix 4) for more details).
This makes subgraph counts especially appropriate as statistics in an hypothesis test.
Indeed, for any finite set of subgraphs F, we could have two kernels f and f such that subgraph counts is sufficient but not necessary to distinguish between kernels. Conversely, not observing a difference in subgraph count can only imply that we do not observe a difference in the kernels, and therefore that we fail to reject the hypotheses that the kernel are equal. This has implication regarding the power of our test that we explore below.
Unfortunately, all known results on subgraph counts under the kernel model consider the setting where one very large graph is observed. Here we present the tools to address the problem where a sample of graphs is observed.

The simple case: Samples from one kernel
We now present a central limit theorem as well as practical methods to build confidence regions for the subgraph counts observed in a network sample G = (G 1 , . . . , G N ). In this section, we assume that there is a kernel f such that each G i is drawn independently from where for any graph F we write |F | for the number of nodes in F ).
Fix F ∈ F and G ∈ G. In this setting, X F (G) is a random variable, and the first parameter to consider is its mean. To compute this mean, let F 1 , . . . F X F (K G ) be all the copies of F in K G (the complete graph over the nodes of G), so that using the linearity of the expectation, we have that Then, direct computations show that E1 {F j ⊂G} does not depend on j (see Proposition 1), and that Similar computations for higher moments, aided by Lemma 1, enable us to use the Lindeberg-Feller central limit theorem and the Cramer-Wold device to prove the following: Theorem 1 (Statistical properties of subgraph counts). Fix a set of graphs F, a kernel f and asymptotically normal estimator of µ F (f ); i.e., Eμ F (G) = µ F (f ) and there exists a positive semi-definite Σ F (n, f ) such that asymptotically in N : Crucial to the following is the covariance matrix Σ F (n, f )-which can be obtained by taking the limit in N in (2) for each F, F ∈ F-and which will enable the computation of confidence regions. Interestingly, its elicitation is more involved than for the study of large networks, where only a few terms dominate. We refer to the Appendix for the proof.
The Appendix's proof relies on understanding the moments of single network counts, which have been studied extensively in the Erdős-Rényi and the exchangeable cases, see for example Rucinski (1988); Coulson et al. (2016).
Remark 1 (Non-random latent positions). Because our proof build on the foundation of Rucinski (1988), our results extend to the case where the latent x i -s are not random; say fixed to some values. The only modification would be that the sum in (2) should be restricted to graphs H formed by copies of F and F overlapping over at least an edge (as opposed to overlapping over one node in (2).) This also means that all our methods apply to that case, with the caveat that the resulting test will be conservative (as the variance is overestimated.) Theorem 1 enables testing against the null that all G i are drawn from a given kernel.
Further, the Appendix contains a simulation experiment exploring convergence in small sample. To make this concrete, we consider an example in Figure 2. There, we observe a graph sample G = (G 1 , . . . , G 100 ), and aim to compare it to two kernels f a (in black) and f b (in gray) using Theorem 1; i.e., we assume that for i ∈ [N ], G i ∼ G n i (f ) and consider the null hypothesis H 0 : f = f a and the alternative H 1 : f = f b . We draw as a white crosŝ µ F (G). The sizes of the networks in G, the n i , are non-random but not constant. We achieve this by using the sequence of digits of π.
First, since we have specified f a and f b , we can evaluate both µ F (f a ) and µ F (f b ) and draw them on the figure (as white dots). Then, since n = (n i ) i≤N is observed, we can compute Σ F (n, f a ) and Σ F (n, f b ) using Theorem 1, which allows us to compute the confidence ellipse around µ F (f a ) and µ F (f b ) (in shaded black and gray respectively). Finally, since we know The estimateμ F (G) is denoted by a white cross. Overlaid are the expected densities (white dots) and the confidence ellipse (shaded area) for two alternative kernels f a and f b .
the limit distribution and covariance under the null, we can use Σ F (n, f a ) and µ F (f a ) to compute a p-value using Mahalanobis distance; e.g., assuming Figure 2 is useful to understand the power of the proposed test. We see that since all the confidence ellipses do not all overlap, the power is larger than .95. However, if we were to use only and , then the power would be less as the ellipses overlap. For larger sample sizes, the radii of both ellipses will be smaller, so that eventually the power tends to 1 using any combination of subgraphs. In the Supplementary Material we explore the power of our test when the expected count is held fixed, but the number of blocks in the null and true models differ.
In the following we consider the case where instead of testing against the null of a single kernel, we test against the null of a kernel class.

13
Here we expand our results to cases where the observed networks may be generated from different kernels. Indeed, in many settings, the sampling mechanism may distort the structure of the underlying kernel; e.g., although the network connecting brain regions can be satisfactorily modeled by a blockmodel (Koutra et al., 2013), the proportion of nodes of each block may be different in different experimental settings, so that each observation is drawn from a different blockmodel.
In this practically important and conceptually challenging new setting, the proof techniques developed for Theorem 1 yield the following.
Theorem 2. Fix a set of graphs F, a sequence of kernels f = (f i ) i∈N and a sequence of Therefore, even in this much more flexible setting, we can recover the barycenter of the µ F (f i ). However, the variance has now a more complex structure (see Appendix for details).
Following the intuition of our example of brain networks, and to make the usefulness of Theorem 2 concrete, we introduce the flexible stochastic blockmodel (FSBm).
Definition 4 (FSBm and embedding shape). For a symmetric matrix B ∈ [0, 1] K×K we call D(B) the set of all possible kernels with the same block structure as B; i.e., recalling For a set F of graphs, we call embedding shape the set Figure 3: Testing for a FSBm class. The sample G is such that: , and plot it as a white cross. Then, we draw in solid color the embedding shapes µ F (B a ) and µ F (B b ). In shaded color we draw the associated confidence regions; p-values can be obtained using the Mahalanobis distance associated with the closest point The most direct way of using the FSBm is to test for all the f i being equal to any blockmodel instance in a class; i.e assume that all G i -s are drawn from a kernel f and test for the null H 0 : f ∈ D(B). This is achieved by using a composite hypothesis test, and our results allow us to produce confidence regions and p-values using the same tools as before.
We present such an example in Figure 3. There, we observe G = (G 1 , . . . , G 200 ), and consider two FSBm classes generated from B a (in gray) and B b (in black). Then, we assume that all networks in the sample are drawn from a kernel f and test for the null We present an example in Figure 4. There we observe G = (G 1 , . . . , G 10 3 ) and consider two FSBm classes generated by B a and B b . We first plot µ F (G) as a white cross. Then, using Definition 4, we plot the convex hull of the embedding shapes µ F (B a ) and µ F (B b ) (in solid gray and black respectively) wherein-by Theorem 2-µ F (f ; N ) must lie. Finally, we use a method described in the SI to produce the confidence region around each shape (in shaded hue).

Are creative brains different from less creative ones?
We now consider a sample of brain networks G = (G 1 , . . . , G 113 ) (Koutra et al., 2013). This sample was produced as follows: diffusion MRI (dMRI) and structural MRI (sMRI) scans from 113 individuals were collected over 2 sessions from Beijing Normal University (Zuo et al., Figure 4: Testing for a full FSBm class. The sample G is such that: N = 10 3 , n i is the i-th we estimateμ F (G), and plot it as a white cross. Then, we draw in solid color the convex hulls of the embedding shapes µ F (B a ) and µ F (B b ). In shaded color we draw the associated confidence regions; approximate (and conservative) p-values can be obtained by determining the confidence level at which the observation ceases to be in the confidence region.
2014). Graphs were estimated using the ndmg (Kiar et al., 2018) pipeline. The dMRI scans were pre-processed for eddy currents using FSL's eddy-correct (Smith et al., 2004). FSL's "standard" linear registration pipeline was used to register the sMRI and dMRI images to the MNI152 atlas (Woolrich et al., 2009;Jenkinson et al., 2012;Mazziotta et al., 2001). A tensor model is fit using DiPy (Garyfallidis et al., 2014) to obtain an estimated tensor at each voxel. A deterministic tractography algorithm is applied using DiPy's EuDX (Garyfallidis et al., 2012) to obtain streamlines, which indicate the voxels connected by an axonal fiber tract. Simple graphs are formed by first contracting voxels into graph vertices according to the Desikan parcellation (Desikan et al., 2006) and then by placing an edge between two regions if a fiber tract is observed between any pair of voxel from each 70 regions. Further, we have available a covariate C = (c 1 , . . . , c 113 ) measuring subject's creativity, which is related to the person's performance on a series of creativity tasks.
To study this network sample and use the covariate C, we introduce a direct extension of our results to compare two network samples: Corollary 1 (Two-sample test). Fix a set of subgraphs F and two network samples G and G generated respectively by the kernels f and f and the network size sequences n and n .
Then, as both |G| and |G | tend to infinity, and if min(n, n ) ≥ 2 max F ∈F |F |, we have that and, as (2) applies, we can produce an unbiassed |G| + |G |-consistent and asymptotically normal estimator of vec Σ F (n, n , f ) , this without specifying or estimating f .
In the following we choose to work with F = { , , }. We chose these for several reasons: in a simulation experiment (see Supplementary Material) we find that using these counts leads to power surpassing the best available alternative in the literature (Tang et al. We note that our estimator of Σ F (n, n , f ) is entrywise Normal, not Wishart as in the classical setting. Thus, we have no guarantee that Σ F (n, n , f ) is positive definite, and cannot use the Hotelling's T -squared distribution to compute p-values. If the estimate is positive definite, we recommend ignoring the variations in Σ F (n, n , f ) and using the χ 2 |F | distribution. If the estimate is not positive definite, we recommend using the marginals.
Before analyzing G using our results, we make the following test: we subsample uniformly at random and without replacement from G, yielding G 1 and G 2 such that G 1 ∪ G 2 = G, and use Corollary 1 to test for G 1 and G 2 being drawn from the same kernel f . Unless G presents characteristics that cannot be explained by our results, G 1 and G 2 should be indistinguishable, and we expect to see p-values that are uniformly distributed in [0, 1].
We perform this experiment 100 times, and obtain a sample of p-values for which we fail to reject the null of a uniform distribution using the Kolmogorov-Smirnov test (D = 0.09, p-value = 0.3). For this test we use F = { } because of the small sample (|G 1 | + |G 2 | = 113) size and a very high level of correlation; otherwise the correlation between the counts is so high that the correlation matrix appears singular.
We now use C to split G in two samples. To do so, we choose to build a first subsample G 1 containing the less creative, and a second subsample G 2 containing the more creative.
More precisely, for a quantile q and denoting Q C the empirical quantile function of C: Interestingly, for q = 0.5 and q = 0.4, we fail to reject the null that the networks in G q 1 and G q 2 come from the same kernel (see Table 3 for the p-values). However, for q = 0.3 we can reject the null of the same kernel at the 5% confidence level using or , but not .
Thus, we observe that individuals with a very high level of creativity present significantly more and than those with a very low level of creativity. To further confirm this discovery, we undertake the following experiment: we estimate kernels (through the randomdot-product framework (Athreya et al., 2018)) over the samples G, G 0.2 1 , and G 0.2 2 , that we call f , f 1 , and f 2 respectively; then, for several γ ∈ [0, 1], we consider the power of  Table 1: Testing for differences between G q 1 and G q 2 . For each F ∈ { , , } and q ∈ {0.1, 0.2, . . . , 0.5} we produce the p-value for the null H 0 : µ F (G q 1 ) = µ F (G q 2 ). The p-values increase with q, except for q = 0.1, in which case |G q 1 | and |G q 2 | are too small for the test to be significant.
Corollary 1's test when the samples compared are of the same cardinality as G 0.2 1 , and G 0.2 2 , but drawn i.i.d. from G 70 γf 1 + (1 − γ)f and G 70 γf 2 + (1 − γ)f respectively; we find that the proposed test is very powerful, as powerful than a semi-parametric test relying on the random-dot-product structure of the model . This comforts the claim that the difference in sample is only observed for q ≥ .3, and not for more central quantiles.
We now aim to understand whether the added and arise from a few edges completing partially present shapes or from fully new and . To do so, we first observe that if G ∼ G n (f ), then G ∼ G n (1 − f ), where G is the complement graph of G. Therefore, we may use our tests on G, which can be understood as estimating to compare network samples.
Then, using the G q i = {G : G ∈ G q i }, we can test whether there are significantly more fully absent subgraphs in G q 1 compared to G q 2 . There, we find we cannot reject this null; i.e., we cannot reject the null of the networks in G q 1 and G q 2 coming from the same kernel for q ≥ 0.3. Therefore, we conclude that the added and in the highly creative arise from a few edges completing partially present and .
One key conclusion of our analysis is that we must use Theorem 2 to study G. Indeed, we have just shown that all networks generating G cannot come from the same kernel. This allows us to write, that using the full sample G, a centered and consistent estimate of the average density of , , within human brains is (0.41, 0.13, 0.06). To conclude, we remark that since all the G i s in G have the same number of nodes, one could produce a similar analysis using-instead of our results-the µ F (G i ) as if they formed an independent and identically distributed sample. However, as we have just shown, the G i are not identically distributed, and therefore such an analysis could lead to spurious conclusions.

Discussion
We provide the tools to perform statistical inference on a network sample using subgraph counts. Our two main results provide consistency and asymptotic normality of subgraph counts under very flexible conditions. Using these results, we show that subgraph counts are powerful statistics to test whether network samples come from a specified distribution, a specified model, or from the same model.
The key insight we provide is that statistical inference methods paralleling classical ones for standard samples may be obtained for network samples. From this perspective, our results may be seen as providing an analog of a multivariate t-test for network samples.
However, going beyond what our results directly imply, we expect that parallels to ANOVA, model selection, model ranking, and goodness of fit may be obtained for network samples using our proof techniques.
However, since our tests are analog to the t-test, they rely on global summaries. Therefore, while we can reject the null of two network samples being drawn from the same model, our approach is unable to locate where in the network the difference is stemming from.

APPENDIX A: Properties of subgraph counts
In the following we formalize certain notions we use loosely in the main body (especially the notion of copy and the sets H F F , as well as the constants c H ), prove all our results, and provide more details on the numerical examples we present.
We start by proving our first lemma, establishing the linearity of subgraph counts. We first define H F 1 F 2 and c H , generalizing definitions given in Rucinski (1988).
Definition 5 (Overlapping copies). For two graphs F 1 and F 2 we write H F 1 F 2 the set of unlabeled graphs that can be formed by two copies of F 1 and F 2 , and c H the number of ways a given H ∈ H F 1 F 2 can be built from copies of F 1 and F 2 : Finally, call H * F 1 F 2 the set H F 1 F 2 removed of F 1 F 2 , the vertex disjoint union of F 1 and F 2 .
Note that H F 1 F 2 is defined as a quotient set, sometimes called quotient space, through the equivalence relation ≡. In the following we identify the equivalence classes in H F 1 F 2 with any of their element, and therefore will treat the elements of H F 1 F 2 as graphs.
Lemma 2 (Copies pairwise interaction). Fix three graphs F 1 , F 2 and G. Then, Proof. We start by writing: Now, from Definition 5, we first note that by construction of H F 1 F 2 , for each pair F 1 , F 2 in the sum, Therefore we can reindex the sum in (3) as follows: We now note that by definition of c H , for each copy of H in G, there will be c H pairs (F 1 , F 2 ) of copies of F 1 and F 2 in G such that F 1 ∪ F 2 = H. Therefore we can simplify the sum above to obtain: yielding the desired result.
With these tools in hand, we compute the first two moments of X F (G) when G ∼ G(n, f ).
Proposition 1. Fix two graphs F and F and a random graph G ∼ G(|G|, f ) such that |F | + |F | ≤ |G|. Then, we have that Proof. We prove each statement in succession. To begin, call F 1 , . . . , F m the m = X F (K G ) copies of F in K G (m > 0 because |F | ≤ |G|). Then, X F (G) = i∈[m] 1 {F i ⊂G} ,and by linearity of the expectation, Let us fix i ∈ [m] and consider E1 {F i ⊂G} . To do so we will use the law of total probability: Therefore, E1 {F i ⊂G} does not depend on i, and resuming from (4), we obtain which is the desired result.
We now turn to the variance.
Now, we observe that if F i is disjoint form F i (i.e., F i ∩ F i = ∅ which is possible because |F | + |F | ≤ |G|) then 1 {F i ⊂G} is independent from 1 {F i ⊂G} and we have that Therefore resuming from (6) we obtain that Then, by the exact same transformation we used in Lemma 2, we know that each F i ∪ F i is in H * F F (because they are not disjoint), and furthermore, for each copy of which is the desired result.
We now turn to the proofs of Theorems 1 and 2. As the second generalizes the first, it is sufficient to prove the second.
Proof. We obtain the result by a joint application of the Lindeberg-Feller central limit theorem and the Cramer-Wold device. To do so, we fix a ∈ R |F | and compute the variance of our estimator projected along a.

Computing the variance: First recall thatμ
Therefore, denoting "·" the inner product and taking the expectation over G, let s 2 N = var (a ·μ F (G)) .
Then, using the independence of the G i s and the bi-linearity of the covariance, we have To proceed, recall that for each i ≤ N , G i ∼ G(n i , f i ). Then, we may use Proposition 1 to obtain Because all sums are finite, we can reorder the summations, leading to where ω H (n, f ; N ) = N −1 Convergence of ω H (n, f ; N ): To proceed we must show that ω H (n, f ; N ) converges to a limit as N diverges. We will achieve this by showing that ω H (n, f ; N ) is Cauchy. To do so, observe that Then, we recall that X F (K G ) = |G| |F | aut(F ) (see for instance Bollobás and Riordan (2009)), where aut(F ) is the number of automorphisms of F (the number of bijections from the vertex set of F to itself that preserve adjacency.) Then, as |H| < |F | + |F |, we have .
. Thus, the sequence ω H (n, f ; N ) is Cauchy, and we may call ω H (n, f ) its limit; i.e., Then, resuming from (7), and writing Σ F the matrix indexed by F such that we have lim N →∞ N s 2 N = a Σ F (n, f )a. Satisfying the Lindeberg-Feller condition: To invoke the Lindeberg-Feller central limit theorem, we must show that our sequence verifies the so called Lindeberg-Feller condition.
Recall that the sequence under study is and that the variance of the partial sum is N 2 s 2 N . Therefore, the Lindeberg-Feller condition we need to satisfy is the following: To verify the condition we first fix > 0. Then, observe that since for each i and F we have X F (G i )/X F (K G i ) ≤ 1 and µ F (f i ) ≤ 1, we have that |Y i | ≤ 2 a 1 by the triangle inequality. Therefore, as N s N → ∞ as N grows, we may fix an N such that for all N > N we have a 1 ≤ N s N . In this setting, the sum in (8) is equal to zero for all N > N , and the condition is verified. Therefore, we have that Reverting to the notation of the statement of the Theorem, we have that for any a which is sufficient to obtain the claimed result for the limit in distribution.
We now turn to the proof of Corollary 1.
Proof. The result follows almost immediately from Theorem 1 and Slutsky's theorem.
First, since |G| and |G | tend to infinity and both n and n are large enough, we have As both samples are independent, linear combinations are also multivariate Gaussian. Therefore, multiplying the first line by |G |/ |G| + |G |, the second by |G|/ |G| + |G | (both ratios being in (0, 1), the limit in distribution is unaffected), and taking the difference, we obtain, with Σ F (n, n , f ) = lim |G|,|G |→∞ which is the desired limit in distribution.
To obtain the consistent estimator of Σ F (n, n , f ) we first recall that for F, F ∈ F There observe that: -As |G| and |G | grow we have that . Furthermore, both µ F (f ) and µ F (f ) may be estimated in the same way.
Then, by a direct application of the Slutsky's theorem, we have that is an asymptotically normal estimator of Σ F (n, n , f ) F F converging at rate |G| + |G |, which yields the claimed result.

APPENDIX B: Finite sample experiments
Before we proceed, we consider a simulation experiment to determine how large the sample size N must be for the asymptotic limit to be a satisfactory approximation of the statistic's distribution. We present our result in Figure 5. There, we observe that fairly small N , on the order of 100 even with small n i -s, may be sufficient. Furthermore, results from Bickel et al. (2012) suggest that larger networks would make this convergence faster.
APPENDIX C: Power experiments APPENDIX C.1: Direct application of Theorem 1 Here we consider the simple problem where we test for a sample being drawn from a given model, as in Figure 2.
Specifically we first fix k and consider the following family of blockmodels B k : graphs over 100 vertices and k equisized blocks, of fixed global density ρ = 1/80, and where between blocks probability of connection is 0, and within blocks probability of connection is constant (set to that the overall density is ρ.) Then, for k = 2 to 4, we draw 12 graphs from B k , yielding G k . This enables us to investigate the power of our test. Specifically we estimate the probability to reject when we test for the null of B k for G k . To obtain power estimates, we apply Theorem 1 with a level of .05 over 5000 replicates, yielding the following table: APPENDIX C.2: Direct application of Corollary 1 Resuming from the main document, Section 5. We undertake the following experiment: we estimate kernels (through the random-dot-product framework (Athreya et al., 2018)) over the samples G, G 0.2 1 , and G 0.2 2 , that we call f , f 1 , and f 2 respectively; then, for several γ ∈ [0, 1], we consider the power of Corollary 1's test when the samples compared are of the same cardinality as G 0.2 1 , and G 0.   Table 2: Power when the number of blocks is miss-specified. We find that our test has better power when the null presents fewer blocks than the true model. We also see that as the block get denser with larger k, it is computationally harder to maintain the power of the test because of precision with inverting the covariance matrix which is almost singular.
The purpose of the exercise is to see how large γ needs to be for the tests to show reasonably good power. But also to compare the powers of the test to other in the literature.
Specifically, we compare to both  and . These tests are tightly related to that of Durante and Dunson (2018). Call β motif , β semipar and β nonpar the respective power of our test, that of  and that of  respectively. Then, we observe that: • Power is sharply increasing with γ for all tests ( Figure 6).
• We cannot reject the null that motif is at least as powerful as semipar for all γ (Table 3).
• We can reject the null that nonpar is more powerful than motif (Table 3).    We now describe the construction of the confidence region and p-value under the null.
The first task is to compute the subgraph densities (the µ F (f B ) and µ H (f B )). To do so, we Practically, this is achieved by imbricated loops.
That we can compute the subgraph densities under the null directly implies that we can compute both µ F (f B ) and Σ F (f B ). Then, the construction of the confidence ellipse as well as of the p-value using the Mahalanobis distance are classical statistical inference methods which we need not describe here. APPENDIX D.3: Figure 3 We first describe the network sample. The sample is such that for each i ≤ 200, G i ∼ G(π[k] + 30, f B ), where for i = j ∈ [3], B ii = .06, B ij = .02 and f B (u, v) = B 3u +1, 3v +1 .
We now describe how we produced the embedding shape and its confidence region. First, to produce the embedding shape, we compute µ F (f ) for a a large number of elements of f ∈ D(B). Practically, we parametrize D(B) by a K-dimensional vector θ, the entries of which are the sizes of each block, and call the associated kernel f θ . Then, we build a grid S over the K dimensional simplex of step-size 0.01, and for each θ ∈ S we compute µ F (f θ ).
To produce the confidence region, we produce the confidence ellipse for each θ ∈ S which we achieve by computing Σ F (n, f θ ) for each θ in our grid S.  We now describe how we produced the surface where µ F (f ) may live as well as its confidence region. First, we know that µ F (f ) may realize any point in the convex hull of the embedding shape. Therefore, we build the embedding shape as for Figure 3, and then present its convex hull. Building the confidence region is achieved using the following reasoning: -Observe that for any kernel f and H ∈ H F F , µ H (f ) ≥ µ F (f )µ F (f ). This can for instance be directly recovered from the formulas in Proposition 1. Then, for any sequences n and f , we have that (Σ F (n, f )) F F ≥ 0, so that all correlations are positive, and the greatest amplitude of the confidence ellipse will be found on its first quadrant; i.e., the variance of a · µ F (G) will be maximal for some a ∈ R |F | such that a F > 0 for all F ∈ F.
-For any sequences n and f , we have that µ H (f ) := (Σ † F (n, f )) F F .
Then, for any a in the first quadrant we have that a Σ F (n, f )a ≤ a Σ † F (n, f )a. Therefore, the maximal radius of the ellipse associated with Σ † F (n, f ) on the first quadrant is larger than that induced by Σ F (n, f ). Thus, by the previous item, the maximal radius of Σ † F (n, f ) is larger than that of Σ F (n, f ). It follows that the auxiliary sphere of the ellipse associated with Σ † F (n, f ) contains that associated with Σ F (n, f ).
-Finally, max f ∈D(B) µ H (f ) is dominated by µ H (f * ), where f * is the constant kernel equal to the maximal entry of B. In this final setting, we need only compute the µ H (f * ), where f * describes an Erdős-Rényi random graph. This makes the computation straightforward as we then have µ H (f * ) = max ij B k ij , where k is the number of edges in H.
Following this argument, our confidence region is the union of the auxiliary spheres of the ellipse associated with Σ † F (n, f ) centered at each point in the convex hull of the embedding shape. Finally, the gray region is built with B , where for i = j ∈ [4], B ii = .8, B ij = .55.