GTST: A Python Package for Graph Two-Sample Testing

The GTST package is a python package for performing graph sample testing. The test infers whether two samples of graphs were generated from the same probability distribution or not. It is a very general framework as it allows comparison between binary, weighted, directed, node-labelled, node attributed and edge-labelled graphs. Up until now, there is no package which offers graph sample testing even though the problem is often encountered in various fields such as risk management, social sciences and molecular science. The flexibility of the test comes from so-called graph kernels which allow one to measure similarities between complex graph data. The difference between the two samples is quantified using an empirical estimate of the maximum mean discrepancy which is a distance on the space of probability measures. Along with testing of graph samples, the package offers various graph kernels, some of which have not been readily available before.


INTRODUCTION
It is increasingly common to obtain sampled data in the form of graph or network realisations.This is either through construction of graph structures as summary statistics for multivariate spatial-temporal data or directly as a graph-network data set from say a social network realisation or the like.In such settings in practice, one often needs to draw inferences about two samples of such graphs or networks.That is, to test whether the collections of graphs/networks in one sample were generated from the same distribution as the collection of graphs/networks in the second sample.
Kernel methods have proven to be useful in pattern recognition tasks such as classification and can be further extended as an inference procedure to two-sample hypothesis testing on structured data.The method embeds the graphs into a reproducing kernel Hilbert space (RKHS) via a feature map which is then extended further to the embedding of a probability distribution.The two-sample null hypothesis is that the generating mechanism behind the two samples is the same and the test statistic, which is called the maximum mean discrepancy (MMD), is the largest distance between the means of the two sample embeddings [7].
Graph kernels are already well established and widely used for solving classification tasks on graphs and can further be used to compare samples of graphs and to perform graph screening [14].They provide a very flexible way of comparing graphs as they exist for a wide range of different graph structures, for example, weighted, directed, labelled, and attributed graphs.Their performance depends on their expressiveness, that is, their ability to distinguish non-isomorphic graphs.The difficulty of distinguishing two samples of graphs varies strongly based on the type of graphs.
Graph two-sample hypothesis testing is a problem that frequently arises in various disciplines, for example in bio informatics [1], community detection [6], and risk management [3].Graph two-sample hypothesis have mostly been performed by using graph statistics such has the degree centrality and shortest paths.Although these methods can often give good performances they fail to take into account various attributes that are often present in real graphs such as node labels, edge labels, node attributes and edge weights.When the kernel twosample hypothesis testing was introduced [7] a flood gate opened to allow for testing of such attributes and therefore providing a flexible way of performing twosample hypothesis testing.Luckily, there also exists a vast literature on graph kernels [11,14].Until now, there is no package which allows one to estimate graphs from real valued data matrices and perform hypothesis testing in a flexible manner.The package GTST provides three functionalities 1. Functions to estimate the two-sample hypothesis test statistic; 2. Functions to calculate different graph kernels; and 3. A function to estimate graphs from data matrices.
The package allows other network science researches who may not be programmers to use the MMD testing framework.
There exists a python package called GraKel [17] which is dedicated to calculating various graph kernels.The package is very user-friendly so the GTST user can use all graph kernels available in the Grakel package.The GTST package extends the choice of graph kernels available for use in graph two-sample testing, by also developing a collection of kernels not available in GraKel.Such as, the fast random walk kernels which is based on ideas from [10] along with an additional fast random walk kernel for edge-labelled graphs; The Wasserstein Weisfeiler-Lehman Graph kernel [18] whose original code was adjusted for the package needs.The Deep Graph kernel [19], and the graph neural tangent kernel [4] whose original code was adjusted for the package needs.The MONK estimator, which is a robust estimator of the MMD, was developed by MONK [12] and they do provide the code online and in a packaged environment.However, we have adjusted the code slightly to allow for robust comparing of samples of different sizes.The MMDGraph then estimates the p-value of test by using a bootstrap or a permutation sampling scheme.The package also allows for estimating graphs using sklearn's graphical lasso [2].Additional preprocessing can be done by using the nonparanormal transform [13].The best graph is found by using the EBIC criterion [15].GTST assumes that the graphs passed are a networkx object [9].One can additionally use pre-computed kernels to perform tests.

IMPLEMENTATION AND ARCHITECTURE
Let G(V, S) denote a graph with vertex set V and edge set S. In the two-sample testing of graph-valued data, we assume we are given two sets of samples/observations that comprise collections of graph-valued data {G 1 , …, G n } and The graphs in the two samples are all generated independently from two probability spaces ( , , )    and ( , , )    , and the goal is to infer whether  = Q.We note that in this general testing framework the vertex sets and edgesets do not have to be equal, but they can be common if it is desirable for the application.This is therefore a very general testing framework.In the simplest case, where we assume both sets of sample graphs come from a common set of vertices, then the sample space Ω contains all possible edges that can occur in a graph G, that is . 1 As the sample space is discrete we can define the σ-algebra as the power set of Ω, namely, then defines the probability of obtaining a certain graph in the sample set of graphvalued data.As an example we can define for instance a population distribution to be uniform is the total number of possible edges and G(V, S) is a graph with |V| vertices and |S| number of edges.This setting is illustrated in Figure 1.Now, returning to the concept of two-sample testing for graph-valued data.The goal is to infer whether the two samples of graphs are generated according to the same distribution.This involves developing a statistical test ′ to determine from the population samples whether there is sufficient evidence to reject the null hypothesis that both population distributions generating the two samples of graphs are equivalent, where  is a function that distinguishes between the null hypothesis and the alternative hypothesis: ( ) Where the class of function, B  , is the unit ball in a RKHS, then the squared population MMD becomes kernalized: where k is some graph kernel.The kernel matrix plays a central role and is defined as: G = is the data.Note, both sampels are included in the data.In the context of two sample graph testing, we have two data sources, sample 1 of graphs assumed drawn from P and sample 2 of graphs assumed drawn from Q.It can be good to order the kernel matrix such that K has a block structured as follows: where K Q is the Kernel function evaluated at data points within the sample coming from the unknown distributions  and Q.K  , K Q and K QQ are defined analogously.Note

UNBIASED ESTIMATE
An unbiased estimate of MMD 2 for n and n' samples of graphs is given by: ( 1)

BIASED ESTIMATE
A biased estimate of MMD 2 for n and n' samples of graphs is given by: .

UNBIASED LINEAR TIME ESTIMATE
An unbiased can be computed in O(n) time.Assume that n = n' and define n 2 = n/2, where • is the floor function, then the linear estimate is computed as: MMDl has higher variance than  2 MMDu, it is computationally much more appealing.

ROBUST ESTIMATE
Consider the partition By the representer theorem [16] we can express a function f within the RKHS space as: A robust MMD estimator is found by solving: , where K  is the Kernel function evaluated at data points within the sample coming from the unknown distributions  and Q, and is an indicator vector of block q.For more details see [12,8].
The package includes graph kernels such as: • Random walk kernels which can be used on weighted, directed, undirected, and bipartite graphs with node labels, node attributes, edge labels, and edge attributes.• Deep graph kernels can be used on node labelled binary graphs.• Graph neural tangent kernels can be used on binary graphs with node labels/attributes.• Wasserstein Weisfeiler-Lehman graph kernels can be used on undirected graphs with node labels and can be used on node-attributed graphs with the right embedding scheme as well.
Other graph kernels can be utilized via the graph kernel library GraKel [17] which is dedicated to calculating various graph kernels.The package is very user-friendly so the GTST user can use all graph kernels available in the Grakel package.For more details on various graph kernels see [8,17,14].

EXAMPLE
The workflow is as follows: 1) Use two data arrays to estimate two sequences/samples of graphs using the graphical lasso [5].This step can be skipped if the practitioner already has the samples of graphs in a networkx format [9].2) Select a graph kernel.3) Select an estimator of the MMD or try multiple estimators to obtain a p-value.A quick example for the random walk kernel is: For more exampels please see https://github.com/ragnarlevi/GTST.

QUALITY CONTROL
The requirements for running test for GTST is listed in the requirements_dev.txt in the Github page.The tests involve testing if kernels matrices are positive semi definite and whether the kernels, test statistic, and permutation method for the p-value are able to reject the null when the null is "extremely" false using known random data sets.
Once the required testing packages have been installed, the tests can be performed by running the command: pytest in the root folder, which will take around 15 minutes.This will generate a coverage report which can be found in the htmlcov directory.To view it run: cd htmlcov python -m http.server and open the localhost link (something like http:// localhost:8000/) in a browser.
To test GTST in a clean environment for all python versions from 3.7-3.10,we use Tox.This can be achieved by running tox in the root directory.Note that this takes significantly longer to run, so is best performed as a final check.Whenever the code is pushed to the remote repository, the Tox test suite is automatically run using GitHub actions.To investigate this process, consult the file found at .github/workflows/tests.yml on the github page.The output of the report can be found in the github actions tab.
A coverage report was made by using coveralls locally, but can be found by clicking the coverage shield on the github page.
The test statistic used in this case is the largest distance between the expectation of some function w.r.t. to the two probability distributions.Let B  be a class of functions : f R Ω → .The maximum mean discrepancy (MMD) is defined as: